diff mbox series

[ovs-dev,v10] netdev-dpdk: add control plane protection support

Message ID 20230417123735.730402-1-rjarry@redhat.com
State Superseded
Headers show
Series [ovs-dev,v10] netdev-dpdk: add control plane protection support | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation fail test: fail

Commit Message

Robin Jarry April 17, 2023, 12:37 p.m. UTC
Some control protocols are used to maintain link status between
forwarding engines (e.g. LACP). When the system is not sized properly,
the PMD threads may not be able to process all incoming traffic from the
configured Rx queues. When a signaling packet of such protocols is
dropped, it can cause link flapping, worsening the situation.

Use the RTE flow API to redirect these protocols into a dedicated Rx
queue. The assumption is made that the ratio between control protocol
traffic and user data traffic is very low and thus this dedicated Rx
queue will never get full. The RSS redirection table is re-programmed to
only use the other Rx queues. The RSS table size is stored in the
netdev_dpdk structure at port initialization to avoid requesting the
information again when changing the port configuration.

The additional Rx queue will be assigned a PMD core like any other Rx
queue. Polling that extra queue may introduce increased latency and
a slight performance penalty at the benefit of preventing link flapping.

This feature must be enabled per port on specific protocols via the
cp-protection option. This option takes a coma-separated list of
protocol names. It is only supported on ethernet ports. This feature is
experimental.

If the user has already configured multiple Rx queues on the port, an
additional one will be allocated for control plane packets. If the
hardware cannot satisfy the requested number of requested Rx queues, the
last Rx queue will be assigned for control plane. If only one Rx queue
is available, the cp-protection feature will be disabled. If the
hardware does not support the RTE flow matchers/actions, the feature
will be disabled.

It cannot be enabled when other_config:hw-offload=true as it may
conflict with the offloaded RTE flows. Similarly, if hw-offload is
enabled, cp-protection will be forcibly disabled on all ports.

Example use:

 ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \
   set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \
   set interface phy0 options:cp-protection=lacp -- \
   set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \
   set interface phy1 options:cp-protection=lacp

As a starting point, only one protocol is supported: LACP. Other
protocols can be added in the future. NIC compatibility should be
checked.

To validate that this works as intended, I used a traffic generator to
generate random traffic slightly above the machine capacity at line rate
on a two ports bond interface. OVS is configured to receive traffic on
two VLANs and pop/push them in a br-int bridge based on tags set on
patch ports.

   +----------------------+
   |         DUT          |
   |+--------------------+|
   ||       br-int       || in_port=patch10,actions=mod_dl_src:$patch11,mod_dl_dst:$tgen1,output:patch11
   ||                    || in_port=patch11,actions=mod_dl_src:$patch10,mod_dl_dst:$tgen0,output:patch10
   || patch10    patch11 ||
   |+---|-----------|----+|
   |    |           |     |
   |+---|-----------|----+|
   || patch00    patch01 ||
   ||  tag:10    tag:20  ||
   ||                    ||
   ||       br-phy       || default flow, action=NORMAL
   ||                    ||
   ||       bond0        || balance-slb, lacp=passive, lacp-time=fast
   ||    phy0   phy1     ||
   |+------|-----|-------+|
   +-------|-----|--------+
           |     |
   +-------|-----|--------+
   |     port0  port1     | balance L3/L4, lacp=active, lacp-time=fast
   |         lag          | mode trunk VLANs 10, 20
   |                      |
   |        switch        |
   |                      |
   |  vlan 10    vlan 20  |  mode access
   |   port2      port3   |
   +-----|----------|-----+
         |          |
   +-----|----------|-----+
   |   tgen0      tgen1   |  Random traffic that is properly balanced
   |                      |  across the bond ports in both directions.
   |  traffic generator   |
   +----------------------+

Without cp-protection, the bond0 links are randomly switching to
"defaulted" when one of the LACP packets sent by the switch is dropped
because the RX queues are full and the PMD threads did not process them
fast enough. When that happens, all traffic must go through a single
link which causes above line rate traffic to be dropped.

 ~# ovs-appctl lacp/show-stats bond0
 ---- bond0 statistics ----
 member: phy0:
   TX PDUs: 347246
   RX PDUs: 14865
   RX Bad PDUs: 0
   RX Marker Request PDUs: 0
   Link Expired: 168
   Link Defaulted: 0
   Carrier Status Changed: 0
 member: phy1:
   TX PDUs: 347245
   RX PDUs: 14919
   RX Bad PDUs: 0
   RX Marker Request PDUs: 0
   Link Expired: 147
   Link Defaulted: 1
   Carrier Status Changed: 0

When cp-protection is enabled, no LACP packet is dropped and the bond
links remain enabled at all times, maximizing the throughput. Neither
the "Link Expired" nor the "Link Defaulted" counters are incremented
anymore.

This feature may be considered as "QoS". However, it does not work by
limiting the rate of traffic explicitly. It only guarantees that some
protocols have a lower chance of being dropped because the PMD cores
cannot keep up with regular traffic.

The choice of protocols is limited on purpose. This is not meant to be
configurable by users. Some limited configurability could be considered
in the future but it would expose to more potential issues if users are
accidentally redirecting all traffic in the control plane queue.

Cc: Anthony Harivel <aharivel@redhat.com>
Cc: Christophe Fontaine <cfontain@redhat.com>
Cc: David Marchand <david.marchand@redhat.com>
Cc: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Robin Jarry <rjarry@redhat.com>
---
v9 -> v10:

* If hw-offload is enabled, cp-protection will now be forcibly disabled
  on all ports.

 Documentation/topics/dpdk/phy.rst |  83 ++++++++
 NEWS                              |   3 +
 lib/netdev-dpdk.c                 | 310 +++++++++++++++++++++++++++++-
 vswitchd/vswitch.xml              |  27 +++
 4 files changed, 421 insertions(+), 2 deletions(-)

Comments

Aaron Conole May 23, 2023, 1:32 p.m. UTC | #1
Robin Jarry <rjarry@redhat.com> writes:

> Some control protocols are used to maintain link status between
> forwarding engines (e.g. LACP). When the system is not sized properly,
> the PMD threads may not be able to process all incoming traffic from the
> configured Rx queues. When a signaling packet of such protocols is
> dropped, it can cause link flapping, worsening the situation.

I think one issue I have with this is that the name is a bit
misleading.  Control plane, from OVS perspective, would be like OpenFlow
communications.  This is more like a traffic steering mechanism.

Maybe it would help to call it something like "traffic-based-rps" or
something like that (but not clear what would be best).  It is really a
way to steer specific traffic to a distinct rxq.

WDYT?

I didn't look much at the code this time - maybe Kevin has taken a look?

> Use the RTE flow API to redirect these protocols into a dedicated Rx
> queue. The assumption is made that the ratio between control protocol
> traffic and user data traffic is very low and thus this dedicated Rx
> queue will never get full. The RSS redirection table is re-programmed to
> only use the other Rx queues. The RSS table size is stored in the
> netdev_dpdk structure at port initialization to avoid requesting the
> information again when changing the port configuration.
>
> The additional Rx queue will be assigned a PMD core like any other Rx
> queue. Polling that extra queue may introduce increased latency and
> a slight performance penalty at the benefit of preventing link flapping.
>
> This feature must be enabled per port on specific protocols via the
> cp-protection option. This option takes a coma-separated list of
> protocol names. It is only supported on ethernet ports. This feature is
> experimental.
>
> If the user has already configured multiple Rx queues on the port, an
> additional one will be allocated for control plane packets. If the
> hardware cannot satisfy the requested number of requested Rx queues, the
> last Rx queue will be assigned for control plane. If only one Rx queue
> is available, the cp-protection feature will be disabled. If the
> hardware does not support the RTE flow matchers/actions, the feature
> will be disabled.
>
> It cannot be enabled when other_config:hw-offload=true as it may
> conflict with the offloaded RTE flows. Similarly, if hw-offload is
> enabled, cp-protection will be forcibly disabled on all ports.
>
> Example use:
>
>  ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \
>    set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \
>    set interface phy0 options:cp-protection=lacp -- \
>    set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \
>    set interface phy1 options:cp-protection=lacp
>
> As a starting point, only one protocol is supported: LACP. Other
> protocols can be added in the future. NIC compatibility should be
> checked.
>
> To validate that this works as intended, I used a traffic generator to
> generate random traffic slightly above the machine capacity at line rate
> on a two ports bond interface. OVS is configured to receive traffic on
> two VLANs and pop/push them in a br-int bridge based on tags set on
> patch ports.
>
>    +----------------------+
>    |         DUT          |
>    |+--------------------+|
>    ||       br-int       || in_port=patch10,actions=mod_dl_src:$patch11,mod_dl_dst:$tgen1,output:patch11
>    ||                    || in_port=patch11,actions=mod_dl_src:$patch10,mod_dl_dst:$tgen0,output:patch10
>    || patch10    patch11 ||
>    |+---|-----------|----+|
>    |    |           |     |
>    |+---|-----------|----+|
>    || patch00    patch01 ||
>    ||  tag:10    tag:20  ||
>    ||                    ||
>    ||       br-phy       || default flow, action=NORMAL
>    ||                    ||
>    ||       bond0        || balance-slb, lacp=passive, lacp-time=fast
>    ||    phy0   phy1     ||
>    |+------|-----|-------+|
>    +-------|-----|--------+
>            |     |
>    +-------|-----|--------+
>    |     port0  port1     | balance L3/L4, lacp=active, lacp-time=fast
>    |         lag          | mode trunk VLANs 10, 20
>    |                      |
>    |        switch        |
>    |                      |
>    |  vlan 10    vlan 20  |  mode access
>    |   port2      port3   |
>    +-----|----------|-----+
>          |          |
>    +-----|----------|-----+
>    |   tgen0      tgen1   |  Random traffic that is properly balanced
>    |                      |  across the bond ports in both directions.
>    |  traffic generator   |
>    +----------------------+
>
> Without cp-protection, the bond0 links are randomly switching to
> "defaulted" when one of the LACP packets sent by the switch is dropped
> because the RX queues are full and the PMD threads did not process them
> fast enough. When that happens, all traffic must go through a single
> link which causes above line rate traffic to be dropped.
>
>  ~# ovs-appctl lacp/show-stats bond0
>  ---- bond0 statistics ----
>  member: phy0:
>    TX PDUs: 347246
>    RX PDUs: 14865
>    RX Bad PDUs: 0
>    RX Marker Request PDUs: 0
>    Link Expired: 168
>    Link Defaulted: 0
>    Carrier Status Changed: 0
>  member: phy1:
>    TX PDUs: 347245
>    RX PDUs: 14919
>    RX Bad PDUs: 0
>    RX Marker Request PDUs: 0
>    Link Expired: 147
>    Link Defaulted: 1
>    Carrier Status Changed: 0
>
> When cp-protection is enabled, no LACP packet is dropped and the bond
> links remain enabled at all times, maximizing the throughput. Neither
> the "Link Expired" nor the "Link Defaulted" counters are incremented
> anymore.
>
> This feature may be considered as "QoS". However, it does not work by
> limiting the rate of traffic explicitly. It only guarantees that some
> protocols have a lower chance of being dropped because the PMD cores
> cannot keep up with regular traffic.
>
> The choice of protocols is limited on purpose. This is not meant to be
> configurable by users. Some limited configurability could be considered
> in the future but it would expose to more potential issues if users are
> accidentally redirecting all traffic in the control plane queue.
>
> Cc: Anthony Harivel <aharivel@redhat.com>
> Cc: Christophe Fontaine <cfontain@redhat.com>
> Cc: David Marchand <david.marchand@redhat.com>
> Cc: Kevin Traynor <ktraynor@redhat.com>
> Signed-off-by: Robin Jarry <rjarry@redhat.com>
> ---
> v9 -> v10:
>
> * If hw-offload is enabled, cp-protection will now be forcibly disabled
>   on all ports.
>
>  Documentation/topics/dpdk/phy.rst |  83 ++++++++
>  NEWS                              |   3 +
>  lib/netdev-dpdk.c                 | 310 +++++++++++++++++++++++++++++-
>  vswitchd/vswitch.xml              |  27 +++
>  4 files changed, 421 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/topics/dpdk/phy.rst b/Documentation/topics/dpdk/phy.rst
> index 4b0fe8dded3a..733c4251bca9 100644
> --- a/Documentation/topics/dpdk/phy.rst
> +++ b/Documentation/topics/dpdk/phy.rst
> @@ -131,6 +131,89 @@ possible with DPDK acceleration. It is possible to configure multiple Rx queues
>  for ``dpdk`` ports, thus ensuring this is not a bottleneck for performance. For
>  information on configuring PMD threads, refer to :doc:`pmd`.
>  
> +Control Plane Protection
> +------------------------
> +
> +.. warning:: This feature is experimental.
> +
> +Some control protocols are used to maintain link status between forwarding
> +engines. In SDN environments, these packets share the same physical network
> +than the user data traffic.
> +
> +When the system is not sized properly, the PMD threads may not be able to
> +process all incoming traffic from the configured Rx queues. When a signaling
> +packet of such protocols is dropped, it can cause link flapping, worsening the
> +situation.
> +
> +Some physical NICs can be programmed to put these protocols in a dedicated
> +hardware Rx queue using the rte_flow__ API.
> +
> +__ https://doc.dpdk.org/guides-22.11/prog_guide/rte_flow.html
> +
> +The currently supported control plane protocols are:
> +
> +``lacp``
> +   `Link Aggregation Control Protocol`__. Ether type ``0x8809``.
> +
> +   __ https://www.ieee802.org/3/ad/public/mar99/seaman_1_0399.pdf
> +
> +.. warning::
> +
> +   This feature is not compatible with all NICs. Refer to the DPDK
> +   `compatibilty matrix`__ and vendor documentation for more details.
> +
> +   __ https://doc.dpdk.org/guides-22.11/nics/overview.html
> +
> +Control plane protection must be enabled on specific protocols per port. The
> +``cp-protection`` option requires a coma separated list of protocol names::
> +
> +   $ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
> +        options:dpdk-devargs=0000:01:00.0 options:n_rxq=2 \
> +        options:cp-protection=lacp
> +
> +.. note::
> +
> +   If multiple Rx queues are already configured, regular RSS (Receive Side
> +   Scaling) queue balancing is done on all but the extra control plane
> +   protection queue.
> +
> +.. tip::
> +
> +   You can check if control plane protection is supported on a port with the
> +   following command::
> +
> +      $ ovs-vsctl get interface dpdk-p0 status
> +      {cp_protection_queue="2", driver_name=..., rss_queues="0-1"}
> +
> +   This will also show in ``ovs-vswitchd.log``::
> +
> +      INFO|dpdk-p0: cp-protection: redirecting lacp traffic to queue 2
> +      INFO|dpdk-p0: cp-protection: applying rss on queues 0-1
> +
> +   If the hardware does not support redirecting control plane traffic to
> +   a dedicated queue, it will be explicit::
> +
> +      $ ovs-vsctl get interface dpdk-p0 status
> +      {cp_protection=unsupported, driver_name=...}
> +
> +   More details can often be found in ``ovs-vswitchd.log``::
> +
> +      WARN|dpdk-p0: cp-protection: failed to add lacp flow: Unsupported pattern
> +
> +To disable control plane protection on a port, use the following command::
> +
> +   $ ovs-vsctl remove Interface dpdk-p0 options cp-protection
> +
> +You can see that it has been disabled in ``ovs-vswitchd.log``::
> +
> +   INFO|dpdk-p0: cp-protection: disabled
> +
> +.. warning::
> +
> +   This feature is mutually exclusive with ``other_options:hw-offload`` as it
> +   may conflict with the offloaded RTE flows. If both are enabled,
> +   ``cp-protection`` will be forcibly disabled.
> +
>  .. _dpdk-phy-flow-control:
>  
>  Flow Control
> diff --git a/NEWS b/NEWS
> index b6418c36e95e..43338edaf5aa 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -21,6 +21,9 @@ Post-v3.1.0
>       * ovs-vswitchd will keep the CAP_SYS_RAWIO capability when started
>         with the --hw-rawio-access command line option.  This allows the
>         process extra privileges when mapping physical interconnect memory.
> +     * New experimental "cp-protection=<protocol>" option to redirect certain
> +       protocols (for now, only LACP) to a dedicated hardware queue using
> +       RTE flow.
>     - SRv6 Tunnel Protocol
>       * Added support for userspace datapath (only).
>  
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index fb0dd43f75c5..f33c6613adea 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -415,6 +415,10 @@ enum dpdk_hw_ol_features {
>      NETDEV_TX_SCTP_CHECKSUM_OFFLOAD = 1 << 4,
>  };
>  
> +enum dpdk_cp_prot_flags {
> +    DPDK_CP_PROT_LACP = 1 << 0,
> +};
> +
>  /*
>   * In order to avoid confusion in variables names, following naming convention
>   * should be used, if possible:
> @@ -504,11 +508,18 @@ struct netdev_dpdk {
>           * so we remember the request and update them next time
>           * netdev_dpdk*_reconfigure() is called */
>          int requested_mtu;
> +        /* User input for n_rxq + an optional control plane protection queue
> +         * (see netdev_dpdk_reconfigure). This field is different from the
> +         * other requested_* fields as it may contain a different value than
> +         * the user input. */
>          int requested_n_txq;
>          int requested_n_rxq;
>          int requested_rxq_size;
>          int requested_txq_size;
>  
> +        /* User input for n_rxq (see dpdk_set_rxq_config). */
> +        int user_n_rxq;
> +
>          /* Number of rx/tx descriptors for physical devices */
>          int rxq_size;
>          int txq_size;
> @@ -534,6 +545,13 @@ struct netdev_dpdk {
>  
>          /* VF configuration. */
>          struct eth_addr requested_hwaddr;
> +
> +        /* Requested control plane protection flags,
> +         * from the enum set 'dpdk_cp_prot_flags'. */
> +        uint64_t requested_cp_prot_flags;
> +        uint64_t cp_prot_flags;
> +        size_t cp_prot_flows_num;
> +        struct rte_flow **cp_prot_flows;
>      );
>  
>      PADDED_MEMBERS(CACHE_LINE_SIZE,
> @@ -1310,9 +1328,14 @@ common_construct(struct netdev *netdev, dpdk_port_t port_no,
>      netdev->n_rxq = 0;
>      netdev->n_txq = 0;
>      dev->requested_n_rxq = NR_QUEUE;
> +    dev->user_n_rxq = NR_QUEUE;
>      dev->requested_n_txq = NR_QUEUE;
>      dev->requested_rxq_size = NIC_PORT_DEFAULT_RXQ_SIZE;
>      dev->requested_txq_size = NIC_PORT_DEFAULT_TXQ_SIZE;
> +    dev->requested_cp_prot_flags = 0;
> +    dev->cp_prot_flags = 0;
> +    dev->cp_prot_flows_num = 0;
> +    dev->cp_prot_flows = NULL;
>  
>      /* Initialize the flow control to NULL */
>      memset(&dev->fc_conf, 0, sizeof dev->fc_conf);
> @@ -1487,6 +1510,8 @@ common_destruct(struct netdev_dpdk *dev)
>      ovs_mutex_destroy(&dev->mutex);
>  }
>  
> +static void dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev);
> +
>  static void
>  netdev_dpdk_destruct(struct netdev *netdev)
>  {
> @@ -1494,6 +1519,9 @@ netdev_dpdk_destruct(struct netdev *netdev)
>  
>      ovs_mutex_lock(&dpdk_mutex);
>  
> +    /* Destroy any rte flows to allow RXQs to be removed. */
> +    dpdk_cp_prot_unconfigure(dev);
> +
>      rte_eth_dev_stop(dev->port_id);
>      dev->started = false;
>  
> @@ -1908,8 +1936,8 @@ dpdk_set_rxq_config(struct netdev_dpdk *dev, const struct smap *args)
>      int new_n_rxq;
>  
>      new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> -    if (new_n_rxq != dev->requested_n_rxq) {
> -        dev->requested_n_rxq = new_n_rxq;
> +    if (new_n_rxq != dev->user_n_rxq) {
> +        dev->user_n_rxq = new_n_rxq;
>          netdev_request_reconfigure(&dev->up);
>      }
>  }
> @@ -1931,6 +1959,48 @@ dpdk_process_queue_size(struct netdev *netdev, const struct smap *args,
>      }
>  }
>  
> +static void
> +dpdk_cp_prot_set_config(struct netdev *netdev, struct netdev_dpdk *dev,
> +                        const struct smap *args, char **errp)
> +{
> +    const char *arg = smap_get_def(args, "cp-protection", "");
> +    char *token, *saveptr, *buf;
> +    uint64_t flags = 0;
> +
> +    buf = xstrdup(arg);
> +    token = strtok_r(buf, ",", &saveptr);
> +    while (token) {
> +        if (strcmp(token, "lacp") == 0) {
> +            flags |= DPDK_CP_PROT_LACP;
> +        } else {
> +            VLOG_WARN_BUF(errp, "%s options:cp-protection "
> +                          "unknown protocol '%s'",
> +                          netdev_get_name(netdev), token);
> +        }
> +        token = strtok_r(NULL, ",", &saveptr);
> +    }
> +    free(buf);
> +
> +    if (flags && dev->type != DPDK_DEV_ETH) {
> +        VLOG_WARN_BUF(errp, "%s options:cp-protection "
> +                      "is only supported on ethernet ports",
> +                      netdev_get_name(netdev));
> +        flags = 0;
> +    }
> +
> +    if (flags && ovsrcu_get(void *, &netdev->hw_info.offload_data)) {
> +        VLOG_WARN_BUF(errp, "%s options:cp-protection "
> +                      "is incompatible with hw-offload",
> +                      netdev_get_name(netdev));
> +        flags = 0;
> +    }
> +
> +    if (flags != dev->requested_cp_prot_flags) {
> +        dev->requested_cp_prot_flags = flags;
> +        netdev_request_reconfigure(netdev);
> +    }
> +}
> +
>  static int
>  netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
>                         char **errp)
> @@ -1950,6 +2020,8 @@ netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
>      ovs_mutex_lock(&dpdk_mutex);
>      ovs_mutex_lock(&dev->mutex);
>  
> +    dpdk_cp_prot_set_config(netdev, dev, args, errp);
> +
>      dpdk_set_rxq_config(dev, args);
>  
>      dpdk_process_queue_size(netdev, args, "n_rxq_desc",
> @@ -3819,9 +3891,12 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>  {
>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>      struct rte_eth_dev_info dev_info;
> +    size_t cp_prot_flows_num;
> +    uint64_t cp_prot_flags;
>      const char *bus_info;
>      uint32_t link_speed;
>      uint32_t dev_flags;
> +    int n_rxq;
>  
>      if (!rte_eth_dev_is_valid_port(dev->port_id)) {
>          return ENODEV;
> @@ -3833,6 +3908,9 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>      link_speed = dev->link.link_speed;
>      dev_flags = *dev_info.dev_flags;
>      bus_info = rte_dev_bus_info(dev_info.device);
> +    cp_prot_flags = dev->cp_prot_flags;
> +    cp_prot_flows_num = dev->cp_prot_flows_num;
> +    n_rxq = netdev->n_rxq;
>      ovs_mutex_unlock(&dev->mutex);
>      ovs_mutex_unlock(&dpdk_mutex);
>  
> @@ -3875,6 +3953,19 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>                          ETH_ADDR_ARGS(dev->hwaddr));
>      }
>  
> +    if (cp_prot_flags) {
> +        if (!cp_prot_flows_num) {
> +            smap_add(args, "cp_protection", "unsupported");
> +        } else {
> +            smap_add_format(args, "cp_protection_queue", "%d", n_rxq - 1);
> +            if (n_rxq > 2) {
> +                smap_add_format(args, "rss_queues", "0-%d", n_rxq - 2);
> +            } else {
> +                smap_add(args, "rss_queues", "0");
> +            }
> +        }
> +    }
> +
>      return 0;
>  }
>  
> @@ -5178,16 +5269,203 @@ static const struct dpdk_qos_ops trtcm_policer_ops = {
>      .qos_queue_dump_state_init = trtcm_policer_qos_queue_dump_state_init
>  };
>  
> +static int
> +dpdk_cp_prot_add_flow(struct netdev_dpdk *dev,
> +                      const struct rte_flow_item items[],
> +                      const char *desc)
> +{
> +    const struct rte_flow_attr attr = { .ingress = 1 };
> +    const struct rte_flow_action actions[] = {
> +        {
> +            .type = RTE_FLOW_ACTION_TYPE_QUEUE,
> +            .conf = &(const struct rte_flow_action_queue) {
> +                .index = dev->up.n_rxq - 1,
> +            },
> +        },
> +        { .type = RTE_FLOW_ACTION_TYPE_END },
> +    };
> +    struct rte_flow_error error;
> +    struct rte_flow *flow;
> +    size_t num;
> +    int err;
> +
> +    set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
> +    err = rte_flow_validate(dev->port_id, &attr, items, actions, &error);
> +    if (err) {
> +        VLOG_WARN("%s: cp-protection: device does not support %s flow: %s",
> +                  netdev_get_name(&dev->up), desc,
> +                  error.message ? error.message : "");
> +        goto out;
> +    }
> +
> +    set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
> +    flow = rte_flow_create(dev->port_id, &attr, items, actions, &error);
> +    if (flow == NULL) {
> +        VLOG_WARN("%s: cp-protection: failed to add %s flow: %s",
> +                  netdev_get_name(&dev->up), desc,
> +                  error.message ? error.message : "");
> +        err = rte_errno;
> +        goto out;
> +    }
> +
> +    num = dev->cp_prot_flows_num + 1;
> +    dev->cp_prot_flows = xrealloc(dev->cp_prot_flows, sizeof(flow) * num);
> +    dev->cp_prot_flows[dev->cp_prot_flows_num] = flow;
> +    dev->cp_prot_flows_num = num;
> +
> +    VLOG_INFO("%s: cp-protection: redirected %s traffic to rx queue %d",
> +              netdev_get_name(&dev->up), desc, dev->up.n_rxq - 1);
> +out:
> +    return err;
> +}
> +
> +#define RETA_CONF_SIZE (RTE_ETH_RSS_RETA_SIZE_512 / RTE_ETH_RETA_GROUP_SIZE)
> +
> +static int
> +dpdk_cp_prot_rss_configure(struct netdev_dpdk *dev, int rss_n_rxq)
> +{
> +    struct rte_eth_rss_reta_entry64 reta_conf[RETA_CONF_SIZE];
> +    struct rte_eth_dev_info info;
> +    int err;
> +
> +    rte_eth_dev_info_get(dev->port_id, &info);
> +
> +    if (info.reta_size % rss_n_rxq != 0 &&
> +        info.reta_size < RTE_ETH_RSS_RETA_SIZE_128) {
> +        /*
> +         * Some drivers set reta_size equal to the total number of rxqs that
> +         * are configured when it is a power of two. Since we are actually
> +         * reconfiguring the redirection table to exclude the last rxq, we may
> +         * end up with an imbalanced redirection table. For example, such
> +         * configuration:
> +         *
> +         *   options:n_rxq=3 options:cp-protection=lacp
> +         *
> +         * Will actually configure 4 rxqs on the NIC, and the default reta to:
> +         *
> +         *   [0, 1, 2, 3]
> +         *
> +         * And dpdk_cp_prot_rss_configure() will reconfigure reta to:
> +         *
> +         *   [0, 1, 2, 0]
> +         *
> +         * Causing queue 0 to receive twice as much traffic as queues 1 and 2.
> +         *
> +         * Work around that corner case by forcing a bigger redirection table
> +         * size to 128 entries when reta_size is not a multiple of rss_n_rxq
> +         * and when reta_size is less than 128. This value seems to be
> +         * supported by most of the drivers that also support rte flow.
> +         */
> +        info.reta_size = RTE_ETH_RSS_RETA_SIZE_128;
> +    }
> +
> +    memset(reta_conf, 0, sizeof(reta_conf));
> +    for (uint16_t i = 0; i < info.reta_size; i++) {
> +        uint16_t idx = i / RTE_ETH_RETA_GROUP_SIZE;
> +        uint16_t shift = i % RTE_ETH_RETA_GROUP_SIZE;
> +        reta_conf[idx].mask |= 1ULL << shift;
> +        reta_conf[idx].reta[shift] = i % rss_n_rxq;
> +    }
> +    err = rte_eth_dev_rss_reta_update(dev->port_id, reta_conf, info.reta_size);
> +    if (err < 0) {
> +        VLOG_WARN("%s: failed to configure RSS redirection table: err=%d",
> +                  netdev_get_name(&dev->up), err);
> +    }
> +
> +    return err;
> +}
> +
> +static int
> +dpdk_cp_prot_configure(struct netdev_dpdk *dev)
> +{
> +    int err = 0;
> +
> +    if (dev->up.n_rxq < 2) {
> +        err = ENOTSUP;
> +        VLOG_WARN("%s: cp-protection: not enough available rx queues",
> +                  netdev_get_name(&dev->up));
> +        goto out;
> +    }
> +
> +    if (dev->requested_cp_prot_flags & DPDK_CP_PROT_LACP) {
> +        const struct rte_flow_item items[] = {
> +            {
> +                .type = RTE_FLOW_ITEM_TYPE_ETH,
> +                .spec = &(const struct rte_flow_item_eth){
> +                    .type = htons(ETH_TYPE_LACP),
> +                },
> +                .mask = &(const struct rte_flow_item_eth){
> +                    .type = htons(0xffff),
> +                },
> +            },
> +            { .type = RTE_FLOW_ITEM_TYPE_END },
> +        };
> +        err = dpdk_cp_prot_add_flow(dev, items, "lacp");
> +        if (err) {
> +            goto out;
> +        }
> +    }
> +
> +    if (dev->cp_prot_flows_num) {
> +        /* Reconfigure RSS reta in all but the cp protection queue. */
> +        err = dpdk_cp_prot_rss_configure(dev, dev->up.n_rxq - 1);
> +        if (!err) {
> +            if (dev->up.n_rxq == 2) {
> +                VLOG_INFO("%s: cp-protection: redirected other traffic to "
> +                          "rx queue 0", netdev_get_name(&dev->up));
> +            } else {
> +                VLOG_INFO("%s: cp-protection: applied rss on rx queue 0-%u",
> +                          netdev_get_name(&dev->up), dev->up.n_rxq - 2);
> +            }
> +        }
> +    }
> +
> +out:
> +    return err;
> +}
> +
> +static void
> +dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev)
> +{
> +    struct rte_flow_error error;
> +
> +    if (!dev->cp_prot_flows_num) {
> +        return;
> +    }
> +
> +    VLOG_DBG("%s: cp-protection: reset flows", netdev_get_name(&dev->up));
> +
> +    for (int i = 0; i < dev->cp_prot_flows_num; i++) {
> +        set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
> +        if (rte_flow_destroy(dev->port_id, dev->cp_prot_flows[i], &error)) {
> +            VLOG_DBG("%s: cp-protection: failed to destroy flow: %s",
> +                     netdev_get_name(&dev->up),
> +                     error.message ? error.message : "");
> +        }
> +    }
> +    free(dev->cp_prot_flows);
> +    dev->cp_prot_flows_num = 0;
> +    dev->cp_prot_flows = NULL;
> +}
> +
>  static int
>  netdev_dpdk_reconfigure(struct netdev *netdev)
>  {
>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +    bool try_cp_prot;
>      int err = 0;
>  
>      ovs_mutex_lock(&dev->mutex);
>  
> +    try_cp_prot = dev->requested_cp_prot_flags != 0;
> +    dev->requested_n_rxq = dev->user_n_rxq;
> +    if (try_cp_prot) {
> +        dev->requested_n_rxq += 1;
> +    }
> +
>      if (netdev->n_txq == dev->requested_n_txq
>          && netdev->n_rxq == dev->requested_n_rxq
> +        && dev->cp_prot_flags == dev->requested_cp_prot_flags
>          && dev->mtu == dev->requested_mtu
>          && dev->lsc_interrupt_mode == dev->requested_lsc_interrupt_mode
>          && dev->rxq_size == dev->requested_rxq_size
> @@ -5200,6 +5478,9 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>          goto out;
>      }
>  
> +retry:
> +    dpdk_cp_prot_unconfigure(dev);
> +
>      if (dev->reset_needed) {
>          rte_eth_dev_reset(dev->port_id);
>          if_notifier_manual_report();
> @@ -5224,6 +5505,7 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>      dev->txq_size = dev->requested_txq_size;
>  
>      rte_free(dev->tx_q);
> +    dev->tx_q = NULL;
>  
>      if (!eth_addr_equals(dev->hwaddr, dev->requested_hwaddr)) {
>          err = netdev_dpdk_set_etheraddr__(dev, dev->requested_hwaddr);
> @@ -5255,6 +5537,23 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>       */
>      dev->requested_hwaddr = dev->hwaddr;
>  
> +    if (try_cp_prot) {
> +        err = dpdk_cp_prot_configure(dev);
> +        if (err) {
> +            /* No hw support, disable & recover gracefully. */
> +            try_cp_prot = false;
> +            /*
> +             * The extra queue must be explicitly removed here to ensure that
> +             * it is unconfigured immediately.
> +             */
> +            dev->requested_n_rxq = dev->user_n_rxq;
> +            goto retry;
> +        }
> +    } else {
> +        VLOG_INFO("%s: cp-protection: disabled", netdev_get_name(&dev->up));
> +    }
> +    dev->cp_prot_flags = dev->requested_cp_prot_flags;
> +
>      dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
>      if (!dev->tx_q) {
>          err = ENOMEM;
> @@ -5467,6 +5766,13 @@ netdev_dpdk_flow_api_supported(struct netdev *netdev)
>      dev = netdev_dpdk_cast(netdev);
>      ovs_mutex_lock(&dev->mutex);
>      if (dev->type == DPDK_DEV_ETH) {
> +        if (dev->requested_cp_prot_flags) {
> +            VLOG_WARN("%s: disabling cp-protection as it is "
> +                      "mutually exclusive with hw-offload.",
> +                      netdev_get_name(netdev));
> +            dev->requested_cp_prot_flags = 0;
> +            netdev_request_reconfigure(netdev);
> +        }
>          /* TODO: Check if we able to offload some minimal flow. */
>          ret = true;
>      }
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index edb5eafa04c3..7ba8bf192d62 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -3491,6 +3491,33 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>          <p>This option may only be used with dpdk VF representors.</p>
>        </column>
>  
> +      <column name="options" key="cp-protection"
> +              type='{"type": "string", "enum": ["set", ["lacp"]]}'>
> +        <p>
> +          Allocate an extra Rx queue for control plane packets of the specified
> +          protocol(s).
> +        </p>
> +        <p>
> +          If the user has already configured multiple
> +          <code>options:n_rxq</code> on the port, an additional one will be
> +          allocated for control plane packets. If the hardware cannot satisfy
> +          the requested number of requested Rx queues, the last Rx queue will
> +          be assigned for control plane. If only one Rx queue is available or
> +          if the hardware does not support the RTE flow matchers/actions
> +          required to redirect the selected protocols,
> +          <code>cp-protection</code> will be disabled.
> +        </p>
> +        <p>
> +          This feature is mutually exclusive with
> +          <code>other_options:hw-offload</code> as it may conflict with the
> +          offloaded RTE flows. If both are enabled, <code>cp-protection</code>
> +          will be forcibly disabled.
> +        </p>
> +        <p>
> +          Disabled by default.
> +        </p>
> +      </column>
> +
>        <column name="other_config" key="tx-steering"
>                type='{"type": "string",
>                       "enum": ["set", ["thread", "hash"]]}'>
Kevin Traynor May 23, 2023, 2:05 p.m. UTC | #2
On 23/05/2023 14:32, Aaron Conole wrote:
> Robin Jarry <rjarry@redhat.com> writes:
> 
>> Some control protocols are used to maintain link status between
>> forwarding engines (e.g. LACP). When the system is not sized properly,
>> the PMD threads may not be able to process all incoming traffic from the
>> configured Rx queues. When a signaling packet of such protocols is
>> dropped, it can cause link flapping, worsening the situation.
> 
> I think one issue I have with this is that the name is a bit
> misleading.  Control plane, from OVS perspective, would be like OpenFlow
> communications.  This is more like a traffic steering mechanism.
> 
> Maybe it would help to call it something like "traffic-based-rps" or
> something like that (but not clear what would be best).  It is really a
> way to steer specific traffic to a distinct rxq.
> 
> WDYT?
> 
> I didn't look much at the code this time - maybe Kevin has taken a look?
> 

I'm looking :-)

>> Use the RTE flow API to redirect these protocols into a dedicated Rx
>> queue. The assumption is made that the ratio between control protocol
>> traffic and user data traffic is very low and thus this dedicated Rx
>> queue will never get full. The RSS redirection table is re-programmed to
>> only use the other Rx queues. The RSS table size is stored in the
>> netdev_dpdk structure at port initialization to avoid requesting the
>> information again when changing the port configuration.
>>
>> The additional Rx queue will be assigned a PMD core like any other Rx
>> queue. Polling that extra queue may introduce increased latency and
>> a slight performance penalty at the benefit of preventing link flapping.
>>
>> This feature must be enabled per port on specific protocols via the
>> cp-protection option. This option takes a coma-separated list of
>> protocol names. It is only supported on ethernet ports. This feature is
>> experimental.
>>
>> If the user has already configured multiple Rx queues on the port, an
>> additional one will be allocated for control plane packets. If the
>> hardware cannot satisfy the requested number of requested Rx queues, the
>> last Rx queue will be assigned for control plane. If only one Rx queue
>> is available, the cp-protection feature will be disabled. If the
>> hardware does not support the RTE flow matchers/actions, the feature
>> will be disabled.
>>
>> It cannot be enabled when other_config:hw-offload=true as it may
>> conflict with the offloaded RTE flows. Similarly, if hw-offload is
>> enabled, cp-protection will be forcibly disabled on all ports.
>>
>> Example use:
>>
>>   ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \
>>     set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \
>>     set interface phy0 options:cp-protection=lacp -- \
>>     set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \
>>     set interface phy1 options:cp-protection=lacp
>>
>> As a starting point, only one protocol is supported: LACP. Other
>> protocols can be added in the future. NIC compatibility should be
>> checked.
>>
>> To validate that this works as intended, I used a traffic generator to
>> generate random traffic slightly above the machine capacity at line rate
>> on a two ports bond interface. OVS is configured to receive traffic on
>> two VLANs and pop/push them in a br-int bridge based on tags set on
>> patch ports.
>>
>>     +----------------------+
>>     |         DUT          |
>>     |+--------------------+|
>>     ||       br-int       || in_port=patch10,actions=mod_dl_src:$patch11,mod_dl_dst:$tgen1,output:patch11
>>     ||                    || in_port=patch11,actions=mod_dl_src:$patch10,mod_dl_dst:$tgen0,output:patch10
>>     || patch10    patch11 ||
>>     |+---|-----------|----+|
>>     |    |           |     |
>>     |+---|-----------|----+|
>>     || patch00    patch01 ||
>>     ||  tag:10    tag:20  ||
>>     ||                    ||
>>     ||       br-phy       || default flow, action=NORMAL
>>     ||                    ||
>>     ||       bond0        || balance-slb, lacp=passive, lacp-time=fast
>>     ||    phy0   phy1     ||
>>     |+------|-----|-------+|
>>     +-------|-----|--------+
>>             |     |
>>     +-------|-----|--------+
>>     |     port0  port1     | balance L3/L4, lacp=active, lacp-time=fast
>>     |         lag          | mode trunk VLANs 10, 20
>>     |                      |
>>     |        switch        |
>>     |                      |
>>     |  vlan 10    vlan 20  |  mode access
>>     |   port2      port3   |
>>     +-----|----------|-----+
>>           |          |
>>     +-----|----------|-----+
>>     |   tgen0      tgen1   |  Random traffic that is properly balanced
>>     |                      |  across the bond ports in both directions.
>>     |  traffic generator   |
>>     +----------------------+
>>
>> Without cp-protection, the bond0 links are randomly switching to
>> "defaulted" when one of the LACP packets sent by the switch is dropped
>> because the RX queues are full and the PMD threads did not process them
>> fast enough. When that happens, all traffic must go through a single
>> link which causes above line rate traffic to be dropped.
>>
>>   ~# ovs-appctl lacp/show-stats bond0
>>   ---- bond0 statistics ----
>>   member: phy0:
>>     TX PDUs: 347246
>>     RX PDUs: 14865
>>     RX Bad PDUs: 0
>>     RX Marker Request PDUs: 0
>>     Link Expired: 168
>>     Link Defaulted: 0
>>     Carrier Status Changed: 0
>>   member: phy1:
>>     TX PDUs: 347245
>>     RX PDUs: 14919
>>     RX Bad PDUs: 0
>>     RX Marker Request PDUs: 0
>>     Link Expired: 147
>>     Link Defaulted: 1
>>     Carrier Status Changed: 0
>>
>> When cp-protection is enabled, no LACP packet is dropped and the bond
>> links remain enabled at all times, maximizing the throughput. Neither
>> the "Link Expired" nor the "Link Defaulted" counters are incremented
>> anymore.
>>
>> This feature may be considered as "QoS". However, it does not work by
>> limiting the rate of traffic explicitly. It only guarantees that some
>> protocols have a lower chance of being dropped because the PMD cores
>> cannot keep up with regular traffic.
>>
>> The choice of protocols is limited on purpose. This is not meant to be
>> configurable by users. Some limited configurability could be considered
>> in the future but it would expose to more potential issues if users are
>> accidentally redirecting all traffic in the control plane queue.
>>
>> Cc: Anthony Harivel <aharivel@redhat.com>
>> Cc: Christophe Fontaine <cfontain@redhat.com>
>> Cc: David Marchand <david.marchand@redhat.com>
>> Cc: Kevin Traynor <ktraynor@redhat.com>
>> Signed-off-by: Robin Jarry <rjarry@redhat.com>
>> ---
>> v9 -> v10:
>>
>> * If hw-offload is enabled, cp-protection will now be forcibly disabled
>>    on all ports.
>>
>>   Documentation/topics/dpdk/phy.rst |  83 ++++++++
>>   NEWS                              |   3 +
>>   lib/netdev-dpdk.c                 | 310 +++++++++++++++++++++++++++++-
>>   vswitchd/vswitch.xml              |  27 +++
>>   4 files changed, 421 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/topics/dpdk/phy.rst b/Documentation/topics/dpdk/phy.rst
>> index 4b0fe8dded3a..733c4251bca9 100644
>> --- a/Documentation/topics/dpdk/phy.rst
>> +++ b/Documentation/topics/dpdk/phy.rst
>> @@ -131,6 +131,89 @@ possible with DPDK acceleration. It is possible to configure multiple Rx queues
>>   for ``dpdk`` ports, thus ensuring this is not a bottleneck for performance. For
>>   information on configuring PMD threads, refer to :doc:`pmd`.
>>   
>> +Control Plane Protection
>> +------------------------
>> +
>> +.. warning:: This feature is experimental.
>> +
>> +Some control protocols are used to maintain link status between forwarding
>> +engines. In SDN environments, these packets share the same physical network
>> +than the user data traffic.
>> +
>> +When the system is not sized properly, the PMD threads may not be able to
>> +process all incoming traffic from the configured Rx queues. When a signaling
>> +packet of such protocols is dropped, it can cause link flapping, worsening the
>> +situation.
>> +
>> +Some physical NICs can be programmed to put these protocols in a dedicated
>> +hardware Rx queue using the rte_flow__ API.
>> +
>> +__ https://doc.dpdk.org/guides-22.11/prog_guide/rte_flow.html
>> +
>> +The currently supported control plane protocols are:
>> +
>> +``lacp``
>> +   `Link Aggregation Control Protocol`__. Ether type ``0x8809``.
>> +
>> +   __ https://www.ieee802.org/3/ad/public/mar99/seaman_1_0399.pdf
>> +
>> +.. warning::
>> +
>> +   This feature is not compatible with all NICs. Refer to the DPDK
>> +   `compatibilty matrix`__ and vendor documentation for more details.
>> +
>> +   __ https://doc.dpdk.org/guides-22.11/nics/overview.html
>> +
>> +Control plane protection must be enabled on specific protocols per port. The
>> +``cp-protection`` option requires a coma separated list of protocol names::
>> +
>> +   $ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
>> +        options:dpdk-devargs=0000:01:00.0 options:n_rxq=2 \
>> +        options:cp-protection=lacp
>> +
>> +.. note::
>> +
>> +   If multiple Rx queues are already configured, regular RSS (Receive Side
>> +   Scaling) queue balancing is done on all but the extra control plane
>> +   protection queue.
>> +
>> +.. tip::
>> +
>> +   You can check if control plane protection is supported on a port with the
>> +   following command::
>> +
>> +      $ ovs-vsctl get interface dpdk-p0 status
>> +      {cp_protection_queue="2", driver_name=..., rss_queues="0-1"}
>> +
>> +   This will also show in ``ovs-vswitchd.log``::
>> +
>> +      INFO|dpdk-p0: cp-protection: redirecting lacp traffic to queue 2
>> +      INFO|dpdk-p0: cp-protection: applying rss on queues 0-1
>> +
>> +   If the hardware does not support redirecting control plane traffic to
>> +   a dedicated queue, it will be explicit::
>> +
>> +      $ ovs-vsctl get interface dpdk-p0 status
>> +      {cp_protection=unsupported, driver_name=...}
>> +
>> +   More details can often be found in ``ovs-vswitchd.log``::
>> +
>> +      WARN|dpdk-p0: cp-protection: failed to add lacp flow: Unsupported pattern
>> +
>> +To disable control plane protection on a port, use the following command::
>> +
>> +   $ ovs-vsctl remove Interface dpdk-p0 options cp-protection
>> +
>> +You can see that it has been disabled in ``ovs-vswitchd.log``::
>> +
>> +   INFO|dpdk-p0: cp-protection: disabled
>> +
>> +.. warning::
>> +
>> +   This feature is mutually exclusive with ``other_options:hw-offload`` as it
>> +   may conflict with the offloaded RTE flows. If both are enabled,
>> +   ``cp-protection`` will be forcibly disabled.
>> +
>>   .. _dpdk-phy-flow-control:
>>   
>>   Flow Control
>> diff --git a/NEWS b/NEWS
>> index b6418c36e95e..43338edaf5aa 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -21,6 +21,9 @@ Post-v3.1.0
>>        * ovs-vswitchd will keep the CAP_SYS_RAWIO capability when started
>>          with the --hw-rawio-access command line option.  This allows the
>>          process extra privileges when mapping physical interconnect memory.
>> +     * New experimental "cp-protection=<protocol>" option to redirect certain
>> +       protocols (for now, only LACP) to a dedicated hardware queue using
>> +       RTE flow.
>>      - SRv6 Tunnel Protocol
>>        * Added support for userspace datapath (only).
>>   
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index fb0dd43f75c5..f33c6613adea 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -415,6 +415,10 @@ enum dpdk_hw_ol_features {
>>       NETDEV_TX_SCTP_CHECKSUM_OFFLOAD = 1 << 4,
>>   };
>>   
>> +enum dpdk_cp_prot_flags {
>> +    DPDK_CP_PROT_LACP = 1 << 0,
>> +};
>> +
>>   /*
>>    * In order to avoid confusion in variables names, following naming convention
>>    * should be used, if possible:
>> @@ -504,11 +508,18 @@ struct netdev_dpdk {
>>            * so we remember the request and update them next time
>>            * netdev_dpdk*_reconfigure() is called */
>>           int requested_mtu;
>> +        /* User input for n_rxq + an optional control plane protection queue
>> +         * (see netdev_dpdk_reconfigure). This field is different from the
>> +         * other requested_* fields as it may contain a different value than
>> +         * the user input. */
>>           int requested_n_txq;
>>           int requested_n_rxq;
>>           int requested_rxq_size;
>>           int requested_txq_size;
>>   
>> +        /* User input for n_rxq (see dpdk_set_rxq_config). */
>> +        int user_n_rxq;
>> +
>>           /* Number of rx/tx descriptors for physical devices */
>>           int rxq_size;
>>           int txq_size;
>> @@ -534,6 +545,13 @@ struct netdev_dpdk {
>>   
>>           /* VF configuration. */
>>           struct eth_addr requested_hwaddr;
>> +
>> +        /* Requested control plane protection flags,
>> +         * from the enum set 'dpdk_cp_prot_flags'. */
>> +        uint64_t requested_cp_prot_flags;
>> +        uint64_t cp_prot_flags;
>> +        size_t cp_prot_flows_num;
>> +        struct rte_flow **cp_prot_flows;
>>       );
>>   
>>       PADDED_MEMBERS(CACHE_LINE_SIZE,
>> @@ -1310,9 +1328,14 @@ common_construct(struct netdev *netdev, dpdk_port_t port_no,
>>       netdev->n_rxq = 0;
>>       netdev->n_txq = 0;
>>       dev->requested_n_rxq = NR_QUEUE;
>> +    dev->user_n_rxq = NR_QUEUE;
>>       dev->requested_n_txq = NR_QUEUE;
>>       dev->requested_rxq_size = NIC_PORT_DEFAULT_RXQ_SIZE;
>>       dev->requested_txq_size = NIC_PORT_DEFAULT_TXQ_SIZE;
>> +    dev->requested_cp_prot_flags = 0;
>> +    dev->cp_prot_flags = 0;
>> +    dev->cp_prot_flows_num = 0;
>> +    dev->cp_prot_flows = NULL;
>>   
>>       /* Initialize the flow control to NULL */
>>       memset(&dev->fc_conf, 0, sizeof dev->fc_conf);
>> @@ -1487,6 +1510,8 @@ common_destruct(struct netdev_dpdk *dev)
>>       ovs_mutex_destroy(&dev->mutex);
>>   }
>>   
>> +static void dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev);
>> +
>>   static void
>>   netdev_dpdk_destruct(struct netdev *netdev)
>>   {
>> @@ -1494,6 +1519,9 @@ netdev_dpdk_destruct(struct netdev *netdev)
>>   
>>       ovs_mutex_lock(&dpdk_mutex);
>>   
>> +    /* Destroy any rte flows to allow RXQs to be removed. */
>> +    dpdk_cp_prot_unconfigure(dev);
>> +
>>       rte_eth_dev_stop(dev->port_id);
>>       dev->started = false;
>>   
>> @@ -1908,8 +1936,8 @@ dpdk_set_rxq_config(struct netdev_dpdk *dev, const struct smap *args)
>>       int new_n_rxq;
>>   
>>       new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
>> -    if (new_n_rxq != dev->requested_n_rxq) {
>> -        dev->requested_n_rxq = new_n_rxq;
>> +    if (new_n_rxq != dev->user_n_rxq) {
>> +        dev->user_n_rxq = new_n_rxq;
>>           netdev_request_reconfigure(&dev->up);
>>       }
>>   }
>> @@ -1931,6 +1959,48 @@ dpdk_process_queue_size(struct netdev *netdev, const struct smap *args,
>>       }
>>   }
>>   
>> +static void
>> +dpdk_cp_prot_set_config(struct netdev *netdev, struct netdev_dpdk *dev,
>> +                        const struct smap *args, char **errp)
>> +{
>> +    const char *arg = smap_get_def(args, "cp-protection", "");
>> +    char *token, *saveptr, *buf;
>> +    uint64_t flags = 0;
>> +
>> +    buf = xstrdup(arg);
>> +    token = strtok_r(buf, ",", &saveptr);
>> +    while (token) {
>> +        if (strcmp(token, "lacp") == 0) {
>> +            flags |= DPDK_CP_PROT_LACP;
>> +        } else {
>> +            VLOG_WARN_BUF(errp, "%s options:cp-protection "
>> +                          "unknown protocol '%s'",
>> +                          netdev_get_name(netdev), token);
>> +        }
>> +        token = strtok_r(NULL, ",", &saveptr);
>> +    }
>> +    free(buf);
>> +
>> +    if (flags && dev->type != DPDK_DEV_ETH) {
>> +        VLOG_WARN_BUF(errp, "%s options:cp-protection "
>> +                      "is only supported on ethernet ports",
>> +                      netdev_get_name(netdev));
>> +        flags = 0;
>> +    }
>> +
>> +    if (flags && ovsrcu_get(void *, &netdev->hw_info.offload_data)) {
>> +        VLOG_WARN_BUF(errp, "%s options:cp-protection "
>> +                      "is incompatible with hw-offload",
>> +                      netdev_get_name(netdev));
>> +        flags = 0;
>> +    }
>> +
>> +    if (flags != dev->requested_cp_prot_flags) {
>> +        dev->requested_cp_prot_flags = flags;
>> +        netdev_request_reconfigure(netdev);
>> +    }
>> +}
>> +
>>   static int
>>   netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
>>                          char **errp)
>> @@ -1950,6 +2020,8 @@ netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
>>       ovs_mutex_lock(&dpdk_mutex);
>>       ovs_mutex_lock(&dev->mutex);
>>   
>> +    dpdk_cp_prot_set_config(netdev, dev, args, errp);
>> +
>>       dpdk_set_rxq_config(dev, args);
>>   
>>       dpdk_process_queue_size(netdev, args, "n_rxq_desc",
>> @@ -3819,9 +3891,12 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>>   {
>>       struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>       struct rte_eth_dev_info dev_info;
>> +    size_t cp_prot_flows_num;
>> +    uint64_t cp_prot_flags;
>>       const char *bus_info;
>>       uint32_t link_speed;
>>       uint32_t dev_flags;
>> +    int n_rxq;
>>   
>>       if (!rte_eth_dev_is_valid_port(dev->port_id)) {
>>           return ENODEV;
>> @@ -3833,6 +3908,9 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>>       link_speed = dev->link.link_speed;
>>       dev_flags = *dev_info.dev_flags;
>>       bus_info = rte_dev_bus_info(dev_info.device);
>> +    cp_prot_flags = dev->cp_prot_flags;
>> +    cp_prot_flows_num = dev->cp_prot_flows_num;
>> +    n_rxq = netdev->n_rxq;
>>       ovs_mutex_unlock(&dev->mutex);
>>       ovs_mutex_unlock(&dpdk_mutex);
>>   
>> @@ -3875,6 +3953,19 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>>                           ETH_ADDR_ARGS(dev->hwaddr));
>>       }
>>   
>> +    if (cp_prot_flags) {
>> +        if (!cp_prot_flows_num) {
>> +            smap_add(args, "cp_protection", "unsupported");
>> +        } else {
>> +            smap_add_format(args, "cp_protection_queue", "%d", n_rxq - 1);
>> +            if (n_rxq > 2) {
>> +                smap_add_format(args, "rss_queues", "0-%d", n_rxq - 2);
>> +            } else {
>> +                smap_add(args, "rss_queues", "0");
>> +            }
>> +        }
>> +    }
>> +
>>       return 0;
>>   }
>>   
>> @@ -5178,16 +5269,203 @@ static const struct dpdk_qos_ops trtcm_policer_ops = {
>>       .qos_queue_dump_state_init = trtcm_policer_qos_queue_dump_state_init
>>   };
>>   
>> +static int
>> +dpdk_cp_prot_add_flow(struct netdev_dpdk *dev,
>> +                      const struct rte_flow_item items[],
>> +                      const char *desc)
>> +{
>> +    const struct rte_flow_attr attr = { .ingress = 1 };
>> +    const struct rte_flow_action actions[] = {
>> +        {
>> +            .type = RTE_FLOW_ACTION_TYPE_QUEUE,
>> +            .conf = &(const struct rte_flow_action_queue) {
>> +                .index = dev->up.n_rxq - 1,
>> +            },
>> +        },
>> +        { .type = RTE_FLOW_ACTION_TYPE_END },
>> +    };
>> +    struct rte_flow_error error;
>> +    struct rte_flow *flow;
>> +    size_t num;
>> +    int err;
>> +
>> +    set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
>> +    err = rte_flow_validate(dev->port_id, &attr, items, actions, &error);
>> +    if (err) {
>> +        VLOG_WARN("%s: cp-protection: device does not support %s flow: %s",
>> +                  netdev_get_name(&dev->up), desc,
>> +                  error.message ? error.message : "");
>> +        goto out;
>> +    }
>> +
>> +    set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
>> +    flow = rte_flow_create(dev->port_id, &attr, items, actions, &error);
>> +    if (flow == NULL) {
>> +        VLOG_WARN("%s: cp-protection: failed to add %s flow: %s",
>> +                  netdev_get_name(&dev->up), desc,
>> +                  error.message ? error.message : "");
>> +        err = rte_errno;
>> +        goto out;
>> +    }
>> +
>> +    num = dev->cp_prot_flows_num + 1;
>> +    dev->cp_prot_flows = xrealloc(dev->cp_prot_flows, sizeof(flow) * num);
>> +    dev->cp_prot_flows[dev->cp_prot_flows_num] = flow;
>> +    dev->cp_prot_flows_num = num;
>> +
>> +    VLOG_INFO("%s: cp-protection: redirected %s traffic to rx queue %d",
>> +              netdev_get_name(&dev->up), desc, dev->up.n_rxq - 1);
>> +out:
>> +    return err;
>> +}
>> +
>> +#define RETA_CONF_SIZE (RTE_ETH_RSS_RETA_SIZE_512 / RTE_ETH_RETA_GROUP_SIZE)
>> +
>> +static int
>> +dpdk_cp_prot_rss_configure(struct netdev_dpdk *dev, int rss_n_rxq)
>> +{
>> +    struct rte_eth_rss_reta_entry64 reta_conf[RETA_CONF_SIZE];
>> +    struct rte_eth_dev_info info;
>> +    int err;
>> +
>> +    rte_eth_dev_info_get(dev->port_id, &info);
>> +
>> +    if (info.reta_size % rss_n_rxq != 0 &&
>> +        info.reta_size < RTE_ETH_RSS_RETA_SIZE_128) {
>> +        /*
>> +         * Some drivers set reta_size equal to the total number of rxqs that
>> +         * are configured when it is a power of two. Since we are actually
>> +         * reconfiguring the redirection table to exclude the last rxq, we may
>> +         * end up with an imbalanced redirection table. For example, such
>> +         * configuration:
>> +         *
>> +         *   options:n_rxq=3 options:cp-protection=lacp
>> +         *
>> +         * Will actually configure 4 rxqs on the NIC, and the default reta to:
>> +         *
>> +         *   [0, 1, 2, 3]
>> +         *
>> +         * And dpdk_cp_prot_rss_configure() will reconfigure reta to:
>> +         *
>> +         *   [0, 1, 2, 0]
>> +         *
>> +         * Causing queue 0 to receive twice as much traffic as queues 1 and 2.
>> +         *
>> +         * Work around that corner case by forcing a bigger redirection table
>> +         * size to 128 entries when reta_size is not a multiple of rss_n_rxq
>> +         * and when reta_size is less than 128. This value seems to be
>> +         * supported by most of the drivers that also support rte flow.
>> +         */
>> +        info.reta_size = RTE_ETH_RSS_RETA_SIZE_128;
>> +    }
>> +
>> +    memset(reta_conf, 0, sizeof(reta_conf));
>> +    for (uint16_t i = 0; i < info.reta_size; i++) {
>> +        uint16_t idx = i / RTE_ETH_RETA_GROUP_SIZE;
>> +        uint16_t shift = i % RTE_ETH_RETA_GROUP_SIZE;
>> +        reta_conf[idx].mask |= 1ULL << shift;
>> +        reta_conf[idx].reta[shift] = i % rss_n_rxq;
>> +    }
>> +    err = rte_eth_dev_rss_reta_update(dev->port_id, reta_conf, info.reta_size);
>> +    if (err < 0) {
>> +        VLOG_WARN("%s: failed to configure RSS redirection table: err=%d",
>> +                  netdev_get_name(&dev->up), err);
>> +    }
>> +
>> +    return err;
>> +}
>> +
>> +static int
>> +dpdk_cp_prot_configure(struct netdev_dpdk *dev)
>> +{
>> +    int err = 0;
>> +
>> +    if (dev->up.n_rxq < 2) {
>> +        err = ENOTSUP;
>> +        VLOG_WARN("%s: cp-protection: not enough available rx queues",
>> +                  netdev_get_name(&dev->up));
>> +        goto out;
>> +    }
>> +
>> +    if (dev->requested_cp_prot_flags & DPDK_CP_PROT_LACP) {
>> +        const struct rte_flow_item items[] = {
>> +            {
>> +                .type = RTE_FLOW_ITEM_TYPE_ETH,
>> +                .spec = &(const struct rte_flow_item_eth){
>> +                    .type = htons(ETH_TYPE_LACP),
>> +                },
>> +                .mask = &(const struct rte_flow_item_eth){
>> +                    .type = htons(0xffff),
>> +                },
>> +            },
>> +            { .type = RTE_FLOW_ITEM_TYPE_END },
>> +        };
>> +        err = dpdk_cp_prot_add_flow(dev, items, "lacp");
>> +        if (err) {
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    if (dev->cp_prot_flows_num) {
>> +        /* Reconfigure RSS reta in all but the cp protection queue. */
>> +        err = dpdk_cp_prot_rss_configure(dev, dev->up.n_rxq - 1);
>> +        if (!err) {
>> +            if (dev->up.n_rxq == 2) {
>> +                VLOG_INFO("%s: cp-protection: redirected other traffic to "
>> +                          "rx queue 0", netdev_get_name(&dev->up));
>> +            } else {
>> +                VLOG_INFO("%s: cp-protection: applied rss on rx queue 0-%u",
>> +                          netdev_get_name(&dev->up), dev->up.n_rxq - 2);
>> +            }
>> +        }
>> +    }
>> +
>> +out:
>> +    return err;
>> +}
>> +
>> +static void
>> +dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev)
>> +{
>> +    struct rte_flow_error error;
>> +
>> +    if (!dev->cp_prot_flows_num) {
>> +        return;
>> +    }
>> +
>> +    VLOG_DBG("%s: cp-protection: reset flows", netdev_get_name(&dev->up));
>> +
>> +    for (int i = 0; i < dev->cp_prot_flows_num; i++) {
>> +        set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
>> +        if (rte_flow_destroy(dev->port_id, dev->cp_prot_flows[i], &error)) {
>> +            VLOG_DBG("%s: cp-protection: failed to destroy flow: %s",
>> +                     netdev_get_name(&dev->up),
>> +                     error.message ? error.message : "");
>> +        }
>> +    }
>> +    free(dev->cp_prot_flows);
>> +    dev->cp_prot_flows_num = 0;
>> +    dev->cp_prot_flows = NULL;
>> +}
>> +
>>   static int
>>   netdev_dpdk_reconfigure(struct netdev *netdev)
>>   {
>>       struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +    bool try_cp_prot;
>>       int err = 0;
>>   
>>       ovs_mutex_lock(&dev->mutex);
>>   
>> +    try_cp_prot = dev->requested_cp_prot_flags != 0;
>> +    dev->requested_n_rxq = dev->user_n_rxq;
>> +    if (try_cp_prot) {
>> +        dev->requested_n_rxq += 1;
>> +    }
>> +
>>       if (netdev->n_txq == dev->requested_n_txq
>>           && netdev->n_rxq == dev->requested_n_rxq
>> +        && dev->cp_prot_flags == dev->requested_cp_prot_flags
>>           && dev->mtu == dev->requested_mtu
>>           && dev->lsc_interrupt_mode == dev->requested_lsc_interrupt_mode
>>           && dev->rxq_size == dev->requested_rxq_size
>> @@ -5200,6 +5478,9 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>>           goto out;
>>       }
>>   
>> +retry:
>> +    dpdk_cp_prot_unconfigure(dev);
>> +
>>       if (dev->reset_needed) {
>>           rte_eth_dev_reset(dev->port_id);
>>           if_notifier_manual_report();
>> @@ -5224,6 +5505,7 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>>       dev->txq_size = dev->requested_txq_size;
>>   
>>       rte_free(dev->tx_q);
>> +    dev->tx_q = NULL;
>>   
>>       if (!eth_addr_equals(dev->hwaddr, dev->requested_hwaddr)) {
>>           err = netdev_dpdk_set_etheraddr__(dev, dev->requested_hwaddr);
>> @@ -5255,6 +5537,23 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>>        */
>>       dev->requested_hwaddr = dev->hwaddr;
>>   
>> +    if (try_cp_prot) {
>> +        err = dpdk_cp_prot_configure(dev);
>> +        if (err) {
>> +            /* No hw support, disable & recover gracefully. */
>> +            try_cp_prot = false;
>> +            /*
>> +             * The extra queue must be explicitly removed here to ensure that
>> +             * it is unconfigured immediately.
>> +             */
>> +            dev->requested_n_rxq = dev->user_n_rxq;
>> +            goto retry;
>> +        }
>> +    } else {
>> +        VLOG_INFO("%s: cp-protection: disabled", netdev_get_name(&dev->up));
>> +    }
>> +    dev->cp_prot_flags = dev->requested_cp_prot_flags;
>> +
>>       dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
>>       if (!dev->tx_q) {
>>           err = ENOMEM;
>> @@ -5467,6 +5766,13 @@ netdev_dpdk_flow_api_supported(struct netdev *netdev)
>>       dev = netdev_dpdk_cast(netdev);
>>       ovs_mutex_lock(&dev->mutex);
>>       if (dev->type == DPDK_DEV_ETH) {
>> +        if (dev->requested_cp_prot_flags) {
>> +            VLOG_WARN("%s: disabling cp-protection as it is "
>> +                      "mutually exclusive with hw-offload.",
>> +                      netdev_get_name(netdev));
>> +            dev->requested_cp_prot_flags = 0;
>> +            netdev_request_reconfigure(netdev);
>> +        }
>>           /* TODO: Check if we able to offload some minimal flow. */
>>           ret = true;
>>       }
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index edb5eafa04c3..7ba8bf192d62 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -3491,6 +3491,33 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>>           <p>This option may only be used with dpdk VF representors.</p>
>>         </column>
>>   
>> +      <column name="options" key="cp-protection"
>> +              type='{"type": "string", "enum": ["set", ["lacp"]]}'>
>> +        <p>
>> +          Allocate an extra Rx queue for control plane packets of the specified
>> +          protocol(s).
>> +        </p>
>> +        <p>
>> +          If the user has already configured multiple
>> +          <code>options:n_rxq</code> on the port, an additional one will be
>> +          allocated for control plane packets. If the hardware cannot satisfy
>> +          the requested number of requested Rx queues, the last Rx queue will
>> +          be assigned for control plane. If only one Rx queue is available or
>> +          if the hardware does not support the RTE flow matchers/actions
>> +          required to redirect the selected protocols,
>> +          <code>cp-protection</code> will be disabled.
>> +        </p>
>> +        <p>
>> +          This feature is mutually exclusive with
>> +          <code>other_options:hw-offload</code> as it may conflict with the
>> +          offloaded RTE flows. If both are enabled, <code>cp-protection</code>
>> +          will be forcibly disabled.
>> +        </p>
>> +        <p>
>> +          Disabled by default.
>> +        </p>
>> +      </column>
>> +
>>         <column name="other_config" key="tx-steering"
>>                 type='{"type": "string",
>>                        "enum": ["set", ["thread", "hash"]]}'>
>
Robin Jarry May 23, 2023, 2:16 p.m. UTC | #3
Aaron Conole, May 23, 2023 at 15:32:
> I think one issue I have with this is that the name is a bit
> misleading.  Control plane, from OVS perspective, would be like OpenFlow
> communications.  This is more like a traffic steering mechanism.
>
> Maybe it would help to call it something like "traffic-based-rps" or
> something like that (but not clear what would be best).  It is really a
> way to steer specific traffic to a distinct rxq.
>
> WDYT?

Actually, "packet-steering" was one of the first ideas I had for this
feature, but I thought it may be confusing. I am weary of reusing the
"rps" acronym as it may be confused with a linux specific feature:

https://docs.kernel.org/networking/scaling.html#rps-receive-packet-steering

The feature introduced in this patch is relying on RTE flow which has
nothing to do with Linux. However, I agree that the name should reflect
that we are steering traffic into a dedicated rxq. How about these
ideas?

    options:isolated-rxq=lacp,...
    options:rxq-isolate=lacp,...
    options:rxq-steering=lacp,...

I personally prefer "rxq-isolate".
Ilya Maximets May 23, 2023, 8:04 p.m. UTC | #4
On 5/23/23 16:16, Robin Jarry wrote:
> Aaron Conole, May 23, 2023 at 15:32:
>> I think one issue I have with this is that the name is a bit
>> misleading.  Control plane, from OVS perspective, would be like OpenFlow
>> communications.  This is more like a traffic steering mechanism.
>>
>> Maybe it would help to call it something like "traffic-based-rps" or
>> something like that (but not clear what would be best).  It is really a
>> way to steer specific traffic to a distinct rxq.
>>
>> WDYT?
> 
> Actually, "packet-steering" was one of the first ideas I had for this
> feature, but I thought it may be confusing. I am weary of reusing the
> "rps" acronym as it may be confused with a linux specific feature:
> 
> https://docs.kernel.org/networking/scaling.html#rps-receive-packet-steering
> 
> The feature introduced in this patch is relying on RTE flow which has
> nothing to do with Linux. However, I agree that the name should reflect
> that we are steering traffic into a dedicated rxq. How about these
> ideas?
> 
>     options:isolated-rxq=lacp,...
>     options:rxq-isolate=lacp,...
>     options:rxq-steering=lacp,...
> 
> I personally prefer "rxq-isolate".

'rxq-isolate' will be confused with 'other_config:pmd-rxq-isolate'.
Same likely goes for the 'isolated-rxq'.

'rxq-steernig' may be confused with 'other_config:tx-steering'.
But this can be argued that it's essentially similar functionality,
so should be named similarly.  Maybe we can double down on that and
use options:rx-steernig=rss|rss+lacp|...  with 'rss' being a default
configuration?

Best regards, Ilya Maximets.
Robin Jarry May 24, 2023, 7:18 a.m. UTC | #5
Ilya Maximets, May 23, 2023 at 22:04:
> 'rxq-isolate' will be confused with 'other_config:pmd-rxq-isolate'.
> Same likely goes for the 'isolated-rxq'.
>
> 'rxq-steernig' may be confused with 'other_config:tx-steering'.
> But this can be argued that it's essentially similar functionality,
> so should be named similarly.  Maybe we can double down on that and
> use options:rx-steernig=rss|rss+lacp|...  with 'rss' being a default
> configuration?

I like that idea. Instead of "rss", how about "hash" to match what
tx-steering is using?

So the value of options:rx-steering would be "hash" followed by an
arbitrary list of preset protocols (for now, only "lacp") all separated
by "+". It may also open the door for other base steering methods than
"hash" ("rr" for round-robin some day?).

If that is OK, I can prepare a v11 with everything renamed.
Ilya Maximets May 24, 2023, 10:41 a.m. UTC | #6
On 5/24/23 09:18, Robin Jarry wrote:
> Ilya Maximets, May 23, 2023 at 22:04:
>> 'rxq-isolate' will be confused with 'other_config:pmd-rxq-isolate'.
>> Same likely goes for the 'isolated-rxq'.
>>
>> 'rxq-steernig' may be confused with 'other_config:tx-steering'.
>> But this can be argued that it's essentially similar functionality,
>> so should be named similarly.  Maybe we can double down on that and
>> use options:rx-steernig=rss|rss+lacp|...  with 'rss' being a default
>> configuration?
> 
> I like that idea. Instead of "rss", how about "hash" to match what
> tx-steering is using?

I'm not sure if it's better to use 'rss' or 'hash' in this context.
Aaron, Kevin, what do you think?

> 
> So the value of options:rx-steering would be "hash" followed by an
> arbitrary list of preset protocols (for now, only "lacp") all separated
> by "+". It may also open the door for other base steering methods than
> "hash" ("rr" for round-robin some day?).
> 
> If that is OK, I can prepare a v11 with everything renamed.

Might make sense for the code review.  Kevin was looking at the
patch, IIUC.

Best regards, Ilya Maximets.
Kevin Traynor May 24, 2023, 2:06 p.m. UTC | #7
On 24/05/2023 11:41, Ilya Maximets wrote:
> On 5/24/23 09:18, Robin Jarry wrote:
>> Ilya Maximets, May 23, 2023 at 22:04:
>>> 'rxq-isolate' will be confused with 'other_config:pmd-rxq-isolate'.
>>> Same likely goes for the 'isolated-rxq'.
>>>
>>> 'rxq-steernig' may be confused with 'other_config:tx-steering'.
>>> But this can be argued that it's essentially similar functionality,
>>> so should be named similarly.  Maybe we can double down on that and
>>> use options:rx-steernig=rss|rss+lacp|...  with 'rss' being a default
>>> configuration?
>>
>> I like that idea. Instead of "rss", how about "hash" to match what
>> tx-steering is using?
> 
> I'm not sure if it's better to use 'rss' or 'hash' in this context.
> Aaron, Kevin, what do you think?
> 

Hmm, not sure on this one. I have a feeling that having a 'hash' mode 
for tx-steering that only applies to vhost devices, and 'hash' mode for 
rx-steering that only applies to NICs means people will miss the 
subtlety and try to enable the wrong hash mode on the wrong device :-) 
So 'rss' might make it more distinct.

>>
>> So the value of options:rx-steering would be "hash" followed by an
>> arbitrary list of preset protocols (for now, only "lacp") all separated
>> by "+". It may also open the door for other base steering methods than
>> "hash" ("rr" for round-robin some day?).
>>
>> If that is OK, I can prepare a v11 with everything renamed.

Any reason for '+' ? I think commas are used more frequently. If it 
needed to be extended in future for some extra config, then adding 
'key:value' pairs would be seamlessly. e.g.
=rss:<reta size?>, lacp:<num of queues?>

> 
> Might make sense for the code review.  Kevin was looking at the
> patch, IIUC.
> 
> Best regards, Ilya Maximets.
>
Kevin Traynor May 24, 2023, 2:13 p.m. UTC | #8
On 17/04/2023 13:37, Robin Jarry wrote:
> Some control protocols are used to maintain link status between
> forwarding engines (e.g. LACP). When the system is not sized properly,
> the PMD threads may not be able to process all incoming traffic from the
> configured Rx queues. When a signaling packet of such protocols is
> dropped, it can cause link flapping, worsening the situation.
> 
> Use the RTE flow API to redirect these protocols into a dedicated Rx
> queue. The assumption is made that the ratio between control protocol
> traffic and user data traffic is very low and thus this dedicated Rx
> queue will never get full. The RSS redirection table is re-programmed to
> only use the other Rx queues. The RSS table size is stored in the
> netdev_dpdk structure at port initialization to avoid requesting the
> information again when changing the port configuration.
> 
> The additional Rx queue will be assigned a PMD core like any other Rx
> queue. Polling that extra queue may introduce increased latency and
> a slight performance penalty at the benefit of preventing link flapping.
> 
> This feature must be enabled per port on specific protocols via the
> cp-protection option. This option takes a coma-separated list of
> protocol names. It is only supported on ethernet ports. This feature is
> experimental.
> 
> If the user has already configured multiple Rx queues on the port, an
> additional one will be allocated for control plane packets. If the
> hardware cannot satisfy the requested number of requested Rx queues, the
> last Rx queue will be assigned for control plane. If only one Rx queue
> is available, the cp-protection feature will be disabled. If the
> hardware does not support the RTE flow matchers/actions, the feature
> will be disabled.
> 
> It cannot be enabled when other_config:hw-offload=true as it may
> conflict with the offloaded RTE flows. Similarly, if hw-offload is
> enabled, cp-protection will be forcibly disabled on all ports.
> 
> Example use:
> 
>   ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \
>     set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \
>     set interface phy0 options:cp-protection=lacp -- \
>     set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \
>     set interface phy1 options:cp-protection=lacp
> 
> As a starting point, only one protocol is supported: LACP. Other
> protocols can be added in the future. NIC compatibility should be
> checked.
> 
> To validate that this works as intended, I used a traffic generator to
> generate random traffic slightly above the machine capacity at line rate
> on a two ports bond interface. OVS is configured to receive traffic on
> two VLANs and pop/push them in a br-int bridge based on tags set on
> patch ports.
> 
>     +----------------------+
>     |         DUT          |
>     |+--------------------+|
>     ||       br-int       || in_port=patch10,actions=mod_dl_src:$patch11,mod_dl_dst:$tgen1,output:patch11
>     ||                    || in_port=patch11,actions=mod_dl_src:$patch10,mod_dl_dst:$tgen0,output:patch10
>     || patch10    patch11 ||
>     |+---|-----------|----+|
>     |    |           |     |
>     |+---|-----------|----+|
>     || patch00    patch01 ||
>     ||  tag:10    tag:20  ||
>     ||                    ||
>     ||       br-phy       || default flow, action=NORMAL
>     ||                    ||
>     ||       bond0        || balance-slb, lacp=passive, lacp-time=fast
>     ||    phy0   phy1     ||
>     |+------|-----|-------+|
>     +-------|-----|--------+
>             |     |
>     +-------|-----|--------+
>     |     port0  port1     | balance L3/L4, lacp=active, lacp-time=fast
>     |         lag          | mode trunk VLANs 10, 20
>     |                      |
>     |        switch        |
>     |                      |
>     |  vlan 10    vlan 20  |  mode access
>     |   port2      port3   |
>     +-----|----------|-----+
>           |          |
>     +-----|----------|-----+
>     |   tgen0      tgen1   |  Random traffic that is properly balanced
>     |                      |  across the bond ports in both directions.
>     |  traffic generator   |
>     +----------------------+
> 
> Without cp-protection, the bond0 links are randomly switching to
> "defaulted" when one of the LACP packets sent by the switch is dropped
> because the RX queues are full and the PMD threads did not process them
> fast enough. When that happens, all traffic must go through a single
> link which causes above line rate traffic to be dropped.
> 
>   ~# ovs-appctl lacp/show-stats bond0
>   ---- bond0 statistics ----
>   member: phy0:
>     TX PDUs: 347246
>     RX PDUs: 14865
>     RX Bad PDUs: 0
>     RX Marker Request PDUs: 0
>     Link Expired: 168
>     Link Defaulted: 0
>     Carrier Status Changed: 0
>   member: phy1:
>     TX PDUs: 347245
>     RX PDUs: 14919
>     RX Bad PDUs: 0
>     RX Marker Request PDUs: 0
>     Link Expired: 147
>     Link Defaulted: 1
>     Carrier Status Changed: 0
> 
> When cp-protection is enabled, no LACP packet is dropped and the bond
> links remain enabled at all times, maximizing the throughput. Neither
> the "Link Expired" nor the "Link Defaulted" counters are incremented
> anymore.
> 
> This feature may be considered as "QoS". However, it does not work by
> limiting the rate of traffic explicitly. It only guarantees that some
> protocols have a lower chance of being dropped because the PMD cores
> cannot keep up with regular traffic.
> 
> The choice of protocols is limited on purpose. This is not meant to be
> configurable by users. Some limited configurability could be considered
> in the future but it would expose to more potential issues if users are
> accidentally redirecting all traffic in the control plane queue.
> 

Hi Robin,

I tested combinations of enabling/disabling cp-proto and enabling hwol. 
It is working as expected and hwol feature always has a clear priority, 
regardless of the order they are enabled in.

I didn't test lacp traffic, but I was able to see that the rxq was 
added/removed for it and that rss was working on the other rxqs.

I also tested combinations of different numbers of rxqs, adding/deleting 
rxqs, enabling/disabling cp-proto etc and working fine. I also tested 
enabling cp-proto and changing rxqs in the same command, working fine.

A few small comments below.

thanks,
Kevin.

> Cc: Anthony Harivel <aharivel@redhat.com>
> Cc: Christophe Fontaine <cfontain@redhat.com>
> Cc: David Marchand <david.marchand@redhat.com>
> Cc: Kevin Traynor <ktraynor@redhat.com>
> Signed-off-by: Robin Jarry <rjarry@redhat.com>
> ---
> v9 -> v10:
> 
> * If hw-offload is enabled, cp-protection will now be forcibly disabled
>    on all ports.
> 
>   Documentation/topics/dpdk/phy.rst |  83 ++++++++
>   NEWS                              |   3 +
>   lib/netdev-dpdk.c                 | 310 +++++++++++++++++++++++++++++-
>   vswitchd/vswitch.xml              |  27 +++
>   4 files changed, 421 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/phy.rst b/Documentation/topics/dpdk/phy.rst
> index 4b0fe8dded3a..733c4251bca9 100644
> --- a/Documentation/topics/dpdk/phy.rst
> +++ b/Documentation/topics/dpdk/phy.rst
> @@ -131,6 +131,89 @@ possible with DPDK acceleration. It is possible to configure multiple Rx queues
>   for ``dpdk`` ports, thus ensuring this is not a bottleneck for performance. For
>   information on configuring PMD threads, refer to :doc:`pmd`.
>   
> +Control Plane Protection
> +------------------------
> +
> +.. warning:: This feature is experimental.
> +
> +Some control protocols are used to maintain link status between forwarding
> +engines. In SDN environments, these packets share the same physical network
> +than the user data traffic.
> +
> +When the system is not sized properly, the PMD threads may not be able to
> +process all incoming traffic from the configured Rx queues. When a signaling
> +packet of such protocols is dropped, it can cause link flapping, worsening the
> +situation.
> +
> +Some physical NICs can be programmed to put these protocols in a dedicated
> +hardware Rx queue using the rte_flow__ API.
> +
> +__ https://doc.dpdk.org/guides-22.11/prog_guide/rte_flow.html
> +
> +The currently supported control plane protocols are:
> +
> +``lacp``
> +   `Link Aggregation Control Protocol`__. Ether type ``0x8809``.
> +
> +   __ https://www.ieee802.org/3/ad/public/mar99/seaman_1_0399.pdf
> +
> +.. warning::
> +
> +   This feature is not compatible with all NICs. Refer to the DPDK
> +   `compatibilty matrix`__ and vendor documentation for more details.
> +
> +   __ https://doc.dpdk.org/guides-22.11/nics/overview.html
> +
> +Control plane protection must be enabled on specific protocols per port. The
> +``cp-protection`` option requires a coma separated list of protocol names::
> +
> +   $ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
> +        options:dpdk-devargs=0000:01:00.0 options:n_rxq=2 \
> +        options:cp-protection=lacp
> +
> +.. note::
> +
> +   If multiple Rx queues are already configured, regular RSS (Receive Side
> +   Scaling) queue balancing is done on all but the extra control plane
> +   protection queue.
> +
> +.. tip::
> +
> +   You can check if control plane protection is supported on a port with the
> +   following command::
> +
> +      $ ovs-vsctl get interface dpdk-p0 status
> +      {cp_protection_queue="2", driver_name=..., rss_queues="0-1"}
> +
> +   This will also show in ``ovs-vswitchd.log``::
> +
> +      INFO|dpdk-p0: cp-protection: redirecting lacp traffic to queue 2
> +      INFO|dpdk-p0: cp-protection: applying rss on queues 0-1
> +
> +   If the hardware does not support redirecting control plane traffic to
> +   a dedicated queue, it will be explicit::
> +
> +      $ ovs-vsctl get interface dpdk-p0 status
> +      {cp_protection=unsupported, driver_name=...}
> +
> +   More details can often be found in ``ovs-vswitchd.log``::
> +
> +      WARN|dpdk-p0: cp-protection: failed to add lacp flow: Unsupported pattern
> +
> +To disable control plane protection on a port, use the following command::
> +
> +   $ ovs-vsctl remove Interface dpdk-p0 options cp-protection
> +
> +You can see that it has been disabled in ``ovs-vswitchd.log``::
> +
> +   INFO|dpdk-p0: cp-protection: disabled
> +
> +.. warning::
> +
> +   This feature is mutually exclusive with ``other_options:hw-offload`` as it
> +   may conflict with the offloaded RTE flows. If both are enabled,
> +   ``cp-protection`` will be forcibly disabled.
> +
>   .. _dpdk-phy-flow-control:
>   
>   Flow Control
> diff --git a/NEWS b/NEWS
> index b6418c36e95e..43338edaf5aa 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -21,6 +21,9 @@ Post-v3.1.0
>        * ovs-vswitchd will keep the CAP_SYS_RAWIO capability when started
>          with the --hw-rawio-access command line option.  This allows the
>          process extra privileges when mapping physical interconnect memory.
> +     * New experimental "cp-protection=<protocol>" option to redirect certain
> +       protocols (for now, only LACP) to a dedicated hardware queue using
> +       RTE flow.
>      - SRv6 Tunnel Protocol
>        * Added support for userspace datapath (only).
>   
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index fb0dd43f75c5..f33c6613adea 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -415,6 +415,10 @@ enum dpdk_hw_ol_features {
>       NETDEV_TX_SCTP_CHECKSUM_OFFLOAD = 1 << 4,
>   };
>   
> +enum dpdk_cp_prot_flags {
> +    DPDK_CP_PROT_LACP = 1 << 0,
> +};
> +
>   /*
>    * In order to avoid confusion in variables names, following naming convention
>    * should be used, if possible:
> @@ -504,11 +508,18 @@ struct netdev_dpdk {
>            * so we remember the request and update them next time
>            * netdev_dpdk*_reconfigure() is called */
>           int requested_mtu;
> +        /* User input for n_rxq + an optional control plane protection queue
> +         * (see netdev_dpdk_reconfigure). This field is different from the
> +         * other requested_* fields as it may contain a different value than
> +         * the user input. */

^ I don't think this comment is really needed. requested_rxq_size can 
also be adjusted and they don't always come from user input, sometimes 
just the OVS default.

>           int requested_n_txq;
>           int requested_n_rxq;
>           int requested_rxq_size;
>           int requested_txq_size;
>   
> +        /* User input for n_rxq (see dpdk_set_rxq_config). */
> +        int user_n_rxq;
> +

I also put in a similar 3rd stage status in recent patches for 
rx/tx-descriptors but managed to get rid of it. You might be able to 
remove this and just check check cp-proto flags do a +1/-1 where needed?

>           /* Number of rx/tx descriptors for physical devices */
>           int rxq_size;
>           int txq_size;
> @@ -534,6 +545,13 @@ struct netdev_dpdk {
>   
>           /* VF configuration. */
>           struct eth_addr requested_hwaddr;
> +
> +        /* Requested control plane protection flags,
> +         * from the enum set 'dpdk_cp_prot_flags'. */
> +        uint64_t requested_cp_prot_flags;
> +        uint64_t cp_prot_flags;
> +        size_t cp_prot_flows_num;
> +        struct rte_flow **cp_prot_flows;
>       );
>   
>       PADDED_MEMBERS(CACHE_LINE_SIZE,
> @@ -1310,9 +1328,14 @@ common_construct(struct netdev *netdev, dpdk_port_t port_no,
>       netdev->n_rxq = 0;
>       netdev->n_txq = 0;
>       dev->requested_n_rxq = NR_QUEUE;
> +    dev->user_n_rxq = NR_QUEUE;
>       dev->requested_n_txq = NR_QUEUE;
>       dev->requested_rxq_size = NIC_PORT_DEFAULT_RXQ_SIZE;
>       dev->requested_txq_size = NIC_PORT_DEFAULT_TXQ_SIZE;
> +    dev->requested_cp_prot_flags = 0;
> +    dev->cp_prot_flags = 0;
> +    dev->cp_prot_flows_num = 0;
> +    dev->cp_prot_flows = NULL;
>   
>       /* Initialize the flow control to NULL */
>       memset(&dev->fc_conf, 0, sizeof dev->fc_conf);
> @@ -1487,6 +1510,8 @@ common_destruct(struct netdev_dpdk *dev)
>       ovs_mutex_destroy(&dev->mutex);
>   }
>   
> +static void dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev);
> +
>   static void
>   netdev_dpdk_destruct(struct netdev *netdev)
>   {
> @@ -1494,6 +1519,9 @@ netdev_dpdk_destruct(struct netdev *netdev)
>   
>       ovs_mutex_lock(&dpdk_mutex);
>   
> +    /* Destroy any rte flows to allow RXQs to be removed. */
> +    dpdk_cp_prot_unconfigure(dev);
> +
>       rte_eth_dev_stop(dev->port_id);
>       dev->started = false;
>   
> @@ -1908,8 +1936,8 @@ dpdk_set_rxq_config(struct netdev_dpdk *dev, const struct smap *args)
>       int new_n_rxq;
>   
>       new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> -    if (new_n_rxq != dev->requested_n_rxq) {
> -        dev->requested_n_rxq = new_n_rxq;
> +    if (new_n_rxq != dev->user_n_rxq) {
> +        dev->user_n_rxq = new_n_rxq;
>           netdev_request_reconfigure(&dev->up);
>       }
>   }
> @@ -1931,6 +1959,48 @@ dpdk_process_queue_size(struct netdev *netdev, const struct smap *args,
>       }
>   }
>   
> +static void
> +dpdk_cp_prot_set_config(struct netdev *netdev, struct netdev_dpdk *dev,
> +                        const struct smap *args, char **errp)
> +{
> +    const char *arg = smap_get_def(args, "cp-protection", "");
> +    char *token, *saveptr, *buf;
> +    uint64_t flags = 0;
> +
> +    buf = xstrdup(arg);
> +    token = strtok_r(buf, ",", &saveptr);
> +    while (token) {
> +        if (strcmp(token, "lacp") == 0) {
> +            flags |= DPDK_CP_PROT_LACP;
> +        } else {
> +            VLOG_WARN_BUF(errp, "%s options:cp-protection "
> +                          "unknown protocol '%s'",
> +                          netdev_get_name(netdev), token);
> +        }
> +        token = strtok_r(NULL, ",", &saveptr);
> +    }
> +    free(buf);
> +
> +    if (flags && dev->type != DPDK_DEV_ETH) {
> +        VLOG_WARN_BUF(errp, "%s options:cp-protection "
> +                      "is only supported on ethernet ports",
> +                      netdev_get_name(netdev));
> +        flags = 0;
> +    }
> +
> +    if (flags && ovsrcu_get(void *, &netdev->hw_info.offload_data)) {
> +        VLOG_WARN_BUF(errp, "%s options:cp-protection "
> +                      "is incompatible with hw-offload",
> +                      netdev_get_name(netdev));
> +        flags = 0;
> +    }
> +
> +    if (flags != dev->requested_cp_prot_flags) {
> +        dev->requested_cp_prot_flags = flags;
> +        netdev_request_reconfigure(netdev);
> +    }
> +}
> +
>   static int
>   netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
>                          char **errp)
> @@ -1950,6 +2020,8 @@ netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
>       ovs_mutex_lock(&dpdk_mutex);
>       ovs_mutex_lock(&dev->mutex);
>   
> +    dpdk_cp_prot_set_config(netdev, dev, args, errp);
> +
>       dpdk_set_rxq_config(dev, args);
>   
>       dpdk_process_queue_size(netdev, args, "n_rxq_desc",
> @@ -3819,9 +3891,12 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>   {
>       struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>       struct rte_eth_dev_info dev_info;
> +    size_t cp_prot_flows_num;
> +    uint64_t cp_prot_flags;
>       const char *bus_info;
>       uint32_t link_speed;
>       uint32_t dev_flags;
> +    int n_rxq;
>   
>       if (!rte_eth_dev_is_valid_port(dev->port_id)) {
>           return ENODEV;
> @@ -3833,6 +3908,9 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>       link_speed = dev->link.link_speed;
>       dev_flags = *dev_info.dev_flags;
>       bus_info = rte_dev_bus_info(dev_info.device);
> +    cp_prot_flags = dev->cp_prot_flags;
> +    cp_prot_flows_num = dev->cp_prot_flows_num;
> +    n_rxq = netdev->n_rxq;
>       ovs_mutex_unlock(&dev->mutex);
>       ovs_mutex_unlock(&dpdk_mutex);
>   
> @@ -3875,6 +3953,19 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>                           ETH_ADDR_ARGS(dev->hwaddr));
>       }
>   
> +    if (cp_prot_flags) {
> +        if (!cp_prot_flows_num) {
> +            smap_add(args, "cp_protection", "unsupported");
> +        } else {
> +            smap_add_format(args, "cp_protection_queue", "%d", n_rxq - 1);
> +            if (n_rxq > 2) {
> +                smap_add_format(args, "rss_queues", "0-%d", n_rxq - 2);
> +            } else {
> +                smap_add(args, "rss_queues", "0");
> +            }
> +        }
> +    }
> +
>       return 0;
>   }
>   
> @@ -5178,16 +5269,203 @@ static const struct dpdk_qos_ops trtcm_policer_ops = {
>       .qos_queue_dump_state_init = trtcm_policer_qos_queue_dump_state_init
>   };
>   
> +static int
> +dpdk_cp_prot_add_flow(struct netdev_dpdk *dev,
> +                      const struct rte_flow_item items[],
> +                      const char *desc)
> +{
> +    const struct rte_flow_attr attr = { .ingress = 1 };
> +    const struct rte_flow_action actions[] = {
> +        {
> +            .type = RTE_FLOW_ACTION_TYPE_QUEUE,
> +            .conf = &(const struct rte_flow_action_queue) {
> +                .index = dev->up.n_rxq - 1,
> +            },
> +        },
> +        { .type = RTE_FLOW_ACTION_TYPE_END },
> +    };
> +    struct rte_flow_error error;
> +    struct rte_flow *flow;
> +    size_t num;
> +    int err;
> +
> +    set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
> +    err = rte_flow_validate(dev->port_id, &attr, items, actions, &error);
> +    if (err) {
> +        VLOG_WARN("%s: cp-protection: device does not support %s flow: %s",
> +                  netdev_get_name(&dev->up), desc,
> +                  error.message ? error.message : "");
> +        goto out;
> +    }
> +
> +    set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
> +    flow = rte_flow_create(dev->port_id, &attr, items, actions, &error);
> +    if (flow == NULL) {
> +        VLOG_WARN("%s: cp-protection: failed to add %s flow: %s",
> +                  netdev_get_name(&dev->up), desc,
> +                  error.message ? error.message : "");
> +        err = rte_errno;
> +        goto out;
> +    }
> +
> +    num = dev->cp_prot_flows_num + 1;
> +    dev->cp_prot_flows = xrealloc(dev->cp_prot_flows, sizeof(flow) * num);
> +    dev->cp_prot_flows[dev->cp_prot_flows_num] = flow;
> +    dev->cp_prot_flows_num = num;
> +
> +    VLOG_INFO("%s: cp-protection: redirected %s traffic to rx queue %d",
> +              netdev_get_name(&dev->up), desc, dev->up.n_rxq - 1);
> +out:
> +    return err;
> +}
> +
> +#define RETA_CONF_SIZE (RTE_ETH_RSS_RETA_SIZE_512 / RTE_ETH_RETA_GROUP_SIZE)
> +
> +static int
> +dpdk_cp_prot_rss_configure(struct netdev_dpdk *dev, int rss_n_rxq)
> +{
> +    struct rte_eth_rss_reta_entry64 reta_conf[RETA_CONF_SIZE];
> +    struct rte_eth_dev_info info;
> +    int err;
> +
> +    rte_eth_dev_info_get(dev->port_id, &info);
> +
> +    if (info.reta_size % rss_n_rxq != 0 &&
> +        info.reta_size < RTE_ETH_RSS_RETA_SIZE_128) {
> +        /*
> +         * Some drivers set reta_size equal to the total number of rxqs that
> +         * are configured when it is a power of two. Since we are actually
> +         * reconfiguring the redirection table to exclude the last rxq, we may
> +         * end up with an imbalanced redirection table. For example, such
> +         * configuration:
> +         *
> +         *   options:n_rxq=3 options:cp-protection=lacp
> +         *
> +         * Will actually configure 4 rxqs on the NIC, and the default reta to:
> +         *
> +         *   [0, 1, 2, 3]
> +         *
> +         * And dpdk_cp_prot_rss_configure() will reconfigure reta to:
> +         *
> +         *   [0, 1, 2, 0]
> +         *
> +         * Causing queue 0 to receive twice as much traffic as queues 1 and 2.
> +         *
> +         * Work around that corner case by forcing a bigger redirection table
> +         * size to 128 entries when reta_size is not a multiple of rss_n_rxq
> +         * and when reta_size is less than 128. This value seems to be
> +         * supported by most of the drivers that also support rte flow.
> +         */
> +        info.reta_size = RTE_ETH_RSS_RETA_SIZE_128;

Thanks for following up on previous comments and adding this workaround. 
I think you didn't get much response about it, so might be worth adding 
a DPDK bugzilla to formally report/track it.

> +    }
> +
> +    memset(reta_conf, 0, sizeof(reta_conf));
> +    for (uint16_t i = 0; i < info.reta_size; i++) {
> +        uint16_t idx = i / RTE_ETH_RETA_GROUP_SIZE;
> +        uint16_t shift = i % RTE_ETH_RETA_GROUP_SIZE;
> +        reta_conf[idx].mask |= 1ULL << shift;
> +        reta_conf[idx].reta[shift] = i % rss_n_rxq;
> +    }
> +    err = rte_eth_dev_rss_reta_update(dev->port_id, reta_conf, info.reta_size);
> +    if (err < 0) {
> +        VLOG_WARN("%s: failed to configure RSS redirection table: err=%d",
> +                  netdev_get_name(&dev->up), err);
> +    }
> +
> +    return err;
> +}
> +
> +static int
> +dpdk_cp_prot_configure(struct netdev_dpdk *dev)
> +{
> +    int err = 0;
> +
> +    if (dev->up.n_rxq < 2) {
> +        err = ENOTSUP;
> +        VLOG_WARN("%s: cp-protection: not enough available rx queues",
> +                  netdev_get_name(&dev->up));
> +        goto out;
> +    }
> +
> +    if (dev->requested_cp_prot_flags & DPDK_CP_PROT_LACP) {
> +        const struct rte_flow_item items[] = {
> +            {
> +                .type = RTE_FLOW_ITEM_TYPE_ETH,
> +                .spec = &(const struct rte_flow_item_eth){
> +                    .type = htons(ETH_TYPE_LACP),
> +                },
> +                .mask = &(const struct rte_flow_item_eth){
> +                    .type = htons(0xffff),
> +                },
> +            },
> +            { .type = RTE_FLOW_ITEM_TYPE_END },
> +        };
> +        err = dpdk_cp_prot_add_flow(dev, items, "lacp");
> +        if (err) {
> +            goto out;
> +        }
> +    }
> +
> +    if (dev->cp_prot_flows_num) {
> +        /* Reconfigure RSS reta in all but the cp protection queue. */
> +        err = dpdk_cp_prot_rss_configure(dev, dev->up.n_rxq - 1);
> +        if (!err) {
> +            if (dev->up.n_rxq == 2) {
> +                VLOG_INFO("%s: cp-protection: redirected other traffic to "
> +                          "rx queue 0", netdev_get_name(&dev->up));
> +            } else {
> +                VLOG_INFO("%s: cp-protection: applied rss on rx queue 0-%u",
> +                          netdev_get_name(&dev->up), dev->up.n_rxq - 2);
> +            }
> +        }
> +    }
> +
> +out:
> +    return err;
> +}
> +
> +static void
> +dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev)
> +{
> +    struct rte_flow_error error;
> +
> +    if (!dev->cp_prot_flows_num) {
> +        return;
> +    }
> +
> +    VLOG_DBG("%s: cp-protection: reset flows", netdev_get_name(&dev->up));
> +
> +    for (int i = 0; i < dev->cp_prot_flows_num; i++) {
> +        set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
> +        if (rte_flow_destroy(dev->port_id, dev->cp_prot_flows[i], &error)) {
> +            VLOG_DBG("%s: cp-protection: failed to destroy flow: %s",
> +                     netdev_get_name(&dev->up),
> +                     error.message ? error.message : "");
> +        }
> +    }
> +    free(dev->cp_prot_flows);
> +    dev->cp_prot_flows_num = 0;
> +    dev->cp_prot_flows = NULL;
> +}
> +
>   static int
>   netdev_dpdk_reconfigure(struct netdev *netdev)
>   {
>       struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +    bool try_cp_prot;
>       int err = 0;
>   
>       ovs_mutex_lock(&dev->mutex);
>   

> +    try_cp_prot = dev->requested_cp_prot_flags != 0;
> +    dev->requested_n_rxq = dev->user_n_rxq;
> +    if (try_cp_prot) {
> +        dev->requested_n_rxq += 1;

Minor, but adjusting the requested_n_rxq's seems more suited to 
dpdk_cp_prot_set_config() (and move it after dpdk_set_rxq_config() in 
netdev_dpdk_set_config()) as this is the function that applies the 
configuration. I admit, it's partly aesthetic, so that this function 
still just checks for changes that require a reconfig and returns if 
there are none :-)

> +    }
> +
>       if (netdev->n_txq == dev->requested_n_txq
>           && netdev->n_rxq == dev->requested_n_rxq
> +        && dev->cp_prot_flags == dev->requested_cp_prot_flags
>           && dev->mtu == dev->requested_mtu
>           && dev->lsc_interrupt_mode == dev->requested_lsc_interrupt_mode
>           && dev->rxq_size == dev->requested_rxq_size
> @@ -5200,6 +5478,9 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>           goto out;
>       }
>   
> +retry:
> +    dpdk_cp_prot_unconfigure(dev);
> +
>       if (dev->reset_needed) {
>           rte_eth_dev_reset(dev->port_id);
>           if_notifier_manual_report();
> @@ -5224,6 +5505,7 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>       dev->txq_size = dev->requested_txq_size;
>   
>       rte_free(dev->tx_q);
> +    dev->tx_q = NULL;
>   
>       if (!eth_addr_equals(dev->hwaddr, dev->requested_hwaddr)) {
>           err = netdev_dpdk_set_etheraddr__(dev, dev->requested_hwaddr);
> @@ -5255,6 +5537,23 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>        */
>       dev->requested_hwaddr = dev->hwaddr;
>   
> +    if (try_cp_prot) {
> +        err = dpdk_cp_prot_configure(dev);
> +        if (err) {
> +            /* No hw support, disable & recover gracefully. */
> +            try_cp_prot = false;
> +            /*
> +             * The extra queue must be explicitly removed here to ensure that
> +             * it is unconfigured immediately.
> +             */
> +            dev->requested_n_rxq = dev->user_n_rxq;
> +            goto retry;
> +        }
> +    } else {
> +        VLOG_INFO("%s: cp-protection: disabled", netdev_get_name(&dev->up));
> +    }
> +    dev->cp_prot_flags = dev->requested_cp_prot_flags;
> +
>       dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
>       if (!dev->tx_q) {
>           err = ENOMEM;
> @@ -5467,6 +5766,13 @@ netdev_dpdk_flow_api_supported(struct netdev *netdev)
>       dev = netdev_dpdk_cast(netdev);
>       ovs_mutex_lock(&dev->mutex);
>       if (dev->type == DPDK_DEV_ETH) {
> +        if (dev->requested_cp_prot_flags) {
> +            VLOG_WARN("%s: disabling cp-protection as it is "
> +                      "mutually exclusive with hw-offload.",
> +                      netdev_get_name(netdev));
> +            dev->requested_cp_prot_flags = 0;
> +            netdev_request_reconfigure(netdev);
> +        }
>           /* TODO: Check if we able to offload some minimal flow. */
>           ret = true;
>       }
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index edb5eafa04c3..7ba8bf192d62 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -3491,6 +3491,33 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>           <p>This option may only be used with dpdk VF representors.</p>
>         </column>
>   
> +      <column name="options" key="cp-protection"
> +              type='{"type": "string", "enum": ["set", ["lacp"]]}'>
> +        <p>
> +          Allocate an extra Rx queue for control plane packets of the specified
> +          protocol(s).
> +        </p>
> +        <p>
> +          If the user has already configured multiple
> +          <code>options:n_rxq</code> on the port, an additional one will be
> +          allocated for control plane packets. If the hardware cannot satisfy
> +          the requested number of requested Rx queues, the last Rx queue will
> +          be assigned for control plane. If only one Rx queue is available or
> +          if the hardware does not support the RTE flow matchers/actions
> +          required to redirect the selected protocols,
> +          <code>cp-protection</code> will be disabled.
> +        </p>
> +        <p>
> +          This feature is mutually exclusive with
> +          <code>other_options:hw-offload</code> as it may conflict with the
> +          offloaded RTE flows. If both are enabled, <code>cp-protection</code>
> +          will be forcibly disabled.
> +        </p>
> +        <p>
> +          Disabled by default.
> +        </p>
> +      </column>
> +
>         <column name="other_config" key="tx-steering"
>                 type='{"type": "string",
>                        "enum": ["set", ["thread", "hash"]]}'>
Robin Jarry May 24, 2023, 2:32 p.m. UTC | #9
Kevin Traynor, May 24, 2023 at 16:06:
> Hmm, not sure on this one. I have a feeling that having a 'hash' mode
> for tx-steering that only applies to vhost devices, and 'hash' mode
> for rx-steering that only applies to NICs means people will miss the
> subtlety and try to enable the wrong hash mode on the wrong device :-)
> So 'rss' might make it more distinct.

"rss" is fine then.

> Any reason for '+' ? I think commas are used more frequently. If it
> needed to be extended in future for some extra config, then adding
> 'key:value' pairs would be seamlessly. e.g. =rss:<reta size?>,
> lacp:<num of queues?>

The "rss" item should be mandatory anyway. There should be no way to
disable it. I assume that it is what Ilya meant with these "+" symbols.
That would leave us with these examples:

# lacp+bfd on separate queue, rss on other queues

    options:rx-steering=rss,lacp,bfd

# same

    options:rx-steering=lacp,bfd,rss

# only regular rss, same as no config at all

    options:rx-steering=rss

# invalid configurations

    options:rx-steering=lacp
    options:rx-steering=
    options:rx-steering=bfd,lacp

What do you think?
Kevin Traynor May 24, 2023, 2:54 p.m. UTC | #10
On 24/05/2023 15:32, Robin Jarry wrote:
> Kevin Traynor, May 24, 2023 at 16:06:
>> Hmm, not sure on this one. I have a feeling that having a 'hash' mode
>> for tx-steering that only applies to vhost devices, and 'hash' mode
>> for rx-steering that only applies to NICs means people will miss the
>> subtlety and try to enable the wrong hash mode on the wrong device :-)
>> So 'rss' might make it more distinct.
> 
> "rss" is fine then.
> 
>> Any reason for '+' ? I think commas are used more frequently. If it
>> needed to be extended in future for some extra config, then adding
>> 'key:value' pairs would be seamlessly. e.g. =rss:<reta size?>,
>> lacp:<num of queues?>
> 
> The "rss" item should be mandatory anyway. There should be no way to
> disable it. I assume that it is what Ilya meant with these "+" symbols.
> That would leave us with these examples:
> 
> # lacp+bfd on separate queue, rss on other queues
> 
>      options:rx-steering=rss,lacp,bfd
> 
> # same
> 
>      options:rx-steering=lacp,bfd,rss
> 

^ It looks a little odd to me that some values are for single queues and 
some are for the rest of the queues, with no way to distinguish.

So maybe Ilya had placed significance in the order with his suggestion ? 
i.e. [default]+[single queue proto]+[other single queue proto]+...

I don't want to bike shed too much on it, i guess with good docs, it can 
be clear anyway.

> # only regular rss, same as no config at all
> 
>      options:rx-steering=rss
> 
> # invalid configurations
> 
>      options:rx-steering=lacp

>      options:rx-steering=
(^ This will be ok and use the default rss)

>      options:rx-steering=bfd,lacp
> 
> What do you think?
>
Robin Jarry May 24, 2023, 3:03 p.m. UTC | #11
Hey Kevin,

Kevin Traynor, May 24, 2023 at 16:13:
> Hi Robin,
>
> I tested combinations of enabling/disabling cp-proto and enabling
> hwol. It is working as expected and hwol feature always has a clear
> priority, regardless of the order they are enabled in.
>
> I didn't test lacp traffic, but I was able to see that the rxq was
> added/removed for it and that rss was working on the other rxqs.
>
> I also tested combinations of different numbers of rxqs,
> adding/deleting rxqs, enabling/disabling cp-proto etc and working
> fine. I also tested enabling cp-proto and changing rxqs in the same
> command, working fine.

Thanks for testing!

> > +        /* User input for n_rxq + an optional control plane protection queue
> > +         * (see netdev_dpdk_reconfigure). This field is different from the
> > +         * other requested_* fields as it may contain a different value than
> > +         * the user input. */
>
> ^ I don't think this comment is really needed. requested_rxq_size can
> also be adjusted and they don't always come from user input, sometimes
> just the OVS default.
...
> > +        /* User input for n_rxq (see dpdk_set_rxq_config). */
> > +        int user_n_rxq;
>
> I also put in a similar 3rd stage status in recent patches for
> rx/tx-descriptors but managed to get rid of it. You might be able to
> remove this and just check check cp-proto flags do a +1/-1 where
> needed?

I had already tried this and it was not possible. I will have another
fresh look. Maybe I was thinking about this the wrong way.

> > +        /*
> > +         * Some drivers set reta_size equal to the total number of rxqs that
> > +         * are configured when it is a power of two. Since we are actually
> > +         * reconfiguring the redirection table to exclude the last rxq, we may
> > +         * end up with an imbalanced redirection table. For example, such
> > +         * configuration:
> > +         *
> > +         *   options:n_rxq=3 options:cp-protection=lacp
> > +         *
> > +         * Will actually configure 4 rxqs on the NIC, and the default reta to:
> > +         *
> > +         *   [0, 1, 2, 3]
> > +         *
> > +         * And dpdk_cp_prot_rss_configure() will reconfigure reta to:
> > +         *
> > +         *   [0, 1, 2, 0]
> > +         *
> > +         * Causing queue 0 to receive twice as much traffic as queues 1 and 2.
> > +         *
> > +         * Work around that corner case by forcing a bigger redirection table
> > +         * size to 128 entries when reta_size is not a multiple of rss_n_rxq
> > +         * and when reta_size is less than 128. This value seems to be
> > +         * supported by most of the drivers that also support rte flow.
> > +         */
> > +        info.reta_size = RTE_ETH_RSS_RETA_SIZE_128;
>
> Thanks for following up on previous comments and adding this
> workaround. I think you didn't get much response about it, so might be
> worth adding a DPDK bugzilla to formally report/track it.

I don't think it is a bug. If the default redirection table is not
modified, there will be proper balancing on all Rx queues. The issue we
have here is that we are reconfiguring the table and removing the last
queue from it which can lead to imbalances in some cases. Do you think
this is worth reporting anyway?

> > +    try_cp_prot = dev->requested_cp_prot_flags != 0;
> > +    dev->requested_n_rxq = dev->user_n_rxq;
> > +    if (try_cp_prot) {
> > +        dev->requested_n_rxq += 1;
>
> Minor, but adjusting the requested_n_rxq's seems more suited to
> dpdk_cp_prot_set_config() (and move it after dpdk_set_rxq_config() in
> netdev_dpdk_set_config()) as this is the function that applies the
> configuration. I admit, it's partly aesthetic, so that this function
> still just checks for changes that require a reconfig and returns if
> there are none :-)

This is related to the introduction of user_n_rxq. I had tried to make
it work and this was the only way. I'll have another fresh look.

Cheers,
Robin
Ilya Maximets May 24, 2023, 3:05 p.m. UTC | #12
On 5/24/23 16:54, Kevin Traynor wrote:
> On 24/05/2023 15:32, Robin Jarry wrote:
>> Kevin Traynor, May 24, 2023 at 16:06:
>>> Hmm, not sure on this one. I have a feeling that having a 'hash' mode
>>> for tx-steering that only applies to vhost devices, and 'hash' mode
>>> for rx-steering that only applies to NICs means people will miss the
>>> subtlety and try to enable the wrong hash mode on the wrong device :-)
>>> So 'rss' might make it more distinct.
>>
>> "rss" is fine then.
>>
>>> Any reason for '+' ? I think commas are used more frequently. If it
>>> needed to be extended in future for some extra config, then adding
>>> 'key:value' pairs would be seamlessly. e.g. =rss:<reta size?>,
>>> lacp:<num of queues?>
>>
>> The "rss" item should be mandatory anyway. There should be no way to
>> disable it. I assume that it is what Ilya meant with these "+" symbols.
>> That would leave us with these examples:
>>
>> # lacp+bfd on separate queue, rss on other queues
>>
>>      options:rx-steering=rss,lacp,bfd
>>
>> # same
>>
>>      options:rx-steering=lacp,bfd,rss
>>
> 
> ^ It looks a little odd to me that some values are for single queues and 
> some are for the rest of the queues, with no way to distinguish.
> 
> So maybe Ilya had placed significance in the order with his suggestion ? 
> i.e. [default]+[single queue proto]+[other single queue proto]+...

I had a '+' because rss and lacp are two different entities and I looked
at it as a mode of operation. i.e. RSS plus special handling for LACP.
RSS looks strange in a comma-separated list, IMO.

> 
> I don't want to bike shed too much on it, i guess with good docs, it can 
> be clear anyway.
> 
>> # only regular rss, same as no config at all
>>
>>      options:rx-steering=rss
>>
>> # invalid configurations
>>
>>      options:rx-steering=lacp
> 
>>      options:rx-steering=
> (^ This will be ok and use the default rss)
> 
>>      options:rx-steering=bfd,lacp
>>
>> What do you think?
>>
>
Robin Jarry May 24, 2023, 3:05 p.m. UTC | #13
Kevin Traynor, May 24, 2023 at 16:54:
> > The "rss" item should be mandatory anyway. There should be no way to
> > disable it. I assume that it is what Ilya meant with these "+" symbols.
> > That would leave us with these examples:
> > 
> > # lacp+bfd on separate queue, rss on other queues
> > 
> >      options:rx-steering=rss,lacp,bfd
> > 
> > # same
> > 
> >      options:rx-steering=lacp,bfd,rss
>
> ^ It looks a little odd to me that some values are for single queues and 
> some are for the rest of the queues, with no way to distinguish.
>
> So maybe Ilya had placed significance in the order with his suggestion ? 
> i.e. [default]+[single queue proto]+[other single queue proto]+...
>
> I don't want to bike shed too much on it, i guess with good docs, it can 
> be clear anyway.
>
> > # only regular rss, same as no config at all
> > 
> >      options:rx-steering=rss
> > 
> > # invalid configurations
> > 
> >      options:rx-steering=lacp
>
> >      options:rx-steering=
> (^ This will be ok and use the default rss)

In that case, maybe we should exclude "rss" from the available items to
avoid confusion and only require users to specify the protocols that
need to be put in a separate queue. The remaining queues being rss by
default.

Would that be ok?
Kevin Traynor May 25, 2023, 8:47 a.m. UTC | #14
On 24/05/2023 16:05, Robin Jarry wrote:
> Kevin Traynor, May 24, 2023 at 16:54:
>>> The "rss" item should be mandatory anyway. There should be no way to
>>> disable it. I assume that it is what Ilya meant with these "+" symbols.
>>> That would leave us with these examples:
>>>
>>> # lacp+bfd on separate queue, rss on other queues
>>>
>>>       options:rx-steering=rss,lacp,bfd
>>>
>>> # same
>>>
>>>       options:rx-steering=lacp,bfd,rss
>>
>> ^ It looks a little odd to me that some values are for single queues and
>> some are for the rest of the queues, with no way to distinguish.
>>
>> So maybe Ilya had placed significance in the order with his suggestion ?
>> i.e. [default]+[single queue proto]+[other single queue proto]+...
>>
>> I don't want to bike shed too much on it, i guess with good docs, it can
>> be clear anyway.
>>
>>> # only regular rss, same as no config at all
>>>
>>>       options:rx-steering=rss
>>>
>>> # invalid configurations
>>>
>>>       options:rx-steering=lacp
>>
>>>       options:rx-steering=
>> (^ This will be ok and use the default rss)
> 
> In that case, maybe we should exclude "rss" from the available items to
> avoid confusion and only require users to specify the protocols that
> need to be put in a separate queue. The remaining queues being rss by
> default.
> 
> Would that be ok?
> 

Ilya explained the reasoning behind his idea in other mail and it seems 
fine, so for the sake of progress, I'd say to go with with his idea if 
you don't object to it.
Robin Jarry May 25, 2023, 9:25 a.m. UTC | #15
Hey Ilya,

Ilya Maximets, May 24, 2023 at 17:05:
> I had a '+' because rss and lacp are two different entities and I looked
> at it as a mode of operation. i.e. RSS plus special handling for LACP.
> RSS looks strange in a comma-separated list, IMO.

For now, there is only LACP but if other protocols are added (e.g. BFD),
wouldn't it be weird to have them separated as well?

    options:rx-steering=rss+lacp+bfd

Since lacp and bfd will most likely be put in the additional rxq, it
would make sense to identify them as a group.

I also have other reserves about specifying rss here after thinking some
more about it:

- rss shouldn't be disabled anyway, this forces users to always specify
  it. This is not great from a usability point of view.

- When there is a single rxq configured by the user, there is no RSS
  happening per-se since all other traffic will be put in a single
  queue. The additional rxq being reserved for lacp and/or other special
  traffic.

What do you think about removing "rss" altogether from the items?

    options:rx-steering=lacp,bfd,...

Cheers.
Ilya Maximets May 29, 2023, 3:39 p.m. UTC | #16
On 5/25/23 11:25, Robin Jarry wrote:
> Hey Ilya,
> 
> Ilya Maximets, May 24, 2023 at 17:05:
>> I had a '+' because rss and lacp are two different entities and I looked
>> at it as a mode of operation. i.e. RSS plus special handling for LACP.
>> RSS looks strange in a comma-separated list, IMO.
> 
> For now, there is only LACP but if other protocols are added (e.g. BFD),
> wouldn't it be weird to have them separated as well?
> 
>     options:rx-steering=rss+lacp+bfd
> 
> Since lacp and bfd will most likely be put in the additional rxq, it
> would make sense to identify them as a group.
> 
> I also have other reserves about specifying rss here after thinking some
> more about it:
> 
> - rss shouldn't be disabled anyway, this forces users to always specify
>   it. This is not great from a usability point of view.
> 
> - When there is a single rxq configured by the user, there is no RSS
>   happening per-se since all other traffic will be put in a single
>   queue. The additional rxq being reserved for lacp and/or other special
>   traffic.
> 
> What do you think about removing "rss" altogether from the items?
> 
>     options:rx-steering=lacp,bfd,...


IMO, 'rx-steering=lacp' is not descriptive.  By looking at it users can't
tell what is going on without reading the documentation.  So, I think,
the 'rss' part is necessary.

Do you anticipate protocols other than lacp and bfd?  I'd actually just
treat them as enum ('rss', 'rss+lacp', 'rss+bfd' and 'rss+lacp+bfd') and
not a list.  And document them as a enum.  We can reconsider in the future
if we'll need more than 2 protocols.  It's experimental for now anyway.

Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/Documentation/topics/dpdk/phy.rst b/Documentation/topics/dpdk/phy.rst
index 4b0fe8dded3a..733c4251bca9 100644
--- a/Documentation/topics/dpdk/phy.rst
+++ b/Documentation/topics/dpdk/phy.rst
@@ -131,6 +131,89 @@  possible with DPDK acceleration. It is possible to configure multiple Rx queues
 for ``dpdk`` ports, thus ensuring this is not a bottleneck for performance. For
 information on configuring PMD threads, refer to :doc:`pmd`.
 
+Control Plane Protection
+------------------------
+
+.. warning:: This feature is experimental.
+
+Some control protocols are used to maintain link status between forwarding
+engines. In SDN environments, these packets share the same physical network
+than the user data traffic.
+
+When the system is not sized properly, the PMD threads may not be able to
+process all incoming traffic from the configured Rx queues. When a signaling
+packet of such protocols is dropped, it can cause link flapping, worsening the
+situation.
+
+Some physical NICs can be programmed to put these protocols in a dedicated
+hardware Rx queue using the rte_flow__ API.
+
+__ https://doc.dpdk.org/guides-22.11/prog_guide/rte_flow.html
+
+The currently supported control plane protocols are:
+
+``lacp``
+   `Link Aggregation Control Protocol`__. Ether type ``0x8809``.
+
+   __ https://www.ieee802.org/3/ad/public/mar99/seaman_1_0399.pdf
+
+.. warning::
+
+   This feature is not compatible with all NICs. Refer to the DPDK
+   `compatibilty matrix`__ and vendor documentation for more details.
+
+   __ https://doc.dpdk.org/guides-22.11/nics/overview.html
+
+Control plane protection must be enabled on specific protocols per port. The
+``cp-protection`` option requires a coma separated list of protocol names::
+
+   $ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
+        options:dpdk-devargs=0000:01:00.0 options:n_rxq=2 \
+        options:cp-protection=lacp
+
+.. note::
+
+   If multiple Rx queues are already configured, regular RSS (Receive Side
+   Scaling) queue balancing is done on all but the extra control plane
+   protection queue.
+
+.. tip::
+
+   You can check if control plane protection is supported on a port with the
+   following command::
+
+      $ ovs-vsctl get interface dpdk-p0 status
+      {cp_protection_queue="2", driver_name=..., rss_queues="0-1"}
+
+   This will also show in ``ovs-vswitchd.log``::
+
+      INFO|dpdk-p0: cp-protection: redirecting lacp traffic to queue 2
+      INFO|dpdk-p0: cp-protection: applying rss on queues 0-1
+
+   If the hardware does not support redirecting control plane traffic to
+   a dedicated queue, it will be explicit::
+
+      $ ovs-vsctl get interface dpdk-p0 status
+      {cp_protection=unsupported, driver_name=...}
+
+   More details can often be found in ``ovs-vswitchd.log``::
+
+      WARN|dpdk-p0: cp-protection: failed to add lacp flow: Unsupported pattern
+
+To disable control plane protection on a port, use the following command::
+
+   $ ovs-vsctl remove Interface dpdk-p0 options cp-protection
+
+You can see that it has been disabled in ``ovs-vswitchd.log``::
+
+   INFO|dpdk-p0: cp-protection: disabled
+
+.. warning::
+
+   This feature is mutually exclusive with ``other_options:hw-offload`` as it
+   may conflict with the offloaded RTE flows. If both are enabled,
+   ``cp-protection`` will be forcibly disabled.
+
 .. _dpdk-phy-flow-control:
 
 Flow Control
diff --git a/NEWS b/NEWS
index b6418c36e95e..43338edaf5aa 100644
--- a/NEWS
+++ b/NEWS
@@ -21,6 +21,9 @@  Post-v3.1.0
      * ovs-vswitchd will keep the CAP_SYS_RAWIO capability when started
        with the --hw-rawio-access command line option.  This allows the
        process extra privileges when mapping physical interconnect memory.
+     * New experimental "cp-protection=<protocol>" option to redirect certain
+       protocols (for now, only LACP) to a dedicated hardware queue using
+       RTE flow.
    - SRv6 Tunnel Protocol
      * Added support for userspace datapath (only).
 
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index fb0dd43f75c5..f33c6613adea 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -415,6 +415,10 @@  enum dpdk_hw_ol_features {
     NETDEV_TX_SCTP_CHECKSUM_OFFLOAD = 1 << 4,
 };
 
+enum dpdk_cp_prot_flags {
+    DPDK_CP_PROT_LACP = 1 << 0,
+};
+
 /*
  * In order to avoid confusion in variables names, following naming convention
  * should be used, if possible:
@@ -504,11 +508,18 @@  struct netdev_dpdk {
          * so we remember the request and update them next time
          * netdev_dpdk*_reconfigure() is called */
         int requested_mtu;
+        /* User input for n_rxq + an optional control plane protection queue
+         * (see netdev_dpdk_reconfigure). This field is different from the
+         * other requested_* fields as it may contain a different value than
+         * the user input. */
         int requested_n_txq;
         int requested_n_rxq;
         int requested_rxq_size;
         int requested_txq_size;
 
+        /* User input for n_rxq (see dpdk_set_rxq_config). */
+        int user_n_rxq;
+
         /* Number of rx/tx descriptors for physical devices */
         int rxq_size;
         int txq_size;
@@ -534,6 +545,13 @@  struct netdev_dpdk {
 
         /* VF configuration. */
         struct eth_addr requested_hwaddr;
+
+        /* Requested control plane protection flags,
+         * from the enum set 'dpdk_cp_prot_flags'. */
+        uint64_t requested_cp_prot_flags;
+        uint64_t cp_prot_flags;
+        size_t cp_prot_flows_num;
+        struct rte_flow **cp_prot_flows;
     );
 
     PADDED_MEMBERS(CACHE_LINE_SIZE,
@@ -1310,9 +1328,14 @@  common_construct(struct netdev *netdev, dpdk_port_t port_no,
     netdev->n_rxq = 0;
     netdev->n_txq = 0;
     dev->requested_n_rxq = NR_QUEUE;
+    dev->user_n_rxq = NR_QUEUE;
     dev->requested_n_txq = NR_QUEUE;
     dev->requested_rxq_size = NIC_PORT_DEFAULT_RXQ_SIZE;
     dev->requested_txq_size = NIC_PORT_DEFAULT_TXQ_SIZE;
+    dev->requested_cp_prot_flags = 0;
+    dev->cp_prot_flags = 0;
+    dev->cp_prot_flows_num = 0;
+    dev->cp_prot_flows = NULL;
 
     /* Initialize the flow control to NULL */
     memset(&dev->fc_conf, 0, sizeof dev->fc_conf);
@@ -1487,6 +1510,8 @@  common_destruct(struct netdev_dpdk *dev)
     ovs_mutex_destroy(&dev->mutex);
 }
 
+static void dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev);
+
 static void
 netdev_dpdk_destruct(struct netdev *netdev)
 {
@@ -1494,6 +1519,9 @@  netdev_dpdk_destruct(struct netdev *netdev)
 
     ovs_mutex_lock(&dpdk_mutex);
 
+    /* Destroy any rte flows to allow RXQs to be removed. */
+    dpdk_cp_prot_unconfigure(dev);
+
     rte_eth_dev_stop(dev->port_id);
     dev->started = false;
 
@@ -1908,8 +1936,8 @@  dpdk_set_rxq_config(struct netdev_dpdk *dev, const struct smap *args)
     int new_n_rxq;
 
     new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
-    if (new_n_rxq != dev->requested_n_rxq) {
-        dev->requested_n_rxq = new_n_rxq;
+    if (new_n_rxq != dev->user_n_rxq) {
+        dev->user_n_rxq = new_n_rxq;
         netdev_request_reconfigure(&dev->up);
     }
 }
@@ -1931,6 +1959,48 @@  dpdk_process_queue_size(struct netdev *netdev, const struct smap *args,
     }
 }
 
+static void
+dpdk_cp_prot_set_config(struct netdev *netdev, struct netdev_dpdk *dev,
+                        const struct smap *args, char **errp)
+{
+    const char *arg = smap_get_def(args, "cp-protection", "");
+    char *token, *saveptr, *buf;
+    uint64_t flags = 0;
+
+    buf = xstrdup(arg);
+    token = strtok_r(buf, ",", &saveptr);
+    while (token) {
+        if (strcmp(token, "lacp") == 0) {
+            flags |= DPDK_CP_PROT_LACP;
+        } else {
+            VLOG_WARN_BUF(errp, "%s options:cp-protection "
+                          "unknown protocol '%s'",
+                          netdev_get_name(netdev), token);
+        }
+        token = strtok_r(NULL, ",", &saveptr);
+    }
+    free(buf);
+
+    if (flags && dev->type != DPDK_DEV_ETH) {
+        VLOG_WARN_BUF(errp, "%s options:cp-protection "
+                      "is only supported on ethernet ports",
+                      netdev_get_name(netdev));
+        flags = 0;
+    }
+
+    if (flags && ovsrcu_get(void *, &netdev->hw_info.offload_data)) {
+        VLOG_WARN_BUF(errp, "%s options:cp-protection "
+                      "is incompatible with hw-offload",
+                      netdev_get_name(netdev));
+        flags = 0;
+    }
+
+    if (flags != dev->requested_cp_prot_flags) {
+        dev->requested_cp_prot_flags = flags;
+        netdev_request_reconfigure(netdev);
+    }
+}
+
 static int
 netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
                        char **errp)
@@ -1950,6 +2020,8 @@  netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
     ovs_mutex_lock(&dpdk_mutex);
     ovs_mutex_lock(&dev->mutex);
 
+    dpdk_cp_prot_set_config(netdev, dev, args, errp);
+
     dpdk_set_rxq_config(dev, args);
 
     dpdk_process_queue_size(netdev, args, "n_rxq_desc",
@@ -3819,9 +3891,12 @@  netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
 {
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
     struct rte_eth_dev_info dev_info;
+    size_t cp_prot_flows_num;
+    uint64_t cp_prot_flags;
     const char *bus_info;
     uint32_t link_speed;
     uint32_t dev_flags;
+    int n_rxq;
 
     if (!rte_eth_dev_is_valid_port(dev->port_id)) {
         return ENODEV;
@@ -3833,6 +3908,9 @@  netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
     link_speed = dev->link.link_speed;
     dev_flags = *dev_info.dev_flags;
     bus_info = rte_dev_bus_info(dev_info.device);
+    cp_prot_flags = dev->cp_prot_flags;
+    cp_prot_flows_num = dev->cp_prot_flows_num;
+    n_rxq = netdev->n_rxq;
     ovs_mutex_unlock(&dev->mutex);
     ovs_mutex_unlock(&dpdk_mutex);
 
@@ -3875,6 +3953,19 @@  netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
                         ETH_ADDR_ARGS(dev->hwaddr));
     }
 
+    if (cp_prot_flags) {
+        if (!cp_prot_flows_num) {
+            smap_add(args, "cp_protection", "unsupported");
+        } else {
+            smap_add_format(args, "cp_protection_queue", "%d", n_rxq - 1);
+            if (n_rxq > 2) {
+                smap_add_format(args, "rss_queues", "0-%d", n_rxq - 2);
+            } else {
+                smap_add(args, "rss_queues", "0");
+            }
+        }
+    }
+
     return 0;
 }
 
@@ -5178,16 +5269,203 @@  static const struct dpdk_qos_ops trtcm_policer_ops = {
     .qos_queue_dump_state_init = trtcm_policer_qos_queue_dump_state_init
 };
 
+static int
+dpdk_cp_prot_add_flow(struct netdev_dpdk *dev,
+                      const struct rte_flow_item items[],
+                      const char *desc)
+{
+    const struct rte_flow_attr attr = { .ingress = 1 };
+    const struct rte_flow_action actions[] = {
+        {
+            .type = RTE_FLOW_ACTION_TYPE_QUEUE,
+            .conf = &(const struct rte_flow_action_queue) {
+                .index = dev->up.n_rxq - 1,
+            },
+        },
+        { .type = RTE_FLOW_ACTION_TYPE_END },
+    };
+    struct rte_flow_error error;
+    struct rte_flow *flow;
+    size_t num;
+    int err;
+
+    set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
+    err = rte_flow_validate(dev->port_id, &attr, items, actions, &error);
+    if (err) {
+        VLOG_WARN("%s: cp-protection: device does not support %s flow: %s",
+                  netdev_get_name(&dev->up), desc,
+                  error.message ? error.message : "");
+        goto out;
+    }
+
+    set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
+    flow = rte_flow_create(dev->port_id, &attr, items, actions, &error);
+    if (flow == NULL) {
+        VLOG_WARN("%s: cp-protection: failed to add %s flow: %s",
+                  netdev_get_name(&dev->up), desc,
+                  error.message ? error.message : "");
+        err = rte_errno;
+        goto out;
+    }
+
+    num = dev->cp_prot_flows_num + 1;
+    dev->cp_prot_flows = xrealloc(dev->cp_prot_flows, sizeof(flow) * num);
+    dev->cp_prot_flows[dev->cp_prot_flows_num] = flow;
+    dev->cp_prot_flows_num = num;
+
+    VLOG_INFO("%s: cp-protection: redirected %s traffic to rx queue %d",
+              netdev_get_name(&dev->up), desc, dev->up.n_rxq - 1);
+out:
+    return err;
+}
+
+#define RETA_CONF_SIZE (RTE_ETH_RSS_RETA_SIZE_512 / RTE_ETH_RETA_GROUP_SIZE)
+
+static int
+dpdk_cp_prot_rss_configure(struct netdev_dpdk *dev, int rss_n_rxq)
+{
+    struct rte_eth_rss_reta_entry64 reta_conf[RETA_CONF_SIZE];
+    struct rte_eth_dev_info info;
+    int err;
+
+    rte_eth_dev_info_get(dev->port_id, &info);
+
+    if (info.reta_size % rss_n_rxq != 0 &&
+        info.reta_size < RTE_ETH_RSS_RETA_SIZE_128) {
+        /*
+         * Some drivers set reta_size equal to the total number of rxqs that
+         * are configured when it is a power of two. Since we are actually
+         * reconfiguring the redirection table to exclude the last rxq, we may
+         * end up with an imbalanced redirection table. For example, such
+         * configuration:
+         *
+         *   options:n_rxq=3 options:cp-protection=lacp
+         *
+         * Will actually configure 4 rxqs on the NIC, and the default reta to:
+         *
+         *   [0, 1, 2, 3]
+         *
+         * And dpdk_cp_prot_rss_configure() will reconfigure reta to:
+         *
+         *   [0, 1, 2, 0]
+         *
+         * Causing queue 0 to receive twice as much traffic as queues 1 and 2.
+         *
+         * Work around that corner case by forcing a bigger redirection table
+         * size to 128 entries when reta_size is not a multiple of rss_n_rxq
+         * and when reta_size is less than 128. This value seems to be
+         * supported by most of the drivers that also support rte flow.
+         */
+        info.reta_size = RTE_ETH_RSS_RETA_SIZE_128;
+    }
+
+    memset(reta_conf, 0, sizeof(reta_conf));
+    for (uint16_t i = 0; i < info.reta_size; i++) {
+        uint16_t idx = i / RTE_ETH_RETA_GROUP_SIZE;
+        uint16_t shift = i % RTE_ETH_RETA_GROUP_SIZE;
+        reta_conf[idx].mask |= 1ULL << shift;
+        reta_conf[idx].reta[shift] = i % rss_n_rxq;
+    }
+    err = rte_eth_dev_rss_reta_update(dev->port_id, reta_conf, info.reta_size);
+    if (err < 0) {
+        VLOG_WARN("%s: failed to configure RSS redirection table: err=%d",
+                  netdev_get_name(&dev->up), err);
+    }
+
+    return err;
+}
+
+static int
+dpdk_cp_prot_configure(struct netdev_dpdk *dev)
+{
+    int err = 0;
+
+    if (dev->up.n_rxq < 2) {
+        err = ENOTSUP;
+        VLOG_WARN("%s: cp-protection: not enough available rx queues",
+                  netdev_get_name(&dev->up));
+        goto out;
+    }
+
+    if (dev->requested_cp_prot_flags & DPDK_CP_PROT_LACP) {
+        const struct rte_flow_item items[] = {
+            {
+                .type = RTE_FLOW_ITEM_TYPE_ETH,
+                .spec = &(const struct rte_flow_item_eth){
+                    .type = htons(ETH_TYPE_LACP),
+                },
+                .mask = &(const struct rte_flow_item_eth){
+                    .type = htons(0xffff),
+                },
+            },
+            { .type = RTE_FLOW_ITEM_TYPE_END },
+        };
+        err = dpdk_cp_prot_add_flow(dev, items, "lacp");
+        if (err) {
+            goto out;
+        }
+    }
+
+    if (dev->cp_prot_flows_num) {
+        /* Reconfigure RSS reta in all but the cp protection queue. */
+        err = dpdk_cp_prot_rss_configure(dev, dev->up.n_rxq - 1);
+        if (!err) {
+            if (dev->up.n_rxq == 2) {
+                VLOG_INFO("%s: cp-protection: redirected other traffic to "
+                          "rx queue 0", netdev_get_name(&dev->up));
+            } else {
+                VLOG_INFO("%s: cp-protection: applied rss on rx queue 0-%u",
+                          netdev_get_name(&dev->up), dev->up.n_rxq - 2);
+            }
+        }
+    }
+
+out:
+    return err;
+}
+
+static void
+dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev)
+{
+    struct rte_flow_error error;
+
+    if (!dev->cp_prot_flows_num) {
+        return;
+    }
+
+    VLOG_DBG("%s: cp-protection: reset flows", netdev_get_name(&dev->up));
+
+    for (int i = 0; i < dev->cp_prot_flows_num; i++) {
+        set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
+        if (rte_flow_destroy(dev->port_id, dev->cp_prot_flows[i], &error)) {
+            VLOG_DBG("%s: cp-protection: failed to destroy flow: %s",
+                     netdev_get_name(&dev->up),
+                     error.message ? error.message : "");
+        }
+    }
+    free(dev->cp_prot_flows);
+    dev->cp_prot_flows_num = 0;
+    dev->cp_prot_flows = NULL;
+}
+
 static int
 netdev_dpdk_reconfigure(struct netdev *netdev)
 {
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    bool try_cp_prot;
     int err = 0;
 
     ovs_mutex_lock(&dev->mutex);
 
+    try_cp_prot = dev->requested_cp_prot_flags != 0;
+    dev->requested_n_rxq = dev->user_n_rxq;
+    if (try_cp_prot) {
+        dev->requested_n_rxq += 1;
+    }
+
     if (netdev->n_txq == dev->requested_n_txq
         && netdev->n_rxq == dev->requested_n_rxq
+        && dev->cp_prot_flags == dev->requested_cp_prot_flags
         && dev->mtu == dev->requested_mtu
         && dev->lsc_interrupt_mode == dev->requested_lsc_interrupt_mode
         && dev->rxq_size == dev->requested_rxq_size
@@ -5200,6 +5478,9 @@  netdev_dpdk_reconfigure(struct netdev *netdev)
         goto out;
     }
 
+retry:
+    dpdk_cp_prot_unconfigure(dev);
+
     if (dev->reset_needed) {
         rte_eth_dev_reset(dev->port_id);
         if_notifier_manual_report();
@@ -5224,6 +5505,7 @@  netdev_dpdk_reconfigure(struct netdev *netdev)
     dev->txq_size = dev->requested_txq_size;
 
     rte_free(dev->tx_q);
+    dev->tx_q = NULL;
 
     if (!eth_addr_equals(dev->hwaddr, dev->requested_hwaddr)) {
         err = netdev_dpdk_set_etheraddr__(dev, dev->requested_hwaddr);
@@ -5255,6 +5537,23 @@  netdev_dpdk_reconfigure(struct netdev *netdev)
      */
     dev->requested_hwaddr = dev->hwaddr;
 
+    if (try_cp_prot) {
+        err = dpdk_cp_prot_configure(dev);
+        if (err) {
+            /* No hw support, disable & recover gracefully. */
+            try_cp_prot = false;
+            /*
+             * The extra queue must be explicitly removed here to ensure that
+             * it is unconfigured immediately.
+             */
+            dev->requested_n_rxq = dev->user_n_rxq;
+            goto retry;
+        }
+    } else {
+        VLOG_INFO("%s: cp-protection: disabled", netdev_get_name(&dev->up));
+    }
+    dev->cp_prot_flags = dev->requested_cp_prot_flags;
+
     dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
     if (!dev->tx_q) {
         err = ENOMEM;
@@ -5467,6 +5766,13 @@  netdev_dpdk_flow_api_supported(struct netdev *netdev)
     dev = netdev_dpdk_cast(netdev);
     ovs_mutex_lock(&dev->mutex);
     if (dev->type == DPDK_DEV_ETH) {
+        if (dev->requested_cp_prot_flags) {
+            VLOG_WARN("%s: disabling cp-protection as it is "
+                      "mutually exclusive with hw-offload.",
+                      netdev_get_name(netdev));
+            dev->requested_cp_prot_flags = 0;
+            netdev_request_reconfigure(netdev);
+        }
         /* TODO: Check if we able to offload some minimal flow. */
         ret = true;
     }
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index edb5eafa04c3..7ba8bf192d62 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -3491,6 +3491,33 @@  ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
         <p>This option may only be used with dpdk VF representors.</p>
       </column>
 
+      <column name="options" key="cp-protection"
+              type='{"type": "string", "enum": ["set", ["lacp"]]}'>
+        <p>
+          Allocate an extra Rx queue for control plane packets of the specified
+          protocol(s).
+        </p>
+        <p>
+          If the user has already configured multiple
+          <code>options:n_rxq</code> on the port, an additional one will be
+          allocated for control plane packets. If the hardware cannot satisfy
+          the requested number of requested Rx queues, the last Rx queue will
+          be assigned for control plane. If only one Rx queue is available or
+          if the hardware does not support the RTE flow matchers/actions
+          required to redirect the selected protocols,
+          <code>cp-protection</code> will be disabled.
+        </p>
+        <p>
+          This feature is mutually exclusive with
+          <code>other_options:hw-offload</code> as it may conflict with the
+          offloaded RTE flows. If both are enabled, <code>cp-protection</code>
+          will be forcibly disabled.
+        </p>
+        <p>
+          Disabled by default.
+        </p>
+      </column>
+
       <column name="other_config" key="tx-steering"
               type='{"type": "string",
                      "enum": ["set", ["thread", "hash"]]}'>