diff mbox series

[ovs-dev,v4] dpif-netdev: Allow cross-NUMA polling on selected ports

Message ID 20220608102830.882-1-anurag2k@gmail.com
State Superseded
Headers show
Series [ovs-dev,v4] dpif-netdev: Allow cross-NUMA polling on selected ports | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Anurag Agarwal June 8, 2022, 10:28 a.m. UTC
From: Jan Scheurich <jan.scheurich@ericsson.com>

Today dpif-netdev considers PMD threads on a non-local NUMA node for automatic
assignment of the rxqs of a port only if there are no local,non-isolated PMDs.

On typical servers with both physical ports on one NUMA node, this often
leaves the PMDs on the other NUMA node under-utilized, wasting CPU resources.
The alternative, to manually pin the rxqs to PMDs on remote NUMA nodes, also
has drawbacks as it limits OVS' ability to auto load-balance the rxqs.

This patch introduces a new interface configuration option to allow ports to
be automatically polled by PMDs on any NUMA node:

ovs-vsctl set interface <Name> other_config:cross-numa-polling=true

The group assignment algorithm now has the ability to select lowest loaded PMD
on any NUMA, and not just the local NUMA on which the rxq of the port resides

If this option is not present or set to false, legacy behaviour applies.

Co-authored-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
Signed-off-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
---

Changes in this patch:
- Addressed comments from Kevin Traynor

Please refer this thread for an earlier discussion on this topic:
https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392310.html
---
 Documentation/topics/dpdk/pmd.rst |  23 ++++++
 lib/dpif-netdev.c                 | 130 ++++++++++++++++++++++--------
 tests/pmd.at                      |  38 +++++++++
 vswitchd/vswitch.xml              |  20 +++++
 4 files changed, 177 insertions(+), 34 deletions(-)

Comments

Anurag Agarwal June 17, 2022, 4:57 a.m. UTC | #1
Soft reminder! Kindly share your feedback on this patch. 

Regards,
Anurag

> -----Original Message-----
> From: Anurag Agarwal <anurag2k@gmail.com>
> Sent: Wednesday, June 8, 2022 3:59 PM
> To: ovs-dev@openvswitch.org
> Cc: Jan Scheurich <jan.scheurich@ericsson.com>; Anurag Agarwal
> <anurag.agarwal@ericsson.com>
> Subject: [PATCH v4] dpif-netdev: Allow cross-NUMA polling on selected ports
> 
> From: Jan Scheurich <jan.scheurich@ericsson.com>
> 
> Today dpif-netdev considers PMD threads on a non-local NUMA node for
> automatic assignment of the rxqs of a port only if there are no local,non-
> isolated PMDs.
> 
> On typical servers with both physical ports on one NUMA node, this often leaves
> the PMDs on the other NUMA node under-utilized, wasting CPU resources.
> The alternative, to manually pin the rxqs to PMDs on remote NUMA nodes, also
> has drawbacks as it limits OVS' ability to auto load-balance the rxqs.
> 
> This patch introduces a new interface configuration option to allow ports to be
> automatically polled by PMDs on any NUMA node:
> 
> ovs-vsctl set interface <Name> other_config:cross-numa-polling=true
> 
> The group assignment algorithm now has the ability to select lowest loaded
> PMD on any NUMA, and not just the local NUMA on which the rxq of the port
> resides
> 
> If this option is not present or set to false, legacy behaviour applies.
> 
> Co-authored-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
> Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
> Signed-off-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
> ---
> 
> Changes in this patch:
> - Addressed comments from Kevin Traynor
> 
> Please refer this thread for an earlier discussion on this topic:
> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
> 454445555731-6f777962dd61a512&q=1&e=30c64631-8919-433b-a617-
> 59e7d0e396fd&u=https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-
> dev%2F2022-March%2F392310.html
> ---
>  Documentation/topics/dpdk/pmd.rst |  23 ++++++
>  lib/dpif-netdev.c                 | 130 ++++++++++++++++++++++--------
>  tests/pmd.at                      |  38 +++++++++
>  vswitchd/vswitch.xml              |  20 +++++
>  4 files changed, 177 insertions(+), 34 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst
> b/Documentation/topics/dpdk/pmd.rst
> index b259cc8b3..387f962d1 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -99,6 +99,25 @@ core cycles for each Rx queue::
> 
>      $ ovs-appctl dpif-netdev/pmd-rxq-show
> 
> +Normally, Rx queues are assigned to PMD threads automatically.  By
> +default OVS only assigns Rx queues to PMD threads executing on the same
> +NUMA node in order to avoid unnecessary latency for accessing packet
> +buffers across the NUMA boundary.  Typically this overhead is higher
> +for vhostuser ports than for physical ports due to the packet copy that
> +is done for all rx packets.
> +
> +On NUMA servers with physical ports only on one NUMA node, the
> +NUMA-local polling policy can lead to an under-utilization of the PMD
> +threads on the remote NUMA node.  For the overall OVS performance it
> +may in such cases be beneficial to utilize the spare capacity and allow
> +polling of a physical port's rxqs across NUMA nodes despite the overhead
> involved.
> +The policy can be set per port with the following configuration option::
> +
> +    $ ovs-vsctl set Interface <iface> \
> +        other_config:cross-numa-polling=true|false
> +
> +The default value is false.
> +
>  .. note::
> 
>     A history of one minute is recorded and shown for each Rx queue to allow for
> @@ -115,6 +134,10 @@ core cycles for each Rx queue::
>     A ``overhead`` statistics is shown per PMD: it represents the number of
>     cycles inherently consumed by the OVS PMD processing loop.
> 
> +.. versionchanged:: 2.18.0
> +
> +   Added the interface parameter ``other_config:cross-numa-polling``
> +
>  Rx queue to PMD assignment takes place whenever there are configuration
> changes  or can be triggered by using::
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index ff57b3961..86f88964b
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -467,6 +467,7 @@ struct dp_netdev_port {
>      char *type;                 /* Port type as requested by user. */
>      char *rxq_affinity_list;    /* Requested affinity of rx queues. */
>      enum txq_req_mode txq_requested_mode;
> +    bool cross_numa_polling;    /* If true cross polling will be enabled. */
>  };
> 
>  static bool dp_netdev_flow_ref(struct dp_netdev_flow *); @@ -2101,6 +2102,7
> @@ port_create(const char *devname, const char *type,
>      port->sf = NULL;
>      port->emc_enabled = true;
>      port->need_reconfigure = true;
> +    port->cross_numa_polling = false;
>      ovs_mutex_init(&port->txq_used_mutex);
> 
>      *portp = port;
> @@ -5013,6 +5015,7 @@ dpif_netdev_port_set_config(struct dpif *dpif,
> odp_port_t port_no,
>      bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
>      const char *tx_steering_mode = smap_get(cfg, "tx-steering");
>      enum txq_req_mode txq_mode;
> +    bool cross_numa_polling = smap_get_bool(cfg, "cross-numa-polling",
> + false);
> 
>      ovs_rwlock_wrlock(&dp->port_rwlock);
>      error = get_port_by_number(dp, port_no, &port); @@ -5020,6 +5023,14 @@
> dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t port_no,
>          goto unlock;
>      }
> 
> +    if (cross_numa_polling != port->cross_numa_polling) {
> +        port->cross_numa_polling = cross_numa_polling;
> +        VLOG_INFO("%s:cross-numa-polling has been %s.",
> +                  netdev_get_name(port->netdev),
> +                  cross_numa_polling? "enabled" : "disabled");
> +        dp_netdev_request_reconfigure(dp);
> +    }
> +
>      if (emc_enabled != port->emc_enabled) {
>          struct dp_netdev_pmd_thread *pmd;
>          struct ds ds = DS_EMPTY_INITIALIZER; @@ -5751,7 +5762,7 @@
> compare_rxq_cycles(const void *a, const void *b)
> 
>  static bool
>  sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct sched_pmd
> *pmd,
> -                     bool has_proc)
> +                     int port_numa_id, bool has_proc)
>  {
>      uint64_t current_num, pmd_num;
> 
> @@ -5767,16 +5778,22 @@ sched_pmd_new_lowest(struct sched_pmd
> *current_lowest, struct sched_pmd *pmd,
>          pmd_num = pmd->n_rxq;
>      }
> 
> -    if (pmd_num < current_num) {
> -        return true;
> +    if (pmd_num != current_num) {
> +        return (pmd_num < current_num) ? true : false;
>      }
> -    return false;
> +
> +    /* In case of a tie, select the new PMD only if the existing PMD
> +     * assigned is from non-local NUMA
> +     */
> +    return (current_lowest->numa->numa_id != port_numa_id &&
> +            pmd->numa->numa_id == port_numa_id) ? true : false;
> +
>  }
> 
>  static struct sched_pmd *
>  sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)  {
> -    struct sched_pmd *lowest_sched_pmd = NULL;
> +    struct sched_pmd *lowest_pmd = NULL;
> 
>      for (unsigned i = 0; i < numa->n_pmds; i++) {
>          struct sched_pmd *sched_pmd;
> @@ -5785,11 +5802,11 @@ sched_pmd_get_lowest(struct sched_numa
> *numa, bool has_cyc)
>          if (sched_pmd->isolated) {
>              continue;
>          }
> -        if (sched_pmd_new_lowest(lowest_sched_pmd, sched_pmd, has_cyc)) {
> -            lowest_sched_pmd = sched_pmd;
> +        if (sched_pmd_new_lowest(lowest_pmd, sched_pmd, INT_MAX, has_cyc))
> {
> +            lowest_pmd = sched_pmd;
>          }
>      }
> -    return lowest_sched_pmd;
> +    return lowest_pmd;
>  }
> 
>  /*
> @@ -5891,6 +5908,32 @@ get_rxq_cyc_log(char *a, enum
> sched_assignment_type algo, uint64_t cycles)
>      return a;
>  }
> 
> +static struct sched_pmd *
> +sched_pmd_all_numa_get_lowest(struct sched_numa_list *numa_list,
> +                              int port_numa_id, bool has_proc) {
> +    int n_numa;
> +    struct sched_numa *numa = NULL;
> +    struct sched_numa *last_numa = NULL;
> +    struct sched_pmd *lowest_pmd = NULL;
> +    struct sched_pmd *pmd;
> +
> +    n_numa = sched_numa_list_count(numa_list);
> +    /* For all numas. */
> +    for (int i = 0; i < n_numa; i++) {
> +        last_numa = numa;
> +        numa = sched_numa_list_next(numa_list, last_numa);
> +
> +        /* Get the lowest pmd per numa. */
> +        pmd = sched_pmd_get_lowest(numa, has_proc);
> +
> +        /* Check if it's the lowest pmd for all numas. */
> +        if (sched_pmd_new_lowest(lowest_pmd, pmd, port_numa_id, has_proc)) {
> +            lowest_pmd = pmd;
> +        }
> +    }
> +    return lowest_pmd;
> +}
> +
>  static void
>  sched_numa_list_schedule(struct sched_numa_list *numa_list,
>                           struct dp_netdev *dp, @@ -5991,6 +6034,7 @@
> sched_numa_list_schedule(struct sched_numa_list *numa_list,
>          struct sched_pmd *sched_pmd = NULL;
>          struct sched_numa *numa;
>          int port_numa_id;
> +        bool cross_numa = rxq->port->cross_numa_polling;
>          uint64_t proc_cycles;
>          char rxq_cyc_log[MAX_RXQ_CYC_STRLEN];
> 
> @@ -6000,33 +6044,40 @@ sched_numa_list_schedule(struct sched_numa_list
> *numa_list,
>              start_logged = true;
>          }
> 
> -        /* Store the cycles for this rxq as we will log these later. */
> +        /* Store the cycles for this rxq as we will need these later.
> + */
>          proc_cycles = dp_netdev_rxq_get_cycles(rxq, RXQ_CYCLES_PROC_HIST);
> 
>          port_numa_id = netdev_get_numa_id(rxq->port->netdev);
> 
> -        /* Select numa. */
> -        numa = sched_numa_list_lookup(numa_list, port_numa_id);
> -
> -        /* Check if numa has no PMDs or no non-isolated PMDs. */
> -        if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> -            /* Unable to use this numa to find a PMD. */
> -            numa = NULL;
> -            /* Find any numa with available PMDs. */
> -            for (int j = 0; j < n_numa; j++) {
> -                numa = sched_numa_list_next(numa_list, last_cross_numa);
> -                last_cross_numa = numa;
> -                if (sched_numa_noniso_pmd_count(numa)) {
> -                    break;
> -                }
> +        if (cross_numa && algo == SCHED_GROUP) {
> +            /* cross_numa polling enabled so find lowest loaded pmd across
> +             * all numas. */
> +            sched_pmd = sched_pmd_all_numa_get_lowest(numa_list,
> port_numa_id,
> +                                                      proc_cycles);
> +        } else {
> +            /* Select numa. */
> +            numa = sched_numa_list_lookup(numa_list, port_numa_id);
> +
> +            /* Check if numa has no PMDs or no non-isolated PMDs. */
> +            if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> +                /* Unable to use this numa to find a PMD. */
>                  numa = NULL;
> +                /* Find any numa with available PMDs. */
> +                for (int j = 0; j < n_numa; j++) {
> +                    numa = sched_numa_list_next(numa_list, last_cross_numa);
> +                    last_cross_numa = numa;
> +                    if (sched_numa_noniso_pmd_count(numa)) {
> +                        break;
> +                    }
> +                    numa = NULL;
> +                }
>              }
> -        }
> 
> -        if (numa) {
> -            /* Select the PMD that should be used for this rxq. */
> -            sched_pmd = sched_pmd_next(numa, algo,
> -                                       proc_cycles ? true : false);
> +            if (numa) {
> +                /* Select the PMD that should be used for this rxq. */
> +                sched_pmd = sched_pmd_next(numa, algo,
> +                                           proc_cycles ? true : false);
> +            }
>          }
> 
>          /* Check that a pmd has been selected. */ @@ -6036,12 +6087,23 @@
> sched_numa_list_schedule(struct sched_numa_list *numa_list,
>              pmd_numa_id = sched_pmd->numa->numa_id;
>              /* Check if selected pmd numa matches port numa. */
>              if (pmd_numa_id != port_numa_id) {
> -                VLOG(level, "There's no available (non-isolated) pmd thread "
> -                            "on numa node %d. Port \'%s\' rx queue %d will "
> -                            "be assigned to a pmd on numa node %d. "
> -                            "This may lead to reduced performance.",
> -                            port_numa_id, netdev_rxq_get_name(rxq->rx),
> -                            netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
> +                if (cross_numa) {
> +                    VLOG(level, "Cross-numa polling has been selected for "
> +                                "Port \'%s\' rx queue %d on numa node %d. "
> +                                "It will be assigned to a pmd on numa node "
> +                                "%d. This may lead to reduced performance.",
> +                                netdev_rxq_get_name(rxq->rx),
> +                                netdev_rxq_get_queue_id(rxq->rx), port_numa_id,
> +                                pmd_numa_id);
> +
> +                } else {
> +                    VLOG(level, "There's no available (non-isolated) pmd "
> +                                "thread on numa node %d. Port \'%s\' rx queue "
> +                                "%d will be assigned to a pmd on numa node "
> +                                "%d. This may lead to reduced performance.",
> +                                port_numa_id, netdev_rxq_get_name(rxq->rx),
> +                                netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
> +                }
>              }
>              VLOG(level, "Core %2u on numa node %d assigned port \'%s\' "
>                          "rx queue %d%s.", diff --git a/tests/pmd.at b/tests/pmd.at index
> e6b173dab..5d2ec1382 100644
> --- a/tests/pmd.at
> +++ b/tests/pmd.at
> @@ -579,6 +579,44 @@
> icmp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_s
> rc=10
>  OVS_VSWITCHD_STOP
>  AT_CLEANUP
> 
> +AT_SETUP([PMD - enable cross numa polling])
> +OVS_VSWITCHD_START([add-port br0 p0 \
> +                    -- set Interface p0 type=dummy-pmd options:n_rxq=4 \
> +                    -- set Interface p0 options:numa_id=0 \
> +                    -- set Open_vSwitch . other_config:pmd-cpu-mask=0x3 \
> +                    -- set open_vswitch . other_config:pmd-rxq-assign=group],
> +                   [], [], [--dummy-numa 0,1])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 action=controller])
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> +0
> +])
> +
> +dnl Enable cross numa polling and check numa ids TMP=$(($(cat
> +ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
> +PATTERN="cross-numa-polling has been enabled"
> +AT_CHECK([ovs-vsctl set Interface p0
> +other_config:cross-numa-polling=true])
> +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
> +
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> +0
> +1
> +])
> +
> +dnl Disable cross numa polling and check numa ids TMP=$(($(cat
> +ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
> +PATTERN="cross-numa-polling has been disabled"
> +AT_CHECK([ovs-vsctl set Interface p0
> +other_config:cross-numa-polling=false])
> +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
> +
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> +0
> +])
> +
> +OVS_VSWITCHD_STOP(["/|WARN|/d"])
> +AT_CLEANUP
> +
> +
>  AT_SETUP([PMD - change numa node])
>  OVS_VSWITCHD_START(
>    [add-port br0 p1 -- set Interface p1 type=dummy-pmd ofport_request=1
> options:n_rxq=2 -- \ diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index cc1dd77ec..c2e948bd9 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -3286,6 +3286,26 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
> type=patch options:peer=p1 \
>          </p>
>        </column>
> 
> +      <column name="other_config" key="cross-numa-polling"
> +                                  type='{"type": "boolean"}'>
> +        <p>
> +          Specifies if the RX queues of the port can be automatically assigned
> +          to PMD threads on any NUMA node or only on the local NUMA node of
> +          the port.
> +        </p>
> +        <p>
> +          Polling of physical ports from a non-local PMD thread incurs some
> +          performance penalty due to the access to packet data across the NUMA
> +          barrier. This option can still increase the overall performance if
> +          it allows better utilization of those non-local PMDs threads.
> +          It is most useful together with the auto load-balancing of RX queues
> +          <ref table="Open_vSwitch" column="other_config" key="pmd-auto-lb"/>
> +        </p>
> +        <p>
> +          Defaults to false.
> +        </p>
> +      </column>
> +
>        <column name="options" key="xdp-mode"
>                type='{"type": "string",
>                       "enum": ["set", ["best-effort", "native-with-zerocopy",
> --
> 2.25.1
Cheng Li June 17, 2022, 8:58 a.m. UTC | #2
On Wed, Jun 08, 2022 at 03:58:31PM +0530, Anurag Agarwal wrote:
> From: Jan Scheurich <jan.scheurich@ericsson.com>
> 
> Today dpif-netdev considers PMD threads on a non-local NUMA node for automatic
> assignment of the rxqs of a port only if there are no local,non-isolated PMDs.
> 
> On typical servers with both physical ports on one NUMA node, this often
> leaves the PMDs on the other NUMA node under-utilized, wasting CPU resources.
> The alternative, to manually pin the rxqs to PMDs on remote NUMA nodes, also
> has drawbacks as it limits OVS' ability to auto load-balance the rxqs.
> 
> This patch introduces a new interface configuration option to allow ports to
> be automatically polled by PMDs on any NUMA node:
> 
> ovs-vsctl set interface <Name> other_config:cross-numa-polling=true
> 
> The group assignment algorithm now has the ability to select lowest loaded PMD
> on any NUMA, and not just the local NUMA on which the rxq of the port resides
> 
> If this option is not present or set to false, legacy behaviour applies.
> 
> Co-authored-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
> Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
> Signed-off-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
> ---
> 
> Changes in this patch:
> - Addressed comments from Kevin Traynor
> 
> Please refer this thread for an earlier discussion on this topic:
> https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392310.html
> ---
>  Documentation/topics/dpdk/pmd.rst |  23 ++++++
>  lib/dpif-netdev.c                 | 130 ++++++++++++++++++++++--------
>  tests/pmd.at                      |  38 +++++++++
>  vswitchd/vswitch.xml              |  20 +++++
>  4 files changed, 177 insertions(+), 34 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> index b259cc8b3..387f962d1 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -99,6 +99,25 @@ core cycles for each Rx queue::
>  
>      $ ovs-appctl dpif-netdev/pmd-rxq-show
>  
> +Normally, Rx queues are assigned to PMD threads automatically.  By default
> +OVS only assigns Rx queues to PMD threads executing on the same NUMA
> +node in order to avoid unnecessary latency for accessing packet buffers
> +across the NUMA boundary.  Typically this overhead is higher for vhostuser
> +ports than for physical ports due to the packet copy that is done for all
> +rx packets.
> +
> +On NUMA servers with physical ports only on one NUMA node, the NUMA-local
> +polling policy can lead to an under-utilization of the PMD threads on the
> +remote NUMA node.  For the overall OVS performance it may in such cases be
> +beneficial to utilize the spare capacity and allow polling of a physical
> +port's rxqs across NUMA nodes despite the overhead involved.
> +The policy can be set per port with the following configuration option::
> +
> +    $ ovs-vsctl set Interface <iface> \
> +        other_config:cross-numa-polling=true|false
> +
> +The default value is false.
> +
>  .. note::
>  
>     A history of one minute is recorded and shown for each Rx queue to allow for
> @@ -115,6 +134,10 @@ core cycles for each Rx queue::
>     A ``overhead`` statistics is shown per PMD: it represents the number of
>     cycles inherently consumed by the OVS PMD processing loop.
>  
> +.. versionchanged:: 2.18.0
> +
> +   Added the interface parameter ``other_config:cross-numa-polling``
> +
>  Rx queue to PMD assignment takes place whenever there are configuration changes
>  or can be triggered by using::
>  
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index ff57b3961..86f88964b 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -467,6 +467,7 @@ struct dp_netdev_port {
>      char *type;                 /* Port type as requested by user. */
>      char *rxq_affinity_list;    /* Requested affinity of rx queues. */
>      enum txq_req_mode txq_requested_mode;
> +    bool cross_numa_polling;    /* If true cross polling will be enabled. */
>  };
>  
>  static bool dp_netdev_flow_ref(struct dp_netdev_flow *);
> @@ -2101,6 +2102,7 @@ port_create(const char *devname, const char *type,
>      port->sf = NULL;
>      port->emc_enabled = true;
>      port->need_reconfigure = true;
> +    port->cross_numa_polling = false;
>      ovs_mutex_init(&port->txq_used_mutex);
>  
>      *portp = port;
> @@ -5013,6 +5015,7 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t port_no,
>      bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
>      const char *tx_steering_mode = smap_get(cfg, "tx-steering");
>      enum txq_req_mode txq_mode;
> +    bool cross_numa_polling = smap_get_bool(cfg, "cross-numa-polling", false);
>  
>      ovs_rwlock_wrlock(&dp->port_rwlock);
>      error = get_port_by_number(dp, port_no, &port);
> @@ -5020,6 +5023,14 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t port_no,
>          goto unlock;
>      }
>  
> +    if (cross_numa_polling != port->cross_numa_polling) {
> +        port->cross_numa_polling = cross_numa_polling;
> +        VLOG_INFO("%s:cross-numa-polling has been %s.",
> +                  netdev_get_name(port->netdev),
> +                  cross_numa_polling? "enabled" : "disabled");
> +        dp_netdev_request_reconfigure(dp);
> +    }
> +
>      if (emc_enabled != port->emc_enabled) {
>          struct dp_netdev_pmd_thread *pmd;
>          struct ds ds = DS_EMPTY_INITIALIZER;
> @@ -5751,7 +5762,7 @@ compare_rxq_cycles(const void *a, const void *b)
>  
>  static bool
>  sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct sched_pmd *pmd,
> -                     bool has_proc)
> +                     int port_numa_id, bool has_proc)
>  {
>      uint64_t current_num, pmd_num;
>  
> @@ -5767,16 +5778,22 @@ sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct sched_pmd *pmd,
>          pmd_num = pmd->n_rxq;
>      }
>  
> -    if (pmd_num < current_num) {
> -        return true;
> +    if (pmd_num != current_num) {
> +        return (pmd_num < current_num) ? true : false;
>      }
> -    return false;
> +
> +    /* In case of a tie, select the new PMD only if the existing PMD
> +     * assigned is from non-local NUMA
> +     */
> +    return (current_lowest->numa->numa_id != port_numa_id &&
> +            pmd->numa->numa_id == port_numa_id) ? true : false;
> +
>  }
>  
>  static struct sched_pmd *
>  sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)
>  {
> -    struct sched_pmd *lowest_sched_pmd = NULL;
> +    struct sched_pmd *lowest_pmd = NULL;
>  
>      for (unsigned i = 0; i < numa->n_pmds; i++) {
>          struct sched_pmd *sched_pmd;
> @@ -5785,11 +5802,11 @@ sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)
>          if (sched_pmd->isolated) {
>              continue;
>          }
> -        if (sched_pmd_new_lowest(lowest_sched_pmd, sched_pmd, has_cyc)) {
> -            lowest_sched_pmd = sched_pmd;
> +        if (sched_pmd_new_lowest(lowest_pmd, sched_pmd, INT_MAX, has_cyc)) {
> +            lowest_pmd = sched_pmd;
>          }
>      }
> -    return lowest_sched_pmd;
> +    return lowest_pmd;
>  }
>  
>  /*
> @@ -5891,6 +5908,32 @@ get_rxq_cyc_log(char *a, enum sched_assignment_type algo, uint64_t cycles)
>      return a;
>  }
>  
> +static struct sched_pmd *
> +sched_pmd_all_numa_get_lowest(struct sched_numa_list *numa_list,
> +                              int port_numa_id, bool has_proc) {
> +    int n_numa;
> +    struct sched_numa *numa = NULL;
> +    struct sched_numa *last_numa = NULL;
> +    struct sched_pmd *lowest_pmd = NULL;
> +    struct sched_pmd *pmd;
> +
> +    n_numa = sched_numa_list_count(numa_list);
> +    /* For all numas. */
> +    for (int i = 0; i < n_numa; i++) {
> +        last_numa = numa;
> +        numa = sched_numa_list_next(numa_list, last_numa);
> +
> +        /* Get the lowest pmd per numa. */
> +        pmd = sched_pmd_get_lowest(numa, has_proc);
> +
> +        /* Check if it's the lowest pmd for all numas. */
> +        if (sched_pmd_new_lowest(lowest_pmd, pmd, port_numa_id, has_proc)) {
> +            lowest_pmd = pmd;
> +        }
> +    }
> +    return lowest_pmd;
> +}
> +
>  static void
>  sched_numa_list_schedule(struct sched_numa_list *numa_list,
>                           struct dp_netdev *dp,
> @@ -5991,6 +6034,7 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list,
>          struct sched_pmd *sched_pmd = NULL;
>          struct sched_numa *numa;
>          int port_numa_id;
> +        bool cross_numa = rxq->port->cross_numa_polling;

It may be better to schedule non-cross-numa rxqs first, and then cross-numa rxqs.
Otherwise, one numa may hold much more load because of later scheduled
non-cross-numa rxqs.

>          uint64_t proc_cycles;
>          char rxq_cyc_log[MAX_RXQ_CYC_STRLEN];
>  
> @@ -6000,33 +6044,40 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list,
>              start_logged = true;
>          }
>  
> -        /* Store the cycles for this rxq as we will log these later. */
> +        /* Store the cycles for this rxq as we will need these later. */
>          proc_cycles = dp_netdev_rxq_get_cycles(rxq, RXQ_CYCLES_PROC_HIST);
>  
>          port_numa_id = netdev_get_numa_id(rxq->port->netdev);
>  
> -        /* Select numa. */
> -        numa = sched_numa_list_lookup(numa_list, port_numa_id);
> -
> -        /* Check if numa has no PMDs or no non-isolated PMDs. */
> -        if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> -            /* Unable to use this numa to find a PMD. */
> -            numa = NULL;
> -            /* Find any numa with available PMDs. */
> -            for (int j = 0; j < n_numa; j++) {
> -                numa = sched_numa_list_next(numa_list, last_cross_numa);
> -                last_cross_numa = numa;
> -                if (sched_numa_noniso_pmd_count(numa)) {
> -                    break;
> -                }
> +        if (cross_numa && algo == SCHED_GROUP) {
> +            /* cross_numa polling enabled so find lowest loaded pmd across
> +             * all numas. */
> +            sched_pmd = sched_pmd_all_numa_get_lowest(numa_list, port_numa_id,
> +                                                      proc_cycles);
> +        } else {
> +            /* Select numa. */
> +            numa = sched_numa_list_lookup(numa_list, port_numa_id);
> +
> +            /* Check if numa has no PMDs or no non-isolated PMDs. */
> +            if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> +                /* Unable to use this numa to find a PMD. */
>                  numa = NULL;
> +                /* Find any numa with available PMDs. */
> +                for (int j = 0; j < n_numa; j++) {
> +                    numa = sched_numa_list_next(numa_list, last_cross_numa);
> +                    last_cross_numa = numa;
> +                    if (sched_numa_noniso_pmd_count(numa)) {
> +                        break;
> +                    }
> +                    numa = NULL;
> +                }
>              }
> -        }
>  
> -        if (numa) {
> -            /* Select the PMD that should be used for this rxq. */
> -            sched_pmd = sched_pmd_next(numa, algo,
> -                                       proc_cycles ? true : false);
> +            if (numa) {
> +                /* Select the PMD that should be used for this rxq. */
> +                sched_pmd = sched_pmd_next(numa, algo,
> +                                           proc_cycles ? true : false);
> +            }
>          }
>  
>          /* Check that a pmd has been selected. */
> @@ -6036,12 +6087,23 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list,
>              pmd_numa_id = sched_pmd->numa->numa_id;
>              /* Check if selected pmd numa matches port numa. */
>              if (pmd_numa_id != port_numa_id) {
> -                VLOG(level, "There's no available (non-isolated) pmd thread "
> -                            "on numa node %d. Port \'%s\' rx queue %d will "
> -                            "be assigned to a pmd on numa node %d. "
> -                            "This may lead to reduced performance.",
> -                            port_numa_id, netdev_rxq_get_name(rxq->rx),
> -                            netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
> +                if (cross_numa) {
> +                    VLOG(level, "Cross-numa polling has been selected for "
> +                                "Port \'%s\' rx queue %d on numa node %d. "
> +                                "It will be assigned to a pmd on numa node "
> +                                "%d. This may lead to reduced performance.",
> +                                netdev_rxq_get_name(rxq->rx),
> +                                netdev_rxq_get_queue_id(rxq->rx), port_numa_id,
> +                                pmd_numa_id);
> +
> +                } else {
> +                    VLOG(level, "There's no available (non-isolated) pmd "
> +                                "thread on numa node %d. Port \'%s\' rx queue "
> +                                "%d will be assigned to a pmd on numa node "
> +                                "%d. This may lead to reduced performance.",
> +                                port_numa_id, netdev_rxq_get_name(rxq->rx),
> +                                netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
> +                }
>              }
>              VLOG(level, "Core %2u on numa node %d assigned port \'%s\' "
>                          "rx queue %d%s.",
> diff --git a/tests/pmd.at b/tests/pmd.at
> index e6b173dab..5d2ec1382 100644
> --- a/tests/pmd.at
> +++ b/tests/pmd.at
> @@ -579,6 +579,44 @@ icmp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10
>  OVS_VSWITCHD_STOP
>  AT_CLEANUP
>  
> +AT_SETUP([PMD - enable cross numa polling])
> +OVS_VSWITCHD_START([add-port br0 p0 \
> +                    -- set Interface p0 type=dummy-pmd options:n_rxq=4 \
> +                    -- set Interface p0 options:numa_id=0 \
> +                    -- set Open_vSwitch . other_config:pmd-cpu-mask=0x3 \
> +                    -- set open_vswitch . other_config:pmd-rxq-assign=group],
> +                   [], [], [--dummy-numa 0,1])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 action=controller])
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> +0
> +])
> +
> +dnl Enable cross numa polling and check numa ids
> +TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
> +PATTERN="cross-numa-polling has been enabled"
> +AT_CHECK([ovs-vsctl set Interface p0 other_config:cross-numa-polling=true])
> +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
> +
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> +0
> +1
> +])
> +
> +dnl Disable cross numa polling and check numa ids
> +TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
> +PATTERN="cross-numa-polling has been disabled"
> +AT_CHECK([ovs-vsctl set Interface p0 other_config:cross-numa-polling=false])
> +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
> +
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> +0
> +])
> +
> +OVS_VSWITCHD_STOP(["/|WARN|/d"])
> +AT_CLEANUP
> +
> +
>  AT_SETUP([PMD - change numa node])
>  OVS_VSWITCHD_START(
>    [add-port br0 p1 -- set Interface p1 type=dummy-pmd ofport_request=1 options:n_rxq=2 -- \
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index cc1dd77ec..c2e948bd9 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -3286,6 +3286,26 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>          </p>
>        </column>
>  
> +      <column name="other_config" key="cross-numa-polling"
> +                                  type='{"type": "boolean"}'>
> +        <p>
> +          Specifies if the RX queues of the port can be automatically assigned
> +          to PMD threads on any NUMA node or only on the local NUMA node of
> +          the port.
> +        </p>
> +        <p>
> +          Polling of physical ports from a non-local PMD thread incurs some
> +          performance penalty due to the access to packet data across the NUMA
> +          barrier. This option can still increase the overall performance if
> +          it allows better utilization of those non-local PMDs threads.
> +          It is most useful together with the auto load-balancing of RX queues
> +          <ref table="Open_vSwitch" column="other_config" key="pmd-auto-lb"/>
> +        </p>
> +        <p>
> +          Defaults to false.
> +        </p>
> +      </column>
> +
>        <column name="options" key="xdp-mode"
>                type='{"type": "string",
>                       "enum": ["set", ["best-effort", "native-with-zerocopy",
> -- 
> 2.25.1
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Anurag Agarwal June 22, 2022, 9:32 a.m. UTC | #3
Thanks for your feedback. Please find my comments inline. 

Regards,
Anurag

> -----Original Message-----
> From: lic121@chinatelecom.cn <lic121@chinatelecom.cn>
> Sent: Friday, June 17, 2022 2:29 PM
> To: Anurag Agarwal <anurag2k@gmail.com>
> Cc: ovs-dev@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH v4] dpif-netdev: Allow cross-NUMA polling on
> selected ports
> 
> On Wed, Jun 08, 2022 at 03:58:31PM +0530, Anurag Agarwal wrote:
> > From: Jan Scheurich <jan.scheurich@ericsson.com>
> >
> > Today dpif-netdev considers PMD threads on a non-local NUMA node for
> > automatic assignment of the rxqs of a port only if there are no local,non-
> isolated PMDs.
> >
> > On typical servers with both physical ports on one NUMA node, this
> > often leaves the PMDs on the other NUMA node under-utilized, wasting CPU
> resources.
> > The alternative, to manually pin the rxqs to PMDs on remote NUMA
> > nodes, also has drawbacks as it limits OVS' ability to auto load-balance the
> rxqs.
> >
> > This patch introduces a new interface configuration option to allow
> > ports to be automatically polled by PMDs on any NUMA node:
> >
> > ovs-vsctl set interface <Name> other_config:cross-numa-polling=true
> >
> > The group assignment algorithm now has the ability to select lowest
> > loaded PMD on any NUMA, and not just the local NUMA on which the rxq
> > of the port resides
> >
> > If this option is not present or set to false, legacy behaviour applies.
> >
> > Co-authored-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
> > Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
> > Signed-off-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
> > ---
> >
> > Changes in this patch:
> > - Addressed comments from Kevin Traynor
> >
> > Please refer this thread for an earlier discussion on this topic:
> > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-45444
> > 5554331-165fefd0f093b353&q=1&e=d638a27f-b902-46d8-883e-
> ed0c226be33c&u=
> > https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2022-
> March%
> > 2F392310.html
> > ---
> >  Documentation/topics/dpdk/pmd.rst |  23 ++++++
> >  lib/dpif-netdev.c                 | 130 ++++++++++++++++++++++--------
> >  tests/pmd.at                      |  38 +++++++++
> >  vswitchd/vswitch.xml              |  20 +++++
> >  4 files changed, 177 insertions(+), 34 deletions(-)
> >
> > diff --git a/Documentation/topics/dpdk/pmd.rst
> > b/Documentation/topics/dpdk/pmd.rst
> > index b259cc8b3..387f962d1 100644
> > --- a/Documentation/topics/dpdk/pmd.rst
> > +++ b/Documentation/topics/dpdk/pmd.rst
> > @@ -99,6 +99,25 @@ core cycles for each Rx queue::
> >
> >      $ ovs-appctl dpif-netdev/pmd-rxq-show
> >
> > +Normally, Rx queues are assigned to PMD threads automatically.  By
> > +default OVS only assigns Rx queues to PMD threads executing on the
> > +same NUMA node in order to avoid unnecessary latency for accessing
> > +packet buffers across the NUMA boundary.  Typically this overhead is
> > +higher for vhostuser ports than for physical ports due to the packet
> > +copy that is done for all rx packets.
> > +
> > +On NUMA servers with physical ports only on one NUMA node, the
> > +NUMA-local polling policy can lead to an under-utilization of the PMD
> > +threads on the remote NUMA node.  For the overall OVS performance it
> > +may in such cases be beneficial to utilize the spare capacity and
> > +allow polling of a physical port's rxqs across NUMA nodes despite the
> overhead involved.
> > +The policy can be set per port with the following configuration option::
> > +
> > +    $ ovs-vsctl set Interface <iface> \
> > +        other_config:cross-numa-polling=true|false
> > +
> > +The default value is false.
> > +
> >  .. note::
> >
> >     A history of one minute is recorded and shown for each Rx queue to
> > allow for @@ -115,6 +134,10 @@ core cycles for each Rx queue::
> >     A ``overhead`` statistics is shown per PMD: it represents the number of
> >     cycles inherently consumed by the OVS PMD processing loop.
> >
> > +.. versionchanged:: 2.18.0
> > +
> > +   Added the interface parameter ``other_config:cross-numa-polling``
> > +
> >  Rx queue to PMD assignment takes place whenever there are
> > configuration changes  or can be triggered by using::
> >
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > ff57b3961..86f88964b 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -467,6 +467,7 @@ struct dp_netdev_port {
> >      char *type;                 /* Port type as requested by user. */
> >      char *rxq_affinity_list;    /* Requested affinity of rx queues. */
> >      enum txq_req_mode txq_requested_mode;
> > +    bool cross_numa_polling;    /* If true cross polling will be enabled. */
> >  };
> >
> >  static bool dp_netdev_flow_ref(struct dp_netdev_flow *); @@ -2101,6
> > +2102,7 @@ port_create(const char *devname, const char *type,
> >      port->sf = NULL;
> >      port->emc_enabled = true;
> >      port->need_reconfigure = true;
> > +    port->cross_numa_polling = false;
> >      ovs_mutex_init(&port->txq_used_mutex);
> >
> >      *portp = port;
> > @@ -5013,6 +5015,7 @@ dpif_netdev_port_set_config(struct dpif *dpif,
> odp_port_t port_no,
> >      bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
> >      const char *tx_steering_mode = smap_get(cfg, "tx-steering");
> >      enum txq_req_mode txq_mode;
> > +    bool cross_numa_polling = smap_get_bool(cfg,
> > + "cross-numa-polling", false);
> >
> >      ovs_rwlock_wrlock(&dp->port_rwlock);
> >      error = get_port_by_number(dp, port_no, &port); @@ -5020,6
> > +5023,14 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t
> port_no,
> >          goto unlock;
> >      }
> >
> > +    if (cross_numa_polling != port->cross_numa_polling) {
> > +        port->cross_numa_polling = cross_numa_polling;
> > +        VLOG_INFO("%s:cross-numa-polling has been %s.",
> > +                  netdev_get_name(port->netdev),
> > +                  cross_numa_polling? "enabled" : "disabled");
> > +        dp_netdev_request_reconfigure(dp);
> > +    }
> > +
> >      if (emc_enabled != port->emc_enabled) {
> >          struct dp_netdev_pmd_thread *pmd;
> >          struct ds ds = DS_EMPTY_INITIALIZER; @@ -5751,7 +5762,7 @@
> > compare_rxq_cycles(const void *a, const void *b)
> >
> >  static bool
> >  sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct
> sched_pmd *pmd,
> > -                     bool has_proc)
> > +                     int port_numa_id, bool has_proc)
> >  {
> >      uint64_t current_num, pmd_num;
> >
> > @@ -5767,16 +5778,22 @@ sched_pmd_new_lowest(struct sched_pmd
> *current_lowest, struct sched_pmd *pmd,
> >          pmd_num = pmd->n_rxq;
> >      }
> >
> > -    if (pmd_num < current_num) {
> > -        return true;
> > +    if (pmd_num != current_num) {
> > +        return (pmd_num < current_num) ? true : false;
> >      }
> > -    return false;
> > +
> > +    /* In case of a tie, select the new PMD only if the existing PMD
> > +     * assigned is from non-local NUMA
> > +     */
> > +    return (current_lowest->numa->numa_id != port_numa_id &&
> > +            pmd->numa->numa_id == port_numa_id) ? true : false;
> > +
> >  }
> >
> >  static struct sched_pmd *
> >  sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)  {
> > -    struct sched_pmd *lowest_sched_pmd = NULL;
> > +    struct sched_pmd *lowest_pmd = NULL;
> >
> >      for (unsigned i = 0; i < numa->n_pmds; i++) {
> >          struct sched_pmd *sched_pmd;
> > @@ -5785,11 +5802,11 @@ sched_pmd_get_lowest(struct sched_numa
> *numa, bool has_cyc)
> >          if (sched_pmd->isolated) {
> >              continue;
> >          }
> > -        if (sched_pmd_new_lowest(lowest_sched_pmd, sched_pmd, has_cyc)) {
> > -            lowest_sched_pmd = sched_pmd;
> > +        if (sched_pmd_new_lowest(lowest_pmd, sched_pmd, INT_MAX,
> has_cyc)) {
> > +            lowest_pmd = sched_pmd;
> >          }
> >      }
> > -    return lowest_sched_pmd;
> > +    return lowest_pmd;
> >  }
> >
> >  /*
> > @@ -5891,6 +5908,32 @@ get_rxq_cyc_log(char *a, enum
> sched_assignment_type algo, uint64_t cycles)
> >      return a;
> >  }
> >
> > +static struct sched_pmd *
> > +sched_pmd_all_numa_get_lowest(struct sched_numa_list *numa_list,
> > +                              int port_numa_id, bool has_proc) {
> > +    int n_numa;
> > +    struct sched_numa *numa = NULL;
> > +    struct sched_numa *last_numa = NULL;
> > +    struct sched_pmd *lowest_pmd = NULL;
> > +    struct sched_pmd *pmd;
> > +
> > +    n_numa = sched_numa_list_count(numa_list);
> > +    /* For all numas. */
> > +    for (int i = 0; i < n_numa; i++) {
> > +        last_numa = numa;
> > +        numa = sched_numa_list_next(numa_list, last_numa);
> > +
> > +        /* Get the lowest pmd per numa. */
> > +        pmd = sched_pmd_get_lowest(numa, has_proc);
> > +
> > +        /* Check if it's the lowest pmd for all numas. */
> > +        if (sched_pmd_new_lowest(lowest_pmd, pmd, port_numa_id, has_proc))
> {
> > +            lowest_pmd = pmd;
> > +        }
> > +    }
> > +    return lowest_pmd;
> > +}
> > +
> >  static void
> >  sched_numa_list_schedule(struct sched_numa_list *numa_list,
> >                           struct dp_netdev *dp, @@ -5991,6 +6034,7 @@
> > sched_numa_list_schedule(struct sched_numa_list *numa_list,
> >          struct sched_pmd *sched_pmd = NULL;
> >          struct sched_numa *numa;
> >          int port_numa_id;
> > +        bool cross_numa = rxq->port->cross_numa_polling;
> 
> It may be better to schedule non-cross-numa rxqs first, and then cross-numa
> rxqs.
> Otherwise, one numa may hold much more load because of later scheduled non-
> cross-numa rxqs.


For a RxQ assignment, least loaded PMD is chosen, it could be from NUMA0 or NUMA1 based on where the least loaded PMD is present. 
The RxQs themselves are sorted and assigned one after the other. They are not assigned in any specific NUMA order. 

> 
> >          uint64_t proc_cycles;
> >          char rxq_cyc_log[MAX_RXQ_CYC_STRLEN];
> >
> > @@ -6000,33 +6044,40 @@ sched_numa_list_schedule(struct
> sched_numa_list *numa_list,
> >              start_logged = true;
> >          }
> >
> > -        /* Store the cycles for this rxq as we will log these later. */
> > +        /* Store the cycles for this rxq as we will need these later.
> > + */
> >          proc_cycles = dp_netdev_rxq_get_cycles(rxq,
> > RXQ_CYCLES_PROC_HIST);
> >
> >          port_numa_id = netdev_get_numa_id(rxq->port->netdev);
> >
> > -        /* Select numa. */
> > -        numa = sched_numa_list_lookup(numa_list, port_numa_id);
> > -
> > -        /* Check if numa has no PMDs or no non-isolated PMDs. */
> > -        if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> > -            /* Unable to use this numa to find a PMD. */
> > -            numa = NULL;
> > -            /* Find any numa with available PMDs. */
> > -            for (int j = 0; j < n_numa; j++) {
> > -                numa = sched_numa_list_next(numa_list, last_cross_numa);
> > -                last_cross_numa = numa;
> > -                if (sched_numa_noniso_pmd_count(numa)) {
> > -                    break;
> > -                }
> > +        if (cross_numa && algo == SCHED_GROUP) {
> > +            /* cross_numa polling enabled so find lowest loaded pmd across
> > +             * all numas. */
> > +            sched_pmd = sched_pmd_all_numa_get_lowest(numa_list,
> port_numa_id,
> > +                                                      proc_cycles);
> > +        } else {
> > +            /* Select numa. */
> > +            numa = sched_numa_list_lookup(numa_list, port_numa_id);
> > +
> > +            /* Check if numa has no PMDs or no non-isolated PMDs. */
> > +            if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> > +                /* Unable to use this numa to find a PMD. */
> >                  numa = NULL;
> > +                /* Find any numa with available PMDs. */
> > +                for (int j = 0; j < n_numa; j++) {
> > +                    numa = sched_numa_list_next(numa_list, last_cross_numa);
> > +                    last_cross_numa = numa;
> > +                    if (sched_numa_noniso_pmd_count(numa)) {
> > +                        break;
> > +                    }
> > +                    numa = NULL;
> > +                }
> >              }
> > -        }
> >
> > -        if (numa) {
> > -            /* Select the PMD that should be used for this rxq. */
> > -            sched_pmd = sched_pmd_next(numa, algo,
> > -                                       proc_cycles ? true : false);
> > +            if (numa) {
> > +                /* Select the PMD that should be used for this rxq. */
> > +                sched_pmd = sched_pmd_next(numa, algo,
> > +                                           proc_cycles ? true : false);
> > +            }
> >          }
> >
> >          /* Check that a pmd has been selected. */ @@ -6036,12
> > +6087,23 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list,
> >              pmd_numa_id = sched_pmd->numa->numa_id;
> >              /* Check if selected pmd numa matches port numa. */
> >              if (pmd_numa_id != port_numa_id) {
> > -                VLOG(level, "There's no available (non-isolated) pmd thread "
> > -                            "on numa node %d. Port \'%s\' rx queue %d will "
> > -                            "be assigned to a pmd on numa node %d. "
> > -                            "This may lead to reduced performance.",
> > -                            port_numa_id, netdev_rxq_get_name(rxq->rx),
> > -                            netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
> > +                if (cross_numa) {
> > +                    VLOG(level, "Cross-numa polling has been selected for "
> > +                                "Port \'%s\' rx queue %d on numa node %d. "
> > +                                "It will be assigned to a pmd on numa node "
> > +                                "%d. This may lead to reduced performance.",
> > +                                netdev_rxq_get_name(rxq->rx),
> > +                                netdev_rxq_get_queue_id(rxq->rx), port_numa_id,
> > +                                pmd_numa_id);
> > +
> > +                } else {
> > +                    VLOG(level, "There's no available (non-isolated) pmd "
> > +                                "thread on numa node %d. Port \'%s\' rx queue "
> > +                                "%d will be assigned to a pmd on numa node "
> > +                                "%d. This may lead to reduced performance.",
> > +                                port_numa_id, netdev_rxq_get_name(rxq->rx),
> > +                                netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
> > +                }
> >              }
> >              VLOG(level, "Core %2u on numa node %d assigned port \'%s\' "
> >                          "rx queue %d%s.", diff --git a/tests/pmd.at
> > b/tests/pmd.at index e6b173dab..5d2ec1382 100644
> > --- a/tests/pmd.at
> > +++ b/tests/pmd.at
> > @@ -579,6 +579,44 @@
> > icmp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a
> > ,nw_src=10
> >  OVS_VSWITCHD_STOP
> >  AT_CLEANUP
> >
> > +AT_SETUP([PMD - enable cross numa polling])
> > +OVS_VSWITCHD_START([add-port br0 p0 \
> > +                    -- set Interface p0 type=dummy-pmd options:n_rxq=4 \
> > +                    -- set Interface p0 options:numa_id=0 \
> > +                    -- set Open_vSwitch . other_config:pmd-cpu-mask=0x3 \
> > +                    -- set open_vswitch . other_config:pmd-rxq-assign=group],
> > +                   [], [], [--dummy-numa 0,1])
> > +
> > +AT_CHECK([ovs-ofctl add-flow br0 action=controller])
> > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> > +0
> > +])
> > +
> > +dnl Enable cross numa polling and check numa ids TMP=$(($(cat
> > +ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
> > +PATTERN="cross-numa-polling has been enabled"
> > +AT_CHECK([ovs-vsctl set Interface p0
> > +other_config:cross-numa-polling=true])
> > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
> > +
> > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> > +0
> > +1
> > +])
> > +
> > +dnl Disable cross numa polling and check numa ids TMP=$(($(cat
> > +ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
> > +PATTERN="cross-numa-polling has been disabled"
> > +AT_CHECK([ovs-vsctl set Interface p0
> > +other_config:cross-numa-polling=false])
> > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
> > +
> > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> > +0
> > +])
> > +
> > +OVS_VSWITCHD_STOP(["/|WARN|/d"])
> > +AT_CLEANUP
> > +
> > +
> >  AT_SETUP([PMD - change numa node])
> >  OVS_VSWITCHD_START(
> >    [add-port br0 p1 -- set Interface p1 type=dummy-pmd
> > ofport_request=1 options:n_rxq=2 -- \ diff --git
> > a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> > cc1dd77ec..c2e948bd9 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -3286,6 +3286,26 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
> type=patch options:peer=p1 \
> >          </p>
> >        </column>
> >
> > +      <column name="other_config" key="cross-numa-polling"
> > +                                  type='{"type": "boolean"}'>
> > +        <p>
> > +          Specifies if the RX queues of the port can be automatically assigned
> > +          to PMD threads on any NUMA node or only on the local NUMA node of
> > +          the port.
> > +        </p>
> > +        <p>
> > +          Polling of physical ports from a non-local PMD thread incurs some
> > +          performance penalty due to the access to packet data across the NUMA
> > +          barrier. This option can still increase the overall performance if
> > +          it allows better utilization of those non-local PMDs threads.
> > +          It is most useful together with the auto load-balancing of RX queues
> > +          <ref table="Open_vSwitch" column="other_config" key="pmd-auto-
> lb"/>
> > +        </p>
> > +        <p>
> > +          Defaults to false.
> > +        </p>
> > +      </column>
> > +
> >        <column name="options" key="xdp-mode"
> >                type='{"type": "string",
> >                       "enum": ["set", ["best-effort",
> > "native-with-zerocopy",
> > --
> > 2.25.1
> >
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-45444
> > 5554331-abcfec870399ab89&q=1&e=d638a27f-b902-46d8-883e-
> ed0c226be33c&u=
> > https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev
Cheng Li June 22, 2022, 10:49 a.m. UTC | #4
On Wed, Jun 22, 2022 at 09:32:29AM +0000, Anurag Agarwal wrote:
> Thanks for your feedback. Please find my comments inline. 
> 
> Regards,
> Anurag
> 
> > -----Original Message-----
> > From: lic121@chinatelecom.cn <lic121@chinatelecom.cn>
> > Sent: Friday, June 17, 2022 2:29 PM
> > To: Anurag Agarwal <anurag2k@gmail.com>
> > Cc: ovs-dev@openvswitch.org
> > Subject: Re: [ovs-dev] [PATCH v4] dpif-netdev: Allow cross-NUMA polling on
> > selected ports
> > 
> > On Wed, Jun 08, 2022 at 03:58:31PM +0530, Anurag Agarwal wrote:
> > > From: Jan Scheurich <jan.scheurich@ericsson.com>
> > >
> > > Today dpif-netdev considers PMD threads on a non-local NUMA node for
> > > automatic assignment of the rxqs of a port only if there are no local,non-
> > isolated PMDs.
> > >
> > > On typical servers with both physical ports on one NUMA node, this
> > > often leaves the PMDs on the other NUMA node under-utilized, wasting CPU
> > resources.
> > > The alternative, to manually pin the rxqs to PMDs on remote NUMA
> > > nodes, also has drawbacks as it limits OVS' ability to auto load-balance the
> > rxqs.
> > >
> > > This patch introduces a new interface configuration option to allow
> > > ports to be automatically polled by PMDs on any NUMA node:
> > >
> > > ovs-vsctl set interface <Name> other_config:cross-numa-polling=true
> > >
> > > The group assignment algorithm now has the ability to select lowest
> > > loaded PMD on any NUMA, and not just the local NUMA on which the rxq
> > > of the port resides
> > >
> > > If this option is not present or set to false, legacy behaviour applies.
> > >
> > > Co-authored-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
> > > Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
> > > Signed-off-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
> > > ---
> > >
> > > Changes in this patch:
> > > - Addressed comments from Kevin Traynor
> > >
> > > Please refer this thread for an earlier discussion on this topic:
> > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-45444
> > > 5554331-165fefd0f093b353&q=1&e=d638a27f-b902-46d8-883e-
> > ed0c226be33c&u=
> > > https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2022-
> > March%
> > > 2F392310.html
> > > ---
> > >  Documentation/topics/dpdk/pmd.rst |  23 ++++++
> > >  lib/dpif-netdev.c                 | 130 ++++++++++++++++++++++--------
> > >  tests/pmd.at                      |  38 +++++++++
> > >  vswitchd/vswitch.xml              |  20 +++++
> > >  4 files changed, 177 insertions(+), 34 deletions(-)
> > >
> > > diff --git a/Documentation/topics/dpdk/pmd.rst
> > > b/Documentation/topics/dpdk/pmd.rst
> > > index b259cc8b3..387f962d1 100644
> > > --- a/Documentation/topics/dpdk/pmd.rst
> > > +++ b/Documentation/topics/dpdk/pmd.rst
> > > @@ -99,6 +99,25 @@ core cycles for each Rx queue::
> > >
> > >      $ ovs-appctl dpif-netdev/pmd-rxq-show
> > >
> > > +Normally, Rx queues are assigned to PMD threads automatically.  By
> > > +default OVS only assigns Rx queues to PMD threads executing on the
> > > +same NUMA node in order to avoid unnecessary latency for accessing
> > > +packet buffers across the NUMA boundary.  Typically this overhead is
> > > +higher for vhostuser ports than for physical ports due to the packet
> > > +copy that is done for all rx packets.
> > > +
> > > +On NUMA servers with physical ports only on one NUMA node, the
> > > +NUMA-local polling policy can lead to an under-utilization of the PMD
> > > +threads on the remote NUMA node.  For the overall OVS performance it
> > > +may in such cases be beneficial to utilize the spare capacity and
> > > +allow polling of a physical port's rxqs across NUMA nodes despite the
> > overhead involved.
> > > +The policy can be set per port with the following configuration option::
> > > +
> > > +    $ ovs-vsctl set Interface <iface> \
> > > +        other_config:cross-numa-polling=true|false
> > > +
> > > +The default value is false.
> > > +
> > >  .. note::
> > >
> > >     A history of one minute is recorded and shown for each Rx queue to
> > > allow for @@ -115,6 +134,10 @@ core cycles for each Rx queue::
> > >     A ``overhead`` statistics is shown per PMD: it represents the number of
> > >     cycles inherently consumed by the OVS PMD processing loop.
> > >
> > > +.. versionchanged:: 2.18.0
> > > +
> > > +   Added the interface parameter ``other_config:cross-numa-polling``
> > > +
> > >  Rx queue to PMD assignment takes place whenever there are
> > > configuration changes  or can be triggered by using::
> > >
> > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > > ff57b3961..86f88964b 100644
> > > --- a/lib/dpif-netdev.c
> > > +++ b/lib/dpif-netdev.c
> > > @@ -467,6 +467,7 @@ struct dp_netdev_port {
> > >      char *type;                 /* Port type as requested by user. */
> > >      char *rxq_affinity_list;    /* Requested affinity of rx queues. */
> > >      enum txq_req_mode txq_requested_mode;
> > > +    bool cross_numa_polling;    /* If true cross polling will be enabled. */
> > >  };
> > >
> > >  static bool dp_netdev_flow_ref(struct dp_netdev_flow *); @@ -2101,6
> > > +2102,7 @@ port_create(const char *devname, const char *type,
> > >      port->sf = NULL;
> > >      port->emc_enabled = true;
> > >      port->need_reconfigure = true;
> > > +    port->cross_numa_polling = false;
> > >      ovs_mutex_init(&port->txq_used_mutex);
> > >
> > >      *portp = port;
> > > @@ -5013,6 +5015,7 @@ dpif_netdev_port_set_config(struct dpif *dpif,
> > odp_port_t port_no,
> > >      bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
> > >      const char *tx_steering_mode = smap_get(cfg, "tx-steering");
> > >      enum txq_req_mode txq_mode;
> > > +    bool cross_numa_polling = smap_get_bool(cfg,
> > > + "cross-numa-polling", false);
> > >
> > >      ovs_rwlock_wrlock(&dp->port_rwlock);
> > >      error = get_port_by_number(dp, port_no, &port); @@ -5020,6
> > > +5023,14 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t
> > port_no,
> > >          goto unlock;
> > >      }
> > >
> > > +    if (cross_numa_polling != port->cross_numa_polling) {
> > > +        port->cross_numa_polling = cross_numa_polling;
> > > +        VLOG_INFO("%s:cross-numa-polling has been %s.",
> > > +                  netdev_get_name(port->netdev),
> > > +                  cross_numa_polling? "enabled" : "disabled");
> > > +        dp_netdev_request_reconfigure(dp);
> > > +    }
> > > +
> > >      if (emc_enabled != port->emc_enabled) {
> > >          struct dp_netdev_pmd_thread *pmd;
> > >          struct ds ds = DS_EMPTY_INITIALIZER; @@ -5751,7 +5762,7 @@
> > > compare_rxq_cycles(const void *a, const void *b)
> > >
> > >  static bool
> > >  sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct
> > sched_pmd *pmd,
> > > -                     bool has_proc)
> > > +                     int port_numa_id, bool has_proc)
> > >  {
> > >      uint64_t current_num, pmd_num;
> > >
> > > @@ -5767,16 +5778,22 @@ sched_pmd_new_lowest(struct sched_pmd
> > *current_lowest, struct sched_pmd *pmd,
> > >          pmd_num = pmd->n_rxq;
> > >      }
> > >
> > > -    if (pmd_num < current_num) {
> > > -        return true;
> > > +    if (pmd_num != current_num) {
> > > +        return (pmd_num < current_num) ? true : false;
> > >      }
> > > -    return false;
> > > +
> > > +    /* In case of a tie, select the new PMD only if the existing PMD
> > > +     * assigned is from non-local NUMA
> > > +     */
> > > +    return (current_lowest->numa->numa_id != port_numa_id &&
> > > +            pmd->numa->numa_id == port_numa_id) ? true : false;
> > > +
> > >  }
> > >
> > >  static struct sched_pmd *
> > >  sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)  {
> > > -    struct sched_pmd *lowest_sched_pmd = NULL;
> > > +    struct sched_pmd *lowest_pmd = NULL;
> > >
> > >      for (unsigned i = 0; i < numa->n_pmds; i++) {
> > >          struct sched_pmd *sched_pmd;
> > > @@ -5785,11 +5802,11 @@ sched_pmd_get_lowest(struct sched_numa
> > *numa, bool has_cyc)
> > >          if (sched_pmd->isolated) {
> > >              continue;
> > >          }
> > > -        if (sched_pmd_new_lowest(lowest_sched_pmd, sched_pmd, has_cyc)) {
> > > -            lowest_sched_pmd = sched_pmd;
> > > +        if (sched_pmd_new_lowest(lowest_pmd, sched_pmd, INT_MAX,
> > has_cyc)) {
> > > +            lowest_pmd = sched_pmd;
> > >          }
> > >      }
> > > -    return lowest_sched_pmd;
> > > +    return lowest_pmd;
> > >  }
> > >
> > >  /*
> > > @@ -5891,6 +5908,32 @@ get_rxq_cyc_log(char *a, enum
> > sched_assignment_type algo, uint64_t cycles)
> > >      return a;
> > >  }
> > >
> > > +static struct sched_pmd *
> > > +sched_pmd_all_numa_get_lowest(struct sched_numa_list *numa_list,
> > > +                              int port_numa_id, bool has_proc) {
> > > +    int n_numa;
> > > +    struct sched_numa *numa = NULL;
> > > +    struct sched_numa *last_numa = NULL;
> > > +    struct sched_pmd *lowest_pmd = NULL;
> > > +    struct sched_pmd *pmd;
> > > +
> > > +    n_numa = sched_numa_list_count(numa_list);
> > > +    /* For all numas. */
> > > +    for (int i = 0; i < n_numa; i++) {
> > > +        last_numa = numa;
> > > +        numa = sched_numa_list_next(numa_list, last_numa);
> > > +
> > > +        /* Get the lowest pmd per numa. */
> > > +        pmd = sched_pmd_get_lowest(numa, has_proc);
> > > +
> > > +        /* Check if it's the lowest pmd for all numas. */
> > > +        if (sched_pmd_new_lowest(lowest_pmd, pmd, port_numa_id, has_proc))
> > {
> > > +            lowest_pmd = pmd;
> > > +        }
> > > +    }
> > > +    return lowest_pmd;
> > > +}
> > > +
> > >  static void
> > >  sched_numa_list_schedule(struct sched_numa_list *numa_list,
> > >                           struct dp_netdev *dp, @@ -5991,6 +6034,7 @@
> > > sched_numa_list_schedule(struct sched_numa_list *numa_list,
> > >          struct sched_pmd *sched_pmd = NULL;
> > >          struct sched_numa *numa;
> > >          int port_numa_id;
> > > +        bool cross_numa = rxq->port->cross_numa_polling;
> > 
> > It may be better to schedule non-cross-numa rxqs first, and then cross-numa
> > rxqs.
> > Otherwise, one numa may hold much more load because of later scheduled non-
> > cross-numa rxqs.
> 
> 
> For a RxQ assignment, least loaded PMD is chosen, it could be from NUMA0 or NUMA1 based on where the least loaded PMD is present. 
> The RxQs themselves are sorted and assigned one after the other. They are not assigned in any specific NUMA order. 

Considering the following situation:

We have 2 numa nodes, each numa node has 1 pmd.
And we have 2 port(p0, p1), each port has 2 queues.
p0 is configured as cross_numa, p1 is not.

each queue's workload,

rxq   p0q0 p0q1 p1q0 p1q1 
load  30   30   20   20

Based on your current implement, the assignment will be:

p0q0 -> numa0 (30)
p0q1 -> numa1 (30)
p1q0 -> numa0 (30+20=50)
p1q1 -> numa0 (50+20=70)

As the result, numa0 holds 70% workload but numa1 holds only 30%.
Because later assigned queues are numa affinity.

> 
> > 
> > >          uint64_t proc_cycles;
> > >          char rxq_cyc_log[MAX_RXQ_CYC_STRLEN];
> > >
> > > @@ -6000,33 +6044,40 @@ sched_numa_list_schedule(struct
> > sched_numa_list *numa_list,
> > >              start_logged = true;
> > >          }
> > >
> > > -        /* Store the cycles for this rxq as we will log these later. */
> > > +        /* Store the cycles for this rxq as we will need these later.
> > > + */
> > >          proc_cycles = dp_netdev_rxq_get_cycles(rxq,
> > > RXQ_CYCLES_PROC_HIST);
> > >
> > >          port_numa_id = netdev_get_numa_id(rxq->port->netdev);
> > >
> > > -        /* Select numa. */
> > > -        numa = sched_numa_list_lookup(numa_list, port_numa_id);
> > > -
> > > -        /* Check if numa has no PMDs or no non-isolated PMDs. */
> > > -        if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> > > -            /* Unable to use this numa to find a PMD. */
> > > -            numa = NULL;
> > > -            /* Find any numa with available PMDs. */
> > > -            for (int j = 0; j < n_numa; j++) {
> > > -                numa = sched_numa_list_next(numa_list, last_cross_numa);
> > > -                last_cross_numa = numa;
> > > -                if (sched_numa_noniso_pmd_count(numa)) {
> > > -                    break;
> > > -                }
> > > +        if (cross_numa && algo == SCHED_GROUP) {
> > > +            /* cross_numa polling enabled so find lowest loaded pmd across
> > > +             * all numas. */
> > > +            sched_pmd = sched_pmd_all_numa_get_lowest(numa_list,
> > port_numa_id,
> > > +                                                      proc_cycles);
> > > +        } else {
> > > +            /* Select numa. */
> > > +            numa = sched_numa_list_lookup(numa_list, port_numa_id);
> > > +
> > > +            /* Check if numa has no PMDs or no non-isolated PMDs. */
> > > +            if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> > > +                /* Unable to use this numa to find a PMD. */
> > >                  numa = NULL;
> > > +                /* Find any numa with available PMDs. */
> > > +                for (int j = 0; j < n_numa; j++) {
> > > +                    numa = sched_numa_list_next(numa_list, last_cross_numa);
> > > +                    last_cross_numa = numa;
> > > +                    if (sched_numa_noniso_pmd_count(numa)) {
> > > +                        break;
> > > +                    }
> > > +                    numa = NULL;
> > > +                }
> > >              }
> > > -        }
> > >
> > > -        if (numa) {
> > > -            /* Select the PMD that should be used for this rxq. */
> > > -            sched_pmd = sched_pmd_next(numa, algo,
> > > -                                       proc_cycles ? true : false);
> > > +            if (numa) {
> > > +                /* Select the PMD that should be used for this rxq. */
> > > +                sched_pmd = sched_pmd_next(numa, algo,
> > > +                                           proc_cycles ? true : false);
> > > +            }
> > >          }
> > >
> > >          /* Check that a pmd has been selected. */ @@ -6036,12
> > > +6087,23 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list,
> > >              pmd_numa_id = sched_pmd->numa->numa_id;
> > >              /* Check if selected pmd numa matches port numa. */
> > >              if (pmd_numa_id != port_numa_id) {
> > > -                VLOG(level, "There's no available (non-isolated) pmd thread "
> > > -                            "on numa node %d. Port \'%s\' rx queue %d will "
> > > -                            "be assigned to a pmd on numa node %d. "
> > > -                            "This may lead to reduced performance.",
> > > -                            port_numa_id, netdev_rxq_get_name(rxq->rx),
> > > -                            netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
> > > +                if (cross_numa) {
> > > +                    VLOG(level, "Cross-numa polling has been selected for "
> > > +                                "Port \'%s\' rx queue %d on numa node %d. "
> > > +                                "It will be assigned to a pmd on numa node "
> > > +                                "%d. This may lead to reduced performance.",
> > > +                                netdev_rxq_get_name(rxq->rx),
> > > +                                netdev_rxq_get_queue_id(rxq->rx), port_numa_id,
> > > +                                pmd_numa_id);
> > > +
> > > +                } else {
> > > +                    VLOG(level, "There's no available (non-isolated) pmd "
> > > +                                "thread on numa node %d. Port \'%s\' rx queue "
> > > +                                "%d will be assigned to a pmd on numa node "
> > > +                                "%d. This may lead to reduced performance.",
> > > +                                port_numa_id, netdev_rxq_get_name(rxq->rx),
> > > +                                netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
> > > +                }
> > >              }
> > >              VLOG(level, "Core %2u on numa node %d assigned port \'%s\' "
> > >                          "rx queue %d%s.", diff --git a/tests/pmd.at
> > > b/tests/pmd.at index e6b173dab..5d2ec1382 100644
> > > --- a/tests/pmd.at
> > > +++ b/tests/pmd.at
> > > @@ -579,6 +579,44 @@
> > > icmp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a
> > > ,nw_src=10
> > >  OVS_VSWITCHD_STOP
> > >  AT_CLEANUP
> > >
> > > +AT_SETUP([PMD - enable cross numa polling])
> > > +OVS_VSWITCHD_START([add-port br0 p0 \
> > > +                    -- set Interface p0 type=dummy-pmd options:n_rxq=4 \
> > > +                    -- set Interface p0 options:numa_id=0 \
> > > +                    -- set Open_vSwitch . other_config:pmd-cpu-mask=0x3 \
> > > +                    -- set open_vswitch . other_config:pmd-rxq-assign=group],
> > > +                   [], [], [--dummy-numa 0,1])
> > > +
> > > +AT_CHECK([ovs-ofctl add-flow br0 action=controller])
> > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> > cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> > > +0
> > > +])
> > > +
> > > +dnl Enable cross numa polling and check numa ids TMP=$(($(cat
> > > +ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
> > > +PATTERN="cross-numa-polling has been enabled"
> > > +AT_CHECK([ovs-vsctl set Interface p0
> > > +other_config:cross-numa-polling=true])
> > > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
> > > +
> > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> > cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> > > +0
> > > +1
> > > +])
> > > +
> > > +dnl Disable cross numa polling and check numa ids TMP=$(($(cat
> > > +ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
> > > +PATTERN="cross-numa-polling has been disabled"
> > > +AT_CHECK([ovs-vsctl set Interface p0
> > > +other_config:cross-numa-polling=false])
> > > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
> > > +
> > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> > cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> > > +0
> > > +])
> > > +
> > > +OVS_VSWITCHD_STOP(["/|WARN|/d"])
> > > +AT_CLEANUP
> > > +
> > > +
> > >  AT_SETUP([PMD - change numa node])
> > >  OVS_VSWITCHD_START(
> > >    [add-port br0 p1 -- set Interface p1 type=dummy-pmd
> > > ofport_request=1 options:n_rxq=2 -- \ diff --git
> > > a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> > > cc1dd77ec..c2e948bd9 100644
> > > --- a/vswitchd/vswitch.xml
> > > +++ b/vswitchd/vswitch.xml
> > > @@ -3286,6 +3286,26 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
> > type=patch options:peer=p1 \
> > >          </p>
> > >        </column>
> > >
> > > +      <column name="other_config" key="cross-numa-polling"
> > > +                                  type='{"type": "boolean"}'>
> > > +        <p>
> > > +          Specifies if the RX queues of the port can be automatically assigned
> > > +          to PMD threads on any NUMA node or only on the local NUMA node of
> > > +          the port.
> > > +        </p>
> > > +        <p>
> > > +          Polling of physical ports from a non-local PMD thread incurs some
> > > +          performance penalty due to the access to packet data across the NUMA
> > > +          barrier. This option can still increase the overall performance if
> > > +          it allows better utilization of those non-local PMDs threads.
> > > +          It is most useful together with the auto load-balancing of RX queues
> > > +          <ref table="Open_vSwitch" column="other_config" key="pmd-auto-
> > lb"/>
> > > +        </p>
> > > +        <p>
> > > +          Defaults to false.
> > > +        </p>
> > > +      </column>
> > > +
> > >        <column name="options" key="xdp-mode"
> > >                type='{"type": "string",
> > >                       "enum": ["set", ["best-effort",
> > > "native-with-zerocopy",
> > > --
> > > 2.25.1
> > >
> > > _______________________________________________
> > > dev mailing list
> > > dev@openvswitch.org
> > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-45444
> > > 5554331-abcfec870399ab89&q=1&e=d638a27f-b902-46d8-883e-
> > ed0c226be33c&u=
> > > https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev
Anurag Agarwal June 23, 2022, 10:17 a.m. UTC | #5
Thanks for your feedback. I have addressed your comments and shall post the updated patch shortly. 

Regards,
Anurag

> -----Original Message-----
> From: lic121@chinatelecom.cn <lic121@chinatelecom.cn>
> Sent: Wednesday, June 22, 2022 4:19 PM
> To: Anurag Agarwal <anurag.agarwal@ericsson.com>
> Cc: Anurag Agarwal <anurag2k@gmail.com>; ovs-dev@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH v4] dpif-netdev: Allow cross-NUMA polling on
> selected ports
> 
> On Wed, Jun 22, 2022 at 09:32:29AM +0000, Anurag Agarwal wrote:
> > Thanks for your feedback. Please find my comments inline.
> >
> > Regards,
> > Anurag
> >
> > > -----Original Message-----
> > > From: lic121@chinatelecom.cn <lic121@chinatelecom.cn>
> > > Sent: Friday, June 17, 2022 2:29 PM
> > > To: Anurag Agarwal <anurag2k@gmail.com>
> > > Cc: ovs-dev@openvswitch.org
> > > Subject: Re: [ovs-dev] [PATCH v4] dpif-netdev: Allow cross-NUMA
> > > polling on selected ports
> > >
> > > On Wed, Jun 08, 2022 at 03:58:31PM +0530, Anurag Agarwal wrote:
> > > > From: Jan Scheurich <jan.scheurich@ericsson.com>
> > > >
> > > > Today dpif-netdev considers PMD threads on a non-local NUMA node
> > > > for automatic assignment of the rxqs of a port only if there are
> > > > no local,non-
> > > isolated PMDs.
> > > >
> > > > On typical servers with both physical ports on one NUMA node, this
> > > > often leaves the PMDs on the other NUMA node under-utilized,
> > > > wasting CPU
> > > resources.
> > > > The alternative, to manually pin the rxqs to PMDs on remote NUMA
> > > > nodes, also has drawbacks as it limits OVS' ability to auto
> > > > load-balance the
> > > rxqs.
> > > >
> > > > This patch introduces a new interface configuration option to
> > > > allow ports to be automatically polled by PMDs on any NUMA node:
> > > >
> > > > ovs-vsctl set interface <Name>
> > > > other_config:cross-numa-polling=true
> > > >
> > > > The group assignment algorithm now has the ability to select
> > > > lowest loaded PMD on any NUMA, and not just the local NUMA on
> > > > which the rxq of the port resides
> > > >
> > > > If this option is not present or set to false, legacy behaviour applies.
> > > >
> > > > Co-authored-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
> > > > Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
> > > > Signed-off-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
> > > > ---
> > > >
> > > > Changes in this patch:
> > > > - Addressed comments from Kevin Traynor
> > > >
> > > > Please refer this thread for an earlier discussion on this topic:
> > > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-4
> > > > 5444
> > > > 5554331-165fefd0f093b353&q=1&e=d638a27f-b902-46d8-883e-
> > > ed0c226be33c&u=
> > > > https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2022-
> > > March%
> > > > 2F392310.html
> > > > ---
> > > >  Documentation/topics/dpdk/pmd.rst |  23 ++++++
> > > >  lib/dpif-netdev.c                 | 130 ++++++++++++++++++++++--------
> > > >  tests/pmd.at                      |  38 +++++++++
> > > >  vswitchd/vswitch.xml              |  20 +++++
> > > >  4 files changed, 177 insertions(+), 34 deletions(-)
> > > >
> > > > diff --git a/Documentation/topics/dpdk/pmd.rst
> > > > b/Documentation/topics/dpdk/pmd.rst
> > > > index b259cc8b3..387f962d1 100644
> > > > --- a/Documentation/topics/dpdk/pmd.rst
> > > > +++ b/Documentation/topics/dpdk/pmd.rst
> > > > @@ -99,6 +99,25 @@ core cycles for each Rx queue::
> > > >
> > > >      $ ovs-appctl dpif-netdev/pmd-rxq-show
> > > >
> > > > +Normally, Rx queues are assigned to PMD threads automatically.
> > > > +By default OVS only assigns Rx queues to PMD threads executing on
> > > > +the same NUMA node in order to avoid unnecessary latency for
> > > > +accessing packet buffers across the NUMA boundary.  Typically
> > > > +this overhead is higher for vhostuser ports than for physical
> > > > +ports due to the packet copy that is done for all rx packets.
> > > > +
> > > > +On NUMA servers with physical ports only on one NUMA node, the
> > > > +NUMA-local polling policy can lead to an under-utilization of the
> > > > +PMD threads on the remote NUMA node.  For the overall OVS
> > > > +performance it may in such cases be beneficial to utilize the
> > > > +spare capacity and allow polling of a physical port's rxqs across
> > > > +NUMA nodes despite the
> > > overhead involved.
> > > > +The policy can be set per port with the following configuration option::
> > > > +
> > > > +    $ ovs-vsctl set Interface <iface> \
> > > > +        other_config:cross-numa-polling=true|false
> > > > +
> > > > +The default value is false.
> > > > +
> > > >  .. note::
> > > >
> > > >     A history of one minute is recorded and shown for each Rx
> > > > queue to allow for @@ -115,6 +134,10 @@ core cycles for each Rx
> queue::
> > > >     A ``overhead`` statistics is shown per PMD: it represents the number of
> > > >     cycles inherently consumed by the OVS PMD processing loop.
> > > >
> > > > +.. versionchanged:: 2.18.0
> > > > +
> > > > +   Added the interface parameter
> > > > + ``other_config:cross-numa-polling``
> > > > +
> > > >  Rx queue to PMD assignment takes place whenever there are
> > > > configuration changes  or can be triggered by using::
> > > >
> > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > > > ff57b3961..86f88964b 100644
> > > > --- a/lib/dpif-netdev.c
> > > > +++ b/lib/dpif-netdev.c
> > > > @@ -467,6 +467,7 @@ struct dp_netdev_port {
> > > >      char *type;                 /* Port type as requested by user. */
> > > >      char *rxq_affinity_list;    /* Requested affinity of rx queues. */
> > > >      enum txq_req_mode txq_requested_mode;
> > > > +    bool cross_numa_polling;    /* If true cross polling will be enabled. */
> > > >  };
> > > >
> > > >  static bool dp_netdev_flow_ref(struct dp_netdev_flow *); @@
> > > > -2101,6
> > > > +2102,7 @@ port_create(const char *devname, const char *type,
> > > >      port->sf = NULL;
> > > >      port->emc_enabled = true;
> > > >      port->need_reconfigure = true;
> > > > +    port->cross_numa_polling = false;
> > > >      ovs_mutex_init(&port->txq_used_mutex);
> > > >
> > > >      *portp = port;
> > > > @@ -5013,6 +5015,7 @@ dpif_netdev_port_set_config(struct dpif
> > > > *dpif,
> > > odp_port_t port_no,
> > > >      bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
> > > >      const char *tx_steering_mode = smap_get(cfg, "tx-steering");
> > > >      enum txq_req_mode txq_mode;
> > > > +    bool cross_numa_polling = smap_get_bool(cfg,
> > > > + "cross-numa-polling", false);
> > > >
> > > >      ovs_rwlock_wrlock(&dp->port_rwlock);
> > > >      error = get_port_by_number(dp, port_no, &port); @@ -5020,6
> > > > +5023,14 @@ dpif_netdev_port_set_config(struct dpif *dpif,
> > > > +odp_port_t
> > > port_no,
> > > >          goto unlock;
> > > >      }
> > > >
> > > > +    if (cross_numa_polling != port->cross_numa_polling) {
> > > > +        port->cross_numa_polling = cross_numa_polling;
> > > > +        VLOG_INFO("%s:cross-numa-polling has been %s.",
> > > > +                  netdev_get_name(port->netdev),
> > > > +                  cross_numa_polling? "enabled" : "disabled");
> > > > +        dp_netdev_request_reconfigure(dp);
> > > > +    }
> > > > +
> > > >      if (emc_enabled != port->emc_enabled) {
> > > >          struct dp_netdev_pmd_thread *pmd;
> > > >          struct ds ds = DS_EMPTY_INITIALIZER; @@ -5751,7 +5762,7
> > > > @@ compare_rxq_cycles(const void *a, const void *b)
> > > >
> > > >  static bool
> > > >  sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct
> > > sched_pmd *pmd,
> > > > -                     bool has_proc)
> > > > +                     int port_numa_id, bool has_proc)
> > > >  {
> > > >      uint64_t current_num, pmd_num;
> > > >
> > > > @@ -5767,16 +5778,22 @@ sched_pmd_new_lowest(struct sched_pmd
> > > *current_lowest, struct sched_pmd *pmd,
> > > >          pmd_num = pmd->n_rxq;
> > > >      }
> > > >
> > > > -    if (pmd_num < current_num) {
> > > > -        return true;
> > > > +    if (pmd_num != current_num) {
> > > > +        return (pmd_num < current_num) ? true : false;
> > > >      }
> > > > -    return false;
> > > > +
> > > > +    /* In case of a tie, select the new PMD only if the existing PMD
> > > > +     * assigned is from non-local NUMA
> > > > +     */
> > > > +    return (current_lowest->numa->numa_id != port_numa_id &&
> > > > +            pmd->numa->numa_id == port_numa_id) ? true : false;
> > > > +
> > > >  }
> > > >
> > > >  static struct sched_pmd *
> > > >  sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)  {
> > > > -    struct sched_pmd *lowest_sched_pmd = NULL;
> > > > +    struct sched_pmd *lowest_pmd = NULL;
> > > >
> > > >      for (unsigned i = 0; i < numa->n_pmds; i++) {
> > > >          struct sched_pmd *sched_pmd; @@ -5785,11 +5802,11 @@
> > > > sched_pmd_get_lowest(struct sched_numa
> > > *numa, bool has_cyc)
> > > >          if (sched_pmd->isolated) {
> > > >              continue;
> > > >          }
> > > > -        if (sched_pmd_new_lowest(lowest_sched_pmd, sched_pmd,
> has_cyc)) {
> > > > -            lowest_sched_pmd = sched_pmd;
> > > > +        if (sched_pmd_new_lowest(lowest_pmd, sched_pmd, INT_MAX,
> > > has_cyc)) {
> > > > +            lowest_pmd = sched_pmd;
> > > >          }
> > > >      }
> > > > -    return lowest_sched_pmd;
> > > > +    return lowest_pmd;
> > > >  }
> > > >
> > > >  /*
> > > > @@ -5891,6 +5908,32 @@ get_rxq_cyc_log(char *a, enum
> > > sched_assignment_type algo, uint64_t cycles)
> > > >      return a;
> > > >  }
> > > >
> > > > +static struct sched_pmd *
> > > > +sched_pmd_all_numa_get_lowest(struct sched_numa_list *numa_list,
> > > > +                              int port_numa_id, bool has_proc) {
> > > > +    int n_numa;
> > > > +    struct sched_numa *numa = NULL;
> > > > +    struct sched_numa *last_numa = NULL;
> > > > +    struct sched_pmd *lowest_pmd = NULL;
> > > > +    struct sched_pmd *pmd;
> > > > +
> > > > +    n_numa = sched_numa_list_count(numa_list);
> > > > +    /* For all numas. */
> > > > +    for (int i = 0; i < n_numa; i++) {
> > > > +        last_numa = numa;
> > > > +        numa = sched_numa_list_next(numa_list, last_numa);
> > > > +
> > > > +        /* Get the lowest pmd per numa. */
> > > > +        pmd = sched_pmd_get_lowest(numa, has_proc);
> > > > +
> > > > +        /* Check if it's the lowest pmd for all numas. */
> > > > +        if (sched_pmd_new_lowest(lowest_pmd, pmd, port_numa_id,
> > > > + has_proc))
> > > {
> > > > +            lowest_pmd = pmd;
> > > > +        }
> > > > +    }
> > > > +    return lowest_pmd;
> > > > +}
> > > > +
> > > >  static void
> > > >  sched_numa_list_schedule(struct sched_numa_list *numa_list,
> > > >                           struct dp_netdev *dp, @@ -5991,6 +6034,7
> > > > @@ sched_numa_list_schedule(struct sched_numa_list *numa_list,
> > > >          struct sched_pmd *sched_pmd = NULL;
> > > >          struct sched_numa *numa;
> > > >          int port_numa_id;
> > > > +        bool cross_numa = rxq->port->cross_numa_polling;
> > >
> > > It may be better to schedule non-cross-numa rxqs first, and then
> > > cross-numa rxqs.
> > > Otherwise, one numa may hold much more load because of later
> > > scheduled non- cross-numa rxqs.
> >
> >
> > For a RxQ assignment, least loaded PMD is chosen, it could be from NUMA0 or
> NUMA1 based on where the least loaded PMD is present.
> > The RxQs themselves are sorted and assigned one after the other. They are
> not assigned in any specific NUMA order.
> 
> Considering the following situation:
> 
> We have 2 numa nodes, each numa node has 1 pmd.
> And we have 2 port(p0, p1), each port has 2 queues.
> p0 is configured as cross_numa, p1 is not.
> 
> each queue's workload,
> 
> rxq   p0q0 p0q1 p1q0 p1q1
> load  30   30   20   20
> 
> Based on your current implement, the assignment will be:
> 
> p0q0 -> numa0 (30)
> p0q1 -> numa1 (30)
> p1q0 -> numa0 (30+20=50)
> p1q1 -> numa0 (50+20=70)
> 
> As the result, numa0 holds 70% workload but numa1 holds only 30%.
> Because later assigned queues are numa affinity.
> 
> >
> > >
> > > >          uint64_t proc_cycles;
> > > >          char rxq_cyc_log[MAX_RXQ_CYC_STRLEN];
> > > >
> > > > @@ -6000,33 +6044,40 @@ sched_numa_list_schedule(struct
> > > sched_numa_list *numa_list,
> > > >              start_logged = true;
> > > >          }
> > > >
> > > > -        /* Store the cycles for this rxq as we will log these later. */
> > > > +        /* Store the cycles for this rxq as we will need these later.
> > > > + */
> > > >          proc_cycles = dp_netdev_rxq_get_cycles(rxq,
> > > > RXQ_CYCLES_PROC_HIST);
> > > >
> > > >          port_numa_id = netdev_get_numa_id(rxq->port->netdev);
> > > >
> > > > -        /* Select numa. */
> > > > -        numa = sched_numa_list_lookup(numa_list, port_numa_id);
> > > > -
> > > > -        /* Check if numa has no PMDs or no non-isolated PMDs. */
> > > > -        if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> > > > -            /* Unable to use this numa to find a PMD. */
> > > > -            numa = NULL;
> > > > -            /* Find any numa with available PMDs. */
> > > > -            for (int j = 0; j < n_numa; j++) {
> > > > -                numa = sched_numa_list_next(numa_list, last_cross_numa);
> > > > -                last_cross_numa = numa;
> > > > -                if (sched_numa_noniso_pmd_count(numa)) {
> > > > -                    break;
> > > > -                }
> > > > +        if (cross_numa && algo == SCHED_GROUP) {
> > > > +            /* cross_numa polling enabled so find lowest loaded pmd across
> > > > +             * all numas. */
> > > > +            sched_pmd = sched_pmd_all_numa_get_lowest(numa_list,
> > > port_numa_id,
> > > > +                                                      proc_cycles);
> > > > +        } else {
> > > > +            /* Select numa. */
> > > > +            numa = sched_numa_list_lookup(numa_list,
> > > > + port_numa_id);
> > > > +
> > > > +            /* Check if numa has no PMDs or no non-isolated PMDs. */
> > > > +            if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> > > > +                /* Unable to use this numa to find a PMD. */
> > > >                  numa = NULL;
> > > > +                /* Find any numa with available PMDs. */
> > > > +                for (int j = 0; j < n_numa; j++) {
> > > > +                    numa = sched_numa_list_next(numa_list, last_cross_numa);
> > > > +                    last_cross_numa = numa;
> > > > +                    if (sched_numa_noniso_pmd_count(numa)) {
> > > > +                        break;
> > > > +                    }
> > > > +                    numa = NULL;
> > > > +                }
> > > >              }
> > > > -        }
> > > >
> > > > -        if (numa) {
> > > > -            /* Select the PMD that should be used for this rxq. */
> > > > -            sched_pmd = sched_pmd_next(numa, algo,
> > > > -                                       proc_cycles ? true : false);
> > > > +            if (numa) {
> > > > +                /* Select the PMD that should be used for this rxq. */
> > > > +                sched_pmd = sched_pmd_next(numa, algo,
> > > > +                                           proc_cycles ? true : false);
> > > > +            }
> > > >          }
> > > >
> > > >          /* Check that a pmd has been selected. */ @@ -6036,12
> > > > +6087,23 @@ sched_numa_list_schedule(struct sched_numa_list
> > > > +*numa_list,
> > > >              pmd_numa_id = sched_pmd->numa->numa_id;
> > > >              /* Check if selected pmd numa matches port numa. */
> > > >              if (pmd_numa_id != port_numa_id) {
> > > > -                VLOG(level, "There's no available (non-isolated) pmd thread "
> > > > -                            "on numa node %d. Port \'%s\' rx queue %d will "
> > > > -                            "be assigned to a pmd on numa node %d. "
> > > > -                            "This may lead to reduced performance.",
> > > > -                            port_numa_id, netdev_rxq_get_name(rxq->rx),
> > > > -                            netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
> > > > +                if (cross_numa) {
> > > > +                    VLOG(level, "Cross-numa polling has been selected for "
> > > > +                                "Port \'%s\' rx queue %d on numa node %d. "
> > > > +                                "It will be assigned to a pmd on numa node "
> > > > +                                "%d. This may lead to reduced performance.",
> > > > +                                netdev_rxq_get_name(rxq->rx),
> > > > +                                netdev_rxq_get_queue_id(rxq->rx), port_numa_id,
> > > > +                                pmd_numa_id);
> > > > +
> > > > +                } else {
> > > > +                    VLOG(level, "There's no available (non-isolated) pmd "
> > > > +                                "thread on numa node %d. Port \'%s\' rx queue "
> > > > +                                "%d will be assigned to a pmd on numa node "
> > > > +                                "%d. This may lead to reduced performance.",
> > > > +                                port_numa_id, netdev_rxq_get_name(rxq->rx),
> > > > +                                netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
> > > > +                }
> > > >              }
> > > >              VLOG(level, "Core %2u on numa node %d assigned port \'%s\' "
> > > >                          "rx queue %d%s.", diff --git
> > > > a/tests/pmd.at b/tests/pmd.at index e6b173dab..5d2ec1382 100644
> > > > --- a/tests/pmd.at
> > > > +++ b/tests/pmd.at
> > > > @@ -579,6 +579,44 @@
> > > > icmp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:0
> > > > 0:0a
> > > > ,nw_src=10
> > > >  OVS_VSWITCHD_STOP
> > > >  AT_CLEANUP
> > > >
> > > > +AT_SETUP([PMD - enable cross numa polling])
> > > > +OVS_VSWITCHD_START([add-port br0 p0 \
> > > > +                    -- set Interface p0 type=dummy-pmd options:n_rxq=4 \
> > > > +                    -- set Interface p0 options:numa_id=0 \
> > > > +                    -- set Open_vSwitch . other_config:pmd-cpu-mask=0x3 \
> > > > +                    -- set open_vswitch . other_config:pmd-rxq-assign=group],
> > > > +                   [], [], [--dummy-numa 0,1])
> > > > +
> > > > +AT_CHECK([ovs-ofctl add-flow br0 action=controller])
> > > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show |
> > > > +parse_pmd_rxq_show |
> > > cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> > > > +0
> > > > +])
> > > > +
> > > > +dnl Enable cross numa polling and check numa ids TMP=$(($(cat
> > > > +ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
> > > > +PATTERN="cross-numa-polling has been enabled"
> > > > +AT_CHECK([ovs-vsctl set Interface p0
> > > > +other_config:cross-numa-polling=true])
> > > > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep
> > > > +"$PATTERN"])
> > > > +
> > > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show |
> > > > +parse_pmd_rxq_show |
> > > cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> > > > +0
> > > > +1
> > > > +])
> > > > +
> > > > +dnl Disable cross numa polling and check numa ids TMP=$(($(cat
> > > > +ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
> > > > +PATTERN="cross-numa-polling has been disabled"
> > > > +AT_CHECK([ovs-vsctl set Interface p0
> > > > +other_config:cross-numa-polling=false])
> > > > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep
> > > > +"$PATTERN"])
> > > > +
> > > > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show |
> > > > +parse_pmd_rxq_show |
> > > cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> > > > +0
> > > > +])
> > > > +
> > > > +OVS_VSWITCHD_STOP(["/|WARN|/d"])
> > > > +AT_CLEANUP
> > > > +
> > > > +
> > > >  AT_SETUP([PMD - change numa node])  OVS_VSWITCHD_START(
> > > >    [add-port br0 p1 -- set Interface p1 type=dummy-pmd
> > > > ofport_request=1 options:n_rxq=2 -- \ diff --git
> > > > a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> > > > cc1dd77ec..c2e948bd9 100644
> > > > --- a/vswitchd/vswitch.xml
> > > > +++ b/vswitchd/vswitch.xml
> > > > @@ -3286,6 +3286,26 @@ ovs-vsctl add-port br0 p0 -- set Interface
> > > > p0
> > > type=patch options:peer=p1 \
> > > >          </p>
> > > >        </column>
> > > >
> > > > +      <column name="other_config" key="cross-numa-polling"
> > > > +                                  type='{"type": "boolean"}'>
> > > > +        <p>
> > > > +          Specifies if the RX queues of the port can be automatically assigned
> > > > +          to PMD threads on any NUMA node or only on the local NUMA node
> of
> > > > +          the port.
> > > > +        </p>
> > > > +        <p>
> > > > +          Polling of physical ports from a non-local PMD thread incurs some
> > > > +          performance penalty due to the access to packet data across the
> NUMA
> > > > +          barrier. This option can still increase the overall performance if
> > > > +          it allows better utilization of those non-local PMDs threads.
> > > > +          It is most useful together with the auto load-balancing of RX queues
> > > > +          <ref table="Open_vSwitch" column="other_config"
> > > > + key="pmd-auto-
> > > lb"/>
> > > > +        </p>
> > > > +        <p>
> > > > +          Defaults to false.
> > > > +        </p>
> > > > +      </column>
> > > > +
> > > >        <column name="options" key="xdp-mode"
> > > >                type='{"type": "string",
> > > >                       "enum": ["set", ["best-effort",
> > > > "native-with-zerocopy",
> > > > --
> > > > 2.25.1
> > > >
> > > > _______________________________________________
> > > > dev mailing list
> > > > dev@openvswitch.org
> > > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-4
> > > > 5444
> > > > 5554331-abcfec870399ab89&q=1&e=d638a27f-b902-46d8-883e-
> > > ed0c226be33c&u=
> > > > https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev
diff mbox series

Patch

diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
index b259cc8b3..387f962d1 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -99,6 +99,25 @@  core cycles for each Rx queue::
 
     $ ovs-appctl dpif-netdev/pmd-rxq-show
 
+Normally, Rx queues are assigned to PMD threads automatically.  By default
+OVS only assigns Rx queues to PMD threads executing on the same NUMA
+node in order to avoid unnecessary latency for accessing packet buffers
+across the NUMA boundary.  Typically this overhead is higher for vhostuser
+ports than for physical ports due to the packet copy that is done for all
+rx packets.
+
+On NUMA servers with physical ports only on one NUMA node, the NUMA-local
+polling policy can lead to an under-utilization of the PMD threads on the
+remote NUMA node.  For the overall OVS performance it may in such cases be
+beneficial to utilize the spare capacity and allow polling of a physical
+port's rxqs across NUMA nodes despite the overhead involved.
+The policy can be set per port with the following configuration option::
+
+    $ ovs-vsctl set Interface <iface> \
+        other_config:cross-numa-polling=true|false
+
+The default value is false.
+
 .. note::
 
    A history of one minute is recorded and shown for each Rx queue to allow for
@@ -115,6 +134,10 @@  core cycles for each Rx queue::
    A ``overhead`` statistics is shown per PMD: it represents the number of
    cycles inherently consumed by the OVS PMD processing loop.
 
+.. versionchanged:: 2.18.0
+
+   Added the interface parameter ``other_config:cross-numa-polling``
+
 Rx queue to PMD assignment takes place whenever there are configuration changes
 or can be triggered by using::
 
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index ff57b3961..86f88964b 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -467,6 +467,7 @@  struct dp_netdev_port {
     char *type;                 /* Port type as requested by user. */
     char *rxq_affinity_list;    /* Requested affinity of rx queues. */
     enum txq_req_mode txq_requested_mode;
+    bool cross_numa_polling;    /* If true cross polling will be enabled. */
 };
 
 static bool dp_netdev_flow_ref(struct dp_netdev_flow *);
@@ -2101,6 +2102,7 @@  port_create(const char *devname, const char *type,
     port->sf = NULL;
     port->emc_enabled = true;
     port->need_reconfigure = true;
+    port->cross_numa_polling = false;
     ovs_mutex_init(&port->txq_used_mutex);
 
     *portp = port;
@@ -5013,6 +5015,7 @@  dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t port_no,
     bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
     const char *tx_steering_mode = smap_get(cfg, "tx-steering");
     enum txq_req_mode txq_mode;
+    bool cross_numa_polling = smap_get_bool(cfg, "cross-numa-polling", false);
 
     ovs_rwlock_wrlock(&dp->port_rwlock);
     error = get_port_by_number(dp, port_no, &port);
@@ -5020,6 +5023,14 @@  dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t port_no,
         goto unlock;
     }
 
+    if (cross_numa_polling != port->cross_numa_polling) {
+        port->cross_numa_polling = cross_numa_polling;
+        VLOG_INFO("%s:cross-numa-polling has been %s.",
+                  netdev_get_name(port->netdev),
+                  cross_numa_polling? "enabled" : "disabled");
+        dp_netdev_request_reconfigure(dp);
+    }
+
     if (emc_enabled != port->emc_enabled) {
         struct dp_netdev_pmd_thread *pmd;
         struct ds ds = DS_EMPTY_INITIALIZER;
@@ -5751,7 +5762,7 @@  compare_rxq_cycles(const void *a, const void *b)
 
 static bool
 sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct sched_pmd *pmd,
-                     bool has_proc)
+                     int port_numa_id, bool has_proc)
 {
     uint64_t current_num, pmd_num;
 
@@ -5767,16 +5778,22 @@  sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct sched_pmd *pmd,
         pmd_num = pmd->n_rxq;
     }
 
-    if (pmd_num < current_num) {
-        return true;
+    if (pmd_num != current_num) {
+        return (pmd_num < current_num) ? true : false;
     }
-    return false;
+
+    /* In case of a tie, select the new PMD only if the existing PMD
+     * assigned is from non-local NUMA
+     */
+    return (current_lowest->numa->numa_id != port_numa_id &&
+            pmd->numa->numa_id == port_numa_id) ? true : false;
+
 }
 
 static struct sched_pmd *
 sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)
 {
-    struct sched_pmd *lowest_sched_pmd = NULL;
+    struct sched_pmd *lowest_pmd = NULL;
 
     for (unsigned i = 0; i < numa->n_pmds; i++) {
         struct sched_pmd *sched_pmd;
@@ -5785,11 +5802,11 @@  sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)
         if (sched_pmd->isolated) {
             continue;
         }
-        if (sched_pmd_new_lowest(lowest_sched_pmd, sched_pmd, has_cyc)) {
-            lowest_sched_pmd = sched_pmd;
+        if (sched_pmd_new_lowest(lowest_pmd, sched_pmd, INT_MAX, has_cyc)) {
+            lowest_pmd = sched_pmd;
         }
     }
-    return lowest_sched_pmd;
+    return lowest_pmd;
 }
 
 /*
@@ -5891,6 +5908,32 @@  get_rxq_cyc_log(char *a, enum sched_assignment_type algo, uint64_t cycles)
     return a;
 }
 
+static struct sched_pmd *
+sched_pmd_all_numa_get_lowest(struct sched_numa_list *numa_list,
+                              int port_numa_id, bool has_proc) {
+    int n_numa;
+    struct sched_numa *numa = NULL;
+    struct sched_numa *last_numa = NULL;
+    struct sched_pmd *lowest_pmd = NULL;
+    struct sched_pmd *pmd;
+
+    n_numa = sched_numa_list_count(numa_list);
+    /* For all numas. */
+    for (int i = 0; i < n_numa; i++) {
+        last_numa = numa;
+        numa = sched_numa_list_next(numa_list, last_numa);
+
+        /* Get the lowest pmd per numa. */
+        pmd = sched_pmd_get_lowest(numa, has_proc);
+
+        /* Check if it's the lowest pmd for all numas. */
+        if (sched_pmd_new_lowest(lowest_pmd, pmd, port_numa_id, has_proc)) {
+            lowest_pmd = pmd;
+        }
+    }
+    return lowest_pmd;
+}
+
 static void
 sched_numa_list_schedule(struct sched_numa_list *numa_list,
                          struct dp_netdev *dp,
@@ -5991,6 +6034,7 @@  sched_numa_list_schedule(struct sched_numa_list *numa_list,
         struct sched_pmd *sched_pmd = NULL;
         struct sched_numa *numa;
         int port_numa_id;
+        bool cross_numa = rxq->port->cross_numa_polling;
         uint64_t proc_cycles;
         char rxq_cyc_log[MAX_RXQ_CYC_STRLEN];
 
@@ -6000,33 +6044,40 @@  sched_numa_list_schedule(struct sched_numa_list *numa_list,
             start_logged = true;
         }
 
-        /* Store the cycles for this rxq as we will log these later. */
+        /* Store the cycles for this rxq as we will need these later. */
         proc_cycles = dp_netdev_rxq_get_cycles(rxq, RXQ_CYCLES_PROC_HIST);
 
         port_numa_id = netdev_get_numa_id(rxq->port->netdev);
 
-        /* Select numa. */
-        numa = sched_numa_list_lookup(numa_list, port_numa_id);
-
-        /* Check if numa has no PMDs or no non-isolated PMDs. */
-        if (!numa || !sched_numa_noniso_pmd_count(numa)) {
-            /* Unable to use this numa to find a PMD. */
-            numa = NULL;
-            /* Find any numa with available PMDs. */
-            for (int j = 0; j < n_numa; j++) {
-                numa = sched_numa_list_next(numa_list, last_cross_numa);
-                last_cross_numa = numa;
-                if (sched_numa_noniso_pmd_count(numa)) {
-                    break;
-                }
+        if (cross_numa && algo == SCHED_GROUP) {
+            /* cross_numa polling enabled so find lowest loaded pmd across
+             * all numas. */
+            sched_pmd = sched_pmd_all_numa_get_lowest(numa_list, port_numa_id,
+                                                      proc_cycles);
+        } else {
+            /* Select numa. */
+            numa = sched_numa_list_lookup(numa_list, port_numa_id);
+
+            /* Check if numa has no PMDs or no non-isolated PMDs. */
+            if (!numa || !sched_numa_noniso_pmd_count(numa)) {
+                /* Unable to use this numa to find a PMD. */
                 numa = NULL;
+                /* Find any numa with available PMDs. */
+                for (int j = 0; j < n_numa; j++) {
+                    numa = sched_numa_list_next(numa_list, last_cross_numa);
+                    last_cross_numa = numa;
+                    if (sched_numa_noniso_pmd_count(numa)) {
+                        break;
+                    }
+                    numa = NULL;
+                }
             }
-        }
 
-        if (numa) {
-            /* Select the PMD that should be used for this rxq. */
-            sched_pmd = sched_pmd_next(numa, algo,
-                                       proc_cycles ? true : false);
+            if (numa) {
+                /* Select the PMD that should be used for this rxq. */
+                sched_pmd = sched_pmd_next(numa, algo,
+                                           proc_cycles ? true : false);
+            }
         }
 
         /* Check that a pmd has been selected. */
@@ -6036,12 +6087,23 @@  sched_numa_list_schedule(struct sched_numa_list *numa_list,
             pmd_numa_id = sched_pmd->numa->numa_id;
             /* Check if selected pmd numa matches port numa. */
             if (pmd_numa_id != port_numa_id) {
-                VLOG(level, "There's no available (non-isolated) pmd thread "
-                            "on numa node %d. Port \'%s\' rx queue %d will "
-                            "be assigned to a pmd on numa node %d. "
-                            "This may lead to reduced performance.",
-                            port_numa_id, netdev_rxq_get_name(rxq->rx),
-                            netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
+                if (cross_numa) {
+                    VLOG(level, "Cross-numa polling has been selected for "
+                                "Port \'%s\' rx queue %d on numa node %d. "
+                                "It will be assigned to a pmd on numa node "
+                                "%d. This may lead to reduced performance.",
+                                netdev_rxq_get_name(rxq->rx),
+                                netdev_rxq_get_queue_id(rxq->rx), port_numa_id,
+                                pmd_numa_id);
+
+                } else {
+                    VLOG(level, "There's no available (non-isolated) pmd "
+                                "thread on numa node %d. Port \'%s\' rx queue "
+                                "%d will be assigned to a pmd on numa node "
+                                "%d. This may lead to reduced performance.",
+                                port_numa_id, netdev_rxq_get_name(rxq->rx),
+                                netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
+                }
             }
             VLOG(level, "Core %2u on numa node %d assigned port \'%s\' "
                         "rx queue %d%s.",
diff --git a/tests/pmd.at b/tests/pmd.at
index e6b173dab..5d2ec1382 100644
--- a/tests/pmd.at
+++ b/tests/pmd.at
@@ -579,6 +579,44 @@  icmp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10
 OVS_VSWITCHD_STOP
 AT_CLEANUP
 
+AT_SETUP([PMD - enable cross numa polling])
+OVS_VSWITCHD_START([add-port br0 p0 \
+                    -- set Interface p0 type=dummy-pmd options:n_rxq=4 \
+                    -- set Interface p0 options:numa_id=0 \
+                    -- set Open_vSwitch . other_config:pmd-cpu-mask=0x3 \
+                    -- set open_vswitch . other_config:pmd-rxq-assign=group],
+                   [], [], [--dummy-numa 0,1])
+
+AT_CHECK([ovs-ofctl add-flow br0 action=controller])
+AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
+0
+])
+
+dnl Enable cross numa polling and check numa ids
+TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
+PATTERN="cross-numa-polling has been enabled"
+AT_CHECK([ovs-vsctl set Interface p0 other_config:cross-numa-polling=true])
+OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
+
+AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
+0
+1
+])
+
+dnl Disable cross numa polling and check numa ids
+TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
+PATTERN="cross-numa-polling has been disabled"
+AT_CHECK([ovs-vsctl set Interface p0 other_config:cross-numa-polling=false])
+OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
+
+AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
+0
+])
+
+OVS_VSWITCHD_STOP(["/|WARN|/d"])
+AT_CLEANUP
+
+
 AT_SETUP([PMD - change numa node])
 OVS_VSWITCHD_START(
   [add-port br0 p1 -- set Interface p1 type=dummy-pmd ofport_request=1 options:n_rxq=2 -- \
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index cc1dd77ec..c2e948bd9 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -3286,6 +3286,26 @@  ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
         </p>
       </column>
 
+      <column name="other_config" key="cross-numa-polling"
+                                  type='{"type": "boolean"}'>
+        <p>
+          Specifies if the RX queues of the port can be automatically assigned
+          to PMD threads on any NUMA node or only on the local NUMA node of
+          the port.
+        </p>
+        <p>
+          Polling of physical ports from a non-local PMD thread incurs some
+          performance penalty due to the access to packet data across the NUMA
+          barrier. This option can still increase the overall performance if
+          it allows better utilization of those non-local PMDs threads.
+          It is most useful together with the auto load-balancing of RX queues
+          <ref table="Open_vSwitch" column="other_config" key="pmd-auto-lb"/>
+        </p>
+        <p>
+          Defaults to false.
+        </p>
+      </column>
+
       <column name="options" key="xdp-mode"
               type='{"type": "string",
                      "enum": ["set", ["best-effort", "native-with-zerocopy",