diff mbox series

[ovs-dev,v3] dpif-netdev: Allow cross-NUMA polling on selected ports

Message ID 20220603042524.1350-1-anurag2k@gmail.com
State Superseded
Headers show
Series [ovs-dev,v3] dpif-netdev: Allow cross-NUMA polling on selected ports | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Anurag Agarwal June 3, 2022, 4:25 a.m. UTC
From: Jan Scheurich <jan.scheurich@ericsson.com>

Today dpif-netdev considers PMD threads on a non-local NUMA node for automatic
assignment of the rxqs of a port only if there are no local,non-isolated PMDs.

On typical servers with both physical ports on one NUMA node, this often
leaves the PMDs on the other NUMA node under-utilized, wasting CPU resources.
The alternative, to manually pin the rxqs to PMDs on remote NUMA nodes, also
has drawbacks as it limits OVS' ability to auto load-balance the rxqs.

This patch introduces a new interface configuration option to allow ports to
be automatically polled by PMDs on any NUMA node:

ovs-vsctl set interface <Name> other_config:cross-numa-polling=true

The group assignment algorithm now has the ability to select lowest loaded PMD
on any NUMA, and not just the local NUMA on which the rxq of the port resides

If this option is not present or set to false, legacy behaviour applies.

Co-authored-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
Signed-off-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
---
 Documentation/topics/dpdk/pmd.rst |  23 ++++++
 lib/dpif-netdev.c                 | 123 ++++++++++++++++++++++--------
 tests/pmd.at                      |  33 ++++++++
 vswitchd/vswitch.xml              |  20 +++++
 4 files changed, 166 insertions(+), 33 deletions(-)

Comments

Kevin Traynor June 3, 2022, 4:06 p.m. UTC | #1
Hi Anurag,

Thanks for submitting this. Some initial comments on the code below.

On 03/06/2022 05:25, Anurag Agarwal wrote:
> From: Jan Scheurich <jan.scheurich@ericsson.com>
> 
> Today dpif-netdev considers PMD threads on a non-local NUMA node for automatic
> assignment of the rxqs of a port only if there are no local,non-isolated PMDs.
> 
> On typical servers with both physical ports on one NUMA node, this often
> leaves the PMDs on the other NUMA node under-utilized, wasting CPU resources.
> The alternative, to manually pin the rxqs to PMDs on remote NUMA nodes, also
> has drawbacks as it limits OVS' ability to auto load-balance the rxqs.
> 
> This patch introduces a new interface configuration option to allow ports to
> be automatically polled by PMDs on any NUMA node:
> 
> ovs-vsctl set interface <Name> other_config:cross-numa-polling=true
> 
> The group assignment algorithm now has the ability to select lowest loaded PMD
> on any NUMA, and not just the local NUMA on which the rxq of the port resides
> 
> If this option is not present or set to false, legacy behaviour applies.
> 
> Co-authored-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
> Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
> Signed-off-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
> ---

It would be good if you could include some links below the cut line here 
or in a cover-letter to the previous discussion on this feature and the 
possible side-effect, so reviewers can be aware of that.

Also, fine so far as it was minor revisions sent close together, but for 
future revisions, please add a note about what has changed. That will 
help reviewers to know what they need to focus on in the new revision.

>   Documentation/topics/dpdk/pmd.rst |  23 ++++++
>   lib/dpif-netdev.c                 | 123 ++++++++++++++++++++++--------
>   tests/pmd.at                      |  33 ++++++++
>   vswitchd/vswitch.xml              |  20 +++++
>   4 files changed, 166 insertions(+), 33 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> index b259cc8b3..387f962d1 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -99,6 +99,25 @@ core cycles for each Rx queue::
>   
>       $ ovs-appctl dpif-netdev/pmd-rxq-show
>   
> +Normally, Rx queues are assigned to PMD threads automatically.  By default
> +OVS only assigns Rx queues to PMD threads executing on the same NUMA
> +node in order to avoid unnecessary latency for accessing packet buffers
> +across the NUMA boundary.  Typically this overhead is higher for vhostuser
> +ports than for physical ports due to the packet copy that is done for all
> +rx packets.
> +

I don't think it needs double space for the start of each sentence, but 
that could be because I'm comparing with other parts of the 
documentation that I wrote incorrectly without that o_O

> +On NUMA servers with physical ports only on one NUMA node, the NUMA-local
> +polling policy can lead to an under-utilization of the PMD threads on the
> +remote NUMA node.  For the overall OVS performance it may in such cases be
> +beneficial to utilize the spare capacity and allow polling of a physical
> +port's rxqs across NUMA nodes despite the overhead involved.
> +The policy can be set per port with the following configuration option::
> +
> +    $ ovs-vsctl set Interface <iface> \
> +        other_config:cross-numa-polling=true|false
> +
> +The default value is false.
> +
>   .. note::
>   
>      A history of one minute is recorded and shown for each Rx queue to allow for
> @@ -115,6 +134,10 @@ core cycles for each Rx queue::
>      A ``overhead`` statistics is shown per PMD: it represents the number of
>      cycles inherently consumed by the OVS PMD processing loop.
>   
> +.. versionchanged:: 2.18.0
> +
> +   Added the interface parameter ``other_config:cross-numa-polling``
> +
>   Rx queue to PMD assignment takes place whenever there are configuration changes
>   or can be triggered by using::
>   
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index ff57b3961..ace5c1920 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -467,6 +467,7 @@ struct dp_netdev_port {
>       char *type;                 /* Port type as requested by user. */
>       char *rxq_affinity_list;    /* Requested affinity of rx queues. */
>       enum txq_req_mode txq_requested_mode;
> +    bool cross_numa_polling;    /* If true cross polling will be enabled */

Full stop to be consistent with other comments e.g.'...enabled.'

>   };
>   
>   static bool dp_netdev_flow_ref(struct dp_netdev_flow *);
> @@ -2101,6 +2102,7 @@ port_create(const char *devname, const char *type,
>       port->sf = NULL;
>       port->emc_enabled = true;
>       port->need_reconfigure = true;
> +    port->cross_numa_polling = false;
>       ovs_mutex_init(&port->txq_used_mutex);
>   
>       *portp = port;
> @@ -5013,6 +5015,7 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t port_no,
>       bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
>       const char *tx_steering_mode = smap_get(cfg, "tx-steering");
>       enum txq_req_mode txq_mode;
> +    bool cross_numa_polling = smap_get_bool(cfg, "cross-numa-polling", false);
>   
>       ovs_rwlock_wrlock(&dp->port_rwlock);
>       error = get_port_by_number(dp, port_no, &port);
> @@ -5020,6 +5023,14 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t port_no,
>           goto unlock;
>       }
>   
> +    if (cross_numa_polling != port->cross_numa_polling) {
> +        port->cross_numa_polling = cross_numa_polling;
> +        VLOG_INFO("%s:cross-numa-polling has been %s.",
> +                  netdev_get_name(port->netdev),
> +                  (cross_numa_polling)? "enabled" : "disabled");

you can remove unneeded brackets here

> +        dp_netdev_request_reconfigure(dp);
> +    }
> +
>       if (emc_enabled != port->emc_enabled) {
>           struct dp_netdev_pmd_thread *pmd;
>           struct ds ds = DS_EMPTY_INITIALIZER;
> @@ -5751,7 +5762,7 @@ compare_rxq_cycles(const void *a, const void *b)
>   
>   static bool
>   sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct sched_pmd *pmd,
> -                     bool has_proc)
> +                     int port_numa_id, bool has_proc)
>   {
>       uint64_t current_num, pmd_num;
>   
> @@ -5767,16 +5778,17 @@ sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct sched_pmd *pmd,
>           pmd_num = pmd->n_rxq;
>       }
>   
> -    if (pmd_num < current_num) {
> -        return true;
> +    if (pmd_num != current_num) {
> +        return (pmd_num < current_num) ? true : false;
>       }
> -    return false;
> +
> +    return (pmd->numa->numa_id == port_numa_id) ? true : false;

Before this in the event of a tie, the current_lowest was selected (ret 
false). Now, if in this additional test there is a further tie (because 
both are local NUMA to the port) the new one will be selected (ret true).

Either scheme should be fine, but perhaps better to keep it consistent 
with what it was, so check would be something like:

return (current_lowest->numa->numa_id != port_numa_id && 
pmd->numa->numa_id == pmd_numa_id) ? : true : false;

>   }
>   
>   static struct sched_pmd *
>   sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)
>   {
> -    struct sched_pmd *lowest_sched_pmd = NULL;
> +    struct sched_pmd *lowest_pmd = NULL;

no real objection to the name, but just to note it is an unrelated 
cosmetic change, so I'm not sure it's worth to add to this patchset.

>   
>       for (unsigned i = 0; i < numa->n_pmds; i++) {
>           struct sched_pmd *sched_pmd;
> @@ -5785,11 +5797,11 @@ sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)
>           if (sched_pmd->isolated) {
>               continue;
>           }
> -        if (sched_pmd_new_lowest(lowest_sched_pmd, sched_pmd, has_cyc)) {
> -            lowest_sched_pmd = sched_pmd;
> +        if (sched_pmd_new_lowest(lowest_pmd, sched_pmd, INT_MAX, has_cyc)) {
> +            lowest_pmd = sched_pmd;
>           }
>       }
> -    return lowest_sched_pmd;
> +    return lowest_pmd;
>   }
>   
>   /*
> @@ -5891,6 +5903,32 @@ get_rxq_cyc_log(char *a, enum sched_assignment_type algo, uint64_t cycles)
>       return a;
>   }
>   
> +static struct sched_pmd *
> +sched_pmd_all_numa_get_lowest(struct sched_numa_list *numa_list,
> +                              int port_numa_id, bool has_proc) {
> +    int n_numa;
> +    struct sched_numa *numa = NULL;
> +    struct sched_numa *last_numa = NULL;
> +    struct sched_pmd *lowest_pmd = NULL;

> +    struct sched_pmd *pmd;

I guess this one ^^^ can be moved inside the for loop.

> +
> +    n_numa = sched_numa_list_count(numa_list);
> +    /* For all numas. */
> +    for (int i = 0; i < n_numa; i++) {
> +        last_numa = numa;
> +        numa = sched_numa_list_next(numa_list, last_numa);
> +
> +        /* Get the lowest pmd per numa. */
> +        pmd = sched_pmd_get_lowest(numa, has_proc);
> +
> +        /* Check if it's the lowest pmd for all numas. */
> +        if (sched_pmd_new_lowest(lowest_pmd, pmd, port_numa_id, has_proc)) {
> +            lowest_pmd = pmd;
> +        }
> +    }
> +    return lowest_pmd;
> +}
> +
>   static void
>   sched_numa_list_schedule(struct sched_numa_list *numa_list,
>                            struct dp_netdev *dp,
> @@ -5991,6 +6029,7 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list,
>           struct sched_pmd *sched_pmd = NULL;
>           struct sched_numa *numa;
>           int port_numa_id;
> +        bool cross_numa = rxq->port->cross_numa_polling;
>           uint64_t proc_cycles;
>           char rxq_cyc_log[MAX_RXQ_CYC_STRLEN];
>   
> @@ -6005,28 +6044,35 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list,
>   
>           port_numa_id = netdev_get_numa_id(rxq->port->netdev);
>   
> -        /* Select numa. */
> -        numa = sched_numa_list_lookup(numa_list, port_numa_id);
> -
> -        /* Check if numa has no PMDs or no non-isolated PMDs. */
> -        if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> -            /* Unable to use this numa to find a PMD. */
> -            numa = NULL;
> -            /* Find any numa with available PMDs. */
> -            for (int j = 0; j < n_numa; j++) {
> -                numa = sched_numa_list_next(numa_list, last_cross_numa);
> -                last_cross_numa = numa;
> -                if (sched_numa_noniso_pmd_count(numa)) {
> -                    break;
> -                }
> +        if (cross_numa && algo == SCHED_GROUP) {
> +            /* cross_numa polling enabled so find lowest loaded pmd across
> +             * all numas. */

A few lines above here is:

/* Store the cycles for this rxq as we will log these later. */
proc_cycles = dp_netdev_rxq_get_cycles(rxq, RXQ_CYCLES_PROC_HIST);

the comment can be removed or changed as it's not just for the logs now 
- we need it here

> +            sched_pmd = sched_pmd_all_numa_get_lowest(numa_list, port_numa_id,
> +                                                      proc_cycles);
> +        } else {
> +            /* Select numa. */
> +            numa = sched_numa_list_lookup(numa_list, port_numa_id);
> +
> +            /* Check if numa has no PMDs or no non-isolated PMDs. */
> +            if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> +                /* Unable to use this numa to find a PMD. */
>                   numa = NULL;
> +                /* Find any numa with available PMDs. */
> +                for (int j = 0; j < n_numa; j++) {
> +                    numa = sched_numa_list_next(numa_list, last_cross_numa);
> +                    last_cross_numa = numa;
> +                    if (sched_numa_noniso_pmd_count(numa)) {
> +                        break;
> +                    }
> +                    numa = NULL;
> +                }
>               }
> -        }
>   
> -        if (numa) {
> -            /* Select the PMD that should be used for this rxq. */
> -            sched_pmd = sched_pmd_next(numa, algo,
> -                                       proc_cycles ? true : false);
> +            if (numa) {
> +                /* Select the PMD that should be used for this rxq. */
> +                sched_pmd = sched_pmd_next(numa, algo,
> +                                           proc_cycles ? true : false);
> +            }
>           }
>   
>           /* Check that a pmd has been selected. */
> @@ -6036,12 +6082,23 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list,
>               pmd_numa_id = sched_pmd->numa->numa_id;
>               /* Check if selected pmd numa matches port numa. */
>               if (pmd_numa_id != port_numa_id) {
> -                VLOG(level, "There's no available (non-isolated) pmd thread "
> -                            "on numa node %d. Port \'%s\' rx queue %d will "
> -                            "be assigned to a pmd on numa node %d. "
> -                            "This may lead to reduced performance.",
> -                            port_numa_id, netdev_rxq_get_name(rxq->rx),
> -                            netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
> +                if (cross_numa) {
> +                    VLOG(level, "Cross-numa polling has been selected for "
> +                                "Port \'%s\' rx queue %d on numa node %d. "
> +                                "It will be assigned to a pmd on numa node "
> +                                "%d. This may lead to reduced performance.",
> +                                netdev_rxq_get_name(rxq->rx),
> +                                netdev_rxq_get_queue_id(rxq->rx), port_numa_id,
> +                                pmd_numa_id);
> +
> +                } else {
> +                    VLOG(level, "There's no available (non-isolated) pmd "
> +                                "thread on numa node %d. Port \'%s\' rx queue "
> +                                "%d will be assigned to a pmd on numa node "
> +                                "%d. This may lead to reduced performance.",
> +                                port_numa_id, netdev_rxq_get_name(rxq->rx),
> +                                netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
> +                }
>               }
>               VLOG(level, "Core %2u on numa node %d assigned port \'%s\' "
>                           "rx queue %d%s.",
> diff --git a/tests/pmd.at b/tests/pmd.at
> index e6b173dab..a27fa17e8 100644
> --- a/tests/pmd.at
> +++ b/tests/pmd.at
> @@ -579,6 +579,39 @@ icmp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10
>   OVS_VSWITCHD_STOP
>   AT_CLEANUP
>   
> +AT_SETUP([PMD - Enable cross numa polling])

lower case 'enable' is more consistent with the other tests

> +OVS_VSWITCHD_START(
> +  [add-port br0 p1 -- set Interface p1 type=dummy-pmd ofport_request=1 options:n_rxq=4 -- \

You can explictily set the numa of the port with "-- set Interface p1 
options:numa_id=0". See above NUMA tests for example

> +   set Open_vSwitch . other_config:pmd-cpu-mask=3
> +], [], [], [--dummy-numa 0,1])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 action=controller])
> +
> +AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=group])
> +
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> +0
> +])
> +
> +dnl Enable cross numa polling and check numa ids
> +AT_CHECK([ovs-vsctl set Interface p1 other_config:cross-numa-polling=true])
> +

You could also check vswitchd log for "cross-numa-polling has been enabled."

> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> +0
> +1
> +])
> +
> +dnl Disable cross numa polling and check numa ids
> +AT_CHECK([ovs-vsctl set Interface p1 other_config:cross-numa-polling=false])
> +

You could also check vswitchd log for "cross-numa-polling has been 
disabled."

> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> +0
> +])
> +
> +OVS_VSWITCHD_STOP(["/|WARN|/d"])

It would be better to check for an explict warning, so any future 
unwanted warnings which may occur would be caught be the test.

> +AT_CLEANUP
> +
> +
>   AT_SETUP([PMD - change numa node])
>   OVS_VSWITCHD_START(
>     [add-port br0 p1 -- set Interface p1 type=dummy-pmd ofport_request=1 options:n_rxq=2 -- \
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index cc1dd77ec..7e51c85ef 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -3286,6 +3286,26 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>           </p>
>         </column>
>   
> +      <column name="other_config" key="cross-numa-polling"
> +                                  type='{"type": "boolean"}'>
> +        <p>
> +          Specifies if the RX queues of the port can be automatically assigned
> +          to PMD threads on any NUMA node or only on the local NUMA node of
> +          the port.
> +        </p>
> +        <p>
> +          Polling of physical ports from a non-local PMD thread incurs some
> +          performance penalty due to the access to packet data across the NUMA
> +          barrier. This option can still increase the overall performance if
> +          it allows better utilization of those non-local PMDs threads.
> +          It is most useful together with the auto load-balancing of RX queues
> +          (see other_config:auto_lb in table Open_vSwitch).

I think the link should be:
<ref table="Open_vSwitch" column="other_config" key="pmd-auto-lb"/>

thanks,
Kevin.

> +        </p>
> +        <p>
> +          Defaults to false.
> +        </p>
> +      </column>
> +
>         <column name="options" key="xdp-mode"
>                 type='{"type": "string",
>                        "enum": ["set", ["best-effort", "native-with-zerocopy",
Anurag Agarwal June 8, 2022, 10:15 a.m. UTC | #2
Hi Kevin,
     Thanks for your feedback. Please find my response inline. 

I will be uploading a new version v4 of the patch with comments addressed. 

Regards,
Anurag

> -----Original Message-----
> From: Kevin Traynor <ktraynor@redhat.com>
> Sent: Friday, June 3, 2022 9:36 PM
> To: Anurag Agarwal <anurag2k@gmail.com>; ovs-dev@openvswitch.org
> Cc: Jan Scheurich <jan.scheurich@ericsson.com>; Anurag Agarwal
> <anurag.agarwal@ericsson.com>
> Subject: Re: [PATCH v3] dpif-netdev: Allow cross-NUMA polling on selected
> ports
> 
> Hi Anurag,
> 
> Thanks for submitting this. Some initial comments on the code below.
> 
> On 03/06/2022 05:25, Anurag Agarwal wrote:
> > From: Jan Scheurich <jan.scheurich@ericsson.com>
> >
> > Today dpif-netdev considers PMD threads on a non-local NUMA node for
> > automatic assignment of the rxqs of a port only if there are no local,non-
> isolated PMDs.
> >
> > On typical servers with both physical ports on one NUMA node, this
> > often leaves the PMDs on the other NUMA node under-utilized, wasting CPU
> resources.
> > The alternative, to manually pin the rxqs to PMDs on remote NUMA
> > nodes, also has drawbacks as it limits OVS' ability to auto load-balance the
> rxqs.
> >
> > This patch introduces a new interface configuration option to allow
> > ports to be automatically polled by PMDs on any NUMA node:
> >
> > ovs-vsctl set interface <Name> other_config:cross-numa-polling=true
> >
> > The group assignment algorithm now has the ability to select lowest
> > loaded PMD on any NUMA, and not just the local NUMA on which the rxq
> > of the port resides
> >
> > If this option is not present or set to false, legacy behaviour applies.
> >
> > Co-authored-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
> > Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
> > Signed-off-by: Anurag Agarwal <anurag.agarwal@ericsson.com>
> > ---
> 
> It would be good if you could include some links below the cut line here or in a
> cover-letter to the previous discussion on this feature and the possible side-
> effect, so reviewers can be aware of that.
Done

> 
> Also, fine so far as it was minor revisions sent close together, but for future
> revisions, please add a note about what has changed. That will help reviewers to
> know what they need to focus on in the new revision.
> 
Sure, will do.

> >   Documentation/topics/dpdk/pmd.rst |  23 ++++++
> >   lib/dpif-netdev.c                 | 123 ++++++++++++++++++++++--------
> >   tests/pmd.at                      |  33 ++++++++
> >   vswitchd/vswitch.xml              |  20 +++++
> >   4 files changed, 166 insertions(+), 33 deletions(-)
> >
> > diff --git a/Documentation/topics/dpdk/pmd.rst
> > b/Documentation/topics/dpdk/pmd.rst
> > index b259cc8b3..387f962d1 100644
> > --- a/Documentation/topics/dpdk/pmd.rst
> > +++ b/Documentation/topics/dpdk/pmd.rst
> > @@ -99,6 +99,25 @@ core cycles for each Rx queue::
> >
> >       $ ovs-appctl dpif-netdev/pmd-rxq-show
> >
> > +Normally, Rx queues are assigned to PMD threads automatically.  By
> > +default OVS only assigns Rx queues to PMD threads executing on the
> > +same NUMA node in order to avoid unnecessary latency for accessing
> > +packet buffers across the NUMA boundary.  Typically this overhead is
> > +higher for vhostuser ports than for physical ports due to the packet
> > +copy that is done for all rx packets.
> > +
> 
> I don't think it needs double space for the start of each sentence, but that could
> be because I'm comparing with other parts of the documentation that I wrote
> incorrectly without that o_O

Checked the documentation, no extra space added at the start of each line. 

> 
> > +On NUMA servers with physical ports only on one NUMA node, the
> > +NUMA-local polling policy can lead to an under-utilization of the PMD
> > +threads on the remote NUMA node.  For the overall OVS performance it
> > +may in such cases be beneficial to utilize the spare capacity and
> > +allow polling of a physical port's rxqs across NUMA nodes despite the
> overhead involved.
> > +The policy can be set per port with the following configuration option::
> > +
> > +    $ ovs-vsctl set Interface <iface> \
> > +        other_config:cross-numa-polling=true|false
> > +
> > +The default value is false.
> > +
> >   .. note::
> >
> >      A history of one minute is recorded and shown for each Rx queue
> > to allow for @@ -115,6 +134,10 @@ core cycles for each Rx queue::
> >      A ``overhead`` statistics is shown per PMD: it represents the number of
> >      cycles inherently consumed by the OVS PMD processing loop.
> >
> > +.. versionchanged:: 2.18.0
> > +
> > +   Added the interface parameter ``other_config:cross-numa-polling``
> > +
> >   Rx queue to PMD assignment takes place whenever there are configuration
> changes
> >   or can be triggered by using::
> >
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > ff57b3961..ace5c1920 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -467,6 +467,7 @@ struct dp_netdev_port {
> >       char *type;                 /* Port type as requested by user. */
> >       char *rxq_affinity_list;    /* Requested affinity of rx queues. */
> >       enum txq_req_mode txq_requested_mode;
> > +    bool cross_numa_polling;    /* If true cross polling will be enabled */
> 
> Full stop to be consistent with other comments e.g.'...enabled.'
Done

> 
> >   };
> >
> >   static bool dp_netdev_flow_ref(struct dp_netdev_flow *); @@ -2101,6
> > +2102,7 @@ port_create(const char *devname, const char *type,
> >       port->sf = NULL;
> >       port->emc_enabled = true;
> >       port->need_reconfigure = true;
> > +    port->cross_numa_polling = false;
> >       ovs_mutex_init(&port->txq_used_mutex);
> >
> >       *portp = port;
> > @@ -5013,6 +5015,7 @@ dpif_netdev_port_set_config(struct dpif *dpif,
> odp_port_t port_no,
> >       bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
> >       const char *tx_steering_mode = smap_get(cfg, "tx-steering");
> >       enum txq_req_mode txq_mode;
> > +    bool cross_numa_polling = smap_get_bool(cfg,
> > + "cross-numa-polling", false);
> >
> >       ovs_rwlock_wrlock(&dp->port_rwlock);
> >       error = get_port_by_number(dp, port_no, &port); @@ -5020,6
> > +5023,14 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t
> port_no,
> >           goto unlock;
> >       }
> >
> > +    if (cross_numa_polling != port->cross_numa_polling) {
> > +        port->cross_numa_polling = cross_numa_polling;
> > +        VLOG_INFO("%s:cross-numa-polling has been %s.",
> > +                  netdev_get_name(port->netdev),
> > +                  (cross_numa_polling)? "enabled" : "disabled");
> 
> you can remove unneeded brackets here
Done

> 
> > +        dp_netdev_request_reconfigure(dp);
> > +    }
> > +
> >       if (emc_enabled != port->emc_enabled) {
> >           struct dp_netdev_pmd_thread *pmd;
> >           struct ds ds = DS_EMPTY_INITIALIZER; @@ -5751,7 +5762,7 @@
> > compare_rxq_cycles(const void *a, const void *b)
> >
> >   static bool
> >   sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct
> sched_pmd *pmd,
> > -                     bool has_proc)
> > +                     int port_numa_id, bool has_proc)
> >   {
> >       uint64_t current_num, pmd_num;
> >
> > @@ -5767,16 +5778,17 @@ sched_pmd_new_lowest(struct sched_pmd
> *current_lowest, struct sched_pmd *pmd,
> >           pmd_num = pmd->n_rxq;
> >       }
> >
> > -    if (pmd_num < current_num) {
> > -        return true;
> > +    if (pmd_num != current_num) {
> > +        return (pmd_num < current_num) ? true : false;
> >       }
> > -    return false;
> > +
> > +    return (pmd->numa->numa_id == port_numa_id) ? true : false;
> 
> Before this in the event of a tie, the current_lowest was selected (ret false).
> Now, if in this additional test there is a further tie (because both are local NUMA
> to the port) the new one will be selected (ret true).
> 
> Either scheme should be fine, but perhaps better to keep it consistent with what
> it was, so check would be something like:
> 
> return (current_lowest->numa->numa_id != port_numa_id &&
> pmd->numa->numa_id == pmd_numa_id) ? : true : false;
> 
Done

> >   }
> >
> >   static struct sched_pmd *
> >   sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)
> >   {
> > -    struct sched_pmd *lowest_sched_pmd = NULL;
> > +    struct sched_pmd *lowest_pmd = NULL;
> 
> no real objection to the name, but just to note it is an unrelated cosmetic
> change, so I'm not sure it's worth to add to this patchset.
I had to shorten the name to fit some of the lines within 80 characters as reported by checkpatch
> 
> >
> >       for (unsigned i = 0; i < numa->n_pmds; i++) {
> >           struct sched_pmd *sched_pmd; @@ -5785,11 +5797,11 @@
> > sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)
> >           if (sched_pmd->isolated) {
> >               continue;
> >           }
> > -        if (sched_pmd_new_lowest(lowest_sched_pmd, sched_pmd, has_cyc)) {
> > -            lowest_sched_pmd = sched_pmd;
> > +        if (sched_pmd_new_lowest(lowest_pmd, sched_pmd, INT_MAX,
> has_cyc)) {
> > +            lowest_pmd = sched_pmd;
> >           }
> >       }
> > -    return lowest_sched_pmd;
> > +    return lowest_pmd;
> >   }
> >
> >   /*
> > @@ -5891,6 +5903,32 @@ get_rxq_cyc_log(char *a, enum
> sched_assignment_type algo, uint64_t cycles)
> >       return a;
> >   }
> >
> > +static struct sched_pmd *
> > +sched_pmd_all_numa_get_lowest(struct sched_numa_list *numa_list,
> > +                              int port_numa_id, bool has_proc) {
> > +    int n_numa;
> > +    struct sched_numa *numa = NULL;
> > +    struct sched_numa *last_numa = NULL;
> > +    struct sched_pmd *lowest_pmd = NULL;
> 
> > +    struct sched_pmd *pmd;
> 
> I guess this one ^^^ can be moved inside the for loop.
I'll keep this one as is. 

> 
> > +
> > +    n_numa = sched_numa_list_count(numa_list);
> > +    /* For all numas. */
> > +    for (int i = 0; i < n_numa; i++) {
> > +        last_numa = numa;
> > +        numa = sched_numa_list_next(numa_list, last_numa);
> > +
> > +        /* Get the lowest pmd per numa. */
> > +        pmd = sched_pmd_get_lowest(numa, has_proc);
> > +
> > +        /* Check if it's the lowest pmd for all numas. */
> > +        if (sched_pmd_new_lowest(lowest_pmd, pmd, port_numa_id, has_proc))
> {
> > +            lowest_pmd = pmd;
> > +        }
> > +    }
> > +    return lowest_pmd;
> > +}
> > +
> >   static void
> >   sched_numa_list_schedule(struct sched_numa_list *numa_list,
> >                            struct dp_netdev *dp, @@ -5991,6 +6029,7 @@
> > sched_numa_list_schedule(struct sched_numa_list *numa_list,
> >           struct sched_pmd *sched_pmd = NULL;
> >           struct sched_numa *numa;
> >           int port_numa_id;
> > +        bool cross_numa = rxq->port->cross_numa_polling;
> >           uint64_t proc_cycles;
> >           char rxq_cyc_log[MAX_RXQ_CYC_STRLEN];
> >
> > @@ -6005,28 +6044,35 @@ sched_numa_list_schedule(struct
> > sched_numa_list *numa_list,
> >
> >           port_numa_id = netdev_get_numa_id(rxq->port->netdev);
> >
> > -        /* Select numa. */
> > -        numa = sched_numa_list_lookup(numa_list, port_numa_id);
> > -
> > -        /* Check if numa has no PMDs or no non-isolated PMDs. */
> > -        if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> > -            /* Unable to use this numa to find a PMD. */
> > -            numa = NULL;
> > -            /* Find any numa with available PMDs. */
> > -            for (int j = 0; j < n_numa; j++) {
> > -                numa = sched_numa_list_next(numa_list, last_cross_numa);
> > -                last_cross_numa = numa;
> > -                if (sched_numa_noniso_pmd_count(numa)) {
> > -                    break;
> > -                }
> > +        if (cross_numa && algo == SCHED_GROUP) {
> > +            /* cross_numa polling enabled so find lowest loaded pmd across
> > +             * all numas. */
> 
> A few lines above here is:
> 
> /* Store the cycles for this rxq as we will log these later. */ proc_cycles =
> dp_netdev_rxq_get_cycles(rxq, RXQ_CYCLES_PROC_HIST);
> 
> the comment can be removed or changed as it's not just for the logs now
> - we need it here
Done

> 
> > +            sched_pmd = sched_pmd_all_numa_get_lowest(numa_list,
> port_numa_id,
> > +                                                      proc_cycles);
> > +        } else {
> > +            /* Select numa. */
> > +            numa = sched_numa_list_lookup(numa_list, port_numa_id);
> > +
> > +            /* Check if numa has no PMDs or no non-isolated PMDs. */
> > +            if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> > +                /* Unable to use this numa to find a PMD. */
> >                   numa = NULL;
> > +                /* Find any numa with available PMDs. */
> > +                for (int j = 0; j < n_numa; j++) {
> > +                    numa = sched_numa_list_next(numa_list, last_cross_numa);
> > +                    last_cross_numa = numa;
> > +                    if (sched_numa_noniso_pmd_count(numa)) {
> > +                        break;
> > +                    }
> > +                    numa = NULL;
> > +                }
> >               }
> > -        }
> >
> > -        if (numa) {
> > -            /* Select the PMD that should be used for this rxq. */
> > -            sched_pmd = sched_pmd_next(numa, algo,
> > -                                       proc_cycles ? true : false);
> > +            if (numa) {
> > +                /* Select the PMD that should be used for this rxq. */
> > +                sched_pmd = sched_pmd_next(numa, algo,
> > +                                           proc_cycles ? true : false);
> > +            }
> >           }
> >
> >           /* Check that a pmd has been selected. */ @@ -6036,12
> > +6082,23 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list,
> >               pmd_numa_id = sched_pmd->numa->numa_id;
> >               /* Check if selected pmd numa matches port numa. */
> >               if (pmd_numa_id != port_numa_id) {
> > -                VLOG(level, "There's no available (non-isolated) pmd thread "
> > -                            "on numa node %d. Port \'%s\' rx queue %d will "
> > -                            "be assigned to a pmd on numa node %d. "
> > -                            "This may lead to reduced performance.",
> > -                            port_numa_id, netdev_rxq_get_name(rxq->rx),
> > -                            netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
> > +                if (cross_numa) {
> > +                    VLOG(level, "Cross-numa polling has been selected for "
> > +                                "Port \'%s\' rx queue %d on numa node %d. "
> > +                                "It will be assigned to a pmd on numa node "
> > +                                "%d. This may lead to reduced performance.",
> > +                                netdev_rxq_get_name(rxq->rx),
> > +                                netdev_rxq_get_queue_id(rxq->rx), port_numa_id,
> > +                                pmd_numa_id);
> > +
> > +                } else {
> > +                    VLOG(level, "There's no available (non-isolated) pmd "
> > +                                "thread on numa node %d. Port \'%s\' rx queue "
> > +                                "%d will be assigned to a pmd on numa node "
> > +                                "%d. This may lead to reduced performance.",
> > +                                port_numa_id, netdev_rxq_get_name(rxq->rx),
> > +                                netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
> > +                }
> >               }
> >               VLOG(level, "Core %2u on numa node %d assigned port \'%s\' "
> >                           "rx queue %d%s.", diff --git a/tests/pmd.at
> > b/tests/pmd.at index e6b173dab..a27fa17e8 100644
> > --- a/tests/pmd.at
> > +++ b/tests/pmd.at
> > @@ -579,6 +579,39 @@
> icmp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_s
> rc=10
> >   OVS_VSWITCHD_STOP
> >   AT_CLEANUP
> >
> > +AT_SETUP([PMD - Enable cross numa polling])
> 
> lower case 'enable' is more consistent with the other tests
> 
> > +OVS_VSWITCHD_START(
> > +  [add-port br0 p1 -- set Interface p1 type=dummy-pmd
> > +ofport_request=1 options:n_rxq=4 -- \
> 
> You can explictily set the numa of the port with "-- set Interface p1
> options:numa_id=0". See above NUMA tests for example
> 
> > +   set Open_vSwitch . other_config:pmd-cpu-mask=3 ], [], [],
> > +[--dummy-numa 0,1])
> > +
> > +AT_CHECK([ovs-ofctl add-flow br0 action=controller])
> > +
> > +AT_CHECK([ovs-vsctl set Open_vSwitch .
> > +other_config:pmd-rxq-assign=group])
> > +
> > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> > +0
> > +])
> > +
> > +dnl Enable cross numa polling and check numa ids AT_CHECK([ovs-vsctl
> > +set Interface p1 other_config:cross-numa-polling=true])
> > +
> 
> You could also check vswitchd log for "cross-numa-polling has been enabled."
> 
> > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> > +0
> > +1
> > +])
> > +
> > +dnl Disable cross numa polling and check numa ids AT_CHECK([ovs-vsctl
> > +set Interface p1 other_config:cross-numa-polling=false])
> > +
> 
> You could also check vswitchd log for "cross-numa-polling has been disabled."
> 
Done

> > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show |
> cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
> > +0
> > +])
> > +
> > +OVS_VSWITCHD_STOP(["/|WARN|/d"])
> 
> It would be better to check for an explict warning, so any future unwanted
> warnings which may occur would be caught be the test.
> 
> > +AT_CLEANUP
> > +
> > +
> >   AT_SETUP([PMD - change numa node])
> >   OVS_VSWITCHD_START(
> >     [add-port br0 p1 -- set Interface p1 type=dummy-pmd
> > ofport_request=1 options:n_rxq=2 -- \ diff --git
> > a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> > cc1dd77ec..7e51c85ef 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -3286,6 +3286,26 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
> type=patch options:peer=p1 \
> >           </p>
> >         </column>
> >
> > +      <column name="other_config" key="cross-numa-polling"
> > +                                  type='{"type": "boolean"}'>
> > +        <p>
> > +          Specifies if the RX queues of the port can be automatically assigned
> > +          to PMD threads on any NUMA node or only on the local NUMA node of
> > +          the port.
> > +        </p>
> > +        <p>
> > +          Polling of physical ports from a non-local PMD thread incurs some
> > +          performance penalty due to the access to packet data across the NUMA
> > +          barrier. This option can still increase the overall performance if
> > +          it allows better utilization of those non-local PMDs threads.
> > +          It is most useful together with the auto load-balancing of RX queues
> > +          (see other_config:auto_lb in table Open_vSwitch).
> 
> I think the link should be:
> <ref table="Open_vSwitch" column="other_config" key="pmd-auto-lb"/>
> 
Done

> thanks,
> Kevin.
> 
> > +        </p>
> > +        <p>
> > +          Defaults to false.
> > +        </p>
> > +      </column>
> > +
> >         <column name="options" key="xdp-mode"
> >                 type='{"type": "string",
> >                        "enum": ["set", ["best-effort",
> > "native-with-zerocopy",
diff mbox series

Patch

diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
index b259cc8b3..387f962d1 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -99,6 +99,25 @@  core cycles for each Rx queue::
 
     $ ovs-appctl dpif-netdev/pmd-rxq-show
 
+Normally, Rx queues are assigned to PMD threads automatically.  By default
+OVS only assigns Rx queues to PMD threads executing on the same NUMA
+node in order to avoid unnecessary latency for accessing packet buffers
+across the NUMA boundary.  Typically this overhead is higher for vhostuser
+ports than for physical ports due to the packet copy that is done for all
+rx packets.
+
+On NUMA servers with physical ports only on one NUMA node, the NUMA-local
+polling policy can lead to an under-utilization of the PMD threads on the
+remote NUMA node.  For the overall OVS performance it may in such cases be
+beneficial to utilize the spare capacity and allow polling of a physical
+port's rxqs across NUMA nodes despite the overhead involved.
+The policy can be set per port with the following configuration option::
+
+    $ ovs-vsctl set Interface <iface> \
+        other_config:cross-numa-polling=true|false
+
+The default value is false.
+
 .. note::
 
    A history of one minute is recorded and shown for each Rx queue to allow for
@@ -115,6 +134,10 @@  core cycles for each Rx queue::
    A ``overhead`` statistics is shown per PMD: it represents the number of
    cycles inherently consumed by the OVS PMD processing loop.
 
+.. versionchanged:: 2.18.0
+
+   Added the interface parameter ``other_config:cross-numa-polling``
+
 Rx queue to PMD assignment takes place whenever there are configuration changes
 or can be triggered by using::
 
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index ff57b3961..ace5c1920 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -467,6 +467,7 @@  struct dp_netdev_port {
     char *type;                 /* Port type as requested by user. */
     char *rxq_affinity_list;    /* Requested affinity of rx queues. */
     enum txq_req_mode txq_requested_mode;
+    bool cross_numa_polling;    /* If true cross polling will be enabled */
 };
 
 static bool dp_netdev_flow_ref(struct dp_netdev_flow *);
@@ -2101,6 +2102,7 @@  port_create(const char *devname, const char *type,
     port->sf = NULL;
     port->emc_enabled = true;
     port->need_reconfigure = true;
+    port->cross_numa_polling = false;
     ovs_mutex_init(&port->txq_used_mutex);
 
     *portp = port;
@@ -5013,6 +5015,7 @@  dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t port_no,
     bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
     const char *tx_steering_mode = smap_get(cfg, "tx-steering");
     enum txq_req_mode txq_mode;
+    bool cross_numa_polling = smap_get_bool(cfg, "cross-numa-polling", false);
 
     ovs_rwlock_wrlock(&dp->port_rwlock);
     error = get_port_by_number(dp, port_no, &port);
@@ -5020,6 +5023,14 @@  dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t port_no,
         goto unlock;
     }
 
+    if (cross_numa_polling != port->cross_numa_polling) {
+        port->cross_numa_polling = cross_numa_polling;
+        VLOG_INFO("%s:cross-numa-polling has been %s.",
+                  netdev_get_name(port->netdev),
+                  (cross_numa_polling)? "enabled" : "disabled");
+        dp_netdev_request_reconfigure(dp);
+    }
+
     if (emc_enabled != port->emc_enabled) {
         struct dp_netdev_pmd_thread *pmd;
         struct ds ds = DS_EMPTY_INITIALIZER;
@@ -5751,7 +5762,7 @@  compare_rxq_cycles(const void *a, const void *b)
 
 static bool
 sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct sched_pmd *pmd,
-                     bool has_proc)
+                     int port_numa_id, bool has_proc)
 {
     uint64_t current_num, pmd_num;
 
@@ -5767,16 +5778,17 @@  sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct sched_pmd *pmd,
         pmd_num = pmd->n_rxq;
     }
 
-    if (pmd_num < current_num) {
-        return true;
+    if (pmd_num != current_num) {
+        return (pmd_num < current_num) ? true : false;
     }
-    return false;
+
+    return (pmd->numa->numa_id == port_numa_id) ? true : false;
 }
 
 static struct sched_pmd *
 sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)
 {
-    struct sched_pmd *lowest_sched_pmd = NULL;
+    struct sched_pmd *lowest_pmd = NULL;
 
     for (unsigned i = 0; i < numa->n_pmds; i++) {
         struct sched_pmd *sched_pmd;
@@ -5785,11 +5797,11 @@  sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)
         if (sched_pmd->isolated) {
             continue;
         }
-        if (sched_pmd_new_lowest(lowest_sched_pmd, sched_pmd, has_cyc)) {
-            lowest_sched_pmd = sched_pmd;
+        if (sched_pmd_new_lowest(lowest_pmd, sched_pmd, INT_MAX, has_cyc)) {
+            lowest_pmd = sched_pmd;
         }
     }
-    return lowest_sched_pmd;
+    return lowest_pmd;
 }
 
 /*
@@ -5891,6 +5903,32 @@  get_rxq_cyc_log(char *a, enum sched_assignment_type algo, uint64_t cycles)
     return a;
 }
 
+static struct sched_pmd *
+sched_pmd_all_numa_get_lowest(struct sched_numa_list *numa_list,
+                              int port_numa_id, bool has_proc) {
+    int n_numa;
+    struct sched_numa *numa = NULL;
+    struct sched_numa *last_numa = NULL;
+    struct sched_pmd *lowest_pmd = NULL;
+    struct sched_pmd *pmd;
+
+    n_numa = sched_numa_list_count(numa_list);
+    /* For all numas. */
+    for (int i = 0; i < n_numa; i++) {
+        last_numa = numa;
+        numa = sched_numa_list_next(numa_list, last_numa);
+
+        /* Get the lowest pmd per numa. */
+        pmd = sched_pmd_get_lowest(numa, has_proc);
+
+        /* Check if it's the lowest pmd for all numas. */
+        if (sched_pmd_new_lowest(lowest_pmd, pmd, port_numa_id, has_proc)) {
+            lowest_pmd = pmd;
+        }
+    }
+    return lowest_pmd;
+}
+
 static void
 sched_numa_list_schedule(struct sched_numa_list *numa_list,
                          struct dp_netdev *dp,
@@ -5991,6 +6029,7 @@  sched_numa_list_schedule(struct sched_numa_list *numa_list,
         struct sched_pmd *sched_pmd = NULL;
         struct sched_numa *numa;
         int port_numa_id;
+        bool cross_numa = rxq->port->cross_numa_polling;
         uint64_t proc_cycles;
         char rxq_cyc_log[MAX_RXQ_CYC_STRLEN];
 
@@ -6005,28 +6044,35 @@  sched_numa_list_schedule(struct sched_numa_list *numa_list,
 
         port_numa_id = netdev_get_numa_id(rxq->port->netdev);
 
-        /* Select numa. */
-        numa = sched_numa_list_lookup(numa_list, port_numa_id);
-
-        /* Check if numa has no PMDs or no non-isolated PMDs. */
-        if (!numa || !sched_numa_noniso_pmd_count(numa)) {
-            /* Unable to use this numa to find a PMD. */
-            numa = NULL;
-            /* Find any numa with available PMDs. */
-            for (int j = 0; j < n_numa; j++) {
-                numa = sched_numa_list_next(numa_list, last_cross_numa);
-                last_cross_numa = numa;
-                if (sched_numa_noniso_pmd_count(numa)) {
-                    break;
-                }
+        if (cross_numa && algo == SCHED_GROUP) {
+            /* cross_numa polling enabled so find lowest loaded pmd across
+             * all numas. */
+            sched_pmd = sched_pmd_all_numa_get_lowest(numa_list, port_numa_id,
+                                                      proc_cycles);
+        } else {
+            /* Select numa. */
+            numa = sched_numa_list_lookup(numa_list, port_numa_id);
+
+            /* Check if numa has no PMDs or no non-isolated PMDs. */
+            if (!numa || !sched_numa_noniso_pmd_count(numa)) {
+                /* Unable to use this numa to find a PMD. */
                 numa = NULL;
+                /* Find any numa with available PMDs. */
+                for (int j = 0; j < n_numa; j++) {
+                    numa = sched_numa_list_next(numa_list, last_cross_numa);
+                    last_cross_numa = numa;
+                    if (sched_numa_noniso_pmd_count(numa)) {
+                        break;
+                    }
+                    numa = NULL;
+                }
             }
-        }
 
-        if (numa) {
-            /* Select the PMD that should be used for this rxq. */
-            sched_pmd = sched_pmd_next(numa, algo,
-                                       proc_cycles ? true : false);
+            if (numa) {
+                /* Select the PMD that should be used for this rxq. */
+                sched_pmd = sched_pmd_next(numa, algo,
+                                           proc_cycles ? true : false);
+            }
         }
 
         /* Check that a pmd has been selected. */
@@ -6036,12 +6082,23 @@  sched_numa_list_schedule(struct sched_numa_list *numa_list,
             pmd_numa_id = sched_pmd->numa->numa_id;
             /* Check if selected pmd numa matches port numa. */
             if (pmd_numa_id != port_numa_id) {
-                VLOG(level, "There's no available (non-isolated) pmd thread "
-                            "on numa node %d. Port \'%s\' rx queue %d will "
-                            "be assigned to a pmd on numa node %d. "
-                            "This may lead to reduced performance.",
-                            port_numa_id, netdev_rxq_get_name(rxq->rx),
-                            netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
+                if (cross_numa) {
+                    VLOG(level, "Cross-numa polling has been selected for "
+                                "Port \'%s\' rx queue %d on numa node %d. "
+                                "It will be assigned to a pmd on numa node "
+                                "%d. This may lead to reduced performance.",
+                                netdev_rxq_get_name(rxq->rx),
+                                netdev_rxq_get_queue_id(rxq->rx), port_numa_id,
+                                pmd_numa_id);
+
+                } else {
+                    VLOG(level, "There's no available (non-isolated) pmd "
+                                "thread on numa node %d. Port \'%s\' rx queue "
+                                "%d will be assigned to a pmd on numa node "
+                                "%d. This may lead to reduced performance.",
+                                port_numa_id, netdev_rxq_get_name(rxq->rx),
+                                netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
+                }
             }
             VLOG(level, "Core %2u on numa node %d assigned port \'%s\' "
                         "rx queue %d%s.",
diff --git a/tests/pmd.at b/tests/pmd.at
index e6b173dab..a27fa17e8 100644
--- a/tests/pmd.at
+++ b/tests/pmd.at
@@ -579,6 +579,39 @@  icmp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10
 OVS_VSWITCHD_STOP
 AT_CLEANUP
 
+AT_SETUP([PMD - Enable cross numa polling])
+OVS_VSWITCHD_START(
+  [add-port br0 p1 -- set Interface p1 type=dummy-pmd ofport_request=1 options:n_rxq=4 -- \
+   set Open_vSwitch . other_config:pmd-cpu-mask=3
+], [], [], [--dummy-numa 0,1])
+
+AT_CHECK([ovs-ofctl add-flow br0 action=controller])
+
+AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=group])
+
+AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
+0
+])
+
+dnl Enable cross numa polling and check numa ids
+AT_CHECK([ovs-vsctl set Interface p1 other_config:cross-numa-polling=true])
+
+AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
+0
+1
+])
+
+dnl Disable cross numa polling and check numa ids
+AT_CHECK([ovs-vsctl set Interface p1 other_config:cross-numa-polling=false])
+
+AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 3 -d ' ' | sort |      uniq], [0], [dnl
+0
+])
+
+OVS_VSWITCHD_STOP(["/|WARN|/d"])
+AT_CLEANUP
+
+
 AT_SETUP([PMD - change numa node])
 OVS_VSWITCHD_START(
   [add-port br0 p1 -- set Interface p1 type=dummy-pmd ofport_request=1 options:n_rxq=2 -- \
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index cc1dd77ec..7e51c85ef 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -3286,6 +3286,26 @@  ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
         </p>
       </column>
 
+      <column name="other_config" key="cross-numa-polling"
+                                  type='{"type": "boolean"}'>
+        <p>
+          Specifies if the RX queues of the port can be automatically assigned
+          to PMD threads on any NUMA node or only on the local NUMA node of
+          the port.
+        </p>
+        <p>
+          Polling of physical ports from a non-local PMD thread incurs some
+          performance penalty due to the access to packet data across the NUMA
+          barrier. This option can still increase the overall performance if
+          it allows better utilization of those non-local PMDs threads.
+          It is most useful together with the auto load-balancing of RX queues
+          (see other_config:auto_lb in table Open_vSwitch).
+        </p>
+        <p>
+          Defaults to false.
+        </p>
+      </column>
+
       <column name="options" key="xdp-mode"
               type='{"type": "string",
                      "enum": ["set", ["best-effort", "native-with-zerocopy",