diff mbox series

[ovs-dev,3/5] dpif-netdev: Add group rxq scheduling assignment type.

Message ID 20210604211856.915563-4-ktraynor@redhat.com
State New
Headers show
Series Rxq scheduling updates. | expand

Commit Message

Kevin Traynor June 4, 2021, 9:18 p.m. UTC
Add an rxq scheduling option that allows rxqs to be grouped
on a pmd based purely on their load.

The current default 'cycles' assignment sorts rxqs by measured
processing load and then assigns them to a list of round robin PMDs.
This helps to keep the rxqs that require most processing on different
cores but as it selects the PMDs in round robin order, it equally
distributes rxqs to PMDs.

'cycles' assignment has the advantage in that it separates the most
loaded rxqs from being on the same core but maintains the rxqs being
spread across a broad range of PMDs to mitigate against changes to
traffic pattern.

'cycles' assignment has the disadvantage that in order to make the
trade off between optimising for current traffic load and mitigating
against future changes, it tries to assign and equal amount of rxqs
per PMD in a round robin manner and this can lead to less than optimal
balance of the processing load.

Now that PMD auto load balance can help mitigate with future changes in
traffic patterns, a 'group' assignment can be used to assign rxqs based
on their measured cycles and the estimated running total of the PMDs.

In this case, there is no restriction about keeping equal number of
rxqs per PMD as it is purely load based.

This means that one PMD may have a group of low load rxqs assigned to it
while another PMD has one high load rxq assigned to it, as that is the
best balance of their measured loads across the PMDs.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
---
 Documentation/topics/dpdk/pmd.rst |  26 ++++++
 lib/dpif-netdev.c                 | 141 +++++++++++++++++++++++++-----
 vswitchd/vswitch.xml              |   5 +-
 3 files changed, 148 insertions(+), 24 deletions(-)

Comments

Pai G, Sunil June 23, 2021, 3:29 p.m. UTC | #1
Hey Kevin , 

Patch looks good to me.
Builds fine , all test cases here http://patchwork.ozlabs.org/project/openvswitch/patch/20210316154532.127858-1-ktraynor@redhat.com/  pass as well.
The groups assignment works fine too.

From vswitchd logs:

dpif_netdev|INFO|PMD auto load balance load threshold set to 50%
dpif_netdev|INFO|PMD auto load balance is disabled
dpif_netdev|INFO|PMD auto load balance improvement threshold set to 5%
dpif_netdev|INFO|PMD auto load balance is disabled
dpif_netdev|INFO|PMD auto load balance is enabled interval 1 mins, pmd load threshold 50%, improvement threshold 5%
dpif_netdev|INFO|Rxq to PMD assignment mode changed to: 'group'.
dpif_netdev|INFO|Performing pmd to rx queue assignment using group algorithm.
...
...
dpif_netdev|DBG|PMD auto load balance performing dry run.
dpif_netdev|DBG|There's no available (non-isolated) pmd thread on numa node 0. Port 'dpdk1' rx queue 0 will be assigned to a pmd on numa node 1. This may lead to reduced performance.
dpif_netdev|DBG|Core 26 on numa node 1 assigned port 'dpdk1' rx queue 0. (measured processing cycles 117514888500).
dpif_netdev|DBG|There's no available (non-isolated) pmd thread on numa node 0. Port 'dpdk1' rx queue 1 will be assigned to a pmd on numa node 1. This may lead to reduced performance.
dpif_netdev|DBG|Core 27 on numa node 1 assigned port 'dpdk1' rx queue 1. (measured processing cycles 115517336048).
dpif_netdev|DBG|There's no available (non-isolated) pmd thread on numa node 0. Port 'dpdk0' rx queue 1 will be assigned to a pmd on numa node 1. This may lead to reduced performance.
dpif_netdev|DBG|Core 27 on numa node 1 assigned port 'dpdk0' rx queue 1. (measured processing cycles 799822228).
dpif_netdev|DBG|There's no available (non-isolated) pmd thread on numa node 0. Port 'dpdk0' rx queue 0 will be assigned to a pmd on numa node 1. This may lead to reduced performance.
dpif_netdev|DBG|Core 27 on numa node 1 assigned port 'dpdk0' rx queue 0. (measured processing cycles 539222868).
dpif_netdev|DBG|There's no available (non-isolated) pmd thread on numa node 0. Port 'dpdk0' rx queue 2 will be assigned to a pmd on numa node 1. This may lead to reduced performance.
dpif_netdev|DBG|Core 27 on numa node 1 assigned port 'dpdk0' rx queue 2. (measured processing cycles 538244586).
dpif_netdev|DBG|Current variance 1 Estimated variance 0
dpif_netdev|DBG|Variance improvement 100%
dpif_netdev|DBG|PMD load variance improvement threshold 5% is met
dpif_netdev|INFO|PMD auto load balance dry run. Requesting datapath reconfigure.
dpif_netdev|INFO|Performing pmd to rx queue assignment using group algorithm.



Some minor nits inline :

<snipped>

> +enum sched_assignment_type {
> +    SCHED_ROUNDROBIN,
> +    SCHED_CYCLES, /* Default.*/
> +    SCHED_GROUP,
> +    SCHED_MAX
> +};
> +
>  /* Datapath based on the network device interface from netdev.h.
>   *
> @@ -367,5 +374,5 @@ struct dp_netdev {
>      struct ovs_mutex tx_qid_pool_mutex;
>      /* Use measured cycles for rxq to pmd assignment. */
> -    bool pmd_rxq_assign_cyc;
> +    enum sched_assignment_type pmd_rxq_assign_cyc;

sched_type would be a better name perhaps ?


<snipped>


> +static struct sched_pmd *
> +get_lowest_num_rxq_pmd(struct sched_numa *numa) {
> +    struct sched_pmd *lowest_rxqs_sched_pmd = NULL;
> +    unsigned lowest_rxqs = UINT_MAX;

n_lowest_rxqs is a bit more clear perhaps ?

> +
> +    /* find the pmd with lowest number of rxqs */
> +    for (unsigned i = 0; i < numa->n_pmds; i++) {
> +        struct sched_pmd *sched_pmd;
> +        unsigned num_rxqs;
> +
> +        sched_pmd = &numa->pmds[i];
> +        num_rxqs = sched_pmd->n_rxq;
> +        if (sched_pmd->isolated) {
> +            continue;
> +        }
> +
> +        /* If this current load is higher we can go to the next one */

Full stop at the end of the comment missing. May be check this once for the entire patch ?

> +        if (num_rxqs > lowest_rxqs) {
> +            continue;
> +        }
> +       if (num_rxqs < lowest_rxqs) {
> +           lowest_rxqs = num_rxqs;
> +           lowest_rxqs_sched_pmd = sched_pmd;
> +        }
> +    }
> +    return lowest_rxqs_sched_pmd;
> +}
> +
> +static struct sched_pmd *
> +get_lowest_proc_pmd(struct sched_numa *numa) {
> +    struct sched_pmd *lowest_loaded_sched_pmd = NULL;
> +    uint64_t lowest_load = UINT64_MAX;
> +
> +    /* find the pmd with the lowest load */
> +    for (unsigned i = 0; i < numa->n_pmds; i++) {
> +        struct sched_pmd *sched_pmd;
> +        uint64_t pmd_load;
> +
> +        sched_pmd = &numa->pmds[i];
> +        if (sched_pmd->isolated) {
> +            continue;
> +        }
> +        pmd_load = sched_pmd->pmd_proc_cycles;
> +        /* If this current load is higher we can go to the next one */
> +        if (pmd_load > lowest_load) {
> +            continue;
> +        }
> +       if (pmd_load < lowest_load) {
> +           lowest_load = pmd_load;
> +           lowest_loaded_sched_pmd = sched_pmd;
> +        }
> +    }
> +    return lowest_loaded_sched_pmd;
> +}

get_lowest_proc_pmd and get_lowest_num_rxq_pmd look quite identical.
I wonder if we could combine them into a single function somehow ?

<snipped>
David Marchand June 24, 2021, 2:52 p.m. UTC | #2
On Fri, Jun 4, 2021 at 11:19 PM Kevin Traynor <ktraynor@redhat.com> wrote:
>
> Add an rxq scheduling option that allows rxqs to be grouped
> on a pmd based purely on their load.
>
> The current default 'cycles' assignment sorts rxqs by measured
> processing load and then assigns them to a list of round robin PMDs.
> This helps to keep the rxqs that require most processing on different
> cores but as it selects the PMDs in round robin order, it equally
> distributes rxqs to PMDs.
>
> 'cycles' assignment has the advantage in that it separates the most
> loaded rxqs from being on the same core but maintains the rxqs being
> spread across a broad range of PMDs to mitigate against changes to
> traffic pattern.
>
> 'cycles' assignment has the disadvantage that in order to make the
> trade off between optimising for current traffic load and mitigating
> against future changes, it tries to assign and equal amount of rxqs
> per PMD in a round robin manner and this can lead to less than optimal
> balance of the processing load.
>
> Now that PMD auto load balance can help mitigate with future changes in
> traffic patterns, a 'group' assignment can be used to assign rxqs based
> on their measured cycles and the estimated running total of the PMDs.
>
> In this case, there is no restriction about keeping equal number of
> rxqs per PMD as it is purely load based.
>
> This means that one PMD may have a group of low load rxqs assigned to it
> while another PMD has one high load rxq assigned to it, as that is the
> best balance of their measured loads across the PMDs.
>
> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
> ---
>  Documentation/topics/dpdk/pmd.rst |  26 ++++++
>  lib/dpif-netdev.c                 | 141 +++++++++++++++++++++++++-----
>  vswitchd/vswitch.xml              |   5 +-
>  3 files changed, 148 insertions(+), 24 deletions(-)
>
> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> index e481e7941..d1c45cdfb 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -137,4 +137,30 @@ The Rx queues will be assigned to the cores in the following order::
>      Core 8: Q3 (60%) | Q0 (30%)
>
> +``group`` assignment is similar to ``cycles`` in that the Rxqs will be
> +ordered by their measured processing cycles before being assigned to PMDs.
> +It differs from ``cycles`` in that it uses a running estimate of the cycles
> +that will be on each PMD to select the PMD with the lowest load for each Rxq.
> +
> +This means that there can be a group of low traffic Rxqs on one PMD, while a
> +high traffic Rxq may have a PMD to itself. Where ``cycles`` kept as close to
> +the same number of Rxqs per PMD as possible, with ``group`` this restriction is
> +removed for a better balance of the workload across PMDs.
> +
> +For example, where there are five Rx queues and three cores - 3, 7, and 8 -
> +available and the measured usage of core cycles per Rx queue over the last
> +interval is seen to be:
> +
> +- Queue #0: 10%
> +- Queue #1: 80%
> +- Queue #3: 50%
> +- Queue #4: 70%
> +- Queue #5: 10%
> +
> +The Rx queues will be assigned to the cores in the following order::
> +
> +    Core 3: Q1 (80%) |
> +    Core 7: Q4 (70%) |
> +    Core 8: Q3 (50%) | Q0 (10%) | Q5 (10%)
> +
>  Alternatively, ``roundrobin`` assignment can be used, where the Rxqs are
>  assigned to PMDs in a round-robined fashion. This algorithm was used by
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index eaa4e9733..61e0a516f 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -306,4 +306,11 @@ struct pmd_auto_lb {
>  };
>
> +enum sched_assignment_type {
> +    SCHED_ROUNDROBIN,
> +    SCHED_CYCLES, /* Default.*/
> +    SCHED_GROUP,
> +    SCHED_MAX
> +};
> +
>  /* Datapath based on the network device interface from netdev.h.
>   *
> @@ -367,5 +374,5 @@ struct dp_netdev {
>      struct ovs_mutex tx_qid_pool_mutex;
>      /* Use measured cycles for rxq to pmd assignment. */
> -    bool pmd_rxq_assign_cyc;
> +    enum sched_assignment_type pmd_rxq_assign_cyc;
>
>      /* Protects the access of the 'struct dp_netdev_pmd_thread'
> @@ -1799,5 +1806,5 @@ create_dp_netdev(const char *name, const struct dpif_class *class,
>
>      cmap_init(&dp->poll_threads);
> -    dp->pmd_rxq_assign_cyc = true;
> +    dp->pmd_rxq_assign_cyc = SCHED_CYCLES;
>
>      ovs_mutex_init(&dp->tx_qid_pool_mutex);
> @@ -4223,5 +4230,5 @@ set_pmd_auto_lb(struct dp_netdev *dp, bool always_log)
>      bool enable_alb = false;
>      bool multi_rxq = false;
> -    bool pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc;
> +    enum sched_assignment_type pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc;
>
>      /* Ensure that there is at least 2 non-isolated PMDs and
> @@ -4242,6 +4249,6 @@ set_pmd_auto_lb(struct dp_netdev *dp, bool always_log)
>      }
>
> -    /* Enable auto LB if it is requested and cycle based assignment is true. */
> -    enable_alb = enable_alb && pmd_rxq_assign_cyc &&
> +    /* Enable auto LB if requested and not using roundrobin assignment. */
> +    enable_alb = enable_alb && pmd_rxq_assign_cyc != SCHED_ROUNDROBIN &&
>                      pmd_alb->auto_lb_requested;
>
> @@ -4284,4 +4291,5 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>      uint8_t rebalance_improve;
>      bool log_autolb = false;
> +    enum sched_assignment_type pmd_rxq_assign_cyc;
>
>      tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
> @@ -4342,9 +4350,15 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>      }
>
> -    bool pmd_rxq_assign_cyc = !strcmp(pmd_rxq_assign, "cycles");
> -    if (!pmd_rxq_assign_cyc && strcmp(pmd_rxq_assign, "roundrobin")) {
> -        VLOG_WARN("Unsupported Rxq to PMD assignment mode in pmd-rxq-assign. "
> -                      "Defaulting to 'cycles'.");
> -        pmd_rxq_assign_cyc = true;
> +    if (!strcmp(pmd_rxq_assign, "roundrobin")) {
> +        pmd_rxq_assign_cyc = SCHED_ROUNDROBIN;
> +    } else if (!strcmp(pmd_rxq_assign, "cycles")) {
> +        pmd_rxq_assign_cyc = SCHED_CYCLES;
> +    } else if (!strcmp(pmd_rxq_assign, "group")) {
> +        pmd_rxq_assign_cyc = SCHED_GROUP;
> +    } else {
> +        /* default */
> +        VLOG_WARN("Unsupported rx queue to PMD assignment mode in "
> +                  "pmd-rxq-assign. Defaulting to 'cycles'.");
> +        pmd_rxq_assign_cyc = SCHED_CYCLES;
>          pmd_rxq_assign = "cycles";
>      }
> @@ -5171,4 +5185,61 @@ compare_rxq_cycles(const void *a, const void *b)
>  }
>
> +static struct sched_pmd *
> +get_lowest_num_rxq_pmd(struct sched_numa *numa)
> +{
> +    struct sched_pmd *lowest_rxqs_sched_pmd = NULL;
> +    unsigned lowest_rxqs = UINT_MAX;
> +
> +    /* find the pmd with lowest number of rxqs */
> +    for (unsigned i = 0; i < numa->n_pmds; i++) {
> +        struct sched_pmd *sched_pmd;
> +        unsigned num_rxqs;
> +
> +        sched_pmd = &numa->pmds[i];
> +        num_rxqs = sched_pmd->n_rxq;
> +        if (sched_pmd->isolated) {
> +            continue;
> +        }
> +
> +        /* If this current load is higher we can go to the next one */
> +        if (num_rxqs > lowest_rxqs) {

Testing for >= avoids the (wrongly indented) test below.


> +            continue;
> +        }
> +       if (num_rxqs < lowest_rxqs) {
> +           lowest_rxqs = num_rxqs;
> +           lowest_rxqs_sched_pmd = sched_pmd;
> +        }
> +    }
> +    return lowest_rxqs_sched_pmd;
> +}
> +
> +static struct sched_pmd *
> +get_lowest_proc_pmd(struct sched_numa *numa)
> +{
> +    struct sched_pmd *lowest_loaded_sched_pmd = NULL;
> +    uint64_t lowest_load = UINT64_MAX;
> +
> +    /* find the pmd with the lowest load */
> +    for (unsigned i = 0; i < numa->n_pmds; i++) {
> +        struct sched_pmd *sched_pmd;
> +        uint64_t pmd_load;
> +
> +        sched_pmd = &numa->pmds[i];
> +        if (sched_pmd->isolated) {
> +            continue;
> +        }
> +        pmd_load = sched_pmd->pmd_proc_cycles;
> +        /* If this current load is higher we can go to the next one */
> +        if (pmd_load > lowest_load) {
> +            continue;
> +        }
> +       if (pmd_load < lowest_load) {
> +           lowest_load = pmd_load;
> +           lowest_loaded_sched_pmd = sched_pmd;
> +        }
> +    }
> +    return lowest_loaded_sched_pmd;
> +}
> +
>  /*
>   * Returns the next pmd from the numa node.
> @@ -5229,16 +5300,40 @@ get_available_rr_pmd(struct sched_numa *numa, bool updown)
>
>  static struct sched_pmd *
> -get_next_pmd(struct sched_numa *numa, bool algo)
> +get_next_pmd(struct sched_numa *numa, enum sched_assignment_type algo,
> +             bool has_proc)
>  {
> -    return get_available_rr_pmd(numa, algo);
> +    if (algo == SCHED_GROUP) {
> +        struct sched_pmd *sched_pmd = NULL;
> +
> +        /* Check if the rxq has associated cycles. This is handled differently
> +         * as adding an zero cycles rxq to a PMD will mean that the lowest
> +         * core would not change on a subsequent call and all zero rxqs would
> +         * be assigned to the same PMD. */
> +        if (has_proc) {
> +            sched_pmd = get_lowest_proc_pmd(numa);
> +        } else {
> +            sched_pmd = get_lowest_num_rxq_pmd(numa);
> +        }
> +        /* If there is a pmd selected, return it now. */
> +        if (sched_pmd) {
> +            return sched_pmd;
> +        }

The only case where sched_pmd == NULL is when n_pmds == 0 in which
case the rr stuff would also lead to no pmd available.
And it seems unintuitive that in non rr mode we would still call the
rr pmd selector.

I would simply return sched_pmd here.



> +    }
> +
> +    /* By default or as a last resort, just RR the PMDs. */
> +    return get_available_rr_pmd(numa, algo == SCHED_CYCLES ? true : false);

This ternary is unneeded.


>  }
>

[snip]
David Marchand June 24, 2021, 3:21 p.m. UTC | #3
On Thu, Jun 24, 2021 at 4:52 PM David Marchand
<david.marchand@redhat.com> wrote:
> > @@ -5229,16 +5300,40 @@ get_available_rr_pmd(struct sched_numa *numa, bool updown)
> >
> >  static struct sched_pmd *
> > -get_next_pmd(struct sched_numa *numa, bool algo)
> > +get_next_pmd(struct sched_numa *numa, enum sched_assignment_type algo,
> > +             bool has_proc)
> >  {
> > -    return get_available_rr_pmd(numa, algo);
> > +    if (algo == SCHED_GROUP) {
> > +        struct sched_pmd *sched_pmd = NULL;
> > +
> > +        /* Check if the rxq has associated cycles. This is handled differently
> > +         * as adding an zero cycles rxq to a PMD will mean that the lowest
> > +         * core would not change on a subsequent call and all zero rxqs would
> > +         * be assigned to the same PMD. */
> > +        if (has_proc) {
> > +            sched_pmd = get_lowest_proc_pmd(numa);
> > +        } else {
> > +            sched_pmd = get_lowest_num_rxq_pmd(numa);
> > +        }
> > +        /* If there is a pmd selected, return it now. */
> > +        if (sched_pmd) {
> > +            return sched_pmd;
> > +        }
>
> The only case where sched_pmd == NULL is when n_pmds == 0 in which
> case the rr stuff would also lead to no pmd available.

Sorry, rather than n_pmds == 0, it is more that no non-isolated pmd is
available.
But the end result is the same.


> And it seems unintuitive that in non rr mode we would still call the
> rr pmd selector.
>
> I would simply return sched_pmd here.
diff mbox series

Patch

diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
index e481e7941..d1c45cdfb 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -137,4 +137,30 @@  The Rx queues will be assigned to the cores in the following order::
     Core 8: Q3 (60%) | Q0 (30%)
 
+``group`` assignment is similar to ``cycles`` in that the Rxqs will be
+ordered by their measured processing cycles before being assigned to PMDs.
+It differs from ``cycles`` in that it uses a running estimate of the cycles
+that will be on each PMD to select the PMD with the lowest load for each Rxq.
+
+This means that there can be a group of low traffic Rxqs on one PMD, while a
+high traffic Rxq may have a PMD to itself. Where ``cycles`` kept as close to
+the same number of Rxqs per PMD as possible, with ``group`` this restriction is
+removed for a better balance of the workload across PMDs.
+
+For example, where there are five Rx queues and three cores - 3, 7, and 8 -
+available and the measured usage of core cycles per Rx queue over the last
+interval is seen to be:
+
+- Queue #0: 10%
+- Queue #1: 80%
+- Queue #3: 50%
+- Queue #4: 70%
+- Queue #5: 10%
+
+The Rx queues will be assigned to the cores in the following order::
+
+    Core 3: Q1 (80%) |
+    Core 7: Q4 (70%) |
+    Core 8: Q3 (50%) | Q0 (10%) | Q5 (10%)
+
 Alternatively, ``roundrobin`` assignment can be used, where the Rxqs are
 assigned to PMDs in a round-robined fashion. This algorithm was used by
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index eaa4e9733..61e0a516f 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -306,4 +306,11 @@  struct pmd_auto_lb {
 };
 
+enum sched_assignment_type {
+    SCHED_ROUNDROBIN,
+    SCHED_CYCLES, /* Default.*/
+    SCHED_GROUP,
+    SCHED_MAX
+};
+
 /* Datapath based on the network device interface from netdev.h.
  *
@@ -367,5 +374,5 @@  struct dp_netdev {
     struct ovs_mutex tx_qid_pool_mutex;
     /* Use measured cycles for rxq to pmd assignment. */
-    bool pmd_rxq_assign_cyc;
+    enum sched_assignment_type pmd_rxq_assign_cyc;
 
     /* Protects the access of the 'struct dp_netdev_pmd_thread'
@@ -1799,5 +1806,5 @@  create_dp_netdev(const char *name, const struct dpif_class *class,
 
     cmap_init(&dp->poll_threads);
-    dp->pmd_rxq_assign_cyc = true;
+    dp->pmd_rxq_assign_cyc = SCHED_CYCLES;
 
     ovs_mutex_init(&dp->tx_qid_pool_mutex);
@@ -4223,5 +4230,5 @@  set_pmd_auto_lb(struct dp_netdev *dp, bool always_log)
     bool enable_alb = false;
     bool multi_rxq = false;
-    bool pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc;
+    enum sched_assignment_type pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc;
 
     /* Ensure that there is at least 2 non-isolated PMDs and
@@ -4242,6 +4249,6 @@  set_pmd_auto_lb(struct dp_netdev *dp, bool always_log)
     }
 
-    /* Enable auto LB if it is requested and cycle based assignment is true. */
-    enable_alb = enable_alb && pmd_rxq_assign_cyc &&
+    /* Enable auto LB if requested and not using roundrobin assignment. */
+    enable_alb = enable_alb && pmd_rxq_assign_cyc != SCHED_ROUNDROBIN &&
                     pmd_alb->auto_lb_requested;
 
@@ -4284,4 +4291,5 @@  dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
     uint8_t rebalance_improve;
     bool log_autolb = false;
+    enum sched_assignment_type pmd_rxq_assign_cyc;
 
     tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
@@ -4342,9 +4350,15 @@  dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
     }
 
-    bool pmd_rxq_assign_cyc = !strcmp(pmd_rxq_assign, "cycles");
-    if (!pmd_rxq_assign_cyc && strcmp(pmd_rxq_assign, "roundrobin")) {
-        VLOG_WARN("Unsupported Rxq to PMD assignment mode in pmd-rxq-assign. "
-                      "Defaulting to 'cycles'.");
-        pmd_rxq_assign_cyc = true;
+    if (!strcmp(pmd_rxq_assign, "roundrobin")) {
+        pmd_rxq_assign_cyc = SCHED_ROUNDROBIN;
+    } else if (!strcmp(pmd_rxq_assign, "cycles")) {
+        pmd_rxq_assign_cyc = SCHED_CYCLES;
+    } else if (!strcmp(pmd_rxq_assign, "group")) {
+        pmd_rxq_assign_cyc = SCHED_GROUP;
+    } else {
+        /* default */
+        VLOG_WARN("Unsupported rx queue to PMD assignment mode in "
+                  "pmd-rxq-assign. Defaulting to 'cycles'.");
+        pmd_rxq_assign_cyc = SCHED_CYCLES;
         pmd_rxq_assign = "cycles";
     }
@@ -5171,4 +5185,61 @@  compare_rxq_cycles(const void *a, const void *b)
 }
 
+static struct sched_pmd *
+get_lowest_num_rxq_pmd(struct sched_numa *numa)
+{
+    struct sched_pmd *lowest_rxqs_sched_pmd = NULL;
+    unsigned lowest_rxqs = UINT_MAX;
+
+    /* find the pmd with lowest number of rxqs */
+    for (unsigned i = 0; i < numa->n_pmds; i++) {
+        struct sched_pmd *sched_pmd;
+        unsigned num_rxqs;
+
+        sched_pmd = &numa->pmds[i];
+        num_rxqs = sched_pmd->n_rxq;
+        if (sched_pmd->isolated) {
+            continue;
+        }
+
+        /* If this current load is higher we can go to the next one */
+        if (num_rxqs > lowest_rxqs) {
+            continue;
+        }
+       if (num_rxqs < lowest_rxqs) {
+           lowest_rxqs = num_rxqs;
+           lowest_rxqs_sched_pmd = sched_pmd;
+        }
+    }
+    return lowest_rxqs_sched_pmd;
+}
+
+static struct sched_pmd *
+get_lowest_proc_pmd(struct sched_numa *numa)
+{
+    struct sched_pmd *lowest_loaded_sched_pmd = NULL;
+    uint64_t lowest_load = UINT64_MAX;
+
+    /* find the pmd with the lowest load */
+    for (unsigned i = 0; i < numa->n_pmds; i++) {
+        struct sched_pmd *sched_pmd;
+        uint64_t pmd_load;
+
+        sched_pmd = &numa->pmds[i];
+        if (sched_pmd->isolated) {
+            continue;
+        }
+        pmd_load = sched_pmd->pmd_proc_cycles;
+        /* If this current load is higher we can go to the next one */
+        if (pmd_load > lowest_load) {
+            continue;
+        }
+       if (pmd_load < lowest_load) {
+           lowest_load = pmd_load;
+           lowest_loaded_sched_pmd = sched_pmd;
+        }
+    }
+    return lowest_loaded_sched_pmd;
+}
+
 /*
  * Returns the next pmd from the numa node.
@@ -5229,16 +5300,40 @@  get_available_rr_pmd(struct sched_numa *numa, bool updown)
 
 static struct sched_pmd *
-get_next_pmd(struct sched_numa *numa, bool algo)
+get_next_pmd(struct sched_numa *numa, enum sched_assignment_type algo,
+             bool has_proc)
 {
-    return get_available_rr_pmd(numa, algo);
+    if (algo == SCHED_GROUP) {
+        struct sched_pmd *sched_pmd = NULL;
+
+        /* Check if the rxq has associated cycles. This is handled differently
+         * as adding an zero cycles rxq to a PMD will mean that the lowest
+         * core would not change on a subsequent call and all zero rxqs would
+         * be assigned to the same PMD. */
+        if (has_proc) {
+            sched_pmd = get_lowest_proc_pmd(numa);
+        } else {
+            sched_pmd = get_lowest_num_rxq_pmd(numa);
+        }
+        /* If there is a pmd selected, return it now. */
+        if (sched_pmd) {
+            return sched_pmd;
+        }
+    }
+
+    /* By default or as a last resort, just RR the PMDs. */
+    return get_available_rr_pmd(numa, algo == SCHED_CYCLES ? true : false);
 }
 
 static const char *
-get_assignment_type_string(bool algo)
+get_assignment_type_string(enum sched_assignment_type algo)
 {
-    if (algo == false) {
-        return "roundrobin";
+    switch (algo) {
+    case SCHED_ROUNDROBIN: return "roundrobin";
+    case SCHED_CYCLES: return "cycles";
+    case SCHED_GROUP: return "group";
+    case SCHED_MAX:
+    /* fall through */
+    default: return "Unknown";
     }
-    return "cycles";
 }
 
@@ -5246,9 +5341,9 @@  get_assignment_type_string(bool algo)
 
 static bool
-get_rxq_cyc_log(char *a, bool algo, uint64_t cycles)
+get_rxq_cyc_log(char *a, enum sched_assignment_type algo, uint64_t cycles)
 {
     int ret = 0;
 
-    if (algo) {
+    if (algo != SCHED_ROUNDROBIN) {
         ret = snprintf(a, MAX_RXQ_CYC_STRLEN,
                  " (measured processing cycles %"PRIu64").",
@@ -5261,5 +5356,5 @@  static void
 sched_numa_list_schedule(struct sched_numa_list *numa_list,
                          struct dp_netdev *dp,
-                         bool algo,
+                         enum sched_assignment_type algo,
                          enum vlog_level level)
     OVS_REQUIRES(dp->port_mutex)
@@ -5285,5 +5380,5 @@  sched_numa_list_schedule(struct sched_numa_list *numa_list,
             rxqs[n_rxqs++] = rxq;
 
-            if (algo == true) {
+            if (algo != SCHED_ROUNDROBIN) {
                 uint64_t cycle_hist = 0;
 
@@ -5341,5 +5436,5 @@  sched_numa_list_schedule(struct sched_numa_list *numa_list,
     }
 
-    if (n_rxqs > 1 && algo) {
+    if (n_rxqs > 1 && algo != SCHED_ROUNDROBIN) {
         /* Sort the queues in order of the processing cycles
          * they consumed during their last pmd interval. */
@@ -5401,5 +5496,5 @@  sched_numa_list_schedule(struct sched_numa_list *numa_list,
         if (numa) {
             /* Select the PMD that should be used for this rxq. */
-            sched_pmd = get_next_pmd(numa, algo);
+            sched_pmd = get_next_pmd(numa, algo, proc_cycles ? true : false);
             if (sched_pmd) {
                 VLOG(level, "Core %2u on numa node %d assigned port \'%s\' "
@@ -5431,5 +5526,5 @@  rxq_scheduling(struct dp_netdev *dp) OVS_REQUIRES(dp->port_mutex)
 {
     struct sched_numa_list *numa_list;
-    bool algo = dp->pmd_rxq_assign_cyc;
+    enum sched_assignment_type algo = dp->pmd_rxq_assign_cyc;
 
     numa_list = xzalloc(sizeof *numa_list);
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 4597a215d..14cb8a2c6 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -520,5 +520,5 @@ 
       <column name="other_config" key="pmd-rxq-assign"
               type='{"type": "string",
-                     "enum": ["set", ["cycles", "roundrobin"]]}'>
+                     "enum": ["set", ["cycles", "roundrobin", "group"]]}'>
         <p>
           Specifies how RX queues will be automatically assigned to CPU cores.
@@ -530,4 +530,7 @@ 
             <dt><code>roundrobin</code></dt>
             <dd>Rxqs will be round-robined across CPU cores.</dd>
+            <dt><code>group</code></dt>
+            <dd>Rxqs will be sorted by order of measured processing cycles
+            before being assigned to CPU cores with lowest estimated load.</dd>
           </dl>
         </p>