diff mbox series

[ovs-dev] netdev-dpdk: Add port/queue based rxq to pmd assignment.

Message ID 1534871735-11422-1-git-send-email-ktraynor@redhat.com
State Superseded
Delegated to: Ian Stokes
Headers show
Series [ovs-dev] netdev-dpdk: Add port/queue based rxq to pmd assignment. | expand

Commit Message

Kevin Traynor Aug. 21, 2018, 5:15 p.m. UTC
Prior to OVS 2.9 automatic assignment of Rxqs to PMDs
(i.e. CPUs) was done by assigning Rxqs in an ascending
port/queue order, round robined across the available
PMDs.

That was changed in OVS 2.9 to order the Rxqs by the
measured processing cycles for each, in order to try
and keep the busiest Rxqs on different PMDs.

For the most part the new scheme should be better, but
there could be situations where a user prefers a
port/queue scheme because Rxqs from a single port are
more likely to be spread across multiple cores, and/or
traffic is very bursty/unpredictable.

Add the option to have a port/queue based assignment.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
---
 Documentation/topics/dpdk/pmd.rst |  34 +++++++++--
 NEWS                              |   2 +
 lib/dpif-netdev.c                 | 115 +++++++++++++++++++++++++++-----------
 tests/pmd.at                      |  15 ++++-
 vswitchd/vswitch.xml              |  23 ++++++++
 5 files changed, 147 insertions(+), 42 deletions(-)

Comments

Eelco Chaudron Aug. 22, 2018, 7:32 a.m. UTC | #1
On 21 Aug 2018, at 19:15, Kevin Traynor wrote:

> Prior to OVS 2.9 automatic assignment of Rxqs to PMDs
> (i.e. CPUs) was done by assigning Rxqs in an ascending
> port/queue order, round robined across the available
> PMDs.
>
> That was changed in OVS 2.9 to order the Rxqs by the
> measured processing cycles for each, in order to try
> and keep the busiest Rxqs on different PMDs.
>
> For the most part the new scheme should be better, but
> there could be situations where a user prefers a
> port/queue scheme because Rxqs from a single port are
> more likely to be spread across multiple cores, and/or
> traffic is very bursty/unpredictable.
>
> Add the option to have a port/queue based assignment.

Some small comments below… Rest looks good.

//Eelco

> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
> ---
>  Documentation/topics/dpdk/pmd.rst |  34 +++++++++--
>  NEWS                              |   2 +
>  lib/dpif-netdev.c                 | 115 
> +++++++++++++++++++++++++++-----------
>  tests/pmd.at                      |  15 ++++-
>  vswitchd/vswitch.xml              |  23 ++++++++
>  5 files changed, 147 insertions(+), 42 deletions(-)
>
> diff --git a/Documentation/topics/dpdk/pmd.rst 
> b/Documentation/topics/dpdk/pmd.rst
> index 5f0671e..e8628cc 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -113,10 +113,15 @@ means that this thread will only poll the 
> *pinned* Rx queues.
>
>  If ``pmd-rxq-affinity`` is not set for Rx queues, they will be 
> assigned to PMDs
> -(cores) automatically. Where known, the processing cycles that have 
> been stored
> -for each Rx queue will be used to assign Rx queue to PMDs based on a 
> round
> -robin of the sorted Rx queues. For example, take the following 
> example, where
> -there are five Rx queues and three cores - 3, 7, and 8 - available 
> and the
> -measured usage of core cycles per Rx queue over the last interval is 
> seen to
> -be:
> +(cores) automatically.
> +
> +The algorithm used to automatically assign Rxqs to PMDs can be set 
> by::
> +
> +    $ ovs-vsctl set Open_vSwitch . 
> other_config:pmd-rxq-assign=<assignment>
> +
> +By default, ``cycles`` assignment is used where the Rxqs will be 
> ordered by
> +their measured processing cycles, and then be assigned in descending 
> order to
> +PMDs based on a round robin up/down walk of the PMDs. For example, 
> where there
> +are five Rx queues and three cores - 3, 7, and 8 - available and the 
> measured
> +usage of core cycles per Rx queue over the last interval is seen to 
> be:
>
>  - Queue #0: 30%
> @@ -132,4 +137,21 @@ The Rx queues will be assigned to the cores in 
> the following order::
>      Core 8: Q3 (60%) | Q0 (30%)
>
> +Alternatively, ``portqueue`` assignment can be used, where the Rxqs 
> are
> +ordered by their port/queue number, and then assigned in ascending 
> order to
> +PMDs based on a round robin of the PMDs. This algorithm was used by 
> default
> +prior to OVS 2.9. For example, given the following ports and queues:
> +
> +- Port #0 Queue #0 (P0Q0)
> +- Port #0 Queue #1 (P0Q1)
> +- Port #1 Queue #0 (P1Q0)
> +- Port #1 Queue #1 (P1Q1)
> +- Port #1 Queue #2 (P1Q2)
> +
> +The Rx queues will be assigned to the cores in the following order::
> +
> +    Core 3: P0Q0 | P1Q1
> +    Core 7: P0Q1 | P1Q2
> +    Core 8: P1Q0 |
> +
>  To see the current measured usage history of PMD core cycles for each 
> Rx
>  queue::
> diff --git a/NEWS b/NEWS
> index 8987f9a..9e809d1 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -5,4 +5,6 @@ Post-v2.10.0
>     - The environment variable OVS_CTL_TIMEOUT, if set, is now used
>       as the default timeout for control utilities.
> +   - DPDK:
> +     * Add option for port/queue based Rxq to PMD assignment.
>
>
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 7f836bb..507906c 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -342,4 +342,7 @@ struct dp_netdev {
>      struct id_pool *tx_qid_pool;
>      struct ovs_mutex tx_qid_pool_mutex;
> +    /* Stores whether an rxq's used cycles should be
> +     * used to influence assignment to pmds. */
> +    atomic_bool pmd_rxq_assign_cyc;
>
>      /* Protects the access of the 'struct dp_netdev_pmd_thread'
> @@ -1491,4 +1494,5 @@ create_dp_netdev(const char *name, const struct 
> dpif_class *class,
>      atomic_init(&dp->emc_insert_min, DEFAULT_EM_FLOW_INSERT_MIN);
>      atomic_init(&dp->tx_flush_interval, DEFAULT_TX_FLUSH_INTERVAL);
> +    atomic_init(&dp->pmd_rxq_assign_cyc, true);
>
>      cmap_init(&dp->poll_threads);
> @@ -3717,4 +3721,6 @@ dpif_netdev_set_config(struct dpif *dpif, const 
> struct smap *other_config)
>      struct dp_netdev *dp = get_dp_netdev(dpif);
>      const char *cmask = smap_get(other_config, "pmd-cpu-mask");
> +    const char *pmd_rxq_assign = smap_get_def(other_config, 
> "pmd-rxq-assign",
> +                                             "cycles");
>      unsigned long long insert_prob =
>          smap_get_ullong(other_config, "emc-insert-inv-prob",
> @@ -3779,4 +3785,20 @@ dpif_netdev_set_config(struct dpif *dpif, const 
> struct smap *other_config)
>          }
>      }
> +
> +    bool pmd_rxq_assign_cyc = !strcmp(pmd_rxq_assign, "cycles");
> +    if (pmd_rxq_assign_cyc || !strcmp(pmd_rxq_assign, "portqueue")) {
> +        bool cur_pmd_rxq_assign_cyc;
> +        atomic_read_relaxed(&dp->pmd_rxq_assign_cyc, 
> &cur_pmd_rxq_assign_cyc);
> +        if (pmd_rxq_assign_cyc != cur_pmd_rxq_assign_cyc) {
> +            atomic_store_relaxed(&dp->pmd_rxq_assign_cyc, 
> pmd_rxq_assign_cyc);
> +            if (pmd_rxq_assign_cyc) {
> +                VLOG_INFO("Rxq to PMD assignment mode: "
> +                          "Sort queues by processing cycles.");
> +            } else {
> +                VLOG_INFO("Rxq to PMD assignment mode: "
> +                          "Sort queues by port/queue number.");
> +            }
> +        }
> +    }
>      return 0;
>  }
> @@ -4249,27 +4271,40 @@ rr_numa_list_populate(struct dp_netdev *dp, 
> struct rr_numa_list *rr)
>  }
>
> -/* Returns the next pmd from the numa node in
> - * incrementing or decrementing order. */
> +/* Returns the next pmd from the numa node.
> + *
> + * If updown is 'true' it will alternate between
> + * selecting the next pmd in either an up or down
> + * walk, switching between up/down each time the
> + * min or max core is reached. e.g. 1,2,3,3,2,1,1,2...
> + *
> + * If updown is 'false' it will select the next pmd
> + * in ascending order, wrapping around when max core
> + * is reached. e.g. 1,2,3,1,2,3,1,2... */
>  static struct dp_netdev_pmd_thread *
> -rr_numa_get_pmd(struct rr_numa *numa)
> +rr_numa_get_pmd(struct rr_numa *numa, bool updown)
>  {
> -    int numa_idx = numa->cur_index;
> +    int numa_idx;
>
> -    if (numa->idx_inc == true) {
> -        /* Incrementing through list of pmds. */
> -        if (numa->cur_index == numa->n_pmds-1) {
> -            /* Reached the last pmd. */
> -            numa->idx_inc = false;
> +    if (updown) {
> +        numa_idx = numa->cur_index;
> +        if (numa->idx_inc == true) {
> +            /* Incrementing through list of pmds. */
> +            if (numa->cur_index == numa->n_pmds-1) {
> +                /* Reached the last pmd. */
> +                numa->idx_inc = false;
> +            } else {
> +                numa->cur_index++;
> +            }
>          } else {
> -            numa->cur_index++;
> +            /* Decrementing through list of pmds. */
> +            if (numa->cur_index == 0) {
> +                /* Reached the first pmd. */
> +                numa->idx_inc = true;
> +            } else {
> +                numa->cur_index--;
> +            }
>          }
>      } else {
> -        /* Decrementing through list of pmds. */
> -        if (numa->cur_index == 0) {
> -            /* Reached the first pmd. */
> -            numa->idx_inc = true;
> -        } else {
> -            numa->cur_index--;
> -        }
> +        numa_idx = numa->cur_index++ % numa->n_pmds;
>      }

This looks like a big change, how about the following:

@@ -4259,7 +4259,11 @@ rr_numa_get_pmd(struct rr_numa *numa)
          /* Incrementing through list of pmds. */
          if (numa->cur_index == numa->n_pmds-1) {
              /* Reached the last pmd. */
-            numa->idx_inc = false;
+            if (updown) {
+                numa->idx_inc = false;
+            } else {
+                numa->cur_index = 0;
+            }
          } else {


>      return numa->pmds[numa_idx];
> @@ -4323,7 +4358,4 @@ compare_rxq_cycles(const void *a, const void *b)
>   * pmds to unpinned queues.
>   *
> - * If 'pinned' is false queues will be sorted by processing cycles 
> they are
> - * consuming and then assigned to pmds in round robin order.
> - *
>   * The function doesn't touch the pmd threads, it just stores the 
> assignment
>   * in the 'pmd' member of each rxq. */
> @@ -4338,5 +4370,7 @@ rxq_scheduling(struct dp_netdev *dp, bool 
> pinned) OVS_REQUIRES(dp->port_mutex)
>      struct rr_numa *numa = NULL;
>      int numa_id;
> +    bool assign_cyc;
>
> +    atomic_read_relaxed(&dp->pmd_rxq_assign_cyc, &assign_cyc);
>      HMAP_FOR_EACH (port, node, &dp->ports) {
>          if (!netdev_is_pmd(port->netdev)) {
> @@ -4368,10 +4402,13 @@ rxq_scheduling(struct dp_netdev *dp, bool 
> pinned) OVS_REQUIRES(dp->port_mutex)
>                      rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 
> 1));
>                  }
> -                /* Sum the queue intervals and store the cycle 
> history. */
> -                for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
> -                    cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, 
> i);
> -                }
> -                dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, 
> cycle_hist);
>
> +                if (assign_cyc) {
> +                    /* Sum the queue intervals and store the cycle 
> history. */
> +                    for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; 
> i++) {
> +                        cycle_hist += 
> dp_netdev_rxq_get_intrvl_cycles(q, i);
> +                    }
> +                    dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST,
> +                                             cycle_hist);
> +                }
>                  /* Store the queue. */
>                  rxqs[n_rxqs++] = q;
> @@ -4380,5 +4417,5 @@ rxq_scheduling(struct dp_netdev *dp, bool 
> pinned) OVS_REQUIRES(dp->port_mutex)
>      }
>
> -    if (n_rxqs > 1) {
> +    if (n_rxqs > 1 && assign_cyc) {
>          /* Sort the queues in order of the processing cycles
>           * they consumed during their last pmd interval. */
> @@ -4404,5 +4441,5 @@ rxq_scheduling(struct dp_netdev *dp, bool 
> pinned) OVS_REQUIRES(dp->port_mutex)
>                  continue;
>              }
> -            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa);
> +            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa, 
> assign_cyc);
>              VLOG_WARN("There's no available (non-isolated) pmd thread 
> "
>                        "on numa node %d. Queue %d on port \'%s\' will 
> "
> @@ -4413,11 +4450,21 @@ rxq_scheduling(struct dp_netdev *dp, bool 
> pinned) OVS_REQUIRES(dp->port_mutex)
>                        rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id);
>          } else {
> -        rxqs[i]->pmd = rr_numa_get_pmd(numa);
> -        VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
> -                  "rx queue %d (measured processing cycles 
> %"PRIu64").",
> -                  rxqs[i]->pmd->core_id, numa_id,
> -                  netdev_rxq_get_name(rxqs[i]->rx),
> -                  netdev_rxq_get_queue_id(rxqs[i]->rx),
> -                  dp_netdev_rxq_get_cycles(rxqs[i], 
> RXQ_CYCLES_PROC_HIST));
> +            rxqs[i]->pmd = rr_numa_get_pmd(numa, assign_cyc);
> +            if (assign_cyc) {
> +                VLOG_INFO("Core %d on numa node %d assigned port 
> \'%s\' "
> +                          "rx queue %d "
> +                          "(measured processing cycles %"PRIu64").",
> +                          rxqs[i]->pmd->core_id, numa_id,
> +                          netdev_rxq_get_name(rxqs[i]->rx),
> +                          netdev_rxq_get_queue_id(rxqs[i]->rx),
> +                          dp_netdev_rxq_get_cycles(rxqs[i],
> +                                                   
> RXQ_CYCLES_PROC_HIST));
> +            } else {
> +                VLOG_INFO("Core %d on numa node %d assigned port 
> \'%s\' "
> +                          "rx queue %d.", rxqs[i]->pmd->core_id, 
> numa_id,
> +                          netdev_rxq_get_name(rxqs[i]->rx),
> +                          netdev_rxq_get_queue_id(rxqs[i]->rx));
> +            }
> +
>          }
>      }
> diff --git a/tests/pmd.at b/tests/pmd.at
> index 4cae6c8..7cbaf41 100644
> --- a/tests/pmd.at
> +++ b/tests/pmd.at
> @@ -62,5 +62,6 @@ m4_define([CHECK_PMD_THREADS_CREATED], [
>
>  m4_define([SED_NUMA_CORE_PATTERN], ["s/\(numa_id \)[[0-9]]*\( core_id 
> \)[[0-9]]*:/\1<cleared>\2<cleared>:/"])
> -m4_define([SED_NUMA_CORE_QUEUE_PATTERN], ["s/1 2 5 6/<group>/;s/0 3 4 
> 7/<group>/"])
> +m4_define([SED_NUMA_CORE_QUEUE_CYC_PATTERN], ["s/1 2 5 6/<group>/;s/0 
> 3 4 7/<group>/"])
> +m4_define([SED_NUMA_CORE_QUEUE_PQ_PATTERN], ["s/1 3 5 7/<group>/;s/0 
> 2 4 6/<group>/"])
>  m4_define([DUMMY_NUMA], [--dummy-numa="0,0,0,0"])
>
> @@ -146,9 +147,19 @@ pmd thread numa_id <cleared> core_id <cleared>:
>  ])
>
> +AT_CHECK([ovs-vsctl set Open_vSwitch . 
> other_config:pmd-rxq-assign=cycles])

Should you not do a rebalance here, you are “theoretically” changing 
the config here.

>  TMP=$(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])
>  AT_CHECK([ovs-vsctl set Open_vSwitch . 
> other_config:pmd-cpu-mask=0x3])
>  CHECK_PMD_THREADS_CREATED([2], [], [+$TMP])
>
> -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed 
> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed 
> SED_NUMA_CORE_QUEUE_PATTERN], [0], [dnl
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed 
> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed 
> SED_NUMA_CORE_QUEUE_CYC_PATTERN], [0], [dnl
> +port: p0 queue-id: <group>
> +port: p0 queue-id: <group>
> +])
> +
> +AT_CHECK([ovs-vsctl set Open_vSwitch . 
> other_config:pmd-rxq-assign=portqueue])
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-rebalance], [0], [dnl
> +pmd rxq rebalance requested.
> +])
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed 
> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed 
> SED_NUMA_CORE_QUEUE_PQ_PATTERN], [0], [dnl
>  port: p0 queue-id: <group>
>  port: p0 queue-id: <group>
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 0cd8520..b5789a5 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -2810,4 +2810,27 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 
> type=patch options:peer=p1 \
>        </column>
>
> +      <column name="other_config" key="pmd-rxq-assign"
> +              type='{"type": "string"}'>
> +        <p>
> +            Specifies how RX queues are automatically assigned to CPU 
> cores.
> +            Options:
> +                <code>cycles</code> - Sort queues in descending order 
> of
> +                measured processing cycles before assigning round 
> robin
> +                to CPUs.
> +                <code>portqueue</code>  - Sort queues in ascending 
> order
> +                of port/queue number before assign round robin to 
> CPUs.
> +        </p>
> +        <p>
> +          The default value is <code>cycles</code>.
> +        </p>
> +        <p>
> +          Note this this will not automatically perform a reschedule.
> +          When the mode is set to <code>portqueue</code> the
> +          <code>ovs-appctl dpif-netdev/pmd-rxq-rebalance</code> 
> command will
> +          not change the Rxq to PMD assignment, as the 
> Port/Queue/PMDs config
> +          will be unchanged from the last reconfiguration.

The last part of the rebalance in portqueue mode is a bit confusing. I 
thought initially you meant when setting it to portqueue mode, but its 
the case its already set to portqueue mode and run rebalance again.

> +        </p>
> +      </column>
> +
>        <column name="options" key="vhost-server-path"
>                type='{"type": "string"}'>
> -- 
> 1.8.3.1
Ilya Maximets Aug. 22, 2018, 11:51 a.m. UTC | #2
Hi, Kevin.
Thanks for the patch.
I didn't look at the code closely, but have a few high level comments.

At first, please, change the patch prefix to 'dpif-netdev' as this patch
affects only this module and has no changes for 'netdev-dpdk'.

Next, I have a concern about "portqueue" option naming and the docs.
Documentation says that "Rxqs are ordered by their port/queue number,
and then assigned in ascending order to PMDs". I think that it's a bit
misleading, because ports are scheduled in the order in which they are
in a hash map. And this order could be different for different compiler
flags/cpu architectures because of different hash functions. So, ports
will be assigned in round-robin fashion, but we can not actually predict
which port will be assigned to which cpu core. Maybe, it'll be better
to rename to something like "round-robin" and rewrite the docs to avoid
false impression of predictable port to cpu mapping.
What do you think?

vswitch.xml changes are misplaced. They should go in "Configuration"
group of "Open_vSwitch" table, not the "Interface".

The last thing is that you need to request for reconfiguration if this
option changed. IMHO, user will expect that ports will be re-assigned
in case of algorithm change in database. This should be not so
frequent event. And you may avoid using atomics in this case. Just a
new field in the structure.

Best regards, Ilya Maximets.

On 21.08.2018 20:15, Kevin Traynor wrote:
> Prior to OVS 2.9 automatic assignment of Rxqs to PMDs
> (i.e. CPUs) was done by assigning Rxqs in an ascending
> port/queue order, round robined across the available
> PMDs.
> 
> That was changed in OVS 2.9 to order the Rxqs by the
> measured processing cycles for each, in order to try
> and keep the busiest Rxqs on different PMDs.
> 
> For the most part the new scheme should be better, but
> there could be situations where a user prefers a
> port/queue scheme because Rxqs from a single port are
> more likely to be spread across multiple cores, and/or
> traffic is very bursty/unpredictable.
> 
> Add the option to have a port/queue based assignment.
> 
> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
> ---
>  Documentation/topics/dpdk/pmd.rst |  34 +++++++++--
>  NEWS                              |   2 +
>  lib/dpif-netdev.c                 | 115 +++++++++++++++++++++++++++-----------
>  tests/pmd.at                      |  15 ++++-
>  vswitchd/vswitch.xml              |  23 ++++++++
>  5 files changed, 147 insertions(+), 42 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> index 5f0671e..e8628cc 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -113,10 +113,15 @@ means that this thread will only poll the *pinned* Rx queues.
>  
>  If ``pmd-rxq-affinity`` is not set for Rx queues, they will be assigned to PMDs
> -(cores) automatically. Where known, the processing cycles that have been stored
> -for each Rx queue will be used to assign Rx queue to PMDs based on a round
> -robin of the sorted Rx queues. For example, take the following example, where
> -there are five Rx queues and three cores - 3, 7, and 8 - available and the
> -measured usage of core cycles per Rx queue over the last interval is seen to
> -be:
> +(cores) automatically.
> +
> +The algorithm used to automatically assign Rxqs to PMDs can be set by::
> +
> +    $ ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=<assignment>
> +
> +By default, ``cycles`` assignment is used where the Rxqs will be ordered by
> +their measured processing cycles, and then be assigned in descending order to
> +PMDs based on a round robin up/down walk of the PMDs. For example, where there
> +are five Rx queues and three cores - 3, 7, and 8 - available and the measured
> +usage of core cycles per Rx queue over the last interval is seen to be:
>  
>  - Queue #0: 30%
> @@ -132,4 +137,21 @@ The Rx queues will be assigned to the cores in the following order::
>      Core 8: Q3 (60%) | Q0 (30%)
>  
> +Alternatively, ``portqueue`` assignment can be used, where the Rxqs are
> +ordered by their port/queue number, and then assigned in ascending order to
> +PMDs based on a round robin of the PMDs. This algorithm was used by default
> +prior to OVS 2.9. For example, given the following ports and queues:
> +
> +- Port #0 Queue #0 (P0Q0)
> +- Port #0 Queue #1 (P0Q1)
> +- Port #1 Queue #0 (P1Q0)
> +- Port #1 Queue #1 (P1Q1)
> +- Port #1 Queue #2 (P1Q2)
> +
> +The Rx queues will be assigned to the cores in the following order::
> +
> +    Core 3: P0Q0 | P1Q1
> +    Core 7: P0Q1 | P1Q2
> +    Core 8: P1Q0 |
> +
>  To see the current measured usage history of PMD core cycles for each Rx
>  queue::
> diff --git a/NEWS b/NEWS
> index 8987f9a..9e809d1 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -5,4 +5,6 @@ Post-v2.10.0
>     - The environment variable OVS_CTL_TIMEOUT, if set, is now used
>       as the default timeout for control utilities.
> +   - DPDK:
> +     * Add option for port/queue based Rxq to PMD assignment.
>  
>  
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 7f836bb..507906c 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -342,4 +342,7 @@ struct dp_netdev {
>      struct id_pool *tx_qid_pool;
>      struct ovs_mutex tx_qid_pool_mutex;
> +    /* Stores whether an rxq's used cycles should be
> +     * used to influence assignment to pmds. */
> +    atomic_bool pmd_rxq_assign_cyc;
>  
>      /* Protects the access of the 'struct dp_netdev_pmd_thread'
> @@ -1491,4 +1494,5 @@ create_dp_netdev(const char *name, const struct dpif_class *class,
>      atomic_init(&dp->emc_insert_min, DEFAULT_EM_FLOW_INSERT_MIN);
>      atomic_init(&dp->tx_flush_interval, DEFAULT_TX_FLUSH_INTERVAL);
> +    atomic_init(&dp->pmd_rxq_assign_cyc, true);
>  
>      cmap_init(&dp->poll_threads);
> @@ -3717,4 +3721,6 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>      struct dp_netdev *dp = get_dp_netdev(dpif);
>      const char *cmask = smap_get(other_config, "pmd-cpu-mask");
> +    const char *pmd_rxq_assign = smap_get_def(other_config, "pmd-rxq-assign",
> +                                             "cycles");
>      unsigned long long insert_prob =
>          smap_get_ullong(other_config, "emc-insert-inv-prob",
> @@ -3779,4 +3785,20 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>          }
>      }
> +
> +    bool pmd_rxq_assign_cyc = !strcmp(pmd_rxq_assign, "cycles");
> +    if (pmd_rxq_assign_cyc || !strcmp(pmd_rxq_assign, "portqueue")) {
> +        bool cur_pmd_rxq_assign_cyc;
> +        atomic_read_relaxed(&dp->pmd_rxq_assign_cyc, &cur_pmd_rxq_assign_cyc);
> +        if (pmd_rxq_assign_cyc != cur_pmd_rxq_assign_cyc) {
> +            atomic_store_relaxed(&dp->pmd_rxq_assign_cyc, pmd_rxq_assign_cyc);
> +            if (pmd_rxq_assign_cyc) {
> +                VLOG_INFO("Rxq to PMD assignment mode: "
> +                          "Sort queues by processing cycles.");
> +            } else {
> +                VLOG_INFO("Rxq to PMD assignment mode: "
> +                          "Sort queues by port/queue number.");
> +            }
> +        }
> +    }
>      return 0;
>  }
> @@ -4249,27 +4271,40 @@ rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr)
>  }
>  
> -/* Returns the next pmd from the numa node in
> - * incrementing or decrementing order. */
> +/* Returns the next pmd from the numa node.
> + *
> + * If updown is 'true' it will alternate between
> + * selecting the next pmd in either an up or down
> + * walk, switching between up/down each time the
> + * min or max core is reached. e.g. 1,2,3,3,2,1,1,2...
> + *
> + * If updown is 'false' it will select the next pmd
> + * in ascending order, wrapping around when max core
> + * is reached. e.g. 1,2,3,1,2,3,1,2... */
>  static struct dp_netdev_pmd_thread *
> -rr_numa_get_pmd(struct rr_numa *numa)
> +rr_numa_get_pmd(struct rr_numa *numa, bool updown)
>  {
> -    int numa_idx = numa->cur_index;
> +    int numa_idx;
>  
> -    if (numa->idx_inc == true) {
> -        /* Incrementing through list of pmds. */
> -        if (numa->cur_index == numa->n_pmds-1) {
> -            /* Reached the last pmd. */
> -            numa->idx_inc = false;
> +    if (updown) {
> +        numa_idx = numa->cur_index;
> +        if (numa->idx_inc == true) {
> +            /* Incrementing through list of pmds. */
> +            if (numa->cur_index == numa->n_pmds-1) {
> +                /* Reached the last pmd. */
> +                numa->idx_inc = false;
> +            } else {
> +                numa->cur_index++;
> +            }
>          } else {
> -            numa->cur_index++;
> +            /* Decrementing through list of pmds. */
> +            if (numa->cur_index == 0) {
> +                /* Reached the first pmd. */
> +                numa->idx_inc = true;
> +            } else {
> +                numa->cur_index--;
> +            }
>          }
>      } else {
> -        /* Decrementing through list of pmds. */
> -        if (numa->cur_index == 0) {
> -            /* Reached the first pmd. */
> -            numa->idx_inc = true;
> -        } else {
> -            numa->cur_index--;
> -        }
> +        numa_idx = numa->cur_index++ % numa->n_pmds;
>      }
>      return numa->pmds[numa_idx];
> @@ -4323,7 +4358,4 @@ compare_rxq_cycles(const void *a, const void *b)
>   * pmds to unpinned queues.
>   *
> - * If 'pinned' is false queues will be sorted by processing cycles they are
> - * consuming and then assigned to pmds in round robin order.
> - *
>   * The function doesn't touch the pmd threads, it just stores the assignment
>   * in the 'pmd' member of each rxq. */
> @@ -4338,5 +4370,7 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
>      struct rr_numa *numa = NULL;
>      int numa_id;
> +    bool assign_cyc;
>  
> +    atomic_read_relaxed(&dp->pmd_rxq_assign_cyc, &assign_cyc);
>      HMAP_FOR_EACH (port, node, &dp->ports) {
>          if (!netdev_is_pmd(port->netdev)) {
> @@ -4368,10 +4402,13 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
>                      rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1));
>                  }
> -                /* Sum the queue intervals and store the cycle history. */
> -                for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
> -                    cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i);
> -                }
> -                dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, cycle_hist);
>  
> +                if (assign_cyc) {
> +                    /* Sum the queue intervals and store the cycle history. */
> +                    for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
> +                        cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i);
> +                    }
> +                    dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST,
> +                                             cycle_hist);
> +                }
>                  /* Store the queue. */
>                  rxqs[n_rxqs++] = q;
> @@ -4380,5 +4417,5 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
>      }
>  
> -    if (n_rxqs > 1) {
> +    if (n_rxqs > 1 && assign_cyc) {
>          /* Sort the queues in order of the processing cycles
>           * they consumed during their last pmd interval. */
> @@ -4404,5 +4441,5 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
>                  continue;
>              }
> -            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa);
> +            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa, assign_cyc);
>              VLOG_WARN("There's no available (non-isolated) pmd thread "
>                        "on numa node %d. Queue %d on port \'%s\' will "
> @@ -4413,11 +4450,21 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
>                        rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id);
>          } else {
> -        rxqs[i]->pmd = rr_numa_get_pmd(numa);
> -        VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
> -                  "rx queue %d (measured processing cycles %"PRIu64").",
> -                  rxqs[i]->pmd->core_id, numa_id,
> -                  netdev_rxq_get_name(rxqs[i]->rx),
> -                  netdev_rxq_get_queue_id(rxqs[i]->rx),
> -                  dp_netdev_rxq_get_cycles(rxqs[i], RXQ_CYCLES_PROC_HIST));
> +            rxqs[i]->pmd = rr_numa_get_pmd(numa, assign_cyc);
> +            if (assign_cyc) {
> +                VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
> +                          "rx queue %d "
> +                          "(measured processing cycles %"PRIu64").",
> +                          rxqs[i]->pmd->core_id, numa_id,
> +                          netdev_rxq_get_name(rxqs[i]->rx),
> +                          netdev_rxq_get_queue_id(rxqs[i]->rx),
> +                          dp_netdev_rxq_get_cycles(rxqs[i],
> +                                                   RXQ_CYCLES_PROC_HIST));
> +            } else {
> +                VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
> +                          "rx queue %d.", rxqs[i]->pmd->core_id, numa_id,
> +                          netdev_rxq_get_name(rxqs[i]->rx),
> +                          netdev_rxq_get_queue_id(rxqs[i]->rx));
> +            }
> +
>          }
>      }
> diff --git a/tests/pmd.at b/tests/pmd.at
> index 4cae6c8..7cbaf41 100644
> --- a/tests/pmd.at
> +++ b/tests/pmd.at
> @@ -62,5 +62,6 @@ m4_define([CHECK_PMD_THREADS_CREATED], [
>  
>  m4_define([SED_NUMA_CORE_PATTERN], ["s/\(numa_id \)[[0-9]]*\( core_id \)[[0-9]]*:/\1<cleared>\2<cleared>:/"])
> -m4_define([SED_NUMA_CORE_QUEUE_PATTERN], ["s/1 2 5 6/<group>/;s/0 3 4 7/<group>/"])
> +m4_define([SED_NUMA_CORE_QUEUE_CYC_PATTERN], ["s/1 2 5 6/<group>/;s/0 3 4 7/<group>/"])
> +m4_define([SED_NUMA_CORE_QUEUE_PQ_PATTERN], ["s/1 3 5 7/<group>/;s/0 2 4 6/<group>/"])
>  m4_define([DUMMY_NUMA], [--dummy-numa="0,0,0,0"])
>  
> @@ -146,9 +147,19 @@ pmd thread numa_id <cleared> core_id <cleared>:
>  ])
>  
> +AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=cycles])
>  TMP=$(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])
>  AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3])
>  CHECK_PMD_THREADS_CREATED([2], [], [+$TMP])
>  
> -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed SED_NUMA_CORE_QUEUE_PATTERN], [0], [dnl
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed SED_NUMA_CORE_QUEUE_CYC_PATTERN], [0], [dnl
> +port: p0 queue-id: <group>
> +port: p0 queue-id: <group>
> +])
> +
> +AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=portqueue])
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-rebalance], [0], [dnl
> +pmd rxq rebalance requested.
> +])
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed SED_NUMA_CORE_QUEUE_PQ_PATTERN], [0], [dnl
>  port: p0 queue-id: <group>
>  port: p0 queue-id: <group>
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 0cd8520..b5789a5 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -2810,4 +2810,27 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>        </column>
>  
> +      <column name="other_config" key="pmd-rxq-assign"
> +              type='{"type": "string"}'>
> +        <p>
> +            Specifies how RX queues are automatically assigned to CPU cores.
> +            Options:
> +                <code>cycles</code> - Sort queues in descending order of
> +                measured processing cycles before assigning round robin
> +                to CPUs.
> +                <code>portqueue</code>  - Sort queues in ascending order
> +                of port/queue number before assign round robin to CPUs.
> +        </p>
> +        <p>
> +          The default value is <code>cycles</code>.
> +        </p>
> +        <p>
> +          Note this this will not automatically perform a reschedule.
> +          When the mode is set to <code>portqueue</code> the
> +          <code>ovs-appctl dpif-netdev/pmd-rxq-rebalance</code> command will
> +          not change the Rxq to PMD assignment, as the Port/Queue/PMDs config
> +          will be unchanged from the last reconfiguration.
> +        </p>
> +      </column>
> +
>        <column name="options" key="vhost-server-path"
>                type='{"type": "string"}'>
>
Kevin Traynor Aug. 23, 2018, 4:18 p.m. UTC | #3
On 08/22/2018 08:32 AM, Eelco Chaudron wrote:
> 
> 
> On 21 Aug 2018, at 19:15, Kevin Traynor wrote:
> 
>> Prior to OVS 2.9 automatic assignment of Rxqs to PMDs
>> (i.e. CPUs) was done by assigning Rxqs in an ascending
>> port/queue order, round robined across the available
>> PMDs.
>>
>> That was changed in OVS 2.9 to order the Rxqs by the
>> measured processing cycles for each, in order to try
>> and keep the busiest Rxqs on different PMDs.
>>
>> For the most part the new scheme should be better, but
>> there could be situations where a user prefers a
>> port/queue scheme because Rxqs from a single port are
>> more likely to be spread across multiple cores, and/or
>> traffic is very bursty/unpredictable.
>>
>> Add the option to have a port/queue based assignment.
> 
> Some small comments below… Rest looks good.
> 

Thanks for reviewing

> //Eelco
> 
>> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
>> ---
>>  Documentation/topics/dpdk/pmd.rst |  34 +++++++++--
>>  NEWS                              |   2 +
>>  lib/dpif-netdev.c                 | 115
>> +++++++++++++++++++++++++++-----------
>>  tests/pmd.at                      |  15 ++++-
>>  vswitchd/vswitch.xml              |  23 ++++++++
>>  5 files changed, 147 insertions(+), 42 deletions(-)
>>
>> diff --git a/Documentation/topics/dpdk/pmd.rst
>> b/Documentation/topics/dpdk/pmd.rst
>> index 5f0671e..e8628cc 100644
>> --- a/Documentation/topics/dpdk/pmd.rst
>> +++ b/Documentation/topics/dpdk/pmd.rst
>> @@ -113,10 +113,15 @@ means that this thread will only poll the
>> *pinned* Rx queues.
>>
>>  If ``pmd-rxq-affinity`` is not set for Rx queues, they will be
>> assigned to PMDs
>> -(cores) automatically. Where known, the processing cycles that have
>> been stored
>> -for each Rx queue will be used to assign Rx queue to PMDs based on a
>> round
>> -robin of the sorted Rx queues. For example, take the following
>> example, where
>> -there are five Rx queues and three cores - 3, 7, and 8 - available
>> and the
>> -measured usage of core cycles per Rx queue over the last interval is
>> seen to
>> -be:
>> +(cores) automatically.
>> +
>> +The algorithm used to automatically assign Rxqs to PMDs can be set by::
>> +
>> +    $ ovs-vsctl set Open_vSwitch .
>> other_config:pmd-rxq-assign=<assignment>
>> +
>> +By default, ``cycles`` assignment is used where the Rxqs will be
>> ordered by
>> +their measured processing cycles, and then be assigned in descending
>> order to
>> +PMDs based on a round robin up/down walk of the PMDs. For example,
>> where there
>> +are five Rx queues and three cores - 3, 7, and 8 - available and the
>> measured
>> +usage of core cycles per Rx queue over the last interval is seen to be:
>>
>>  - Queue #0: 30%
>> @@ -132,4 +137,21 @@ The Rx queues will be assigned to the cores in
>> the following order::
>>      Core 8: Q3 (60%) | Q0 (30%)
>>
>> +Alternatively, ``portqueue`` assignment can be used, where the Rxqs are
>> +ordered by their port/queue number, and then assigned in ascending
>> order to
>> +PMDs based on a round robin of the PMDs. This algorithm was used by
>> default
>> +prior to OVS 2.9. For example, given the following ports and queues:
>> +
>> +- Port #0 Queue #0 (P0Q0)
>> +- Port #0 Queue #1 (P0Q1)
>> +- Port #1 Queue #0 (P1Q0)
>> +- Port #1 Queue #1 (P1Q1)
>> +- Port #1 Queue #2 (P1Q2)
>> +
>> +The Rx queues will be assigned to the cores in the following order::
>> +
>> +    Core 3: P0Q0 | P1Q1
>> +    Core 7: P0Q1 | P1Q2
>> +    Core 8: P1Q0 |
>> +
>>  To see the current measured usage history of PMD core cycles for each Rx
>>  queue::
>> diff --git a/NEWS b/NEWS
>> index 8987f9a..9e809d1 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -5,4 +5,6 @@ Post-v2.10.0
>>     - The environment variable OVS_CTL_TIMEOUT, if set, is now used
>>       as the default timeout for control utilities.
>> +   - DPDK:
>> +     * Add option for port/queue based Rxq to PMD assignment.
>>
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 7f836bb..507906c 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -342,4 +342,7 @@ struct dp_netdev {
>>      struct id_pool *tx_qid_pool;
>>      struct ovs_mutex tx_qid_pool_mutex;
>> +    /* Stores whether an rxq's used cycles should be
>> +     * used to influence assignment to pmds. */
>> +    atomic_bool pmd_rxq_assign_cyc;
>>
>>      /* Protects the access of the 'struct dp_netdev_pmd_thread'
>> @@ -1491,4 +1494,5 @@ create_dp_netdev(const char *name, const struct
>> dpif_class *class,
>>      atomic_init(&dp->emc_insert_min, DEFAULT_EM_FLOW_INSERT_MIN);
>>      atomic_init(&dp->tx_flush_interval, DEFAULT_TX_FLUSH_INTERVAL);
>> +    atomic_init(&dp->pmd_rxq_assign_cyc, true);
>>
>>      cmap_init(&dp->poll_threads);
>> @@ -3717,4 +3721,6 @@ dpif_netdev_set_config(struct dpif *dpif, const
>> struct smap *other_config)
>>      struct dp_netdev *dp = get_dp_netdev(dpif);
>>      const char *cmask = smap_get(other_config, "pmd-cpu-mask");
>> +    const char *pmd_rxq_assign = smap_get_def(other_config,
>> "pmd-rxq-assign",
>> +                                             "cycles");
>>      unsigned long long insert_prob =
>>          smap_get_ullong(other_config, "emc-insert-inv-prob",
>> @@ -3779,4 +3785,20 @@ dpif_netdev_set_config(struct dpif *dpif, const
>> struct smap *other_config)
>>          }
>>      }
>> +
>> +    bool pmd_rxq_assign_cyc = !strcmp(pmd_rxq_assign, "cycles");
>> +    if (pmd_rxq_assign_cyc || !strcmp(pmd_rxq_assign, "portqueue")) {
>> +        bool cur_pmd_rxq_assign_cyc;
>> +        atomic_read_relaxed(&dp->pmd_rxq_assign_cyc,
>> &cur_pmd_rxq_assign_cyc);
>> +        if (pmd_rxq_assign_cyc != cur_pmd_rxq_assign_cyc) {
>> +            atomic_store_relaxed(&dp->pmd_rxq_assign_cyc,
>> pmd_rxq_assign_cyc);
>> +            if (pmd_rxq_assign_cyc) {
>> +                VLOG_INFO("Rxq to PMD assignment mode: "
>> +                          "Sort queues by processing cycles.");
>> +            } else {
>> +                VLOG_INFO("Rxq to PMD assignment mode: "
>> +                          "Sort queues by port/queue number.");
>> +            }
>> +        }
>> +    }
>>      return 0;
>>  }
>> @@ -4249,27 +4271,40 @@ rr_numa_list_populate(struct dp_netdev *dp,
>> struct rr_numa_list *rr)
>>  }
>>
>> -/* Returns the next pmd from the numa node in
>> - * incrementing or decrementing order. */
>> +/* Returns the next pmd from the numa node.
>> + *
>> + * If updown is 'true' it will alternate between
>> + * selecting the next pmd in either an up or down
>> + * walk, switching between up/down each time the
>> + * min or max core is reached. e.g. 1,2,3,3,2,1,1,2...
>> + *
>> + * If updown is 'false' it will select the next pmd
>> + * in ascending order, wrapping around when max core
>> + * is reached. e.g. 1,2,3,1,2,3,1,2... */
>>  static struct dp_netdev_pmd_thread *
>> -rr_numa_get_pmd(struct rr_numa *numa)
>> +rr_numa_get_pmd(struct rr_numa *numa, bool updown)
>>  {
>> -    int numa_idx = numa->cur_index;
>> +    int numa_idx;
>>
>> -    if (numa->idx_inc == true) {
>> -        /* Incrementing through list of pmds. */
>> -        if (numa->cur_index == numa->n_pmds-1) {
>> -            /* Reached the last pmd. */
>> -            numa->idx_inc = false;
>> +    if (updown) {
>> +        numa_idx = numa->cur_index;
>> +        if (numa->idx_inc == true) {
>> +            /* Incrementing through list of pmds. */
>> +            if (numa->cur_index == numa->n_pmds-1) {
>> +                /* Reached the last pmd. */
>> +                numa->idx_inc = false;
>> +            } else {
>> +                numa->cur_index++;
>> +            }
>>          } else {
>> -            numa->cur_index++;
>> +            /* Decrementing through list of pmds. */
>> +            if (numa->cur_index == 0) {
>> +                /* Reached the first pmd. */
>> +                numa->idx_inc = true;
>> +            } else {
>> +                numa->cur_index--;
>> +            }
>>          }
>>      } else {
>> -        /* Decrementing through list of pmds. */
>> -        if (numa->cur_index == 0) {
>> -            /* Reached the first pmd. */
>> -            numa->idx_inc = true;
>> -        } else {
>> -            numa->cur_index--;
>> -        }
>> +        numa_idx = numa->cur_index++ % numa->n_pmds;
>>      }
> 
> This looks like a big change, how about the following:
> 
> @@ -4259,7 +4259,11 @@ rr_numa_get_pmd(struct rr_numa *numa)
>          /* Incrementing through list of pmds. */
>          if (numa->cur_index == numa->n_pmds-1) {
>              /* Reached the last pmd. */
> -            numa->idx_inc = false;
> +            if (updown) {
> +                numa->idx_inc = false;
> +            } else {
> +                numa->cur_index = 0;
> +            }
>          } else {
> 

I had thought it could be simpler by using the previously used LOC for
that case but perhaps it is better to do the suggested and not have a
mix of whether the mask is required or not. Changed to suggested.

> 
>>      return numa->pmds[numa_idx];
>> @@ -4323,7 +4358,4 @@ compare_rxq_cycles(const void *a, const void *b)
>>   * pmds to unpinned queues.
>>   *
>> - * If 'pinned' is false queues will be sorted by processing cycles
>> they are
>> - * consuming and then assigned to pmds in round robin order.
>> - *
>>   * The function doesn't touch the pmd threads, it just stores the
>> assignment
>>   * in the 'pmd' member of each rxq. */
>> @@ -4338,5 +4370,7 @@ rxq_scheduling(struct dp_netdev *dp, bool
>> pinned) OVS_REQUIRES(dp->port_mutex)
>>      struct rr_numa *numa = NULL;
>>      int numa_id;
>> +    bool assign_cyc;
>>
>> +    atomic_read_relaxed(&dp->pmd_rxq_assign_cyc, &assign_cyc);
>>      HMAP_FOR_EACH (port, node, &dp->ports) {
>>          if (!netdev_is_pmd(port->netdev)) {
>> @@ -4368,10 +4402,13 @@ rxq_scheduling(struct dp_netdev *dp, bool
>> pinned) OVS_REQUIRES(dp->port_mutex)
>>                      rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1));
>>                  }
>> -                /* Sum the queue intervals and store the cycle
>> history. */
>> -                for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
>> -                    cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i);
>> -                }
>> -                dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST,
>> cycle_hist);
>>
>> +                if (assign_cyc) {
>> +                    /* Sum the queue intervals and store the cycle
>> history. */
>> +                    for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX;
>> i++) {
>> +                        cycle_hist +=
>> dp_netdev_rxq_get_intrvl_cycles(q, i);
>> +                    }
>> +                    dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST,
>> +                                             cycle_hist);
>> +                }
>>                  /* Store the queue. */
>>                  rxqs[n_rxqs++] = q;
>> @@ -4380,5 +4417,5 @@ rxq_scheduling(struct dp_netdev *dp, bool
>> pinned) OVS_REQUIRES(dp->port_mutex)
>>      }
>>
>> -    if (n_rxqs > 1) {
>> +    if (n_rxqs > 1 && assign_cyc) {
>>          /* Sort the queues in order of the processing cycles
>>           * they consumed during their last pmd interval. */
>> @@ -4404,5 +4441,5 @@ rxq_scheduling(struct dp_netdev *dp, bool
>> pinned) OVS_REQUIRES(dp->port_mutex)
>>                  continue;
>>              }
>> -            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa);
>> +            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa, assign_cyc);
>>              VLOG_WARN("There's no available (non-isolated) pmd thread "
>>                        "on numa node %d. Queue %d on port \'%s\' will "
>> @@ -4413,11 +4450,21 @@ rxq_scheduling(struct dp_netdev *dp, bool
>> pinned) OVS_REQUIRES(dp->port_mutex)
>>                        rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id);
>>          } else {
>> -        rxqs[i]->pmd = rr_numa_get_pmd(numa);
>> -        VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
>> -                  "rx queue %d (measured processing cycles %"PRIu64").",
>> -                  rxqs[i]->pmd->core_id, numa_id,
>> -                  netdev_rxq_get_name(rxqs[i]->rx),
>> -                  netdev_rxq_get_queue_id(rxqs[i]->rx),
>> -                  dp_netdev_rxq_get_cycles(rxqs[i],
>> RXQ_CYCLES_PROC_HIST));
>> +            rxqs[i]->pmd = rr_numa_get_pmd(numa, assign_cyc);
>> +            if (assign_cyc) {
>> +                VLOG_INFO("Core %d on numa node %d assigned port
>> \'%s\' "
>> +                          "rx queue %d "
>> +                          "(measured processing cycles %"PRIu64").",
>> +                          rxqs[i]->pmd->core_id, numa_id,
>> +                          netdev_rxq_get_name(rxqs[i]->rx),
>> +                          netdev_rxq_get_queue_id(rxqs[i]->rx),
>> +                          dp_netdev_rxq_get_cycles(rxqs[i],
>> +                                                  
>> RXQ_CYCLES_PROC_HIST));
>> +            } else {
>> +                VLOG_INFO("Core %d on numa node %d assigned port
>> \'%s\' "
>> +                          "rx queue %d.", rxqs[i]->pmd->core_id,
>> numa_id,
>> +                          netdev_rxq_get_name(rxqs[i]->rx),
>> +                          netdev_rxq_get_queue_id(rxqs[i]->rx));
>> +            }
>> +
>>          }
>>      }
>> diff --git a/tests/pmd.at b/tests/pmd.at
>> index 4cae6c8..7cbaf41 100644
>> --- a/tests/pmd.at
>> +++ b/tests/pmd.at
>> @@ -62,5 +62,6 @@ m4_define([CHECK_PMD_THREADS_CREATED], [
>>
>>  m4_define([SED_NUMA_CORE_PATTERN], ["s/\(numa_id \)[[0-9]]*\( core_id
>> \)[[0-9]]*:/\1<cleared>\2<cleared>:/"])
>> -m4_define([SED_NUMA_CORE_QUEUE_PATTERN], ["s/1 2 5 6/<group>/;s/0 3 4
>> 7/<group>/"])
>> +m4_define([SED_NUMA_CORE_QUEUE_CYC_PATTERN], ["s/1 2 5 6/<group>/;s/0
>> 3 4 7/<group>/"])
>> +m4_define([SED_NUMA_CORE_QUEUE_PQ_PATTERN], ["s/1 3 5 7/<group>/;s/0
>> 2 4 6/<group>/"])
>>  m4_define([DUMMY_NUMA], [--dummy-numa="0,0,0,0"])
>>
>> @@ -146,9 +147,19 @@ pmd thread numa_id <cleared> core_id <cleared>:
>>  ])
>>
>> +AT_CHECK([ovs-vsctl set Open_vSwitch .
>> other_config:pmd-rxq-assign=cycles])
> 
> Should you not do a rebalance here, you are “theoretically” changing the
> config here.
> 

It won't be needed here or below now that re-assignment will
automatically happen when the assignment mode changes.

>>  TMP=$(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])
>>  AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3])
>>  CHECK_PMD_THREADS_CREATED([2], [], [+$TMP])
>>
>> -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed
>> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed
>> SED_NUMA_CORE_QUEUE_PATTERN], [0], [dnl
>> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed
>> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed
>> SED_NUMA_CORE_QUEUE_CYC_PATTERN], [0], [dnl
>> +port: p0 queue-id: <group>
>> +port: p0 queue-id: <group>
>> +])
>> +
>> +AT_CHECK([ovs-vsctl set Open_vSwitch .
>> other_config:pmd-rxq-assign=portqueue])
>> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-rebalance], [0], [dnl
>> +pmd rxq rebalance requested.
>> +])
>> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed
>> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed
>> SED_NUMA_CORE_QUEUE_PQ_PATTERN], [0], [dnl
>>  port: p0 queue-id: <group>
>>  port: p0 queue-id: <group>
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index 0cd8520..b5789a5 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -2810,4 +2810,27 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
>> type=patch options:peer=p1 \
>>        </column>
>>
>> +      <column name="other_config" key="pmd-rxq-assign"
>> +              type='{"type": "string"}'>
>> +        <p>
>> +            Specifies how RX queues are automatically assigned to CPU
>> cores.
>> +            Options:
>> +                <code>cycles</code> - Sort queues in descending order of
>> +                measured processing cycles before assigning round robin
>> +                to CPUs.
>> +                <code>portqueue</code>  - Sort queues in ascending order
>> +                of port/queue number before assign round robin to CPUs.
>> +        </p>
>> +        <p>
>> +          The default value is <code>cycles</code>.
>> +        </p>
>> +        <p>
>> +          Note this this will not automatically perform a reschedule.
>> +          When the mode is set to <code>portqueue</code> the
>> +          <code>ovs-appctl dpif-netdev/pmd-rxq-rebalance</code>
>> command will
>> +          not change the Rxq to PMD assignment, as the
>> Port/Queue/PMDs config
>> +          will be unchanged from the last reconfiguration.
> 
> The last part of the rebalance in portqueue mode is a bit confusing. I
> thought initially you meant when setting it to portqueue mode, but its
> the case its already set to portqueue mode and run rebalance again.
> 

You're right and the last sentence was not really needed. Anyway, the
operation has changed in v2 and I've replaced this paragraph.

thanks,
Kevin.

>> +        </p>
>> +      </column>
>> +
>>        <column name="options" key="vhost-server-path"
>>                type='{"type": "string"}'>
>> -- 
>> 1.8.3.1
Kevin Traynor Aug. 23, 2018, 4:23 p.m. UTC | #4
On 08/22/2018 12:51 PM, Ilya Maximets wrote:
> Hi, Kevin.
> Thanks for the patch.
> I didn't look at the code closely, but have a few high level comments.
> 

Hi Ilya,

Thanks for the feedback,

> At first, please, change the patch prefix to 'dpif-netdev' as this patch
> affects only this module and has no changes for 'netdev-dpdk'.
> 

oops, fixed.

> Next, I have a concern about "portqueue" option naming and the docs.
> Documentation says that "Rxqs are ordered by their port/queue number,
> and then assigned in ascending order to PMDs". I think that it's a bit
> misleading, because ports are scheduled in the order in which they are
> in a hash map. And this order could be different for different compiler
> flags/cpu architectures because of different hash functions. So, ports
> will be assigned in round-robin fashion, but we can not actually predict
> which port will be assigned to which cpu core. Maybe, it'll be better
> to rename to something like "round-robin" and rewrite the docs to avoid
> false impression of predictable port to cpu mapping.
> What do you think?
> 

ok, sure. Naming is hard. I went for 'roundrobin' and re-wrote the docs.
It's also tricky to find the balance in docs to give some info but not
have confusing wordy explanations. I tried to keep it simple enough for
a user.

> vswitch.xml changes are misplaced. They should go in "Configuration"
> group of "Open_vSwitch" table, not the "Interface".
> 

Fixed.

> The last thing is that you need to request for reconfiguration if this
> option changed. IMHO, user will expect that ports will be re-assigned
> in case of algorithm change in database. This should be not so
> frequent event. And you may avoid using atomics in this case. Just a
> new field in the structure.
> 

Makes sense, I've changed it to do that in v2 to follow...

thanks,
Kevin.

> Best regards, Ilya Maximets.
> 
> On 21.08.2018 20:15, Kevin Traynor wrote:
>> Prior to OVS 2.9 automatic assignment of Rxqs to PMDs
>> (i.e. CPUs) was done by assigning Rxqs in an ascending
>> port/queue order, round robined across the available
>> PMDs.
>>
>> That was changed in OVS 2.9 to order the Rxqs by the
>> measured processing cycles for each, in order to try
>> and keep the busiest Rxqs on different PMDs.
>>
>> For the most part the new scheme should be better, but
>> there could be situations where a user prefers a
>> port/queue scheme because Rxqs from a single port are
>> more likely to be spread across multiple cores, and/or
>> traffic is very bursty/unpredictable.
>>
>> Add the option to have a port/queue based assignment.
>>
>> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
>> ---
>>  Documentation/topics/dpdk/pmd.rst |  34 +++++++++--
>>  NEWS                              |   2 +
>>  lib/dpif-netdev.c                 | 115 +++++++++++++++++++++++++++-----------
>>  tests/pmd.at                      |  15 ++++-
>>  vswitchd/vswitch.xml              |  23 ++++++++
>>  5 files changed, 147 insertions(+), 42 deletions(-)
>>
>> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
>> index 5f0671e..e8628cc 100644
>> --- a/Documentation/topics/dpdk/pmd.rst
>> +++ b/Documentation/topics/dpdk/pmd.rst
>> @@ -113,10 +113,15 @@ means that this thread will only poll the *pinned* Rx queues.
>>  
>>  If ``pmd-rxq-affinity`` is not set for Rx queues, they will be assigned to PMDs
>> -(cores) automatically. Where known, the processing cycles that have been stored
>> -for each Rx queue will be used to assign Rx queue to PMDs based on a round
>> -robin of the sorted Rx queues. For example, take the following example, where
>> -there are five Rx queues and three cores - 3, 7, and 8 - available and the
>> -measured usage of core cycles per Rx queue over the last interval is seen to
>> -be:
>> +(cores) automatically.
>> +
>> +The algorithm used to automatically assign Rxqs to PMDs can be set by::
>> +
>> +    $ ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=<assignment>
>> +
>> +By default, ``cycles`` assignment is used where the Rxqs will be ordered by
>> +their measured processing cycles, and then be assigned in descending order to
>> +PMDs based on a round robin up/down walk of the PMDs. For example, where there
>> +are five Rx queues and three cores - 3, 7, and 8 - available and the measured
>> +usage of core cycles per Rx queue over the last interval is seen to be:
>>  
>>  - Queue #0: 30%
>> @@ -132,4 +137,21 @@ The Rx queues will be assigned to the cores in the following order::
>>      Core 8: Q3 (60%) | Q0 (30%)
>>  
>> +Alternatively, ``portqueue`` assignment can be used, where the Rxqs are
>> +ordered by their port/queue number, and then assigned in ascending order to
>> +PMDs based on a round robin of the PMDs. This algorithm was used by default
>> +prior to OVS 2.9. For example, given the following ports and queues:
>> +
>> +- Port #0 Queue #0 (P0Q0)
>> +- Port #0 Queue #1 (P0Q1)
>> +- Port #1 Queue #0 (P1Q0)
>> +- Port #1 Queue #1 (P1Q1)
>> +- Port #1 Queue #2 (P1Q2)
>> +
>> +The Rx queues will be assigned to the cores in the following order::
>> +
>> +    Core 3: P0Q0 | P1Q1
>> +    Core 7: P0Q1 | P1Q2
>> +    Core 8: P1Q0 |
>> +
>>  To see the current measured usage history of PMD core cycles for each Rx
>>  queue::
>> diff --git a/NEWS b/NEWS
>> index 8987f9a..9e809d1 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -5,4 +5,6 @@ Post-v2.10.0
>>     - The environment variable OVS_CTL_TIMEOUT, if set, is now used
>>       as the default timeout for control utilities.
>> +   - DPDK:
>> +     * Add option for port/queue based Rxq to PMD assignment.
>>  
>>  
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 7f836bb..507906c 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -342,4 +342,7 @@ struct dp_netdev {
>>      struct id_pool *tx_qid_pool;
>>      struct ovs_mutex tx_qid_pool_mutex;
>> +    /* Stores whether an rxq's used cycles should be
>> +     * used to influence assignment to pmds. */
>> +    atomic_bool pmd_rxq_assign_cyc;
>>  
>>      /* Protects the access of the 'struct dp_netdev_pmd_thread'
>> @@ -1491,4 +1494,5 @@ create_dp_netdev(const char *name, const struct dpif_class *class,
>>      atomic_init(&dp->emc_insert_min, DEFAULT_EM_FLOW_INSERT_MIN);
>>      atomic_init(&dp->tx_flush_interval, DEFAULT_TX_FLUSH_INTERVAL);
>> +    atomic_init(&dp->pmd_rxq_assign_cyc, true);
>>  
>>      cmap_init(&dp->poll_threads);
>> @@ -3717,4 +3721,6 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>>      struct dp_netdev *dp = get_dp_netdev(dpif);
>>      const char *cmask = smap_get(other_config, "pmd-cpu-mask");
>> +    const char *pmd_rxq_assign = smap_get_def(other_config, "pmd-rxq-assign",
>> +                                             "cycles");
>>      unsigned long long insert_prob =
>>          smap_get_ullong(other_config, "emc-insert-inv-prob",
>> @@ -3779,4 +3785,20 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>>          }
>>      }
>> +
>> +    bool pmd_rxq_assign_cyc = !strcmp(pmd_rxq_assign, "cycles");
>> +    if (pmd_rxq_assign_cyc || !strcmp(pmd_rxq_assign, "portqueue")) {
>> +        bool cur_pmd_rxq_assign_cyc;
>> +        atomic_read_relaxed(&dp->pmd_rxq_assign_cyc, &cur_pmd_rxq_assign_cyc);
>> +        if (pmd_rxq_assign_cyc != cur_pmd_rxq_assign_cyc) {
>> +            atomic_store_relaxed(&dp->pmd_rxq_assign_cyc, pmd_rxq_assign_cyc);
>> +            if (pmd_rxq_assign_cyc) {
>> +                VLOG_INFO("Rxq to PMD assignment mode: "
>> +                          "Sort queues by processing cycles.");
>> +            } else {
>> +                VLOG_INFO("Rxq to PMD assignment mode: "
>> +                          "Sort queues by port/queue number.");
>> +            }
>> +        }
>> +    }
>>      return 0;
>>  }
>> @@ -4249,27 +4271,40 @@ rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr)
>>  }
>>  
>> -/* Returns the next pmd from the numa node in
>> - * incrementing or decrementing order. */
>> +/* Returns the next pmd from the numa node.
>> + *
>> + * If updown is 'true' it will alternate between
>> + * selecting the next pmd in either an up or down
>> + * walk, switching between up/down each time the
>> + * min or max core is reached. e.g. 1,2,3,3,2,1,1,2...
>> + *
>> + * If updown is 'false' it will select the next pmd
>> + * in ascending order, wrapping around when max core
>> + * is reached. e.g. 1,2,3,1,2,3,1,2... */
>>  static struct dp_netdev_pmd_thread *
>> -rr_numa_get_pmd(struct rr_numa *numa)
>> +rr_numa_get_pmd(struct rr_numa *numa, bool updown)
>>  {
>> -    int numa_idx = numa->cur_index;
>> +    int numa_idx;
>>  
>> -    if (numa->idx_inc == true) {
>> -        /* Incrementing through list of pmds. */
>> -        if (numa->cur_index == numa->n_pmds-1) {
>> -            /* Reached the last pmd. */
>> -            numa->idx_inc = false;
>> +    if (updown) {
>> +        numa_idx = numa->cur_index;
>> +        if (numa->idx_inc == true) {
>> +            /* Incrementing through list of pmds. */
>> +            if (numa->cur_index == numa->n_pmds-1) {
>> +                /* Reached the last pmd. */
>> +                numa->idx_inc = false;
>> +            } else {
>> +                numa->cur_index++;
>> +            }
>>          } else {
>> -            numa->cur_index++;
>> +            /* Decrementing through list of pmds. */
>> +            if (numa->cur_index == 0) {
>> +                /* Reached the first pmd. */
>> +                numa->idx_inc = true;
>> +            } else {
>> +                numa->cur_index--;
>> +            }
>>          }
>>      } else {
>> -        /* Decrementing through list of pmds. */
>> -        if (numa->cur_index == 0) {
>> -            /* Reached the first pmd. */
>> -            numa->idx_inc = true;
>> -        } else {
>> -            numa->cur_index--;
>> -        }
>> +        numa_idx = numa->cur_index++ % numa->n_pmds;
>>      }
>>      return numa->pmds[numa_idx];
>> @@ -4323,7 +4358,4 @@ compare_rxq_cycles(const void *a, const void *b)
>>   * pmds to unpinned queues.
>>   *
>> - * If 'pinned' is false queues will be sorted by processing cycles they are
>> - * consuming and then assigned to pmds in round robin order.
>> - *
>>   * The function doesn't touch the pmd threads, it just stores the assignment
>>   * in the 'pmd' member of each rxq. */
>> @@ -4338,5 +4370,7 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
>>      struct rr_numa *numa = NULL;
>>      int numa_id;
>> +    bool assign_cyc;
>>  
>> +    atomic_read_relaxed(&dp->pmd_rxq_assign_cyc, &assign_cyc);
>>      HMAP_FOR_EACH (port, node, &dp->ports) {
>>          if (!netdev_is_pmd(port->netdev)) {
>> @@ -4368,10 +4402,13 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
>>                      rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1));
>>                  }
>> -                /* Sum the queue intervals and store the cycle history. */
>> -                for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
>> -                    cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i);
>> -                }
>> -                dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, cycle_hist);
>>  
>> +                if (assign_cyc) {
>> +                    /* Sum the queue intervals and store the cycle history. */
>> +                    for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
>> +                        cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i);
>> +                    }
>> +                    dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST,
>> +                                             cycle_hist);
>> +                }
>>                  /* Store the queue. */
>>                  rxqs[n_rxqs++] = q;
>> @@ -4380,5 +4417,5 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
>>      }
>>  
>> -    if (n_rxqs > 1) {
>> +    if (n_rxqs > 1 && assign_cyc) {
>>          /* Sort the queues in order of the processing cycles
>>           * they consumed during their last pmd interval. */
>> @@ -4404,5 +4441,5 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
>>                  continue;
>>              }
>> -            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa);
>> +            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa, assign_cyc);
>>              VLOG_WARN("There's no available (non-isolated) pmd thread "
>>                        "on numa node %d. Queue %d on port \'%s\' will "
>> @@ -4413,11 +4450,21 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
>>                        rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id);
>>          } else {
>> -        rxqs[i]->pmd = rr_numa_get_pmd(numa);
>> -        VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
>> -                  "rx queue %d (measured processing cycles %"PRIu64").",
>> -                  rxqs[i]->pmd->core_id, numa_id,
>> -                  netdev_rxq_get_name(rxqs[i]->rx),
>> -                  netdev_rxq_get_queue_id(rxqs[i]->rx),
>> -                  dp_netdev_rxq_get_cycles(rxqs[i], RXQ_CYCLES_PROC_HIST));
>> +            rxqs[i]->pmd = rr_numa_get_pmd(numa, assign_cyc);
>> +            if (assign_cyc) {
>> +                VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
>> +                          "rx queue %d "
>> +                          "(measured processing cycles %"PRIu64").",
>> +                          rxqs[i]->pmd->core_id, numa_id,
>> +                          netdev_rxq_get_name(rxqs[i]->rx),
>> +                          netdev_rxq_get_queue_id(rxqs[i]->rx),
>> +                          dp_netdev_rxq_get_cycles(rxqs[i],
>> +                                                   RXQ_CYCLES_PROC_HIST));
>> +            } else {
>> +                VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
>> +                          "rx queue %d.", rxqs[i]->pmd->core_id, numa_id,
>> +                          netdev_rxq_get_name(rxqs[i]->rx),
>> +                          netdev_rxq_get_queue_id(rxqs[i]->rx));
>> +            }
>> +
>>          }
>>      }
>> diff --git a/tests/pmd.at b/tests/pmd.at
>> index 4cae6c8..7cbaf41 100644
>> --- a/tests/pmd.at
>> +++ b/tests/pmd.at
>> @@ -62,5 +62,6 @@ m4_define([CHECK_PMD_THREADS_CREATED], [
>>  
>>  m4_define([SED_NUMA_CORE_PATTERN], ["s/\(numa_id \)[[0-9]]*\( core_id \)[[0-9]]*:/\1<cleared>\2<cleared>:/"])
>> -m4_define([SED_NUMA_CORE_QUEUE_PATTERN], ["s/1 2 5 6/<group>/;s/0 3 4 7/<group>/"])
>> +m4_define([SED_NUMA_CORE_QUEUE_CYC_PATTERN], ["s/1 2 5 6/<group>/;s/0 3 4 7/<group>/"])
>> +m4_define([SED_NUMA_CORE_QUEUE_PQ_PATTERN], ["s/1 3 5 7/<group>/;s/0 2 4 6/<group>/"])
>>  m4_define([DUMMY_NUMA], [--dummy-numa="0,0,0,0"])
>>  
>> @@ -146,9 +147,19 @@ pmd thread numa_id <cleared> core_id <cleared>:
>>  ])
>>  
>> +AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=cycles])
>>  TMP=$(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])
>>  AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3])
>>  CHECK_PMD_THREADS_CREATED([2], [], [+$TMP])
>>  
>> -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed SED_NUMA_CORE_QUEUE_PATTERN], [0], [dnl
>> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed SED_NUMA_CORE_QUEUE_CYC_PATTERN], [0], [dnl
>> +port: p0 queue-id: <group>
>> +port: p0 queue-id: <group>
>> +])
>> +
>> +AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=portqueue])
>> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-rebalance], [0], [dnl
>> +pmd rxq rebalance requested.
>> +])
>> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed SED_NUMA_CORE_QUEUE_PQ_PATTERN], [0], [dnl
>>  port: p0 queue-id: <group>
>>  port: p0 queue-id: <group>
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index 0cd8520..b5789a5 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -2810,4 +2810,27 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>>        </column>
>>  
>> +      <column name="other_config" key="pmd-rxq-assign"
>> +              type='{"type": "string"}'>
>> +        <p>
>> +            Specifies how RX queues are automatically assigned to CPU cores.
>> +            Options:
>> +                <code>cycles</code> - Sort queues in descending order of
>> +                measured processing cycles before assigning round robin
>> +                to CPUs.
>> +                <code>portqueue</code>  - Sort queues in ascending order
>> +                of port/queue number before assign round robin to CPUs.
>> +        </p>
>> +        <p>
>> +          The default value is <code>cycles</code>.
>> +        </p>
>> +        <p>
>> +          Note this this will not automatically perform a reschedule.
>> +          When the mode is set to <code>portqueue</code> the
>> +          <code>ovs-appctl dpif-netdev/pmd-rxq-rebalance</code> command will
>> +          not change the Rxq to PMD assignment, as the Port/Queue/PMDs config
>> +          will be unchanged from the last reconfiguration.
>> +        </p>
>> +      </column>
>> +
>>        <column name="options" key="vhost-server-path"
>>                type='{"type": "string"}'>
>>
diff mbox series

Patch

diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
index 5f0671e..e8628cc 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -113,10 +113,15 @@  means that this thread will only poll the *pinned* Rx queues.
 
 If ``pmd-rxq-affinity`` is not set for Rx queues, they will be assigned to PMDs
-(cores) automatically. Where known, the processing cycles that have been stored
-for each Rx queue will be used to assign Rx queue to PMDs based on a round
-robin of the sorted Rx queues. For example, take the following example, where
-there are five Rx queues and three cores - 3, 7, and 8 - available and the
-measured usage of core cycles per Rx queue over the last interval is seen to
-be:
+(cores) automatically.
+
+The algorithm used to automatically assign Rxqs to PMDs can be set by::
+
+    $ ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=<assignment>
+
+By default, ``cycles`` assignment is used where the Rxqs will be ordered by
+their measured processing cycles, and then be assigned in descending order to
+PMDs based on a round robin up/down walk of the PMDs. For example, where there
+are five Rx queues and three cores - 3, 7, and 8 - available and the measured
+usage of core cycles per Rx queue over the last interval is seen to be:
 
 - Queue #0: 30%
@@ -132,4 +137,21 @@  The Rx queues will be assigned to the cores in the following order::
     Core 8: Q3 (60%) | Q0 (30%)
 
+Alternatively, ``portqueue`` assignment can be used, where the Rxqs are
+ordered by their port/queue number, and then assigned in ascending order to
+PMDs based on a round robin of the PMDs. This algorithm was used by default
+prior to OVS 2.9. For example, given the following ports and queues:
+
+- Port #0 Queue #0 (P0Q0)
+- Port #0 Queue #1 (P0Q1)
+- Port #1 Queue #0 (P1Q0)
+- Port #1 Queue #1 (P1Q1)
+- Port #1 Queue #2 (P1Q2)
+
+The Rx queues will be assigned to the cores in the following order::
+
+    Core 3: P0Q0 | P1Q1
+    Core 7: P0Q1 | P1Q2
+    Core 8: P1Q0 |
+
 To see the current measured usage history of PMD core cycles for each Rx
 queue::
diff --git a/NEWS b/NEWS
index 8987f9a..9e809d1 100644
--- a/NEWS
+++ b/NEWS
@@ -5,4 +5,6 @@  Post-v2.10.0
    - The environment variable OVS_CTL_TIMEOUT, if set, is now used
      as the default timeout for control utilities.
+   - DPDK:
+     * Add option for port/queue based Rxq to PMD assignment.
 
 
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 7f836bb..507906c 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -342,4 +342,7 @@  struct dp_netdev {
     struct id_pool *tx_qid_pool;
     struct ovs_mutex tx_qid_pool_mutex;
+    /* Stores whether an rxq's used cycles should be
+     * used to influence assignment to pmds. */
+    atomic_bool pmd_rxq_assign_cyc;
 
     /* Protects the access of the 'struct dp_netdev_pmd_thread'
@@ -1491,4 +1494,5 @@  create_dp_netdev(const char *name, const struct dpif_class *class,
     atomic_init(&dp->emc_insert_min, DEFAULT_EM_FLOW_INSERT_MIN);
     atomic_init(&dp->tx_flush_interval, DEFAULT_TX_FLUSH_INTERVAL);
+    atomic_init(&dp->pmd_rxq_assign_cyc, true);
 
     cmap_init(&dp->poll_threads);
@@ -3717,4 +3721,6 @@  dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
     struct dp_netdev *dp = get_dp_netdev(dpif);
     const char *cmask = smap_get(other_config, "pmd-cpu-mask");
+    const char *pmd_rxq_assign = smap_get_def(other_config, "pmd-rxq-assign",
+                                             "cycles");
     unsigned long long insert_prob =
         smap_get_ullong(other_config, "emc-insert-inv-prob",
@@ -3779,4 +3785,20 @@  dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
         }
     }
+
+    bool pmd_rxq_assign_cyc = !strcmp(pmd_rxq_assign, "cycles");
+    if (pmd_rxq_assign_cyc || !strcmp(pmd_rxq_assign, "portqueue")) {
+        bool cur_pmd_rxq_assign_cyc;
+        atomic_read_relaxed(&dp->pmd_rxq_assign_cyc, &cur_pmd_rxq_assign_cyc);
+        if (pmd_rxq_assign_cyc != cur_pmd_rxq_assign_cyc) {
+            atomic_store_relaxed(&dp->pmd_rxq_assign_cyc, pmd_rxq_assign_cyc);
+            if (pmd_rxq_assign_cyc) {
+                VLOG_INFO("Rxq to PMD assignment mode: "
+                          "Sort queues by processing cycles.");
+            } else {
+                VLOG_INFO("Rxq to PMD assignment mode: "
+                          "Sort queues by port/queue number.");
+            }
+        }
+    }
     return 0;
 }
@@ -4249,27 +4271,40 @@  rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr)
 }
 
-/* Returns the next pmd from the numa node in
- * incrementing or decrementing order. */
+/* Returns the next pmd from the numa node.
+ *
+ * If updown is 'true' it will alternate between
+ * selecting the next pmd in either an up or down
+ * walk, switching between up/down each time the
+ * min or max core is reached. e.g. 1,2,3,3,2,1,1,2...
+ *
+ * If updown is 'false' it will select the next pmd
+ * in ascending order, wrapping around when max core
+ * is reached. e.g. 1,2,3,1,2,3,1,2... */
 static struct dp_netdev_pmd_thread *
-rr_numa_get_pmd(struct rr_numa *numa)
+rr_numa_get_pmd(struct rr_numa *numa, bool updown)
 {
-    int numa_idx = numa->cur_index;
+    int numa_idx;
 
-    if (numa->idx_inc == true) {
-        /* Incrementing through list of pmds. */
-        if (numa->cur_index == numa->n_pmds-1) {
-            /* Reached the last pmd. */
-            numa->idx_inc = false;
+    if (updown) {
+        numa_idx = numa->cur_index;
+        if (numa->idx_inc == true) {
+            /* Incrementing through list of pmds. */
+            if (numa->cur_index == numa->n_pmds-1) {
+                /* Reached the last pmd. */
+                numa->idx_inc = false;
+            } else {
+                numa->cur_index++;
+            }
         } else {
-            numa->cur_index++;
+            /* Decrementing through list of pmds. */
+            if (numa->cur_index == 0) {
+                /* Reached the first pmd. */
+                numa->idx_inc = true;
+            } else {
+                numa->cur_index--;
+            }
         }
     } else {
-        /* Decrementing through list of pmds. */
-        if (numa->cur_index == 0) {
-            /* Reached the first pmd. */
-            numa->idx_inc = true;
-        } else {
-            numa->cur_index--;
-        }
+        numa_idx = numa->cur_index++ % numa->n_pmds;
     }
     return numa->pmds[numa_idx];
@@ -4323,7 +4358,4 @@  compare_rxq_cycles(const void *a, const void *b)
  * pmds to unpinned queues.
  *
- * If 'pinned' is false queues will be sorted by processing cycles they are
- * consuming and then assigned to pmds in round robin order.
- *
  * The function doesn't touch the pmd threads, it just stores the assignment
  * in the 'pmd' member of each rxq. */
@@ -4338,5 +4370,7 @@  rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
     struct rr_numa *numa = NULL;
     int numa_id;
+    bool assign_cyc;
 
+    atomic_read_relaxed(&dp->pmd_rxq_assign_cyc, &assign_cyc);
     HMAP_FOR_EACH (port, node, &dp->ports) {
         if (!netdev_is_pmd(port->netdev)) {
@@ -4368,10 +4402,13 @@  rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
                     rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1));
                 }
-                /* Sum the queue intervals and store the cycle history. */
-                for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
-                    cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i);
-                }
-                dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, cycle_hist);
 
+                if (assign_cyc) {
+                    /* Sum the queue intervals and store the cycle history. */
+                    for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
+                        cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i);
+                    }
+                    dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST,
+                                             cycle_hist);
+                }
                 /* Store the queue. */
                 rxqs[n_rxqs++] = q;
@@ -4380,5 +4417,5 @@  rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
     }
 
-    if (n_rxqs > 1) {
+    if (n_rxqs > 1 && assign_cyc) {
         /* Sort the queues in order of the processing cycles
          * they consumed during their last pmd interval. */
@@ -4404,5 +4441,5 @@  rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
                 continue;
             }
-            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa);
+            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa, assign_cyc);
             VLOG_WARN("There's no available (non-isolated) pmd thread "
                       "on numa node %d. Queue %d on port \'%s\' will "
@@ -4413,11 +4450,21 @@  rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
                       rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id);
         } else {
-        rxqs[i]->pmd = rr_numa_get_pmd(numa);
-        VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
-                  "rx queue %d (measured processing cycles %"PRIu64").",
-                  rxqs[i]->pmd->core_id, numa_id,
-                  netdev_rxq_get_name(rxqs[i]->rx),
-                  netdev_rxq_get_queue_id(rxqs[i]->rx),
-                  dp_netdev_rxq_get_cycles(rxqs[i], RXQ_CYCLES_PROC_HIST));
+            rxqs[i]->pmd = rr_numa_get_pmd(numa, assign_cyc);
+            if (assign_cyc) {
+                VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
+                          "rx queue %d "
+                          "(measured processing cycles %"PRIu64").",
+                          rxqs[i]->pmd->core_id, numa_id,
+                          netdev_rxq_get_name(rxqs[i]->rx),
+                          netdev_rxq_get_queue_id(rxqs[i]->rx),
+                          dp_netdev_rxq_get_cycles(rxqs[i],
+                                                   RXQ_CYCLES_PROC_HIST));
+            } else {
+                VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
+                          "rx queue %d.", rxqs[i]->pmd->core_id, numa_id,
+                          netdev_rxq_get_name(rxqs[i]->rx),
+                          netdev_rxq_get_queue_id(rxqs[i]->rx));
+            }
+
         }
     }
diff --git a/tests/pmd.at b/tests/pmd.at
index 4cae6c8..7cbaf41 100644
--- a/tests/pmd.at
+++ b/tests/pmd.at
@@ -62,5 +62,6 @@  m4_define([CHECK_PMD_THREADS_CREATED], [
 
 m4_define([SED_NUMA_CORE_PATTERN], ["s/\(numa_id \)[[0-9]]*\( core_id \)[[0-9]]*:/\1<cleared>\2<cleared>:/"])
-m4_define([SED_NUMA_CORE_QUEUE_PATTERN], ["s/1 2 5 6/<group>/;s/0 3 4 7/<group>/"])
+m4_define([SED_NUMA_CORE_QUEUE_CYC_PATTERN], ["s/1 2 5 6/<group>/;s/0 3 4 7/<group>/"])
+m4_define([SED_NUMA_CORE_QUEUE_PQ_PATTERN], ["s/1 3 5 7/<group>/;s/0 2 4 6/<group>/"])
 m4_define([DUMMY_NUMA], [--dummy-numa="0,0,0,0"])
 
@@ -146,9 +147,19 @@  pmd thread numa_id <cleared> core_id <cleared>:
 ])
 
+AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=cycles])
 TMP=$(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])
 AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3])
 CHECK_PMD_THREADS_CREATED([2], [], [+$TMP])
 
-AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed SED_NUMA_CORE_QUEUE_PATTERN], [0], [dnl
+AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed SED_NUMA_CORE_QUEUE_CYC_PATTERN], [0], [dnl
+port: p0 queue-id: <group>
+port: p0 queue-id: <group>
+])
+
+AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=portqueue])
+AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-rebalance], [0], [dnl
+pmd rxq rebalance requested.
+])
+AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed SED_NUMA_CORE_QUEUE_PQ_PATTERN], [0], [dnl
 port: p0 queue-id: <group>
 port: p0 queue-id: <group>
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 0cd8520..b5789a5 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -2810,4 +2810,27 @@  ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
       </column>
 
+      <column name="other_config" key="pmd-rxq-assign"
+              type='{"type": "string"}'>
+        <p>
+            Specifies how RX queues are automatically assigned to CPU cores.
+            Options:
+                <code>cycles</code> - Sort queues in descending order of
+                measured processing cycles before assigning round robin
+                to CPUs.
+                <code>portqueue</code>  - Sort queues in ascending order
+                of port/queue number before assign round robin to CPUs.
+        </p>
+        <p>
+          The default value is <code>cycles</code>.
+        </p>
+        <p>
+          Note this this will not automatically perform a reschedule.
+          When the mode is set to <code>portqueue</code> the
+          <code>ovs-appctl dpif-netdev/pmd-rxq-rebalance</code> command will
+          not change the Rxq to PMD assignment, as the Port/Queue/PMDs config
+          will be unchanged from the last reconfiguration.
+        </p>
+      </column>
+
       <column name="options" key="vhost-server-path"
               type='{"type": "string"}'>