diff mbox series

[ovs-dev,v5] Adding support for PMD auto load balancing

Message ID 1547491561-12131-1-git-send-email-nitin.katiyar@ericsson.com
State Changes Requested
Delegated to: Ian Stokes
Headers show
Series [ovs-dev,v5] Adding support for PMD auto load balancing | expand

Commit Message

Nitin Katiyar Jan. 14, 2019, 10:44 a.m. UTC
Port rx queues that have not been statically assigned to PMDs are currently
assigned based on periodically sampled load measurements.
The assignment is performed at specific instances – port addition, port
deletion, upon reassignment request via CLI etc.

Due to change in traffic pattern over time it can cause uneven load among
the PMDs and thus resulting in lower overall throughout.

This patch enables the support of auto load balancing of PMDs based on
measured load of RX queues. Each PMD measures the processing load for each
of its associated queues every 10 seconds. If the aggregated PMD load reaches
95% for 6 consecutive intervals then PMD considers itself to be overloaded.

If any PMD is overloaded, a dry-run of the PMD assignment algorithm is
performed by OVS main thread. The dry-run does NOT change the existing
queue to PMD assignments.

If the resultant mapping of dry-run indicates an improved distribution
of the load then the actual reassignment will be performed.

The automatic rebalancing will be disabled by default and has to be
enabled via configuration option. The interval (in minutes) between
two consecutive rebalancing can also be configured via CLI, default
is 1 min.

Following example commands can be used to set the auto-lb params:
ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true"
ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-rebalance-intvl="5"

Co-authored-by: Rohith Basavaraja <rohith.basavaraja@gmail.com>
Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
Signed-off-by: Rohith Basavaraja <rohith.basavaraja@gmail.com>
Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
Signed-off-by: Nitin Katiyar <nitin.katiyar@ericsson.com>
---
 Documentation/topics/dpdk/pmd.rst |  41 +++++
 NEWS                              |   1 +
 lib/dpif-netdev.c                 | 379 ++++++++++++++++++++++++++++++++++++++
 vswitchd/vswitch.xml              |  41 +++++
 4 files changed, 462 insertions(+)

Comments

Federico Iezzi Jan. 14, 2019, 3:24 p.m. UTC | #1
Maybe it's a bit late for this series, but would be possible in a
future enhancement to have a user parameter to set a different value
for ALB_PMD_LOAD_THRESHOLD?

Regards,
Federico

FEDERICO IEZZI

SR. TELCO ARCHITECT

Red Hat EMEA

fiezzi@redhat.com    M: +31-6-5152-9709

TRIED. TESTED. TRUSTED.
@RedHat   Red Hat   Red Hat


On Mon, 14 Jan 2019 at 11:56, Nitin Katiyar <nitin.katiyar@ericsson.com> wrote:
>
> Port rx queues that have not been statically assigned to PMDs are currently
> assigned based on periodically sampled load measurements.
> The assignment is performed at specific instances – port addition, port
> deletion, upon reassignment request via CLI etc.
>
> Due to change in traffic pattern over time it can cause uneven load among
> the PMDs and thus resulting in lower overall throughout.
>
> This patch enables the support of auto load balancing of PMDs based on
> measured load of RX queues. Each PMD measures the processing load for each
> of its associated queues every 10 seconds. If the aggregated PMD load reaches
> 95% for 6 consecutive intervals then PMD considers itself to be overloaded.
>
> If any PMD is overloaded, a dry-run of the PMD assignment algorithm is
> performed by OVS main thread. The dry-run does NOT change the existing
> queue to PMD assignments.
>
> If the resultant mapping of dry-run indicates an improved distribution
> of the load then the actual reassignment will be performed.
>
> The automatic rebalancing will be disabled by default and has to be
> enabled via configuration option. The interval (in minutes) between
> two consecutive rebalancing can also be configured via CLI, default
> is 1 min.
>
> Following example commands can be used to set the auto-lb params:
> ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true"
> ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-rebalance-intvl="5"
>
> Co-authored-by: Rohith Basavaraja <rohith.basavaraja@gmail.com>
> Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
> Signed-off-by: Rohith Basavaraja <rohith.basavaraja@gmail.com>
> Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
> Signed-off-by: Nitin Katiyar <nitin.katiyar@ericsson.com>
> ---
>  Documentation/topics/dpdk/pmd.rst |  41 +++++
>  NEWS                              |   1 +
>  lib/dpif-netdev.c                 | 379 ++++++++++++++++++++++++++++++++++++++
>  vswitchd/vswitch.xml              |  41 +++++
>  4 files changed, 462 insertions(+)
>
> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> index dd9172d..c273b40 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -183,3 +183,44 @@ or can be triggered by using::
>     In addition, the output of ``pmd-rxq-show`` was modified to include
>     Rx queue utilization of the PMD as a percentage. Prior to this, tracking of
>     stats was not available.
> +
> +Automatic assignment of Port/Rx Queue to PMD Threads (experimental)
> +-------------------------------------------------------------------
> +
> +Cycle or utilization based allocation of Rx queues to PMDs gives efficient
> +load distribution but it is not adaptive to change in traffic pattern occuring
> +over the time. This causes uneven load among the PMDs which results in overall
> +lower throughput.
> +
> +To address this automatic load balancing of PMDs can be set by::
> +
> +    $ ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true"
> +
> +If pmd-auto-lb is set to true AND cycle based assignment is enabled then auto
> +load balancing of PMDs is enabled provided there are 2 or more non-isolated
> +PMDs and at least one of these PMDs is polling more than one RX queue.
> +
> +Once auto load balancing is set, each non-isolated PMD measures the processing
> +load for each of its associated queues every 10 seconds. If the aggregated PMD
> +load reaches 95% for 6 consecutive intervals then PMD considers itself to be
> +overloaded.
> +
> +If any PMD is overloaded, a dry-run of the PMD assignment algorithm is
> +performed by OVS main thread. The dry-run does NOT change the existing queue
> +to PMD assignments.
> +
> +If the resultant mapping of dry-run indicates an improved distribution of the
> +load then the actual reassignment will be performed.
> +
> +The minimum time between 2 consecutive PMD auto load balancing iterations can
> +also be configured by::
> +
> +    $ ovs-vsctl set open_vswitch .\
> +        other_config:pmd-auto-lb-rebal-interval="<interval>"
> +
> +where ``<interval>`` is a value in minutes. The default interval is 1 minute
> +and setting it to 0 will also result in default value i.e. 1 min.
> +
> +A user can use this option to avoid frequent trigger of auto load balancing of
> +PMDs. For e.g. set this (in min) such that it occurs once in few hours or a day
> +or a week.
> diff --git a/NEWS b/NEWS
> index 2de844f..0e9fcb1 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -23,6 +23,7 @@ Post-v2.10.0
>       * Add option for simple round-robin based Rxq to PMD assignment.
>         It can be set with pmd-rxq-assign.
>       * Add support for DPDK 18.11
> +     * Add support for Auto load balancing of PMDs (experimental)
>     - Add 'symmetric_l3' hash function.
>     - OVS now honors 'updelay' and 'downdelay' for bonds with LACP configured.
>     - ovs-vswitchd:
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 1564db9..c1757ab 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -80,6 +80,12 @@
>
>  VLOG_DEFINE_THIS_MODULE(dpif_netdev);
>
> +/* Auto Load Balancing Defaults */
> +#define ALB_ACCEPTABLE_IMPROVEMENT       25
> +#define ALB_PMD_LOAD_THRESHOLD           95
> +#define ALB_PMD_REBALANCE_POLL_INTERVAL  1 /* 1 Min */
> +#define MIN_TO_MSEC                  60000
> +
>  #define FLOW_DUMP_MAX_BATCH 50
>  /* Use per thread recirc_depth to prevent recirculation loop. */
>  #define MAX_RECIRC_DEPTH 6
> @@ -288,6 +294,13 @@ struct dp_meter {
>      struct dp_meter_band bands[];
>  };
>
> +struct pmd_auto_lb {
> +    bool auto_lb_requested;     /* Auto load balancing requested by user. */
> +    bool is_enabled;            /* Current status of Auto load balancing. */
> +    uint64_t rebalance_intvl;
> +    uint64_t rebalance_poll_timer;
> +};
> +
>  /* Datapath based on the network device interface from netdev.h.
>   *
>   *
> @@ -368,6 +381,7 @@ struct dp_netdev {
>      uint64_t last_tnl_conf_seq;
>
>      struct conntrack conntrack;
> +    struct pmd_auto_lb pmd_alb;
>  };
>
>  static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
> @@ -702,6 +716,11 @@ struct dp_netdev_pmd_thread {
>      /* Keep track of detailed PMD performance statistics. */
>      struct pmd_perf_stats perf_stats;
>
> +    /* Stats from previous iteration used by automatic pmd
> +     * load balance logic. */
> +    uint64_t prev_stats[PMD_N_STATS];
> +    atomic_count pmd_overloaded;
> +
>      /* Set to true if the pmd thread needs to be reloaded. */
>      bool need_reload;
>  };
> @@ -3734,6 +3753,53 @@ dpif_netdev_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops,
>      }
>  }
>
> +/* Enable or Disable PMD auto load balancing. */
> +static void
> +set_pmd_auto_lb(struct dp_netdev *dp)
> +{
> +    unsigned int cnt = 0;
> +    struct dp_netdev_pmd_thread *pmd;
> +    struct pmd_auto_lb *pmd_alb = &dp->pmd_alb;
> +
> +    bool enable_alb = false;
> +    bool multi_rxq = false;
> +    bool pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc;
> +
> +    /* Ensure that there is at least 2 non-isolated PMDs and
> +     * one of them is polling more than one rxq. */
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        if (pmd->core_id == NON_PMD_CORE_ID || pmd->isolated) {
> +            continue;
> +        }
> +
> +        if (hmap_count(&pmd->poll_list) > 1) {
> +            multi_rxq = true;
> +        }
> +        if (cnt && multi_rxq) {
> +                enable_alb = true;
> +                break;
> +        }
> +        cnt++;
> +    }
> +
> +    /* Enable auto LB if it is requested and cycle based assignment is true. */
> +    enable_alb = enable_alb && pmd_rxq_assign_cyc &&
> +                    pmd_alb->auto_lb_requested;
> +
> +    if (pmd_alb->is_enabled != enable_alb) {
> +        pmd_alb->is_enabled = enable_alb;
> +        if (pmd_alb->is_enabled) {
> +            VLOG_INFO("PMD auto load balance is enabled "
> +                      "(with rebalance interval:%"PRIu64" msec)",
> +                       pmd_alb->rebalance_intvl);
> +        } else {
> +            pmd_alb->rebalance_poll_timer = 0;
> +            VLOG_INFO("PMD auto load balance is disabled");
> +        }
> +    }
> +
> +}
> +
>  /* Applies datapath configuration from the database. Some of the changes are
>   * actually applied in dpif_netdev_run(). */
>  static int
> @@ -3748,6 +3814,7 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>                          DEFAULT_EM_FLOW_INSERT_INV_PROB);
>      uint32_t insert_min, cur_min;
>      uint32_t tx_flush_interval, cur_tx_flush_interval;
> +    uint64_t rebalance_intvl;
>
>      tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
>                                       DEFAULT_TX_FLUSH_INTERVAL);
> @@ -3819,6 +3886,23 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>                    pmd_rxq_assign);
>          dp_netdev_request_reconfigure(dp);
>      }
> +
> +    struct pmd_auto_lb *pmd_alb = &dp->pmd_alb;
> +    pmd_alb->auto_lb_requested = smap_get_bool(other_config, "pmd-auto-lb",
> +                              false);
> +
> +    rebalance_intvl = smap_get_int(other_config, "pmd-auto-lb-rebal-interval",
> +                              ALB_PMD_REBALANCE_POLL_INTERVAL);
> +
> +    /* Input is in min, convert it to msec. */
> +    rebalance_intvl =
> +        rebalance_intvl ? rebalance_intvl * MIN_TO_MSEC : MIN_TO_MSEC;
> +
> +    if (pmd_alb->rebalance_intvl != rebalance_intvl) {
> +        pmd_alb->rebalance_intvl = rebalance_intvl;
> +    }
> +
> +    set_pmd_auto_lb(dp);
>      return 0;
>  }
>
> @@ -4762,6 +4846,9 @@ reconfigure_datapath(struct dp_netdev *dp)
>
>      /* Reload affected pmd threads. */
>      reload_affected_pmds(dp);
> +
> +    /* Check if PMD Auto LB is to be enabled */
> +    set_pmd_auto_lb(dp);
>  }
>
>  /* Returns true if one of the netdevs in 'dp' requires a reconfiguration */
> @@ -4780,6 +4867,237 @@ ports_require_restart(const struct dp_netdev *dp)
>      return false;
>  }
>
> +/* Function for calculating variance. */
> +static uint64_t
> +variance(uint64_t a[], int n)
> +{
> +    /* Compute mean (average of elements). */
> +    uint64_t sum = 0;
> +    uint64_t mean = 0;
> +    uint64_t sqDiff = 0;
> +
> +    if (!n) {
> +        return 0;
> +    }
> +
> +    for (int i = 0; i < n; i++) {
> +        sum += a[i];
> +    }
> +
> +    if (sum) {
> +        mean = sum / n;
> +
> +        /* Compute sum squared differences with mean. */
> +        for (int i = 0; i < n; i++) {
> +            sqDiff += (a[i] - mean)*(a[i] - mean);
> +        }
> +    }
> +    return (sqDiff ? (sqDiff / n) : 0);
> +}
> +
> +
> +/* Returns the variance in the PMDs usage as part of dry run of rxqs
> + * assignment to PMDs. */
> +static bool
> +get_dry_run_variance(struct dp_netdev *dp, uint32_t *core_list,
> +                     uint32_t num, uint64_t *predicted_variance)
> +    OVS_REQUIRES(dp->port_mutex)
> +{
> +    struct dp_netdev_port *port;
> +    struct dp_netdev_pmd_thread *pmd;
> +    struct dp_netdev_rxq ** rxqs = NULL;
> +    struct rr_numa *numa = NULL;
> +    struct rr_numa_list rr;
> +    int n_rxqs = 0;
> +    bool ret = false;
> +    uint64_t *pmd_usage;
> +
> +    if (!predicted_variance) {
> +        return ret;
> +    }
> +
> +    pmd_usage = xcalloc(num, sizeof(uint64_t));
> +
> +    HMAP_FOR_EACH (port, node, &dp->ports) {
> +        if (!netdev_is_pmd(port->netdev)) {
> +            continue;
> +        }
> +
> +        for (int qid = 0; qid < port->n_rxq; qid++) {
> +            struct dp_netdev_rxq *q = &port->rxqs[qid];
> +            uint64_t cycle_hist = 0;
> +
> +            if (q->pmd->isolated) {
> +                continue;
> +            }
> +
> +            if (n_rxqs == 0) {
> +                rxqs = xmalloc(sizeof *rxqs);
> +            } else {
> +                rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1));
> +            }
> +
> +            /* Sum the queue intervals and store the cycle history. */
> +            for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
> +                cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i);
> +            }
> +            dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST,
> +                                         cycle_hist);
> +            /* Store the queue. */
> +            rxqs[n_rxqs++] = q;
> +        }
> +    }
> +    if (n_rxqs > 1) {
> +        /* Sort the queues in order of the processing cycles
> +         * they consumed during their last pmd interval. */
> +        qsort(rxqs, n_rxqs, sizeof *rxqs, compare_rxq_cycles);
> +    }
> +    rr_numa_list_populate(dp, &rr);
> +
> +    for (int i = 0; i < n_rxqs; i++) {
> +        int numa_id = netdev_get_numa_id(rxqs[i]->port->netdev);
> +        numa = rr_numa_list_lookup(&rr, numa_id);
> +        if (!numa) {
> +            /* Abort if cross NUMA polling. */
> +            VLOG_DBG("PMD auto lb dry run."
> +                     " Aborting due to cross-numa polling.");
> +            goto cleanup;
> +        }
> +
> +        pmd = rr_numa_get_pmd(numa, true);
> +        VLOG_DBG("PMD auto lb dry run. Predicted: Core %d on numa node %d "
> +                  "to be assigned port \'%s\' rx queue %d "
> +                  "(measured processing cycles %"PRIu64").",
> +                  pmd->core_id, numa_id,
> +                  netdev_rxq_get_name(rxqs[i]->rx),
> +                  netdev_rxq_get_queue_id(rxqs[i]->rx),
> +                  dp_netdev_rxq_get_cycles(rxqs[i], RXQ_CYCLES_PROC_HIST));
> +
> +        for (int id = 0; id < num; id++) {
> +            if (pmd->core_id == core_list[id]) {
> +                /* Add the processing cycles of rxq to pmd polling it. */
> +                pmd_usage[id] += dp_netdev_rxq_get_cycles(rxqs[i],
> +                                        RXQ_CYCLES_PROC_HIST);
> +            }
> +        }
> +    }
> +
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        uint64_t total_cycles = 0;
> +
> +        if ((pmd->core_id == NON_PMD_CORE_ID) || pmd->isolated) {
> +            continue;
> +        }
> +
> +        /* Get the total pmd cycles for an interval. */
> +        atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles);
> +        /* Estimate the cycles to cover all intervals. */
> +        total_cycles *= PMD_RXQ_INTERVAL_MAX;
> +        for (int id = 0; id < num; id++) {
> +            if (pmd->core_id == core_list[id]) {
> +                if (pmd_usage[id]) {
> +                    pmd_usage[id] = (pmd_usage[id] * 100) / total_cycles;
> +                }
> +                VLOG_DBG("PMD auto lb dry run. Predicted: Core %d, "
> +                         "usage %"PRIu64"", pmd->core_id, pmd_usage[id]);
> +            }
> +        }
> +    }
> +    *predicted_variance = variance(pmd_usage, num);
> +    ret = true;
> +
> +cleanup:
> +    rr_numa_list_destroy(&rr);
> +    free(rxqs);
> +    free(pmd_usage);
> +    return ret;
> +}
> +
> +/* Does the dry run of Rxq assignment to PMDs and returns true if it gives
> + * better distribution of load on PMDs. */
> +static bool
> +pmd_rebalance_dry_run(struct dp_netdev *dp)
> +    OVS_REQUIRES(dp->port_mutex)
> +{
> +    struct dp_netdev_pmd_thread *pmd;
> +    uint64_t *curr_pmd_usage;
> +
> +    uint64_t curr_variance;
> +    uint64_t new_variance;
> +    uint64_t improvement = 0;
> +    uint32_t num_pmds;
> +    uint32_t *pmd_corelist;
> +    struct rxq_poll *poll, *poll_next;
> +    bool ret;
> +
> +    num_pmds = cmap_count(&dp->poll_threads);
> +
> +    if (num_pmds > 1) {
> +        curr_pmd_usage = xcalloc(num_pmds, sizeof(uint64_t));
> +        pmd_corelist = xcalloc(num_pmds, sizeof(uint32_t));
> +    } else {
> +        return false;
> +    }
> +
> +    num_pmds = 0;
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        uint64_t total_cycles = 0;
> +        uint64_t total_proc = 0;
> +
> +        if ((pmd->core_id == NON_PMD_CORE_ID) || pmd->isolated) {
> +            continue;
> +        }
> +
> +        /* Get the total pmd cycles for an interval. */
> +        atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles);
> +        /* Estimate the cycles to cover all intervals. */
> +        total_cycles *= PMD_RXQ_INTERVAL_MAX;
> +
> +        HMAP_FOR_EACH_SAFE (poll, poll_next, node, &pmd->poll_list) {
> +            uint64_t proc_cycles = 0;
> +            for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
> +                proc_cycles += dp_netdev_rxq_get_intrvl_cycles(poll->rxq, i);
> +            }
> +            total_proc += proc_cycles;
> +        }
> +        if (total_proc) {
> +            curr_pmd_usage[num_pmds] = (total_proc * 100) / total_cycles;
> +        }
> +
> +        VLOG_DBG("PMD auto lb dry run. Current: Core %d, usage %"PRIu64"",
> +                  pmd->core_id, curr_pmd_usage[num_pmds]);
> +
> +        if (atomic_count_get(&pmd->pmd_overloaded)) {
> +            atomic_count_set(&pmd->pmd_overloaded, 0);
> +        }
> +
> +        pmd_corelist[num_pmds] = pmd->core_id;
> +        num_pmds++;
> +    }
> +
> +    curr_variance = variance(curr_pmd_usage, num_pmds);
> +    ret = get_dry_run_variance(dp, pmd_corelist, num_pmds, &new_variance);
> +
> +    if (ret) {
> +        VLOG_DBG("PMD auto lb dry run. Current PMD variance: %"PRIu64","
> +                  " Predicted PMD variance: %"PRIu64"",
> +                  curr_variance, new_variance);
> +
> +        if (new_variance < curr_variance) {
> +            improvement =
> +                ((curr_variance - new_variance) * 100) / curr_variance;
> +        }
> +        if (improvement < ALB_ACCEPTABLE_IMPROVEMENT) {
> +            ret = false;
> +        }
> +    }
> +
> +    free(curr_pmd_usage);
> +    free(pmd_corelist);
> +    return ret;
> +}
> +
> +
>  /* Return true if needs to revalidate datapath flows. */
>  static bool
>  dpif_netdev_run(struct dpif *dpif)
> @@ -4789,6 +5107,9 @@ dpif_netdev_run(struct dpif *dpif)
>      struct dp_netdev_pmd_thread *non_pmd;
>      uint64_t new_tnl_seq;
>      bool need_to_flush = true;
> +    bool pmd_rebalance = false;
> +    long long int now = time_msec();
> +    struct dp_netdev_pmd_thread *pmd;
>
>      ovs_mutex_lock(&dp->port_mutex);
>      non_pmd = dp_netdev_get_pmd(dp, NON_PMD_CORE_ID);
> @@ -4821,6 +5142,32 @@ dpif_netdev_run(struct dpif *dpif)
>          dp_netdev_pmd_unref(non_pmd);
>      }
>
> +    struct pmd_auto_lb *pmd_alb = &dp->pmd_alb;
> +    if (pmd_alb->is_enabled) {
> +        if (!pmd_alb->rebalance_poll_timer) {
> +            pmd_alb->rebalance_poll_timer = now;
> +        } else if ((pmd_alb->rebalance_poll_timer +
> +                   pmd_alb->rebalance_intvl) < now) {
> +            pmd_alb->rebalance_poll_timer = now;
> +            CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +                if (atomic_count_get(&pmd->pmd_overloaded) >=
> +                                    PMD_RXQ_INTERVAL_MAX) {
> +                    pmd_rebalance = true;
> +                    break;
> +                }
> +            }
> +
> +            if (pmd_rebalance &&
> +                !dp_netdev_is_reconf_required(dp) &&
> +                !ports_require_restart(dp) &&
> +                pmd_rebalance_dry_run(dp)) {
> +                VLOG_INFO("PMD auto lb dry run."
> +                          " requesting datapath reconfigure.");
> +                dp_netdev_request_reconfigure(dp);
> +            }
> +        }
> +    }
> +
>      if (dp_netdev_is_reconf_required(dp) || ports_require_restart(dp)) {
>          reconfigure_datapath(dp);
>      }
> @@ -4979,6 +5326,8 @@ pmd_thread_main(void *f_)
>  reload:
>      pmd_alloc_static_tx_qid(pmd);
>
> +    atomic_count_init(&pmd->pmd_overloaded, 0);
> +
>      /* List port/core affinity */
>      for (i = 0; i < poll_cnt; i++) {
>         VLOG_DBG("Core %d processing port \'%s\' with queue-id %d\n",
> @@ -7188,9 +7537,39 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
>                             struct polled_queue *poll_list, int poll_cnt)
>  {
>      struct dpcls *cls;
> +    uint64_t tot_idle = 0, tot_proc = 0;
> +    unsigned int pmd_load = 0;
>
>      if (pmd->ctx.now > pmd->rxq_next_cycle_store) {
>          uint64_t curr_tsc;
> +        struct pmd_auto_lb *pmd_alb = &pmd->dp->pmd_alb;
> +        if (pmd_alb->is_enabled && !pmd->isolated
> +            && (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] >=
> +                                       pmd->prev_stats[PMD_CYCLES_ITER_IDLE])
> +            && (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] >=
> +                                        pmd->prev_stats[PMD_CYCLES_ITER_BUSY]))
> +            {
> +            tot_idle = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] -
> +                       pmd->prev_stats[PMD_CYCLES_ITER_IDLE];
> +            tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] -
> +                       pmd->prev_stats[PMD_CYCLES_ITER_BUSY];
> +
> +            if (tot_proc) {
> +                pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc));
> +            }
> +
> +            if (pmd_load >= ALB_PMD_LOAD_THRESHOLD) {
> +                atomic_count_inc(&pmd->pmd_overloaded);
> +            } else {
> +                atomic_count_set(&pmd->pmd_overloaded, 0);
> +            }
> +        }
> +
> +        pmd->prev_stats[PMD_CYCLES_ITER_IDLE] =
> +                        pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE];
> +        pmd->prev_stats[PMD_CYCLES_ITER_BUSY] =
> +                        pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY];
> +
>          /* Get the cycles that were used to process each queue and store. */
>          for (unsigned i = 0; i < poll_cnt; i++) {
>              uint64_t rxq_cyc_curr = dp_netdev_rxq_get_cycles(poll_list[i].rxq,
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 2160910..72f5283 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -574,6 +574,47 @@
>              be set to 'skip_sw'.
>          </p>
>        </column>
> +      <column name="other_config" key="pmd-auto-lb"
> +              type='{"type": "boolean"}'>
> +        <p>
> +         Configures PMD Auto Load Balancing that allows automatic assignment of
> +         RX queues to PMDs if any of PMDs is overloaded (i.e. processing cycles
> +         > 95%).
> +        </p>
> +        <p>
> +         It uses current scheme of cycle based assignment of RX queues that
> +         are not statically pinned to PMDs.
> +        </p>
> +        <p>
> +          The default value is <code>false</code>.
> +        </p>
> +        <p>
> +          Set this value to <code>true</code> to enable this option. It is
> +          currently disabled by default and an experimental feature.
> +        </p>
> +        <p>
> +         This only comes in effect if cycle based assignment is enabled and
> +         there are more than one non-isolated PMDs present and atleast one of
> +         it polls more than one queue.
> +        </p>
> +      </column>
> +      <column name="other_config" key="pmd-auto-lb-rebal-interval"
> +              type='{"type": "integer",
> +                     "minInteger": 0, "maxInteger": 20000}'>
> +        <p>
> +         The minimum time (in minutes) 2 consecutive PMD Auto Load Balancing
> +         iterations.
> +        </p>
> +        <p>
> +         The defaul value is 1 min. If configured to 0 then it would be
> +         converted to default value i.e. 1 min
> +        </p>
> +        <p>
> +         This option can be configured to avoid frequent trigger of auto load
> +         balancing of PMDs. For e.g. set the value (in min) such that it occurs
> +         once in few hours or a day or a week.
> +        </p>
> +      </column>
>      </group>
>      <group title="Status">
>        <column name="next_cfg">
> --
> 1.9.1
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Stokes, Ian Jan. 14, 2019, 11:35 p.m. UTC | #2
On 1/14/2019 10:44 AM, Nitin Katiyar wrote:
> Port rx queues that have not been statically assigned to PMDs are currently
> assigned based on periodically sampled load measurements.
> The assignment is performed at specific instances – port addition, port
> deletion, upon reassignment request via CLI etc.
> 
> Due to change in traffic pattern over time it can cause uneven load among
> the PMDs and thus resulting in lower overall throughout.
> 
> This patch enables the support of auto load balancing of PMDs based on
> measured load of RX queues. Each PMD measures the processing load for each
> of its associated queues every 10 seconds. If the aggregated PMD load reaches
> 95% for 6 consecutive intervals then PMD considers itself to be overloaded.
> 
> If any PMD is overloaded, a dry-run of the PMD assignment algorithm is
> performed by OVS main thread. The dry-run does NOT change the existing
> queue to PMD assignments.
> 
> If the resultant mapping of dry-run indicates an improved distribution
> of the load then the actual reassignment will be performed.
> 
> The automatic rebalancing will be disabled by default and has to be
> enabled via configuration option. The interval (in minutes) between
> two consecutive rebalancing can also be configured via CLI, default
> is 1 min.
> 
> Following example commands can be used to set the auto-lb params:
> ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true"
> ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-rebalance-intvl="5"
>
Thanks for the patch Nitin. A few comments below.

On An aside, there was discussion if this could be part of OVS 2.11 from 
the community call last week. Although this is a v5 I believe it has 
been under review and testing from the folks at Red Hat however I don't 
see any acks to date.

What are peoples thoughts?

This change seems quite contained and doesn't interfere with default 
cases where rxq isolation, round robin or cycle based assignment is used.

In testing the previous balancing still work work fine and the new load 
balancing works well also although I have queries on default values 
which are specific to use cases discussed below.

Do people feel there is any reason to hold off merging if the issues 
below are addressed and there are no other concerns?

> Co-authored-by: Rohith Basavaraja <rohith.basavaraja@gmail.com>
> Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
> Signed-off-by: Rohith Basavaraja <rohith.basavaraja@gmail.com>
> Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
> Signed-off-by: Nitin Katiyar <nitin.katiyar@ericsson.com>
> ---
>   Documentation/topics/dpdk/pmd.rst |  41 +++++
>   NEWS                              |   1 +
>   lib/dpif-netdev.c                 | 379 ++++++++++++++++++++++++++++++++++++++
>   vswitchd/vswitch.xml              |  41 +++++
>   4 files changed, 462 insertions(+)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> index dd9172d..c273b40 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -183,3 +183,44 @@ or can be triggered by using::
>      In addition, the output of ``pmd-rxq-show`` was modified to include
>      Rx queue utilization of the PMD as a percentage. Prior to this, tracking of
>      stats was not available.
> +
> +Automatic assignment of Port/Rx Queue to PMD Threads (experimental)
> +-------------------------------------------------------------------
> +
> +Cycle or utilization based allocation of Rx queues to PMDs gives efficient
> +load distribution but it is not adaptive to change in traffic pattern occuring
Minor typo above, 'occuring' -> 'occurring'
> +over the time. This causes uneven load among the PMDs which results in overall
> +lower throughput.
> +
> +To address this automatic load balancing of PMDs can be set by::
> +
> +    $ ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true"
> +
> +If pmd-auto-lb is set to true AND cycle based assignment is enabled then auto
> +load balancing of PMDs is enabled provided there are 2 or more non-isolated
> +PMDs and at least one of these PMDs is polling more than one RX queue.

It would be good to give examples of the behavior when enabling this.

I've spent some time playing with this behavior in particular as it 
wasn't clear how it would be triggered and disabled as the queues and 
PMDs are manipulated e.g. where the number of of non-isolated PMDs is 
reduced below 2, pmd-auto-lb is automatically disabled. Changing to 
round robin based assignment will also disable it etc.

> +
> +Once auto load balancing is set, each non-isolated PMD measures the processing
> +load for each of its associated queues every 10 seconds. If the aggregated PMD
> +load reaches 95% for 6 consecutive intervals then PMD considers itself to be
> +overloaded.
> +
> +If any PMD is overloaded, a dry-run of the PMD assignment algorithm is
> +performed by OVS main thread. The dry-run does NOT change the existing queue
> +to PMD assignments.
> +
> +If the resultant mapping of dry-run indicates an improved distribution of the
> +load then the actual reassignment will be performed.
> +
> +The minimum time between 2 consecutive PMD auto load balancing iterations can
> +also be configured by::
> +
> +    $ ovs-vsctl set open_vswitch .\
> +        other_config:pmd-auto-lb-rebal-interval="<interval>"
> +
> +where ``<interval>`` is a value in minutes. The default interval is 1 minute
> +and setting it to 0 will also result in default value i.e. 1 min.
> +
> +A user can use this option to avoid frequent trigger of auto load balancing of
> +PMDs. For e.g. set this (in min) such that it occurs once in few hours or a day
> +or a week.

Are there limitations to this work?

 From inspecting the code, cross NUMA is not supported, could you 
provide detail here explainging when/why it would be the case? Are there 
intentions to to implement cross NUMA support?

I think it would be good to call out where this feature may not work 
well, for instance if traffic profiles for specific rx queues were 
changing dramatically within the 1 minute (via manipulation or due to 
randomness in the profile) would it be possible to thrash the cache of 
the CPU due to the changes required for EMC and DPCLS for new flows?

This is an extreme corner case and I would expect traffic profiles to be 
more uniform but it could be worth mentioning. One way to avoid this 
would be set the interval higher to mitigate this, which again gives the 
user an idea how how to deploy this and avoid such issues along with the 
advantages and disadvantages of interval lengths.

> diff --git a/NEWS b/NEWS
> index 2de844f..0e9fcb1 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -23,6 +23,7 @@ Post-v2.10.0
>        * Add option for simple round-robin based Rxq to PMD assignment.
>          It can be set with pmd-rxq-assign.
>        * Add support for DPDK 18.11
> +     * Add support for Auto load balancing of PMDs (experimental)
>      - Add 'symmetric_l3' hash function.
>      - OVS now honors 'updelay' and 'downdelay' for bonds with LACP configured.
>      - ovs-vswitchd:
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 1564db9..c1757ab 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -80,6 +80,12 @@
>   
>   VLOG_DEFINE_THIS_MODULE(dpif_netdev);
>   
> +/* Auto Load Balancing Defaults */
> +#define ALB_ACCEPTABLE_IMPROVEMENT       25

Would it be an option in the future to be able to set the improvment 
level measered also? Perhaps that too specific for users at the moment 
but I'd be interested to hear peoples opinion as the feature matures.

> +#define ALB_PMD_LOAD_THRESHOLD           95
> +#define ALB_PMD_REBALANCE_POLL_INTERVAL  1 /* 1 Min */
> +#define MIN_TO_MSEC                  60000
> +
>   #define FLOW_DUMP_MAX_BATCH 50
>   /* Use per thread recirc_depth to prevent recirculation loop. */
>   #define MAX_RECIRC_DEPTH 6
> @@ -288,6 +294,13 @@ struct dp_meter {
>       struct dp_meter_band bands[];
>   };
>   
> +struct pmd_auto_lb {
> +    bool auto_lb_requested;     /* Auto load balancing requested by user. */
> +    bool is_enabled;            /* Current status of Auto load balancing. */
> +    uint64_t rebalance_intvl;
> +    uint64_t rebalance_poll_timer;
> +};
> +
>   /* Datapath based on the network device interface from netdev.h.
>    *
>    *
> @@ -368,6 +381,7 @@ struct dp_netdev {
>       uint64_t last_tnl_conf_seq;
>   
>       struct conntrack conntrack;
> +    struct pmd_auto_lb pmd_alb;
>   };
>   
>   static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
> @@ -702,6 +716,11 @@ struct dp_netdev_pmd_thread {
>       /* Keep track of detailed PMD performance statistics. */
>       struct pmd_perf_stats perf_stats;
>   
> +    /* Stats from previous iteration used by automatic pmd
> +     * load balance logic. */
> +    uint64_t prev_stats[PMD_N_STATS];
> +    atomic_count pmd_overloaded;
> +
>       /* Set to true if the pmd thread needs to be reloaded. */
>       bool need_reload;
>   };
> @@ -3734,6 +3753,53 @@ dpif_netdev_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops,
>       }
>   }
>   
> +/* Enable or Disable PMD auto load balancing. */
> +static void
> +set_pmd_auto_lb(struct dp_netdev *dp)
> +{
> +    unsigned int cnt = 0;
> +    struct dp_netdev_pmd_thread *pmd;
> +    struct pmd_auto_lb *pmd_alb = &dp->pmd_alb;
> +
> +    bool enable_alb = false;
> +    bool multi_rxq = false;
> +    bool pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc;
> +
> +    /* Ensure that there is at least 2 non-isolated PMDs and
> +     * one of them is polling more than one rxq. */
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        if (pmd->core_id == NON_PMD_CORE_ID || pmd->isolated) {
> +            continue;
> +        }
> +
> +        if (hmap_count(&pmd->poll_list) > 1) {
> +            multi_rxq = true;
> +        }
> +        if (cnt && multi_rxq) {
> +                enable_alb = true;
> +                break;
> +        }
> +        cnt++;
> +    }
> +
> +    /* Enable auto LB if it is requested and cycle based assignment is true. */
> +    enable_alb = enable_alb && pmd_rxq_assign_cyc &&
> +                    pmd_alb->auto_lb_requested;
> +
> +    if (pmd_alb->is_enabled != enable_alb) {
> +        pmd_alb->is_enabled = enable_alb;
> +        if (pmd_alb->is_enabled) {
> +            VLOG_INFO("PMD auto load balance is enabled "
> +                      "(with rebalance interval:%"PRIu64" msec)",
> +                       pmd_alb->rebalance_intvl);
> +        } else {
> +            pmd_alb->rebalance_poll_timer = 0;
> +            VLOG_INFO("PMD auto load balance is disabled");
> +        }
> +    }
> +
> +}
> +
>   /* Applies datapath configuration from the database. Some of the changes are
>    * actually applied in dpif_netdev_run(). */
>   static int
> @@ -3748,6 +3814,7 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>                           DEFAULT_EM_FLOW_INSERT_INV_PROB);
>       uint32_t insert_min, cur_min;
>       uint32_t tx_flush_interval, cur_tx_flush_interval;
> +    uint64_t rebalance_intvl;
>   
>       tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
>                                        DEFAULT_TX_FLUSH_INTERVAL);
> @@ -3819,6 +3886,23 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>                     pmd_rxq_assign);
>           dp_netdev_request_reconfigure(dp);
>       }
> +
> +    struct pmd_auto_lb *pmd_alb = &dp->pmd_alb;
> +    pmd_alb->auto_lb_requested = smap_get_bool(other_config, "pmd-auto-lb",
> +                              false);
> +
> +    rebalance_intvl = smap_get_int(other_config, "pmd-auto-lb-rebal-interval",
> +                              ALB_PMD_REBALANCE_POLL_INTERVAL);
> +
> +    /* Input is in min, convert it to msec. */
> +    rebalance_intvl =
> +        rebalance_intvl ? rebalance_intvl * MIN_TO_MSEC : MIN_TO_MSEC;
> +
> +    if (pmd_alb->rebalance_intvl != rebalance_intvl) {
> +        pmd_alb->rebalance_intvl = rebalance_intvl;
> +    }
> +
> +    set_pmd_auto_lb(dp);
>       return 0;
>   }
>   
> @@ -4762,6 +4846,9 @@ reconfigure_datapath(struct dp_netdev *dp)
>   
>       /* Reload affected pmd threads. */
>       reload_affected_pmds(dp);
> +
> +    /* Check if PMD Auto LB is to be enabled */
> +    set_pmd_auto_lb(dp);
>   }
>   
>   /* Returns true if one of the netdevs in 'dp' requires a reconfiguration */
> @@ -4780,6 +4867,237 @@ ports_require_restart(const struct dp_netdev *dp)
>       return false;
>   }
>   
> +/* Function for calculating variance. */
> +static uint64_t
> +variance(uint64_t a[], int n)
Argument names above seem quite generic, could you improve the comment 
to explain the arguments, even relating the terms to how they are used 
where n is the number of pmds and a[] is the data array for containing 
usage of each pmd would help clarify.

> +{
> +    /* Compute mean (average of elements). */
> +    uint64_t sum = 0;
> +    uint64_t mean = 0;
> +    uint64_t sqDiff = 0;
> +
> +    if (!n) {
> +        return 0;
> +    }
> +
> +    for (int i = 0; i < n; i++) {
> +        sum += a[i];
> +    }
> +
> +    if (sum) {
> +        mean = sum / n;
> +
> +        /* Compute sum squared differences with mean. */
> +        for (int i = 0; i < n; i++) {
> +            sqDiff += (a[i] - mean)*(a[i] - mean);
> +        }
> +    }
> +    return (sqDiff ? (sqDiff / n) : 0);
> +}
> +
> +
> +/* Returns the variance in the PMDs usage as part of dry run of rxqs
> + * assignment to PMDs. */
> +static bool
> +get_dry_run_variance(struct dp_netdev *dp, uint32_t *core_list,
> +                     uint32_t num, uint64_t *predicted_variance)
Can you change 'num' to 'num_pmds' above so that the argument purpose is 
clearer?

> +    OVS_REQUIRES(dp->port_mutex)
> +{
> +    struct dp_netdev_port *port;
> +    struct dp_netdev_pmd_thread *pmd;
> +    struct dp_netdev_rxq ** rxqs = NULL;
Please remove whitespace above for, '** rxq' ->  '**rxq'
> +    struct rr_numa *numa = NULL;
> +    struct rr_numa_list rr;
> +    int n_rxqs = 0;
> +    bool ret = false;
> +    uint64_t *pmd_usage;
> +
> +    if (!predicted_variance) {
> +        return ret;
> +    }

Above you are checking that predicted_variance is not NULL. However its 
is allocated from the stack as uint64_t new_variance; in the preceding 
calling function 'pmd_rebalance_dry_run()'. Is the worry that 
predicted_variance my not have been allocated correctly i.e. NULL.

If so, would it not be better to error check when new_variance is first 
allocated (and possibly initialize to a meaningful value) in 
pmd_rebalance_dry_run()?

> +
> +    pmd_usage = xcalloc(num, sizeof(uint64_t));
> +
> +    HMAP_FOR_EACH (port, node, &dp->ports) {
> +        if (!netdev_is_pmd(port->netdev)) {
> +            continue;
> +        }
> +
> +        for (int qid = 0; qid < port->n_rxq; qid++) {
> +            struct dp_netdev_rxq *q = &port->rxqs[qid];
> +            uint64_t cycle_hist = 0;
> +
> +            if (q->pmd->isolated) {
> +                continue;
> +            }
> +
> +            if (n_rxqs == 0) {
> +                rxqs = xmalloc(sizeof *rxqs);
> +            } else {
> +                rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1));
> +            }
> +
> +            /* Sum the queue intervals and store the cycle history. */
> +            for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
> +                cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i);
> +            }
> +            dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST,
> +                                         cycle_hist);
> +            /* Store the queue. */
> +            rxqs[n_rxqs++] = q;
> +        }
> +    }
> +    if (n_rxqs > 1) {
> +        /* Sort the queues in order of the processing cycles
> +         * they consumed during their last pmd interval. */
> +        qsort(rxqs, n_rxqs, sizeof *rxqs, compare_rxq_cycles);
> +    }
> +    rr_numa_list_populate(dp, &rr);
> +
> +    for (int i = 0; i < n_rxqs; i++) {
> +        int numa_id = netdev_get_numa_id(rxqs[i]->port->netdev);
> +        numa = rr_numa_list_lookup(&rr, numa_id);
> +        if (!numa) {
> +            /* Abort if cross NUMA polling. */
> +            VLOG_DBG("PMD auto lb dry run."
> +                     " Aborting due to cross-numa polling.");
> +            goto cleanup;
> +        }
> +
> +        pmd = rr_numa_get_pmd(numa, true);
> +        VLOG_DBG("PMD auto lb dry run. Predicted: Core %d on numa node %d "
> +                  "to be assigned port \'%s\' rx queue %d "
> +                  "(measured processing cycles %"PRIu64").",
> +                  pmd->core_id, numa_id,
> +                  netdev_rxq_get_name(rxqs[i]->rx),
> +                  netdev_rxq_get_queue_id(rxqs[i]->rx),
> +                  dp_netdev_rxq_get_cycles(rxqs[i], RXQ_CYCLES_PROC_HIST));
> +
> +        for (int id = 0; id < num; id++) {
> +            if (pmd->core_id == core_list[id]) {
> +                /* Add the processing cycles of rxq to pmd polling it. */
> +                pmd_usage[id] += dp_netdev_rxq_get_cycles(rxqs[i],
> +                                        RXQ_CYCLES_PROC_HIST);
> +            }
> +        }
> +    }
> +
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        uint64_t total_cycles = 0;
> +
> +        if ((pmd->core_id == NON_PMD_CORE_ID) || pmd->isolated) {
> +            continue;
> +        }
> +
> +        /* Get the total pmd cycles for an interval. */
> +        atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles);
> +        /* Estimate the cycles to cover all intervals. */
> +        total_cycles *= PMD_RXQ_INTERVAL_MAX;
> +        for (int id = 0; id < num; id++) {
> +            if (pmd->core_id == core_list[id]) {
> +                if (pmd_usage[id]) {
> +                    pmd_usage[id] = (pmd_usage[id] * 100) / total_cycles;
> +                }
> +                VLOG_DBG("PMD auto lb dry run. Predicted: Core %d, "
> +                         "usage %"PRIu64"", pmd->core_id, pmd_usage[id]);
> +            }
> +        }
> +    }
> +    *predicted_variance = variance(pmd_usage, num);
> +    ret = true;
> +
> +cleanup:
> +    rr_numa_list_destroy(&rr);
> +    free(rxqs);
> +    free(pmd_usage);
> +    return ret;
> +}
> +
> +/* Does the dry run of Rxq assignment to PMDs and returns true if it gives
> + * better distribution of load on PMDs. */
> +static bool
> +pmd_rebalance_dry_run(struct dp_netdev *dp)
> +    OVS_REQUIRES(dp->port_mutex)
> +{
> +    struct dp_netdev_pmd_thread *pmd;
> +    uint64_t *curr_pmd_usage;
> +
> +    uint64_t curr_variance;
> +    uint64_t new_variance;
> +    uint64_t improvement = 0;
> +    uint32_t num_pmds;
> +    uint32_t *pmd_corelist;
> +    struct rxq_poll *poll, *poll_next;
> +    bool ret;
> +
> +    num_pmds = cmap_count(&dp->poll_threads);
> +
> +    if (num_pmds > 1) {
> +        curr_pmd_usage = xcalloc(num_pmds, sizeof(uint64_t));
> +        pmd_corelist = xcalloc(num_pmds, sizeof(uint32_t));
> +    } else {
> +        return false;
> +    }
> +
> +    num_pmds = 0;
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        uint64_t total_cycles = 0;
> +        uint64_t total_proc = 0;
> +
> +        if ((pmd->core_id == NON_PMD_CORE_ID) || pmd->isolated) {
> +            continue;
> +        }
> +
> +        /* Get the total pmd cycles for an interval. */
> +        atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles);
> +        /* Estimate the cycles to cover all intervals. */
> +        total_cycles *= PMD_RXQ_INTERVAL_MAX;
> +
> +        HMAP_FOR_EACH_SAFE (poll, poll_next, node, &pmd->poll_list) {
> +            uint64_t proc_cycles = 0;
> +            for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
> +                proc_cycles += dp_netdev_rxq_get_intrvl_cycles(poll->rxq, i);
> +            }
> +            total_proc += proc_cycles;
> +        }
> +        if (total_proc) {
> +            curr_pmd_usage[num_pmds] = (total_proc * 100) / total_cycles;
> +        }
> +
> +        VLOG_DBG("PMD auto lb dry run. Current: Core %d, usage %"PRIu64"",
> +                  pmd->core_id, curr_pmd_usage[num_pmds]);
> +
> +        if (atomic_count_get(&pmd->pmd_overloaded)) {
> +            atomic_count_set(&pmd->pmd_overloaded, 0);
> +        }
> +
> +        pmd_corelist[num_pmds] = pmd->core_id;
> +        num_pmds++;
> +    }
> +
> +    curr_variance = variance(curr_pmd_usage, num_pmds);
> +    ret = get_dry_run_variance(dp, pmd_corelist, num_pmds, &new_variance);
> +
> +    if (ret) {
> +        VLOG_DBG("PMD auto lb dry run. Current PMD variance: %"PRIu64","
> +                  " Predicted PMD variance: %"PRIu64"",
> +                  curr_variance, new_variance);
> +
> +        if (new_variance < curr_variance) {
> +            improvement =
> +                ((curr_variance - new_variance) * 100) / curr_variance;
> +        }
> +        if (improvement < ALB_ACCEPTABLE_IMPROVEMENT) {
> +            ret = false;
> +        }
> +    }
> +
> +    free(curr_pmd_usage);
> +    free(pmd_corelist);
> +    return ret;
> +}
> +
> +
>   /* Return true if needs to revalidate datapath flows. */
>   static bool
>   dpif_netdev_run(struct dpif *dpif)
> @@ -4789,6 +5107,9 @@ dpif_netdev_run(struct dpif *dpif)
>       struct dp_netdev_pmd_thread *non_pmd;
>       uint64_t new_tnl_seq;
>       bool need_to_flush = true;
> +    bool pmd_rebalance = false;
> +    long long int now = time_msec();
> +    struct dp_netdev_pmd_thread *pmd;
>   
>       ovs_mutex_lock(&dp->port_mutex);
>       non_pmd = dp_netdev_get_pmd(dp, NON_PMD_CORE_ID);
> @@ -4821,6 +5142,32 @@ dpif_netdev_run(struct dpif *dpif)
>           dp_netdev_pmd_unref(non_pmd);
>       }
>   
> +    struct pmd_auto_lb *pmd_alb = &dp->pmd_alb;
> +    if (pmd_alb->is_enabled) {
> +        if (!pmd_alb->rebalance_poll_timer) {
> +            pmd_alb->rebalance_poll_timer = now;
> +        } else if ((pmd_alb->rebalance_poll_timer +
> +                   pmd_alb->rebalance_intvl) < now) {
> +            pmd_alb->rebalance_poll_timer = now;
> +            CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +                if (atomic_count_get(&pmd->pmd_overloaded) >=
> +                                    PMD_RXQ_INTERVAL_MAX) {
> +                    pmd_rebalance = true;
> +                    break;
> +                }
> +            }
> +
> +            if (pmd_rebalance &&
> +                !dp_netdev_is_reconf_required(dp) &&
> +                !ports_require_restart(dp) &&
> +                pmd_rebalance_dry_run(dp)) {
> +                VLOG_INFO("PMD auto lb dry run."
> +                          " requesting datapath reconfigure.");
> +                dp_netdev_request_reconfigure(dp);
> +            }
> +        }
> +    }
> +
>       if (dp_netdev_is_reconf_required(dp) || ports_require_restart(dp)) {
>           reconfigure_datapath(dp);
>       }
> @@ -4979,6 +5326,8 @@ pmd_thread_main(void *f_)
>   reload:
>       pmd_alloc_static_tx_qid(pmd);
>   
> +    atomic_count_init(&pmd->pmd_overloaded, 0);
> +
>       /* List port/core affinity */
>       for (i = 0; i < poll_cnt; i++) {
>          VLOG_DBG("Core %d processing port \'%s\' with queue-id %d\n",
> @@ -7188,9 +7537,39 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
>                              struct polled_queue *poll_list, int poll_cnt)
>   {
>       struct dpcls *cls;
> +    uint64_t tot_idle = 0, tot_proc = 0;
> +    unsigned int pmd_load = 0;
>   
>       if (pmd->ctx.now > pmd->rxq_next_cycle_store) {
>           uint64_t curr_tsc;
> +        struct pmd_auto_lb *pmd_alb = &pmd->dp->pmd_alb;
> +        if (pmd_alb->is_enabled && !pmd->isolated
> +            && (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] >=
> +                                       pmd->prev_stats[PMD_CYCLES_ITER_IDLE])
> +            && (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] >=
> +                                        pmd->prev_stats[PMD_CYCLES_ITER_BUSY]))
> +            {
> +            tot_idle = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] -
> +                       pmd->prev_stats[PMD_CYCLES_ITER_IDLE];
> +            tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] -
> +                       pmd->prev_stats[PMD_CYCLES_ITER_BUSY];
> +
> +            if (tot_proc) {
> +                pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc));
> +            }
> +
> +            if (pmd_load >= ALB_PMD_LOAD_THRESHOLD) {
> +                atomic_count_inc(&pmd->pmd_overloaded);
> +            } else {
> +                atomic_count_set(&pmd->pmd_overloaded, 0);
> +            }
> +        }
> +
> +        pmd->prev_stats[PMD_CYCLES_ITER_IDLE] =
> +                        pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE];
> +        pmd->prev_stats[PMD_CYCLES_ITER_BUSY] =
> +                        pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY];
> +
>           /* Get the cycles that were used to process each queue and store. */
>           for (unsigned i = 0; i < poll_cnt; i++) {
>               uint64_t rxq_cyc_curr = dp_netdev_rxq_get_cycles(poll_list[i].rxq,
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 2160910..72f5283 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -574,6 +574,47 @@
>               be set to 'skip_sw'.
>           </p>
>         </column>
> +      <column name="other_config" key="pmd-auto-lb"
> +              type='{"type": "boolean"}'>
> +        <p>
> +         Configures PMD Auto Load Balancing that allows automatic assignment of
> +         RX queues to PMDs if any of PMDs is overloaded (i.e. processing cycles
> +         > 95%).
> +        </p>
> +        <p>
> +         It uses current scheme of cycle based assignment of RX queues that
> +         are not statically pinned to PMDs.
> +        </p>
> +        <p>
> +          The default value is <code>false</code>.
> +        </p>
> +        <p>
> +          Set this value to <code>true</code> to enable this option. It is
> +          currently disabled by default and an experimental feature.
> +        </p>
> +        <p>
> +         This only comes in effect if cycle based assignment is enabled and
> +         there are more than one non-isolated PMDs present and atleast one of
Typo above, 'atleast' -> 'at least'

Ian
> +         it polls more than one queue.
> +        </p>
> +      </column>
> +      <column name="other_config" key="pmd-auto-lb-rebal-interval"
> +              type='{"type": "integer",
> +                     "minInteger": 0, "maxInteger": 20000}'>
> +        <p>
> +         The minimum time (in minutes) 2 consecutive PMD Auto Load Balancing
> +         iterations.
> +        </p>
> +        <p>
> +         The defaul value is 1 min. If configured to 0 then it would be
> +         converted to default value i.e. 1 min
> +        </p>
> +        <p>
> +         This option can be configured to avoid frequent trigger of auto load
> +         balancing of PMDs. For e.g. set the value (in min) such that it occurs
> +         once in few hours or a day or a week.
> +        </p>
> +      </column>
>       </group>
>       <group title="Status">
>         <column name="next_cfg">
>
Nitin Katiyar Jan. 15, 2019, 9:59 a.m. UTC | #3
> -----Original Message-----
> From: Federico Iezzi [mailto:fiezzi@redhat.com]
> Sent: Monday, January 14, 2019 8:54 PM
> To: Nitin Katiyar <nitin.katiyar@ericsson.com>
> Cc: ovs-dev@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH v5] Adding support for PMD auto load
> balancing
> 
> Maybe it's a bit late for this series, but would be possible in a future
> enhancement to have a user parameter to set a different value for
> ALB_PMD_LOAD_THRESHOLD?
> 
Hi,
Yes, it can be done as enhancement in future.

Regards,
Nitin
> Regards,
> Federico
> 
> FEDERICO IEZZI
> 
> SR. TELCO ARCHITECT
> 
> Red Hat EMEA
> 
> fiezzi@redhat.com    M: +31-6-5152-9709
> 
> TRIED. TESTED. TRUSTED.
> @RedHat   Red Hat   Red Hat
> 
> 
> On Mon, 14 Jan 2019 at 11:56, Nitin Katiyar <nitin.katiyar@ericsson.com>
> wrote:
> >
> > Port rx queues that have not been statically assigned to PMDs are
> > currently assigned based on periodically sampled load measurements.
> > The assignment is performed at specific instances – port addition,
> > port deletion, upon reassignment request via CLI etc.
> >
> > Due to change in traffic pattern over time it can cause uneven load
> > among the PMDs and thus resulting in lower overall throughout.
> >
> > This patch enables the support of auto load balancing of PMDs based on
> > measured load of RX queues. Each PMD measures the processing load for
> > each of its associated queues every 10 seconds. If the aggregated PMD
> > load reaches 95% for 6 consecutive intervals then PMD considers itself to
> be overloaded.
> >
> > If any PMD is overloaded, a dry-run of the PMD assignment algorithm is
> > performed by OVS main thread. The dry-run does NOT change the existing
> > queue to PMD assignments.
> >
> > If the resultant mapping of dry-run indicates an improved distribution
> > of the load then the actual reassignment will be performed.
> >
> > The automatic rebalancing will be disabled by default and has to be
> > enabled via configuration option. The interval (in minutes) between
> > two consecutive rebalancing can also be configured via CLI, default is
> > 1 min.
> >
> > Following example commands can be used to set the auto-lb params:
> > ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true"
> > ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-rebalance-intvl="5"
> >
> > Co-authored-by: Rohith Basavaraja <rohith.basavaraja@gmail.com>
> > Co-authored-by: Venkatesan Pradeep
> <venkatesan.pradeep@ericsson.com>
> > Signed-off-by: Rohith Basavaraja <rohith.basavaraja@gmail.com>
> > Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
> > Signed-off-by: Nitin Katiyar <nitin.katiyar@ericsson.com>
> > ---
> >  Documentation/topics/dpdk/pmd.rst |  41 +++++
> >  NEWS                              |   1 +
> >  lib/dpif-netdev.c                 | 379
> ++++++++++++++++++++++++++++++++++++++
> >  vswitchd/vswitch.xml              |  41 +++++
> >  4 files changed, 462 insertions(+)
> >
> > diff --git a/Documentation/topics/dpdk/pmd.rst
> > b/Documentation/topics/dpdk/pmd.rst
> > index dd9172d..c273b40 100644
> > --- a/Documentation/topics/dpdk/pmd.rst
> > +++ b/Documentation/topics/dpdk/pmd.rst
> > @@ -183,3 +183,44 @@ or can be triggered by using::
> >     In addition, the output of ``pmd-rxq-show`` was modified to include
> >     Rx queue utilization of the PMD as a percentage. Prior to this, tracking of
> >     stats was not available.
> > +
> > +Automatic assignment of Port/Rx Queue to PMD Threads (experimental)
> > +-------------------------------------------------------------------
> > +
> > +Cycle or utilization based allocation of Rx queues to PMDs gives
> > +efficient load distribution but it is not adaptive to change in
> > +traffic pattern occuring over the time. This causes uneven load among
> > +the PMDs which results in overall lower throughput.
> > +
> > +To address this automatic load balancing of PMDs can be set by::
> > +
> > +    $ ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true"
> > +
> > +If pmd-auto-lb is set to true AND cycle based assignment is enabled
> > +then auto load balancing of PMDs is enabled provided there are 2 or
> > +more non-isolated PMDs and at least one of these PMDs is polling more
> than one RX queue.
> > +
> > +Once auto load balancing is set, each non-isolated PMD measures the
> > +processing load for each of its associated queues every 10 seconds.
> > +If the aggregated PMD load reaches 95% for 6 consecutive intervals
> > +then PMD considers itself to be overloaded.
> > +
> > +If any PMD is overloaded, a dry-run of the PMD assignment algorithm
> > +is performed by OVS main thread. The dry-run does NOT change the
> > +existing queue to PMD assignments.
> > +
> > +If the resultant mapping of dry-run indicates an improved
> > +distribution of the load then the actual reassignment will be performed.
> > +
> > +The minimum time between 2 consecutive PMD auto load balancing
> > +iterations can also be configured by::
> > +
> > +    $ ovs-vsctl set open_vswitch .\
> > +        other_config:pmd-auto-lb-rebal-interval="<interval>"
> > +
> > +where ``<interval>`` is a value in minutes. The default interval is 1
> > +minute and setting it to 0 will also result in default value i.e. 1 min.
> > +
> > +A user can use this option to avoid frequent trigger of auto load
> > +balancing of PMDs. For e.g. set this (in min) such that it occurs
> > +once in few hours or a day or a week.
> > diff --git a/NEWS b/NEWS
> > index 2de844f..0e9fcb1 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -23,6 +23,7 @@ Post-v2.10.0
> >       * Add option for simple round-robin based Rxq to PMD assignment.
> >         It can be set with pmd-rxq-assign.
> >       * Add support for DPDK 18.11
> > +     * Add support for Auto load balancing of PMDs (experimental)
> >     - Add 'symmetric_l3' hash function.
> >     - OVS now honors 'updelay' and 'downdelay' for bonds with LACP
> configured.
> >     - ovs-vswitchd:
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > 1564db9..c1757ab 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -80,6 +80,12 @@
> >
> >  VLOG_DEFINE_THIS_MODULE(dpif_netdev);
> >
> > +/* Auto Load Balancing Defaults */
> > +#define ALB_ACCEPTABLE_IMPROVEMENT       25
> > +#define ALB_PMD_LOAD_THRESHOLD           95
> > +#define ALB_PMD_REBALANCE_POLL_INTERVAL  1 /* 1 Min */
> > +#define MIN_TO_MSEC                  60000
> > +
> >  #define FLOW_DUMP_MAX_BATCH 50
> >  /* Use per thread recirc_depth to prevent recirculation loop. */
> > #define MAX_RECIRC_DEPTH 6 @@ -288,6 +294,13 @@ struct dp_meter {
> >      struct dp_meter_band bands[];
> >  };
> >
> > +struct pmd_auto_lb {
> > +    bool auto_lb_requested;     /* Auto load balancing requested by user. */
> > +    bool is_enabled;            /* Current status of Auto load balancing. */
> > +    uint64_t rebalance_intvl;
> > +    uint64_t rebalance_poll_timer;
> > +};
> > +
> >  /* Datapath based on the network device interface from netdev.h.
> >   *
> >   *
> > @@ -368,6 +381,7 @@ struct dp_netdev {
> >      uint64_t last_tnl_conf_seq;
> >
> >      struct conntrack conntrack;
> > +    struct pmd_auto_lb pmd_alb;
> >  };
> >
> >  static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
> > @@ -702,6 +716,11 @@ struct dp_netdev_pmd_thread {
> >      /* Keep track of detailed PMD performance statistics. */
> >      struct pmd_perf_stats perf_stats;
> >
> > +    /* Stats from previous iteration used by automatic pmd
> > +     * load balance logic. */
> > +    uint64_t prev_stats[PMD_N_STATS];
> > +    atomic_count pmd_overloaded;
> > +
> >      /* Set to true if the pmd thread needs to be reloaded. */
> >      bool need_reload;
> >  };
> > @@ -3734,6 +3753,53 @@ dpif_netdev_operate(struct dpif *dpif, struct
> dpif_op **ops, size_t n_ops,
> >      }
> >  }
> >
> > +/* Enable or Disable PMD auto load balancing. */ static void
> > +set_pmd_auto_lb(struct dp_netdev *dp) {
> > +    unsigned int cnt = 0;
> > +    struct dp_netdev_pmd_thread *pmd;
> > +    struct pmd_auto_lb *pmd_alb = &dp->pmd_alb;
> > +
> > +    bool enable_alb = false;
> > +    bool multi_rxq = false;
> > +    bool pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc;
> > +
> > +    /* Ensure that there is at least 2 non-isolated PMDs and
> > +     * one of them is polling more than one rxq. */
> > +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> > +        if (pmd->core_id == NON_PMD_CORE_ID || pmd->isolated) {
> > +            continue;
> > +        }
> > +
> > +        if (hmap_count(&pmd->poll_list) > 1) {
> > +            multi_rxq = true;
> > +        }
> > +        if (cnt && multi_rxq) {
> > +                enable_alb = true;
> > +                break;
> > +        }
> > +        cnt++;
> > +    }
> > +
> > +    /* Enable auto LB if it is requested and cycle based assignment is true. */
> > +    enable_alb = enable_alb && pmd_rxq_assign_cyc &&
> > +                    pmd_alb->auto_lb_requested;
> > +
> > +    if (pmd_alb->is_enabled != enable_alb) {
> > +        pmd_alb->is_enabled = enable_alb;
> > +        if (pmd_alb->is_enabled) {
> > +            VLOG_INFO("PMD auto load balance is enabled "
> > +                      "(with rebalance interval:%"PRIu64" msec)",
> > +                       pmd_alb->rebalance_intvl);
> > +        } else {
> > +            pmd_alb->rebalance_poll_timer = 0;
> > +            VLOG_INFO("PMD auto load balance is disabled");
> > +        }
> > +    }
> > +
> > +}
> > +
> >  /* Applies datapath configuration from the database. Some of the changes
> are
> >   * actually applied in dpif_netdev_run(). */  static int @@ -3748,6
> > +3814,7 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap
> *other_config)
> >                          DEFAULT_EM_FLOW_INSERT_INV_PROB);
> >      uint32_t insert_min, cur_min;
> >      uint32_t tx_flush_interval, cur_tx_flush_interval;
> > +    uint64_t rebalance_intvl;
> >
> >      tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
> >                                       DEFAULT_TX_FLUSH_INTERVAL); @@
> > -3819,6 +3886,23 @@ dpif_netdev_set_config(struct dpif *dpif, const
> struct smap *other_config)
> >                    pmd_rxq_assign);
> >          dp_netdev_request_reconfigure(dp);
> >      }
> > +
> > +    struct pmd_auto_lb *pmd_alb = &dp->pmd_alb;
> > +    pmd_alb->auto_lb_requested = smap_get_bool(other_config, "pmd-
> auto-lb",
> > +                              false);
> > +
> > +    rebalance_intvl = smap_get_int(other_config, "pmd-auto-lb-rebal-
> interval",
> > +                              ALB_PMD_REBALANCE_POLL_INTERVAL);
> > +
> > +    /* Input is in min, convert it to msec. */
> > +    rebalance_intvl =
> > +        rebalance_intvl ? rebalance_intvl * MIN_TO_MSEC :
> > + MIN_TO_MSEC;
> > +
> > +    if (pmd_alb->rebalance_intvl != rebalance_intvl) {
> > +        pmd_alb->rebalance_intvl = rebalance_intvl;
> > +    }
> > +
> > +    set_pmd_auto_lb(dp);
> >      return 0;
> >  }
> >
> > @@ -4762,6 +4846,9 @@ reconfigure_datapath(struct dp_netdev *dp)
> >
> >      /* Reload affected pmd threads. */
> >      reload_affected_pmds(dp);
> > +
> > +    /* Check if PMD Auto LB is to be enabled */
> > +    set_pmd_auto_lb(dp);
> >  }
> >
> >  /* Returns true if one of the netdevs in 'dp' requires a
> > reconfiguration */ @@ -4780,6 +4867,237 @@
> ports_require_restart(const struct dp_netdev *dp)
> >      return false;
> >  }
> >
> > +/* Function for calculating variance. */ static uint64_t
> > +variance(uint64_t a[], int n) {
> > +    /* Compute mean (average of elements). */
> > +    uint64_t sum = 0;
> > +    uint64_t mean = 0;
> > +    uint64_t sqDiff = 0;
> > +
> > +    if (!n) {
> > +        return 0;
> > +    }
> > +
> > +    for (int i = 0; i < n; i++) {
> > +        sum += a[i];
> > +    }
> > +
> > +    if (sum) {
> > +        mean = sum / n;
> > +
> > +        /* Compute sum squared differences with mean. */
> > +        for (int i = 0; i < n; i++) {
> > +            sqDiff += (a[i] - mean)*(a[i] - mean);
> > +        }
> > +    }
> > +    return (sqDiff ? (sqDiff / n) : 0); }
> > +
> > +
> > +/* Returns the variance in the PMDs usage as part of dry run of rxqs
> > + * assignment to PMDs. */
> > +static bool
> > +get_dry_run_variance(struct dp_netdev *dp, uint32_t *core_list,
> > +                     uint32_t num, uint64_t *predicted_variance)
> > +    OVS_REQUIRES(dp->port_mutex)
> > +{
> > +    struct dp_netdev_port *port;
> > +    struct dp_netdev_pmd_thread *pmd;
> > +    struct dp_netdev_rxq ** rxqs = NULL;
> > +    struct rr_numa *numa = NULL;
> > +    struct rr_numa_list rr;
> > +    int n_rxqs = 0;
> > +    bool ret = false;
> > +    uint64_t *pmd_usage;
> > +
> > +    if (!predicted_variance) {
> > +        return ret;
> > +    }
> > +
> > +    pmd_usage = xcalloc(num, sizeof(uint64_t));
> > +
> > +    HMAP_FOR_EACH (port, node, &dp->ports) {
> > +        if (!netdev_is_pmd(port->netdev)) {
> > +            continue;
> > +        }
> > +
> > +        for (int qid = 0; qid < port->n_rxq; qid++) {
> > +            struct dp_netdev_rxq *q = &port->rxqs[qid];
> > +            uint64_t cycle_hist = 0;
> > +
> > +            if (q->pmd->isolated) {
> > +                continue;
> > +            }
> > +
> > +            if (n_rxqs == 0) {
> > +                rxqs = xmalloc(sizeof *rxqs);
> > +            } else {
> > +                rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1));
> > +            }
> > +
> > +            /* Sum the queue intervals and store the cycle history. */
> > +            for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
> > +                cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i);
> > +            }
> > +            dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST,
> > +                                         cycle_hist);
> > +            /* Store the queue. */
> > +            rxqs[n_rxqs++] = q;
> > +        }
> > +    }
> > +    if (n_rxqs > 1) {
> > +        /* Sort the queues in order of the processing cycles
> > +         * they consumed during their last pmd interval. */
> > +        qsort(rxqs, n_rxqs, sizeof *rxqs, compare_rxq_cycles);
> > +    }
> > +    rr_numa_list_populate(dp, &rr);
> > +
> > +    for (int i = 0; i < n_rxqs; i++) {
> > +        int numa_id = netdev_get_numa_id(rxqs[i]->port->netdev);
> > +        numa = rr_numa_list_lookup(&rr, numa_id);
> > +        if (!numa) {
> > +            /* Abort if cross NUMA polling. */
> > +            VLOG_DBG("PMD auto lb dry run."
> > +                     " Aborting due to cross-numa polling.");
> > +            goto cleanup;
> > +        }
> > +
> > +        pmd = rr_numa_get_pmd(numa, true);
> > +        VLOG_DBG("PMD auto lb dry run. Predicted: Core %d on numa node
> %d "
> > +                  "to be assigned port \'%s\' rx queue %d "
> > +                  "(measured processing cycles %"PRIu64").",
> > +                  pmd->core_id, numa_id,
> > +                  netdev_rxq_get_name(rxqs[i]->rx),
> > +                  netdev_rxq_get_queue_id(rxqs[i]->rx),
> > +                  dp_netdev_rxq_get_cycles(rxqs[i],
> > + RXQ_CYCLES_PROC_HIST));
> > +
> > +        for (int id = 0; id < num; id++) {
> > +            if (pmd->core_id == core_list[id]) {
> > +                /* Add the processing cycles of rxq to pmd polling it. */
> > +                pmd_usage[id] += dp_netdev_rxq_get_cycles(rxqs[i],
> > +                                        RXQ_CYCLES_PROC_HIST);
> > +            }
> > +        }
> > +    }
> > +
> > +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> > +        uint64_t total_cycles = 0;
> > +
> > +        if ((pmd->core_id == NON_PMD_CORE_ID) || pmd->isolated) {
> > +            continue;
> > +        }
> > +
> > +        /* Get the total pmd cycles for an interval. */
> > +        atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles);
> > +        /* Estimate the cycles to cover all intervals. */
> > +        total_cycles *= PMD_RXQ_INTERVAL_MAX;
> > +        for (int id = 0; id < num; id++) {
> > +            if (pmd->core_id == core_list[id]) {
> > +                if (pmd_usage[id]) {
> > +                    pmd_usage[id] = (pmd_usage[id] * 100) / total_cycles;
> > +                }
> > +                VLOG_DBG("PMD auto lb dry run. Predicted: Core %d, "
> > +                         "usage %"PRIu64"", pmd->core_id, pmd_usage[id]);
> > +            }
> > +        }
> > +    }
> > +    *predicted_variance = variance(pmd_usage, num);
> > +    ret = true;
> > +
> > +cleanup:
> > +    rr_numa_list_destroy(&rr);
> > +    free(rxqs);
> > +    free(pmd_usage);
> > +    return ret;
> > +}
> > +
> > +/* Does the dry run of Rxq assignment to PMDs and returns true if it
> > +gives
> > + * better distribution of load on PMDs. */ static bool
> > +pmd_rebalance_dry_run(struct dp_netdev *dp)
> > +    OVS_REQUIRES(dp->port_mutex)
> > +{
> > +    struct dp_netdev_pmd_thread *pmd;
> > +    uint64_t *curr_pmd_usage;
> > +
> > +    uint64_t curr_variance;
> > +    uint64_t new_variance;
> > +    uint64_t improvement = 0;
> > +    uint32_t num_pmds;
> > +    uint32_t *pmd_corelist;
> > +    struct rxq_poll *poll, *poll_next;
> > +    bool ret;
> > +
> > +    num_pmds = cmap_count(&dp->poll_threads);
> > +
> > +    if (num_pmds > 1) {
> > +        curr_pmd_usage = xcalloc(num_pmds, sizeof(uint64_t));
> > +        pmd_corelist = xcalloc(num_pmds, sizeof(uint32_t));
> > +    } else {
> > +        return false;
> > +    }
> > +
> > +    num_pmds = 0;
> > +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> > +        uint64_t total_cycles = 0;
> > +        uint64_t total_proc = 0;
> > +
> > +        if ((pmd->core_id == NON_PMD_CORE_ID) || pmd->isolated) {
> > +            continue;
> > +        }
> > +
> > +        /* Get the total pmd cycles for an interval. */
> > +        atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles);
> > +        /* Estimate the cycles to cover all intervals. */
> > +        total_cycles *= PMD_RXQ_INTERVAL_MAX;
> > +
> > +        HMAP_FOR_EACH_SAFE (poll, poll_next, node, &pmd->poll_list) {
> > +            uint64_t proc_cycles = 0;
> > +            for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
> > +                proc_cycles += dp_netdev_rxq_get_intrvl_cycles(poll->rxq, i);
> > +            }
> > +            total_proc += proc_cycles;
> > +        }
> > +        if (total_proc) {
> > +            curr_pmd_usage[num_pmds] = (total_proc * 100) / total_cycles;
> > +        }
> > +
> > +        VLOG_DBG("PMD auto lb dry run. Current: Core %d, usage
> %"PRIu64"",
> > +                  pmd->core_id, curr_pmd_usage[num_pmds]);
> > +
> > +        if (atomic_count_get(&pmd->pmd_overloaded)) {
> > +            atomic_count_set(&pmd->pmd_overloaded, 0);
> > +        }
> > +
> > +        pmd_corelist[num_pmds] = pmd->core_id;
> > +        num_pmds++;
> > +    }
> > +
> > +    curr_variance = variance(curr_pmd_usage, num_pmds);
> > +    ret = get_dry_run_variance(dp, pmd_corelist, num_pmds,
> > + &new_variance);
> > +
> > +    if (ret) {
> > +        VLOG_DBG("PMD auto lb dry run. Current PMD variance: %"PRIu64","
> > +                  " Predicted PMD variance: %"PRIu64"",
> > +                  curr_variance, new_variance);
> > +
> > +        if (new_variance < curr_variance) {
> > +            improvement =
> > +                ((curr_variance - new_variance) * 100) / curr_variance;
> > +        }
> > +        if (improvement < ALB_ACCEPTABLE_IMPROVEMENT) {
> > +            ret = false;
> > +        }
> > +    }
> > +
> > +    free(curr_pmd_usage);
> > +    free(pmd_corelist);
> > +    return ret;
> > +}
> > +
> > +
> >  /* Return true if needs to revalidate datapath flows. */  static bool
> > dpif_netdev_run(struct dpif *dpif) @@ -4789,6 +5107,9 @@
> > dpif_netdev_run(struct dpif *dpif)
> >      struct dp_netdev_pmd_thread *non_pmd;
> >      uint64_t new_tnl_seq;
> >      bool need_to_flush = true;
> > +    bool pmd_rebalance = false;
> > +    long long int now = time_msec();
> > +    struct dp_netdev_pmd_thread *pmd;
> >
> >      ovs_mutex_lock(&dp->port_mutex);
> >      non_pmd = dp_netdev_get_pmd(dp, NON_PMD_CORE_ID); @@ -4821,6
> > +5142,32 @@ dpif_netdev_run(struct dpif *dpif)
> >          dp_netdev_pmd_unref(non_pmd);
> >      }
> >
> > +    struct pmd_auto_lb *pmd_alb = &dp->pmd_alb;
> > +    if (pmd_alb->is_enabled) {
> > +        if (!pmd_alb->rebalance_poll_timer) {
> > +            pmd_alb->rebalance_poll_timer = now;
> > +        } else if ((pmd_alb->rebalance_poll_timer +
> > +                   pmd_alb->rebalance_intvl) < now) {
> > +            pmd_alb->rebalance_poll_timer = now;
> > +            CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> > +                if (atomic_count_get(&pmd->pmd_overloaded) >=
> > +                                    PMD_RXQ_INTERVAL_MAX) {
> > +                    pmd_rebalance = true;
> > +                    break;
> > +                }
> > +            }
> > +
> > +            if (pmd_rebalance &&
> > +                !dp_netdev_is_reconf_required(dp) &&
> > +                !ports_require_restart(dp) &&
> > +                pmd_rebalance_dry_run(dp)) {
> > +                VLOG_INFO("PMD auto lb dry run."
> > +                          " requesting datapath reconfigure.");
> > +                dp_netdev_request_reconfigure(dp);
> > +            }
> > +        }
> > +    }
> > +
> >      if (dp_netdev_is_reconf_required(dp) || ports_require_restart(dp)) {
> >          reconfigure_datapath(dp);
> >      }
> > @@ -4979,6 +5326,8 @@ pmd_thread_main(void *f_)
> >  reload:
> >      pmd_alloc_static_tx_qid(pmd);
> >
> > +    atomic_count_init(&pmd->pmd_overloaded, 0);
> > +
> >      /* List port/core affinity */
> >      for (i = 0; i < poll_cnt; i++) {
> >         VLOG_DBG("Core %d processing port \'%s\' with queue-id %d\n",
> > @@ -7188,9 +7537,39 @@ dp_netdev_pmd_try_optimize(struct
> dp_netdev_pmd_thread *pmd,
> >                             struct polled_queue *poll_list, int
> > poll_cnt)  {
> >      struct dpcls *cls;
> > +    uint64_t tot_idle = 0, tot_proc = 0;
> > +    unsigned int pmd_load = 0;
> >
> >      if (pmd->ctx.now > pmd->rxq_next_cycle_store) {
> >          uint64_t curr_tsc;
> > +        struct pmd_auto_lb *pmd_alb = &pmd->dp->pmd_alb;
> > +        if (pmd_alb->is_enabled && !pmd->isolated
> > +            && (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] >=
> > +                                       pmd->prev_stats[PMD_CYCLES_ITER_IDLE])
> > +            && (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] >=
> > +                                        pmd->prev_stats[PMD_CYCLES_ITER_BUSY]))
> > +            {
> > +            tot_idle = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] -
> > +                       pmd->prev_stats[PMD_CYCLES_ITER_IDLE];
> > +            tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] -
> > +                       pmd->prev_stats[PMD_CYCLES_ITER_BUSY];
> > +
> > +            if (tot_proc) {
> > +                pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc));
> > +            }
> > +
> > +            if (pmd_load >= ALB_PMD_LOAD_THRESHOLD) {
> > +                atomic_count_inc(&pmd->pmd_overloaded);
> > +            } else {
> > +                atomic_count_set(&pmd->pmd_overloaded, 0);
> > +            }
> > +        }
> > +
> > +        pmd->prev_stats[PMD_CYCLES_ITER_IDLE] =
> > +                        pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE];
> > +        pmd->prev_stats[PMD_CYCLES_ITER_BUSY] =
> > +
> > + pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY];
> > +
> >          /* Get the cycles that were used to process each queue and store. */
> >          for (unsigned i = 0; i < poll_cnt; i++) {
> >              uint64_t rxq_cyc_curr =
> > dp_netdev_rxq_get_cycles(poll_list[i].rxq,
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> > 2160910..72f5283 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -574,6 +574,47 @@
> >              be set to 'skip_sw'.
> >          </p>
> >        </column>
> > +      <column name="other_config" key="pmd-auto-lb"
> > +              type='{"type": "boolean"}'>
> > +        <p>
> > +         Configures PMD Auto Load Balancing that allows automatic
> assignment of
> > +         RX queues to PMDs if any of PMDs is overloaded (i.e. processing
> cycles
> > +         > 95%).
> > +        </p>
> > +        <p>
> > +         It uses current scheme of cycle based assignment of RX queues that
> > +         are not statically pinned to PMDs.
> > +        </p>
> > +        <p>
> > +          The default value is <code>false</code>.
> > +        </p>
> > +        <p>
> > +          Set this value to <code>true</code> to enable this option. It is
> > +          currently disabled by default and an experimental feature.
> > +        </p>
> > +        <p>
> > +         This only comes in effect if cycle based assignment is enabled and
> > +         there are more than one non-isolated PMDs present and atleast one
> of
> > +         it polls more than one queue.
> > +        </p>
> > +      </column>
> > +      <column name="other_config" key="pmd-auto-lb-rebal-interval"
> > +              type='{"type": "integer",
> > +                     "minInteger": 0, "maxInteger": 20000}'>
> > +        <p>
> > +         The minimum time (in minutes) 2 consecutive PMD Auto Load
> Balancing
> > +         iterations.
> > +        </p>
> > +        <p>
> > +         The defaul value is 1 min. If configured to 0 then it would be
> > +         converted to default value i.e. 1 min
> > +        </p>
> > +        <p>
> > +         This option can be configured to avoid frequent trigger of auto load
> > +         balancing of PMDs. For e.g. set the value (in min) such that it occurs
> > +         once in few hours or a day or a week.
> > +        </p>
> > +      </column>
> >      </group>
> >      <group title="Status">
> >        <column name="next_cfg">
> > --
> > 1.9.1
> >
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Nitin Katiyar Jan. 15, 2019, 10:44 a.m. UTC | #4
> -----Original Message-----
> From: Ian Stokes [mailto:ian.stokes@intel.com]
> Sent: Tuesday, January 15, 2019 5:06 AM
> To: Nitin Katiyar <nitin.katiyar@ericsson.com>; ovs-dev@openvswitch.org;
> Kevin Traynor <ktraynor@redhat.com>; Ilya Maximets
> <i.maximets@samsung.com>
> Subject: Re: [ovs-dev] [PATCH v5] Adding support for PMD auto load
> balancing
> 
> On 1/14/2019 10:44 AM, Nitin Katiyar wrote:
> > Port rx queues that have not been statically assigned to PMDs are
> > currently assigned based on periodically sampled load measurements.
> > The assignment is performed at specific instances – port addition,
> > port deletion, upon reassignment request via CLI etc.
> >
> > Due to change in traffic pattern over time it can cause uneven load
> > among the PMDs and thus resulting in lower overall throughout.
> >
> > This patch enables the support of auto load balancing of PMDs based on
> > measured load of RX queues. Each PMD measures the processing load for
> > each of its associated queues every 10 seconds. If the aggregated PMD
> > load reaches 95% for 6 consecutive intervals then PMD considers itself to
> be overloaded.
> >
> > If any PMD is overloaded, a dry-run of the PMD assignment algorithm is
> > performed by OVS main thread. The dry-run does NOT change the existing
> > queue to PMD assignments.
> >
> > If the resultant mapping of dry-run indicates an improved distribution
> > of the load then the actual reassignment will be performed.
> >
> > The automatic rebalancing will be disabled by default and has to be
> > enabled via configuration option. The interval (in minutes) between
> > two consecutive rebalancing can also be configured via CLI, default is
> > 1 min.
> >
> > Following example commands can be used to set the auto-lb params:
> > ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true"
> > ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-rebalance-intvl="5"
> >
> Thanks for the patch Nitin. A few comments below.
> 
> On An aside, there was discussion if this could be part of OVS 2.11 from the
> community call last week. Although this is a v5 I believe it has been under
> review and testing from the folks at Red Hat however I don't see any acks to
> date.
> 
> What are peoples thoughts?
> 
> This change seems quite contained and doesn't interfere with default cases
> where rxq isolation, round robin or cycle based assignment is used.
> 
> In testing the previous balancing still work work fine and the new load
> balancing works well also although I have queries on default values which are
> specific to use cases discussed below.
> 
> Do people feel there is any reason to hold off merging if the issues below are
> addressed and there are no other concerns?
> 
> > Co-authored-by: Rohith Basavaraja <rohith.basavaraja@gmail.com>
> > Co-authored-by: Venkatesan Pradeep
> <venkatesan.pradeep@ericsson.com>
> > Signed-off-by: Rohith Basavaraja <rohith.basavaraja@gmail.com>
> > Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
> > Signed-off-by: Nitin Katiyar <nitin.katiyar@ericsson.com>
> > ---
> >   Documentation/topics/dpdk/pmd.rst |  41 +++++
> >   NEWS                              |   1 +
> >   lib/dpif-netdev.c                 | 379
> ++++++++++++++++++++++++++++++++++++++
> >   vswitchd/vswitch.xml              |  41 +++++
> >   4 files changed, 462 insertions(+)
> >
> > diff --git a/Documentation/topics/dpdk/pmd.rst
> > b/Documentation/topics/dpdk/pmd.rst
> > index dd9172d..c273b40 100644
> > --- a/Documentation/topics/dpdk/pmd.rst
> > +++ b/Documentation/topics/dpdk/pmd.rst
> > @@ -183,3 +183,44 @@ or can be triggered by using::
> >      In addition, the output of ``pmd-rxq-show`` was modified to include
> >      Rx queue utilization of the PMD as a percentage. Prior to this, tracking of
> >      stats was not available.
> > +
> > +Automatic assignment of Port/Rx Queue to PMD Threads (experimental)
> > +-------------------------------------------------------------------
> > +
> > +Cycle or utilization based allocation of Rx queues to PMDs gives
> > +efficient load distribution but it is not adaptive to change in
> > +traffic pattern occuring
> Minor typo above, 'occuring' -> 'occurring'
I will update it.
> > +over the time. This causes uneven load among the PMDs which results
> > +in overall lower throughput.
> > +
> > +To address this automatic load balancing of PMDs can be set by::
> > +
> > +    $ ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true"
> > +
> > +If pmd-auto-lb is set to true AND cycle based assignment is enabled
> > +then auto load balancing of PMDs is enabled provided there are 2 or
> > +more non-isolated PMDs and at least one of these PMDs is polling more
> than one RX queue.
> 
> It would be good to give examples of the behavior when enabling this.
> 
> I've spent some time playing with this behavior in particular as it wasn't clear
> how it would be triggered and disabled as the queues and PMDs are
> manipulated e.g. where the number of of non-isolated PMDs is reduced
> below 2, pmd-auto-lb is automatically disabled. Changing to round robin based
> assignment will also disable it etc.
> 
I will add some example in the description.
> > +
> > +Once auto load balancing is set, each non-isolated PMD measures the
> processing
> > +load for each of its associated queues every 10 seconds. If the aggregated
> PMD
> > +load reaches 95% for 6 consecutive intervals then PMD considers itself to
> be
> > +overloaded.
> > +
> > +If any PMD is overloaded, a dry-run of the PMD assignment algorithm is
> > +performed by OVS main thread. The dry-run does NOT change the existing
> queue
> > +to PMD assignments.
> > +
> > +If the resultant mapping of dry-run indicates an improved distribution of
> the
> > +load then the actual reassignment will be performed.
> > +
> > +The minimum time between 2 consecutive PMD auto load balancing
> iterations can
> > +also be configured by::
> > +
> > +    $ ovs-vsctl set open_vswitch .\
> > +        other_config:pmd-auto-lb-rebal-interval="<interval>"
> > +
> > +where ``<interval>`` is a value in minutes. The default interval is 1 minute
> > +and setting it to 0 will also result in default value i.e. 1 min.
> > +
> > +A user can use this option to avoid frequent trigger of auto load balancing
> of
> > +PMDs. For e.g. set this (in min) such that it occurs once in few hours or a
> day
> > +or a week.
> 
> Are there limitations to this work?
Only limitation (other than you mentioned below) I can think of is the option to configure the load threshold is not there which can be added later.
> 
>  From inspecting the code, cross NUMA is not supported, could you
> provide detail here explainging when/why it would be the case? Are there
> intentions to to implement cross NUMA support?
> 
Yes, the idea was not to reassign queues across NUMA as the overall throughput may actually get worse than it is predicted.

> I think it would be good to call out where this feature may not work
> well, for instance if traffic profiles for specific rx queues were
> changing dramatically within the 1 minute (via manipulation or due to
> randomness in the profile) would it be possible to thrash the cache of
> the CPU due to the changes required for EMC and DPCLS for new flows?
> 
> This is an extreme corner case and I would expect traffic profiles to be
> more uniform but it could be worth mentioning. One way to avoid this
> would be set the interval higher to mitigate this, which again gives the
> user an idea how how to deploy this and avoid such issues along with the
> advantages and disadvantages of interval lengths.
> 
Sure, I will add some more details in documentation. 
> > diff --git a/NEWS b/NEWS
> > index 2de844f..0e9fcb1 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -23,6 +23,7 @@ Post-v2.10.0
> >        * Add option for simple round-robin based Rxq to PMD assignment.
> >          It can be set with pmd-rxq-assign.
> >        * Add support for DPDK 18.11
> > +     * Add support for Auto load balancing of PMDs (experimental)
> >      - Add 'symmetric_l3' hash function.
> >      - OVS now honors 'updelay' and 'downdelay' for bonds with LACP
> configured.
> >      - ovs-vswitchd:
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > index 1564db9..c1757ab 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -80,6 +80,12 @@
> >
> >   VLOG_DEFINE_THIS_MODULE(dpif_netdev);
> >
> > +/* Auto Load Balancing Defaults */
> > +#define ALB_ACCEPTABLE_IMPROVEMENT       25
> 
> Would it be an option in the future to be able to set the improvment
> level measered also? Perhaps that too specific for users at the moment
> but I'd be interested to hear peoples opinion as the feature matures.
> 
We initially had this and load threshold as options in RFC but later dropped to keep it simple and with less options in beginning. But of course it can be added later.
 
> > +#define ALB_PMD_LOAD_THRESHOLD           95
> > +#define ALB_PMD_REBALANCE_POLL_INTERVAL  1 /* 1 Min */
> > +#define MIN_TO_MSEC                  60000
> > +
> >   #define FLOW_DUMP_MAX_BATCH 50
> >   /* Use per thread recirc_depth to prevent recirculation loop. */
> >   #define MAX_RECIRC_DEPTH 6
> > @@ -288,6 +294,13 @@ struct dp_meter {
> >       struct dp_meter_band bands[];
> >   };
> >
> > +struct pmd_auto_lb {
> > +    bool auto_lb_requested;     /* Auto load balancing requested by user. */
> > +    bool is_enabled;            /* Current status of Auto load balancing. */
> > +    uint64_t rebalance_intvl;
> > +    uint64_t rebalance_poll_timer;
> > +};
> > +
> >   /* Datapath based on the network device interface from netdev.h.
> >    *
> >    *
> > @@ -368,6 +381,7 @@ struct dp_netdev {
> >       uint64_t last_tnl_conf_seq;
> >
> >       struct conntrack conntrack;
> > +    struct pmd_auto_lb pmd_alb;
> >   };
> >
> >   static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
> > @@ -702,6 +716,11 @@ struct dp_netdev_pmd_thread {
> >       /* Keep track of detailed PMD performance statistics. */
> >       struct pmd_perf_stats perf_stats;
> >
> > +    /* Stats from previous iteration used by automatic pmd
> > +     * load balance logic. */
> > +    uint64_t prev_stats[PMD_N_STATS];
> > +    atomic_count pmd_overloaded;
> > +
> >       /* Set to true if the pmd thread needs to be reloaded. */
> >       bool need_reload;
> >   };
> > @@ -3734,6 +3753,53 @@ dpif_netdev_operate(struct dpif *dpif, struct
> dpif_op **ops, size_t n_ops,
> >       }
> >   }
> >
> > +/* Enable or Disable PMD auto load balancing. */
> > +static void
> > +set_pmd_auto_lb(struct dp_netdev *dp)
> > +{
> > +    unsigned int cnt = 0;
> > +    struct dp_netdev_pmd_thread *pmd;
> > +    struct pmd_auto_lb *pmd_alb = &dp->pmd_alb;
> > +
> > +    bool enable_alb = false;
> > +    bool multi_rxq = false;
> > +    bool pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc;
> > +
> > +    /* Ensure that there is at least 2 non-isolated PMDs and
> > +     * one of them is polling more than one rxq. */
> > +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> > +        if (pmd->core_id == NON_PMD_CORE_ID || pmd->isolated) {
> > +            continue;
> > +        }
> > +
> > +        if (hmap_count(&pmd->poll_list) > 1) {
> > +            multi_rxq = true;
> > +        }
> > +        if (cnt && multi_rxq) {
> > +                enable_alb = true;
> > +                break;
> > +        }
> > +        cnt++;
> > +    }
> > +
> > +    /* Enable auto LB if it is requested and cycle based assignment is true. */
> > +    enable_alb = enable_alb && pmd_rxq_assign_cyc &&
> > +                    pmd_alb->auto_lb_requested;
> > +
> > +    if (pmd_alb->is_enabled != enable_alb) {
> > +        pmd_alb->is_enabled = enable_alb;
> > +        if (pmd_alb->is_enabled) {
> > +            VLOG_INFO("PMD auto load balance is enabled "
> > +                      "(with rebalance interval:%"PRIu64" msec)",
> > +                       pmd_alb->rebalance_intvl);
> > +        } else {
> > +            pmd_alb->rebalance_poll_timer = 0;
> > +            VLOG_INFO("PMD auto load balance is disabled");
> > +        }
> > +    }
> > +
> > +}
> > +
> >   /* Applies datapath configuration from the database. Some of the changes
> are
> >    * actually applied in dpif_netdev_run(). */
> >   static int
> > @@ -3748,6 +3814,7 @@ dpif_netdev_set_config(struct dpif *dpif, const
> struct smap *other_config)
> >                           DEFAULT_EM_FLOW_INSERT_INV_PROB);
> >       uint32_t insert_min, cur_min;
> >       uint32_t tx_flush_interval, cur_tx_flush_interval;
> > +    uint64_t rebalance_intvl;
> >
> >       tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
> >                                        DEFAULT_TX_FLUSH_INTERVAL);
> > @@ -3819,6 +3886,23 @@ dpif_netdev_set_config(struct dpif *dpif, const
> struct smap *other_config)
> >                     pmd_rxq_assign);
> >           dp_netdev_request_reconfigure(dp);
> >       }
> > +
> > +    struct pmd_auto_lb *pmd_alb = &dp->pmd_alb;
> > +    pmd_alb->auto_lb_requested = smap_get_bool(other_config, "pmd-
> auto-lb",
> > +                              false);
> > +
> > +    rebalance_intvl = smap_get_int(other_config, "pmd-auto-lb-rebal-
> interval",
> > +                              ALB_PMD_REBALANCE_POLL_INTERVAL);
> > +
> > +    /* Input is in min, convert it to msec. */
> > +    rebalance_intvl =
> > +        rebalance_intvl ? rebalance_intvl * MIN_TO_MSEC : MIN_TO_MSEC;
> > +
> > +    if (pmd_alb->rebalance_intvl != rebalance_intvl) {
> > +        pmd_alb->rebalance_intvl = rebalance_intvl;
> > +    }
> > +
> > +    set_pmd_auto_lb(dp);
> >       return 0;
> >   }
> >
> > @@ -4762,6 +4846,9 @@ reconfigure_datapath(struct dp_netdev *dp)
> >
> >       /* Reload affected pmd threads. */
> >       reload_affected_pmds(dp);
> > +
> > +    /* Check if PMD Auto LB is to be enabled */
> > +    set_pmd_auto_lb(dp);
> >   }
> >
> >   /* Returns true if one of the netdevs in 'dp' requires a reconfiguration */
> > @@ -4780,6 +4867,237 @@ ports_require_restart(const struct dp_netdev
> *dp)
> >       return false;
> >   }
> >
> > +/* Function for calculating variance. */
> > +static uint64_t
> > +variance(uint64_t a[], int n)
> Argument names above seem quite generic, could you improve the comment
> to explain the arguments, even relating the terms to how they are used
> where n is the number of pmds and a[] is the data array for containing
> usage of each pmd would help clarify.
Sure.
> 
> > +{
> > +    /* Compute mean (average of elements). */
> > +    uint64_t sum = 0;
> > +    uint64_t mean = 0;
> > +    uint64_t sqDiff = 0;
> > +
> > +    if (!n) {
> > +        return 0;
> > +    }
> > +
> > +    for (int i = 0; i < n; i++) {
> > +        sum += a[i];
> > +    }
> > +
> > +    if (sum) {
> > +        mean = sum / n;
> > +
> > +        /* Compute sum squared differences with mean. */
> > +        for (int i = 0; i < n; i++) {
> > +            sqDiff += (a[i] - mean)*(a[i] - mean);
> > +        }
> > +    }
> > +    return (sqDiff ? (sqDiff / n) : 0);
> > +}
> > +
> > +
> > +/* Returns the variance in the PMDs usage as part of dry run of rxqs
> > + * assignment to PMDs. */
> > +static bool
> > +get_dry_run_variance(struct dp_netdev *dp, uint32_t *core_list,
> > +                     uint32_t num, uint64_t *predicted_variance)
> Can you change 'num' to 'num_pmds' above so that the argument purpose is
> clearer?
okay
> 
> > +    OVS_REQUIRES(dp->port_mutex)
> > +{
> > +    struct dp_netdev_port *port;
> > +    struct dp_netdev_pmd_thread *pmd;
> > +    struct dp_netdev_rxq ** rxqs = NULL;
> Please remove whitespace above for, '** rxq' ->  '**rxq'
Sure
> > +    struct rr_numa *numa = NULL;
> > +    struct rr_numa_list rr;
> > +    int n_rxqs = 0;
> > +    bool ret = false;
> > +    uint64_t *pmd_usage;
> > +
> > +    if (!predicted_variance) {
> > +        return ret;
> > +    }
> 
> Above you are checking that predicted_variance is not NULL. However its
> is allocated from the stack as uint64_t new_variance; in the preceding
> calling function 'pmd_rebalance_dry_run()'. Is the worry that
> predicted_variance my not have been allocated correctly i.e. NULL.
> 
> If so, would it not be better to error check when new_variance is first
> allocated (and possibly initialize to a meaningful value) in
> pmd_rebalance_dry_run()?
> 
This is a safety check. If in future if it is called from some different function. But you are right for current implementation.
> > +
> > +    pmd_usage = xcalloc(num, sizeof(uint64_t));
> > +
> > +    HMAP_FOR_EACH (port, node, &dp->ports) {
> > +        if (!netdev_is_pmd(port->netdev)) {
> > +            continue;
> > +        }
> > +
> > +        for (int qid = 0; qid < port->n_rxq; qid++) {
> > +            struct dp_netdev_rxq *q = &port->rxqs[qid];
> > +            uint64_t cycle_hist = 0;
> > +
> > +            if (q->pmd->isolated) {
> > +                continue;
> > +            }
> > +
> > +            if (n_rxqs == 0) {
> > +                rxqs = xmalloc(sizeof *rxqs);
> > +            } else {
> > +                rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1));
> > +            }
> > +
> > +            /* Sum the queue intervals and store the cycle history. */
> > +            for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
> > +                cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i);
> > +            }
> > +            dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST,
> > +                                         cycle_hist);
> > +            /* Store the queue. */
> > +            rxqs[n_rxqs++] = q;
> > +        }
> > +    }
> > +    if (n_rxqs > 1) {
> > +        /* Sort the queues in order of the processing cycles
> > +         * they consumed during their last pmd interval. */
> > +        qsort(rxqs, n_rxqs, sizeof *rxqs, compare_rxq_cycles);
> > +    }
> > +    rr_numa_list_populate(dp, &rr);
> > +
> > +    for (int i = 0; i < n_rxqs; i++) {
> > +        int numa_id = netdev_get_numa_id(rxqs[i]->port->netdev);
> > +        numa = rr_numa_list_lookup(&rr, numa_id);
> > +        if (!numa) {
> > +            /* Abort if cross NUMA polling. */
> > +            VLOG_DBG("PMD auto lb dry run."
> > +                     " Aborting due to cross-numa polling.");
> > +            goto cleanup;
> > +        }
> > +
> > +        pmd = rr_numa_get_pmd(numa, true);
> > +        VLOG_DBG("PMD auto lb dry run. Predicted: Core %d on numa node
> %d "
> > +                  "to be assigned port \'%s\' rx queue %d "
> > +                  "(measured processing cycles %"PRIu64").",
> > +                  pmd->core_id, numa_id,
> > +                  netdev_rxq_get_name(rxqs[i]->rx),
> > +                  netdev_rxq_get_queue_id(rxqs[i]->rx),
> > +                  dp_netdev_rxq_get_cycles(rxqs[i], RXQ_CYCLES_PROC_HIST));
> > +
> > +        for (int id = 0; id < num; id++) {
> > +            if (pmd->core_id == core_list[id]) {
> > +                /* Add the processing cycles of rxq to pmd polling it. */
> > +                pmd_usage[id] += dp_netdev_rxq_get_cycles(rxqs[i],
> > +                                        RXQ_CYCLES_PROC_HIST);
> > +            }
> > +        }
> > +    }
> > +
> > +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> > +        uint64_t total_cycles = 0;
> > +
> > +        if ((pmd->core_id == NON_PMD_CORE_ID) || pmd->isolated) {
> > +            continue;
> > +        }
> > +
> > +        /* Get the total pmd cycles for an interval. */
> > +        atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles);
> > +        /* Estimate the cycles to cover all intervals. */
> > +        total_cycles *= PMD_RXQ_INTERVAL_MAX;
> > +        for (int id = 0; id < num; id++) {
> > +            if (pmd->core_id == core_list[id]) {
> > +                if (pmd_usage[id]) {
> > +                    pmd_usage[id] = (pmd_usage[id] * 100) / total_cycles;
> > +                }
> > +                VLOG_DBG("PMD auto lb dry run. Predicted: Core %d, "
> > +                         "usage %"PRIu64"", pmd->core_id, pmd_usage[id]);
> > +            }
> > +        }
> > +    }
> > +    *predicted_variance = variance(pmd_usage, num);
> > +    ret = true;
> > +
> > +cleanup:
> > +    rr_numa_list_destroy(&rr);
> > +    free(rxqs);
> > +    free(pmd_usage);
> > +    return ret;
> > +}
> > +
> > +/* Does the dry run of Rxq assignment to PMDs and returns true if it gives
> > + * better distribution of load on PMDs. */
> > +static bool
> > +pmd_rebalance_dry_run(struct dp_netdev *dp)
> > +    OVS_REQUIRES(dp->port_mutex)
> > +{
> > +    struct dp_netdev_pmd_thread *pmd;
> > +    uint64_t *curr_pmd_usage;
> > +
> > +    uint64_t curr_variance;
> > +    uint64_t new_variance;
> > +    uint64_t improvement = 0;
> > +    uint32_t num_pmds;
> > +    uint32_t *pmd_corelist;
> > +    struct rxq_poll *poll, *poll_next;
> > +    bool ret;
> > +
> > +    num_pmds = cmap_count(&dp->poll_threads);
> > +
> > +    if (num_pmds > 1) {
> > +        curr_pmd_usage = xcalloc(num_pmds, sizeof(uint64_t));
> > +        pmd_corelist = xcalloc(num_pmds, sizeof(uint32_t));
> > +    } else {
> > +        return false;
> > +    }
> > +
> > +    num_pmds = 0;
> > +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> > +        uint64_t total_cycles = 0;
> > +        uint64_t total_proc = 0;
> > +
> > +        if ((pmd->core_id == NON_PMD_CORE_ID) || pmd->isolated) {
> > +            continue;
> > +        }
> > +
> > +        /* Get the total pmd cycles for an interval. */
> > +        atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles);
> > +        /* Estimate the cycles to cover all intervals. */
> > +        total_cycles *= PMD_RXQ_INTERVAL_MAX;
> > +
> > +        HMAP_FOR_EACH_SAFE (poll, poll_next, node, &pmd->poll_list) {
> > +            uint64_t proc_cycles = 0;
> > +            for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
> > +                proc_cycles += dp_netdev_rxq_get_intrvl_cycles(poll->rxq, i);
> > +            }
> > +            total_proc += proc_cycles;
> > +        }
> > +        if (total_proc) {
> > +            curr_pmd_usage[num_pmds] = (total_proc * 100) / total_cycles;
> > +        }
> > +
> > +        VLOG_DBG("PMD auto lb dry run. Current: Core %d, usage
> %"PRIu64"",
> > +                  pmd->core_id, curr_pmd_usage[num_pmds]);
> > +
> > +        if (atomic_count_get(&pmd->pmd_overloaded)) {
> > +            atomic_count_set(&pmd->pmd_overloaded, 0);
> > +        }
> > +
> > +        pmd_corelist[num_pmds] = pmd->core_id;
> > +        num_pmds++;
> > +    }
> > +
> > +    curr_variance = variance(curr_pmd_usage, num_pmds);
> > +    ret = get_dry_run_variance(dp, pmd_corelist, num_pmds,
> &new_variance);
> > +
> > +    if (ret) {
> > +        VLOG_DBG("PMD auto lb dry run. Current PMD variance: %"PRIu64","
> > +                  " Predicted PMD variance: %"PRIu64"",
> > +                  curr_variance, new_variance);
> > +
> > +        if (new_variance < curr_variance) {
> > +            improvement =
> > +                ((curr_variance - new_variance) * 100) / curr_variance;
> > +        }
> > +        if (improvement < ALB_ACCEPTABLE_IMPROVEMENT) {
> > +            ret = false;
> > +        }
> > +    }
> > +
> > +    free(curr_pmd_usage);
> > +    free(pmd_corelist);
> > +    return ret;
> > +}
> > +
> > +
> >   /* Return true if needs to revalidate datapath flows. */
> >   static bool
> >   dpif_netdev_run(struct dpif *dpif)
> > @@ -4789,6 +5107,9 @@ dpif_netdev_run(struct dpif *dpif)
> >       struct dp_netdev_pmd_thread *non_pmd;
> >       uint64_t new_tnl_seq;
> >       bool need_to_flush = true;
> > +    bool pmd_rebalance = false;
> > +    long long int now = time_msec();
> > +    struct dp_netdev_pmd_thread *pmd;
> >
> >       ovs_mutex_lock(&dp->port_mutex);
> >       non_pmd = dp_netdev_get_pmd(dp, NON_PMD_CORE_ID);
> > @@ -4821,6 +5142,32 @@ dpif_netdev_run(struct dpif *dpif)
> >           dp_netdev_pmd_unref(non_pmd);
> >       }
> >
> > +    struct pmd_auto_lb *pmd_alb = &dp->pmd_alb;
> > +    if (pmd_alb->is_enabled) {
> > +        if (!pmd_alb->rebalance_poll_timer) {
> > +            pmd_alb->rebalance_poll_timer = now;
> > +        } else if ((pmd_alb->rebalance_poll_timer +
> > +                   pmd_alb->rebalance_intvl) < now) {
> > +            pmd_alb->rebalance_poll_timer = now;
> > +            CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> > +                if (atomic_count_get(&pmd->pmd_overloaded) >=
> > +                                    PMD_RXQ_INTERVAL_MAX) {
> > +                    pmd_rebalance = true;
> > +                    break;
> > +                }
> > +            }
> > +
> > +            if (pmd_rebalance &&
> > +                !dp_netdev_is_reconf_required(dp) &&
> > +                !ports_require_restart(dp) &&
> > +                pmd_rebalance_dry_run(dp)) {
> > +                VLOG_INFO("PMD auto lb dry run."
> > +                          " requesting datapath reconfigure.");
> > +                dp_netdev_request_reconfigure(dp);
> > +            }
> > +        }
> > +    }
> > +
> >       if (dp_netdev_is_reconf_required(dp) || ports_require_restart(dp)) {
> >           reconfigure_datapath(dp);
> >       }
> > @@ -4979,6 +5326,8 @@ pmd_thread_main(void *f_)
> >   reload:
> >       pmd_alloc_static_tx_qid(pmd);
> >
> > +    atomic_count_init(&pmd->pmd_overloaded, 0);
> > +
> >       /* List port/core affinity */
> >       for (i = 0; i < poll_cnt; i++) {
> >          VLOG_DBG("Core %d processing port \'%s\' with queue-id %d\n",
> > @@ -7188,9 +7537,39 @@ dp_netdev_pmd_try_optimize(struct
> dp_netdev_pmd_thread *pmd,
> >                              struct polled_queue *poll_list, int poll_cnt)
> >   {
> >       struct dpcls *cls;
> > +    uint64_t tot_idle = 0, tot_proc = 0;
> > +    unsigned int pmd_load = 0;
> >
> >       if (pmd->ctx.now > pmd->rxq_next_cycle_store) {
> >           uint64_t curr_tsc;
> > +        struct pmd_auto_lb *pmd_alb = &pmd->dp->pmd_alb;
> > +        if (pmd_alb->is_enabled && !pmd->isolated
> > +            && (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] >=
> > +                                       pmd->prev_stats[PMD_CYCLES_ITER_IDLE])
> > +            && (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] >=
> > +                                        pmd->prev_stats[PMD_CYCLES_ITER_BUSY]))
> > +            {
> > +            tot_idle = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] -
> > +                       pmd->prev_stats[PMD_CYCLES_ITER_IDLE];
> > +            tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] -
> > +                       pmd->prev_stats[PMD_CYCLES_ITER_BUSY];
> > +
> > +            if (tot_proc) {
> > +                pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc));
> > +            }
> > +
> > +            if (pmd_load >= ALB_PMD_LOAD_THRESHOLD) {
> > +                atomic_count_inc(&pmd->pmd_overloaded);
> > +            } else {
> > +                atomic_count_set(&pmd->pmd_overloaded, 0);
> > +            }
> > +        }
> > +
> > +        pmd->prev_stats[PMD_CYCLES_ITER_IDLE] =
> > +                        pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE];
> > +        pmd->prev_stats[PMD_CYCLES_ITER_BUSY] =
> > +                        pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY];
> > +
> >           /* Get the cycles that were used to process each queue and store. */
> >           for (unsigned i = 0; i < poll_cnt; i++) {
> >               uint64_t rxq_cyc_curr = dp_netdev_rxq_get_cycles(poll_list[i].rxq,
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> > index 2160910..72f5283 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -574,6 +574,47 @@
> >               be set to 'skip_sw'.
> >           </p>
> >         </column>
> > +      <column name="other_config" key="pmd-auto-lb"
> > +              type='{"type": "boolean"}'>
> > +        <p>
> > +         Configures PMD Auto Load Balancing that allows automatic
> assignment of
> > +         RX queues to PMDs if any of PMDs is overloaded (i.e. processing
> cycles
> > +         > 95%).
> > +        </p>
> > +        <p>
> > +         It uses current scheme of cycle based assignment of RX queues that
> > +         are not statically pinned to PMDs.
> > +        </p>
> > +        <p>
> > +          The default value is <code>false</code>.
> > +        </p>
> > +        <p>
> > +          Set this value to <code>true</code> to enable this option. It is
> > +          currently disabled by default and an experimental feature.
> > +        </p>
> > +        <p>
> > +         This only comes in effect if cycle based assignment is enabled and
> > +         there are more than one non-isolated PMDs present and atleast one
> of
> Typo above, 'atleast' -> 'at least'
I will correct it.

Thanks for reviewing it.
> 
> Ian
> > +         it polls more than one queue.
> > +        </p>
> > +      </column>
> > +      <column name="other_config" key="pmd-auto-lb-rebal-interval"
> > +              type='{"type": "integer",
> > +                     "minInteger": 0, "maxInteger": 20000}'>
> > +        <p>
> > +         The minimum time (in minutes) 2 consecutive PMD Auto Load
> Balancing
> > +         iterations.
> > +        </p>
> > +        <p>
> > +         The defaul value is 1 min. If configured to 0 then it would be
> > +         converted to default value i.e. 1 min
> > +        </p>
> > +        <p>
> > +         This option can be configured to avoid frequent trigger of auto load
> > +         balancing of PMDs. For e.g. set the value (in min) such that it occurs
> > +         once in few hours or a day or a week.
> > +        </p>
> > +      </column>
> >       </group>
> >       <group title="Status">
> >         <column name="next_cfg">
> >
Kevin Traynor Jan. 15, 2019, 11:04 a.m. UTC | #5
On 01/15/2019 10:44 AM, Nitin Katiyar wrote:
> 
> 
>> -----Original Message-----
>> From: Ian Stokes [mailto:ian.stokes@intel.com]
>> Sent: Tuesday, January 15, 2019 5:06 AM
>> To: Nitin Katiyar <nitin.katiyar@ericsson.com>; ovs-dev@openvswitch.org;
>> Kevin Traynor <ktraynor@redhat.com>; Ilya Maximets
>> <i.maximets@samsung.com>
>> Subject: Re: [ovs-dev] [PATCH v5] Adding support for PMD auto load
>> balancing
>>
>> On 1/14/2019 10:44 AM, Nitin Katiyar wrote:
>>> Port rx queues that have not been statically assigned to PMDs are
>>> currently assigned based on periodically sampled load measurements.
>>> The assignment is performed at specific instances – port addition,
>>> port deletion, upon reassignment request via CLI etc.
>>>
>>> Due to change in traffic pattern over time it can cause uneven load
>>> among the PMDs and thus resulting in lower overall throughout.
>>>
>>> This patch enables the support of auto load balancing of PMDs based on
>>> measured load of RX queues. Each PMD measures the processing load for
>>> each of its associated queues every 10 seconds. If the aggregated PMD
>>> load reaches 95% for 6 consecutive intervals then PMD considers itself to
>> be overloaded.
>>>
>>> If any PMD is overloaded, a dry-run of the PMD assignment algorithm is
>>> performed by OVS main thread. The dry-run does NOT change the existing
>>> queue to PMD assignments.
>>>
>>> If the resultant mapping of dry-run indicates an improved distribution
>>> of the load then the actual reassignment will be performed.
>>>
>>> The automatic rebalancing will be disabled by default and has to be
>>> enabled via configuration option. The interval (in minutes) between
>>> two consecutive rebalancing can also be configured via CLI, default is
>>> 1 min.
>>>
>>> Following example commands can be used to set the auto-lb params:
>>> ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true"
>>> ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-rebalance-intvl="5"
>>>
>> Thanks for the patch Nitin. A few comments below.
>>
>> On An aside, there was discussion if this could be part of OVS 2.11 from the
>> community call last week. Although this is a v5 I believe it has been under
>> review and testing from the folks at Red Hat however I don't see any acks to
>> date.
>>

I've been reviewing and testing it since the RFC. At this stage it LGTM

Acked-by: Kevin Traynor <ktraynor@redhat.com>
Tested-by: Kevin Traynor <ktraynor@redhat.com>

>> What are peoples thoughts?
>>
>> This change seems quite contained and doesn't interfere with default cases
>> where rxq isolation, round robin or cycle based assignment is used.
>>
>> In testing the previous balancing still work work fine and the new load
>> balancing works well also although I have queries on default values which are
>> specific to use cases discussed below.
>>
>> Do people feel there is any reason to hold off merging if the issues below are
>> addressed and there are no other concerns?
>>
>>> Co-authored-by: Rohith Basavaraja <rohith.basavaraja@gmail.com>
>>> Co-authored-by: Venkatesan Pradeep
>> <venkatesan.pradeep@ericsson.com>
>>> Signed-off-by: Rohith Basavaraja <rohith.basavaraja@gmail.com>
>>> Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
>>> Signed-off-by: Nitin Katiyar <nitin.katiyar@ericsson.com>
diff mbox series

Patch

diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
index dd9172d..c273b40 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -183,3 +183,44 @@  or can be triggered by using::
    In addition, the output of ``pmd-rxq-show`` was modified to include
    Rx queue utilization of the PMD as a percentage. Prior to this, tracking of
    stats was not available.
+
+Automatic assignment of Port/Rx Queue to PMD Threads (experimental)
+-------------------------------------------------------------------
+
+Cycle or utilization based allocation of Rx queues to PMDs gives efficient
+load distribution but it is not adaptive to change in traffic pattern occuring
+over the time. This causes uneven load among the PMDs which results in overall
+lower throughput.
+
+To address this automatic load balancing of PMDs can be set by::
+
+    $ ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true"
+
+If pmd-auto-lb is set to true AND cycle based assignment is enabled then auto
+load balancing of PMDs is enabled provided there are 2 or more non-isolated
+PMDs and at least one of these PMDs is polling more than one RX queue.
+
+Once auto load balancing is set, each non-isolated PMD measures the processing
+load for each of its associated queues every 10 seconds. If the aggregated PMD
+load reaches 95% for 6 consecutive intervals then PMD considers itself to be
+overloaded.
+
+If any PMD is overloaded, a dry-run of the PMD assignment algorithm is
+performed by OVS main thread. The dry-run does NOT change the existing queue
+to PMD assignments.
+
+If the resultant mapping of dry-run indicates an improved distribution of the
+load then the actual reassignment will be performed.
+
+The minimum time between 2 consecutive PMD auto load balancing iterations can
+also be configured by::
+
+    $ ovs-vsctl set open_vswitch .\
+        other_config:pmd-auto-lb-rebal-interval="<interval>"
+
+where ``<interval>`` is a value in minutes. The default interval is 1 minute
+and setting it to 0 will also result in default value i.e. 1 min.
+
+A user can use this option to avoid frequent trigger of auto load balancing of
+PMDs. For e.g. set this (in min) such that it occurs once in few hours or a day
+or a week.
diff --git a/NEWS b/NEWS
index 2de844f..0e9fcb1 100644
--- a/NEWS
+++ b/NEWS
@@ -23,6 +23,7 @@  Post-v2.10.0
      * Add option for simple round-robin based Rxq to PMD assignment.
        It can be set with pmd-rxq-assign.
      * Add support for DPDK 18.11
+     * Add support for Auto load balancing of PMDs (experimental)
    - Add 'symmetric_l3' hash function.
    - OVS now honors 'updelay' and 'downdelay' for bonds with LACP configured.
    - ovs-vswitchd:
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 1564db9..c1757ab 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -80,6 +80,12 @@ 
 
 VLOG_DEFINE_THIS_MODULE(dpif_netdev);
 
+/* Auto Load Balancing Defaults */
+#define ALB_ACCEPTABLE_IMPROVEMENT       25
+#define ALB_PMD_LOAD_THRESHOLD           95
+#define ALB_PMD_REBALANCE_POLL_INTERVAL  1 /* 1 Min */
+#define MIN_TO_MSEC                  60000
+
 #define FLOW_DUMP_MAX_BATCH 50
 /* Use per thread recirc_depth to prevent recirculation loop. */
 #define MAX_RECIRC_DEPTH 6
@@ -288,6 +294,13 @@  struct dp_meter {
     struct dp_meter_band bands[];
 };
 
+struct pmd_auto_lb {
+    bool auto_lb_requested;     /* Auto load balancing requested by user. */
+    bool is_enabled;            /* Current status of Auto load balancing. */
+    uint64_t rebalance_intvl;
+    uint64_t rebalance_poll_timer;
+};
+
 /* Datapath based on the network device interface from netdev.h.
  *
  *
@@ -368,6 +381,7 @@  struct dp_netdev {
     uint64_t last_tnl_conf_seq;
 
     struct conntrack conntrack;
+    struct pmd_auto_lb pmd_alb;
 };
 
 static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
@@ -702,6 +716,11 @@  struct dp_netdev_pmd_thread {
     /* Keep track of detailed PMD performance statistics. */
     struct pmd_perf_stats perf_stats;
 
+    /* Stats from previous iteration used by automatic pmd
+     * load balance logic. */
+    uint64_t prev_stats[PMD_N_STATS];
+    atomic_count pmd_overloaded;
+
     /* Set to true if the pmd thread needs to be reloaded. */
     bool need_reload;
 };
@@ -3734,6 +3753,53 @@  dpif_netdev_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops,
     }
 }
 
+/* Enable or Disable PMD auto load balancing. */
+static void
+set_pmd_auto_lb(struct dp_netdev *dp)
+{
+    unsigned int cnt = 0;
+    struct dp_netdev_pmd_thread *pmd;
+    struct pmd_auto_lb *pmd_alb = &dp->pmd_alb;
+
+    bool enable_alb = false;
+    bool multi_rxq = false;
+    bool pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc;
+
+    /* Ensure that there is at least 2 non-isolated PMDs and
+     * one of them is polling more than one rxq. */
+    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+        if (pmd->core_id == NON_PMD_CORE_ID || pmd->isolated) {
+            continue;
+        }
+
+        if (hmap_count(&pmd->poll_list) > 1) {
+            multi_rxq = true;
+        }
+        if (cnt && multi_rxq) {
+                enable_alb = true;
+                break;
+        }
+        cnt++;
+    }
+
+    /* Enable auto LB if it is requested and cycle based assignment is true. */
+    enable_alb = enable_alb && pmd_rxq_assign_cyc &&
+                    pmd_alb->auto_lb_requested;
+
+    if (pmd_alb->is_enabled != enable_alb) {
+        pmd_alb->is_enabled = enable_alb;
+        if (pmd_alb->is_enabled) {
+            VLOG_INFO("PMD auto load balance is enabled "
+                      "(with rebalance interval:%"PRIu64" msec)",
+                       pmd_alb->rebalance_intvl);
+        } else {
+            pmd_alb->rebalance_poll_timer = 0;
+            VLOG_INFO("PMD auto load balance is disabled");
+        }
+    }
+
+}
+
 /* Applies datapath configuration from the database. Some of the changes are
  * actually applied in dpif_netdev_run(). */
 static int
@@ -3748,6 +3814,7 @@  dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
                         DEFAULT_EM_FLOW_INSERT_INV_PROB);
     uint32_t insert_min, cur_min;
     uint32_t tx_flush_interval, cur_tx_flush_interval;
+    uint64_t rebalance_intvl;
 
     tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
                                      DEFAULT_TX_FLUSH_INTERVAL);
@@ -3819,6 +3886,23 @@  dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
                   pmd_rxq_assign);
         dp_netdev_request_reconfigure(dp);
     }
+
+    struct pmd_auto_lb *pmd_alb = &dp->pmd_alb;
+    pmd_alb->auto_lb_requested = smap_get_bool(other_config, "pmd-auto-lb",
+                              false);
+
+    rebalance_intvl = smap_get_int(other_config, "pmd-auto-lb-rebal-interval",
+                              ALB_PMD_REBALANCE_POLL_INTERVAL);
+
+    /* Input is in min, convert it to msec. */
+    rebalance_intvl =
+        rebalance_intvl ? rebalance_intvl * MIN_TO_MSEC : MIN_TO_MSEC;
+
+    if (pmd_alb->rebalance_intvl != rebalance_intvl) {
+        pmd_alb->rebalance_intvl = rebalance_intvl;
+    }
+
+    set_pmd_auto_lb(dp);
     return 0;
 }
 
@@ -4762,6 +4846,9 @@  reconfigure_datapath(struct dp_netdev *dp)
 
     /* Reload affected pmd threads. */
     reload_affected_pmds(dp);
+
+    /* Check if PMD Auto LB is to be enabled */
+    set_pmd_auto_lb(dp);
 }
 
 /* Returns true if one of the netdevs in 'dp' requires a reconfiguration */
@@ -4780,6 +4867,237 @@  ports_require_restart(const struct dp_netdev *dp)
     return false;
 }
 
+/* Function for calculating variance. */
+static uint64_t
+variance(uint64_t a[], int n)
+{
+    /* Compute mean (average of elements). */
+    uint64_t sum = 0;
+    uint64_t mean = 0;
+    uint64_t sqDiff = 0;
+
+    if (!n) {
+        return 0;
+    }
+
+    for (int i = 0; i < n; i++) {
+        sum += a[i];
+    }
+
+    if (sum) {
+        mean = sum / n;
+
+        /* Compute sum squared differences with mean. */
+        for (int i = 0; i < n; i++) {
+            sqDiff += (a[i] - mean)*(a[i] - mean);
+        }
+    }
+    return (sqDiff ? (sqDiff / n) : 0);
+}
+
+
+/* Returns the variance in the PMDs usage as part of dry run of rxqs
+ * assignment to PMDs. */
+static bool
+get_dry_run_variance(struct dp_netdev *dp, uint32_t *core_list,
+                     uint32_t num, uint64_t *predicted_variance)
+    OVS_REQUIRES(dp->port_mutex)
+{
+    struct dp_netdev_port *port;
+    struct dp_netdev_pmd_thread *pmd;
+    struct dp_netdev_rxq ** rxqs = NULL;
+    struct rr_numa *numa = NULL;
+    struct rr_numa_list rr;
+    int n_rxqs = 0;
+    bool ret = false;
+    uint64_t *pmd_usage;
+
+    if (!predicted_variance) {
+        return ret;
+    }
+
+    pmd_usage = xcalloc(num, sizeof(uint64_t));
+
+    HMAP_FOR_EACH (port, node, &dp->ports) {
+        if (!netdev_is_pmd(port->netdev)) {
+            continue;
+        }
+
+        for (int qid = 0; qid < port->n_rxq; qid++) {
+            struct dp_netdev_rxq *q = &port->rxqs[qid];
+            uint64_t cycle_hist = 0;
+
+            if (q->pmd->isolated) {
+                continue;
+            }
+
+            if (n_rxqs == 0) {
+                rxqs = xmalloc(sizeof *rxqs);
+            } else {
+                rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1));
+            }
+
+            /* Sum the queue intervals and store the cycle history. */
+            for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
+                cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i);
+            }
+            dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST,
+                                         cycle_hist);
+            /* Store the queue. */
+            rxqs[n_rxqs++] = q;
+        }
+    }
+    if (n_rxqs > 1) {
+        /* Sort the queues in order of the processing cycles
+         * they consumed during their last pmd interval. */
+        qsort(rxqs, n_rxqs, sizeof *rxqs, compare_rxq_cycles);
+    }
+    rr_numa_list_populate(dp, &rr);
+
+    for (int i = 0; i < n_rxqs; i++) {
+        int numa_id = netdev_get_numa_id(rxqs[i]->port->netdev);
+        numa = rr_numa_list_lookup(&rr, numa_id);
+        if (!numa) {
+            /* Abort if cross NUMA polling. */
+            VLOG_DBG("PMD auto lb dry run."
+                     " Aborting due to cross-numa polling.");
+            goto cleanup;
+        }
+
+        pmd = rr_numa_get_pmd(numa, true);
+        VLOG_DBG("PMD auto lb dry run. Predicted: Core %d on numa node %d "
+                  "to be assigned port \'%s\' rx queue %d "
+                  "(measured processing cycles %"PRIu64").",
+                  pmd->core_id, numa_id,
+                  netdev_rxq_get_name(rxqs[i]->rx),
+                  netdev_rxq_get_queue_id(rxqs[i]->rx),
+                  dp_netdev_rxq_get_cycles(rxqs[i], RXQ_CYCLES_PROC_HIST));
+
+        for (int id = 0; id < num; id++) {
+            if (pmd->core_id == core_list[id]) {
+                /* Add the processing cycles of rxq to pmd polling it. */
+                pmd_usage[id] += dp_netdev_rxq_get_cycles(rxqs[i],
+                                        RXQ_CYCLES_PROC_HIST);
+            }
+        }
+    }
+
+    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+        uint64_t total_cycles = 0;
+
+        if ((pmd->core_id == NON_PMD_CORE_ID) || pmd->isolated) {
+            continue;
+        }
+
+        /* Get the total pmd cycles for an interval. */
+        atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles);
+        /* Estimate the cycles to cover all intervals. */
+        total_cycles *= PMD_RXQ_INTERVAL_MAX;
+        for (int id = 0; id < num; id++) {
+            if (pmd->core_id == core_list[id]) {
+                if (pmd_usage[id]) {
+                    pmd_usage[id] = (pmd_usage[id] * 100) / total_cycles;
+                }
+                VLOG_DBG("PMD auto lb dry run. Predicted: Core %d, "
+                         "usage %"PRIu64"", pmd->core_id, pmd_usage[id]);
+            }
+        }
+    }
+    *predicted_variance = variance(pmd_usage, num);
+    ret = true;
+
+cleanup:
+    rr_numa_list_destroy(&rr);
+    free(rxqs);
+    free(pmd_usage);
+    return ret;
+}
+
+/* Does the dry run of Rxq assignment to PMDs and returns true if it gives
+ * better distribution of load on PMDs. */
+static bool
+pmd_rebalance_dry_run(struct dp_netdev *dp)
+    OVS_REQUIRES(dp->port_mutex)
+{
+    struct dp_netdev_pmd_thread *pmd;
+    uint64_t *curr_pmd_usage;
+
+    uint64_t curr_variance;
+    uint64_t new_variance;
+    uint64_t improvement = 0;
+    uint32_t num_pmds;
+    uint32_t *pmd_corelist;
+    struct rxq_poll *poll, *poll_next;
+    bool ret;
+
+    num_pmds = cmap_count(&dp->poll_threads);
+
+    if (num_pmds > 1) {
+        curr_pmd_usage = xcalloc(num_pmds, sizeof(uint64_t));
+        pmd_corelist = xcalloc(num_pmds, sizeof(uint32_t));
+    } else {
+        return false;
+    }
+
+    num_pmds = 0;
+    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+        uint64_t total_cycles = 0;
+        uint64_t total_proc = 0;
+
+        if ((pmd->core_id == NON_PMD_CORE_ID) || pmd->isolated) {
+            continue;
+        }
+
+        /* Get the total pmd cycles for an interval. */
+        atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles);
+        /* Estimate the cycles to cover all intervals. */
+        total_cycles *= PMD_RXQ_INTERVAL_MAX;
+
+        HMAP_FOR_EACH_SAFE (poll, poll_next, node, &pmd->poll_list) {
+            uint64_t proc_cycles = 0;
+            for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
+                proc_cycles += dp_netdev_rxq_get_intrvl_cycles(poll->rxq, i);
+            }
+            total_proc += proc_cycles;
+        }
+        if (total_proc) {
+            curr_pmd_usage[num_pmds] = (total_proc * 100) / total_cycles;
+        }
+
+        VLOG_DBG("PMD auto lb dry run. Current: Core %d, usage %"PRIu64"",
+                  pmd->core_id, curr_pmd_usage[num_pmds]);
+
+        if (atomic_count_get(&pmd->pmd_overloaded)) {
+            atomic_count_set(&pmd->pmd_overloaded, 0);
+        }
+
+        pmd_corelist[num_pmds] = pmd->core_id;
+        num_pmds++;
+    }
+
+    curr_variance = variance(curr_pmd_usage, num_pmds);
+    ret = get_dry_run_variance(dp, pmd_corelist, num_pmds, &new_variance);
+
+    if (ret) {
+        VLOG_DBG("PMD auto lb dry run. Current PMD variance: %"PRIu64","
+                  " Predicted PMD variance: %"PRIu64"",
+                  curr_variance, new_variance);
+
+        if (new_variance < curr_variance) {
+            improvement =
+                ((curr_variance - new_variance) * 100) / curr_variance;
+        }
+        if (improvement < ALB_ACCEPTABLE_IMPROVEMENT) {
+            ret = false;
+        }
+    }
+
+    free(curr_pmd_usage);
+    free(pmd_corelist);
+    return ret;
+}
+
+
 /* Return true if needs to revalidate datapath flows. */
 static bool
 dpif_netdev_run(struct dpif *dpif)
@@ -4789,6 +5107,9 @@  dpif_netdev_run(struct dpif *dpif)
     struct dp_netdev_pmd_thread *non_pmd;
     uint64_t new_tnl_seq;
     bool need_to_flush = true;
+    bool pmd_rebalance = false;
+    long long int now = time_msec();
+    struct dp_netdev_pmd_thread *pmd;
 
     ovs_mutex_lock(&dp->port_mutex);
     non_pmd = dp_netdev_get_pmd(dp, NON_PMD_CORE_ID);
@@ -4821,6 +5142,32 @@  dpif_netdev_run(struct dpif *dpif)
         dp_netdev_pmd_unref(non_pmd);
     }
 
+    struct pmd_auto_lb *pmd_alb = &dp->pmd_alb;
+    if (pmd_alb->is_enabled) {
+        if (!pmd_alb->rebalance_poll_timer) {
+            pmd_alb->rebalance_poll_timer = now;
+        } else if ((pmd_alb->rebalance_poll_timer +
+                   pmd_alb->rebalance_intvl) < now) {
+            pmd_alb->rebalance_poll_timer = now;
+            CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+                if (atomic_count_get(&pmd->pmd_overloaded) >=
+                                    PMD_RXQ_INTERVAL_MAX) {
+                    pmd_rebalance = true;
+                    break;
+                }
+            }
+
+            if (pmd_rebalance &&
+                !dp_netdev_is_reconf_required(dp) &&
+                !ports_require_restart(dp) &&
+                pmd_rebalance_dry_run(dp)) {
+                VLOG_INFO("PMD auto lb dry run."
+                          " requesting datapath reconfigure.");
+                dp_netdev_request_reconfigure(dp);
+            }
+        }
+    }
+
     if (dp_netdev_is_reconf_required(dp) || ports_require_restart(dp)) {
         reconfigure_datapath(dp);
     }
@@ -4979,6 +5326,8 @@  pmd_thread_main(void *f_)
 reload:
     pmd_alloc_static_tx_qid(pmd);
 
+    atomic_count_init(&pmd->pmd_overloaded, 0);
+
     /* List port/core affinity */
     for (i = 0; i < poll_cnt; i++) {
        VLOG_DBG("Core %d processing port \'%s\' with queue-id %d\n",
@@ -7188,9 +7537,39 @@  dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
                            struct polled_queue *poll_list, int poll_cnt)
 {
     struct dpcls *cls;
+    uint64_t tot_idle = 0, tot_proc = 0;
+    unsigned int pmd_load = 0;
 
     if (pmd->ctx.now > pmd->rxq_next_cycle_store) {
         uint64_t curr_tsc;
+        struct pmd_auto_lb *pmd_alb = &pmd->dp->pmd_alb;
+        if (pmd_alb->is_enabled && !pmd->isolated
+            && (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] >=
+                                       pmd->prev_stats[PMD_CYCLES_ITER_IDLE])
+            && (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] >=
+                                        pmd->prev_stats[PMD_CYCLES_ITER_BUSY]))
+            {
+            tot_idle = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] -
+                       pmd->prev_stats[PMD_CYCLES_ITER_IDLE];
+            tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] -
+                       pmd->prev_stats[PMD_CYCLES_ITER_BUSY];
+
+            if (tot_proc) {
+                pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc));
+            }
+
+            if (pmd_load >= ALB_PMD_LOAD_THRESHOLD) {
+                atomic_count_inc(&pmd->pmd_overloaded);
+            } else {
+                atomic_count_set(&pmd->pmd_overloaded, 0);
+            }
+        }
+
+        pmd->prev_stats[PMD_CYCLES_ITER_IDLE] =
+                        pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE];
+        pmd->prev_stats[PMD_CYCLES_ITER_BUSY] =
+                        pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY];
+
         /* Get the cycles that were used to process each queue and store. */
         for (unsigned i = 0; i < poll_cnt; i++) {
             uint64_t rxq_cyc_curr = dp_netdev_rxq_get_cycles(poll_list[i].rxq,
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 2160910..72f5283 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -574,6 +574,47 @@ 
             be set to 'skip_sw'.
         </p>
       </column>
+      <column name="other_config" key="pmd-auto-lb"
+              type='{"type": "boolean"}'>
+        <p>
+         Configures PMD Auto Load Balancing that allows automatic assignment of
+         RX queues to PMDs if any of PMDs is overloaded (i.e. processing cycles
+         > 95%).
+        </p>
+        <p>
+         It uses current scheme of cycle based assignment of RX queues that
+         are not statically pinned to PMDs.
+        </p>
+        <p>
+          The default value is <code>false</code>.
+        </p>
+        <p>
+          Set this value to <code>true</code> to enable this option. It is
+          currently disabled by default and an experimental feature.
+        </p>
+        <p>
+         This only comes in effect if cycle based assignment is enabled and
+         there are more than one non-isolated PMDs present and atleast one of
+         it polls more than one queue.
+        </p>
+      </column>
+      <column name="other_config" key="pmd-auto-lb-rebal-interval"
+              type='{"type": "integer",
+                     "minInteger": 0, "maxInteger": 20000}'>
+        <p>
+         The minimum time (in minutes) 2 consecutive PMD Auto Load Balancing
+         iterations.
+        </p>
+        <p>
+         The defaul value is 1 min. If configured to 0 then it would be
+         converted to default value i.e. 1 min
+        </p>
+        <p>
+         This option can be configured to avoid frequent trigger of auto load
+         balancing of PMDs. For e.g. set the value (in min) such that it occurs
+         once in few hours or a day or a week.
+        </p>
+      </column>
     </group>
     <group title="Status">
       <column name="next_cfg">