diff mbox series

[ovs-dev,v10,2/3] dpif-netdev: Detailed performance stats for PMDs

Message ID 1521395709-4020-3-git-send-email-jan.scheurich@ericsson.com
State Changes Requested
Delegated to: Ian Stokes
Headers show
Series dpif-netdev: Detailed PMD performance metrics and supervision | expand

Commit Message

Jan Scheurich March 18, 2018, 5:55 p.m. UTC
This patch instruments the dpif-netdev datapath to record detailed
statistics of what is happening in every iteration of a PMD thread.

The collection of detailed statistics can be controlled by a new
Open_vSwitch configuration parameter "other_config:pmd-perf-metrics".
By default it is disabled. The run-time overhead, when enabled, is
in the order of 1%.

The covered metrics per iteration are:
  - cycles
  - packets
  - (rx) batches
  - packets/batch
  - max. vhostuser qlen
  - upcalls
  - cycles spent in upcalls

This raw recorded data is used threefold:

1. In histograms for each of the following metrics:
   - cycles/iteration (log.)
   - packets/iteration (log.)
   - cycles/packet
   - packets/batch
   - max. vhostuser qlen (log.)
   - upcalls
   - cycles/upcall (log)
   The histograms bins are divided linear or logarithmic.

2. A cyclic history of the above statistics for 999 iterations

3. A cyclic history of the cummulative/average values per millisecond
   wall clock for the last 1000 milliseconds:
   - number of iterations
   - avg. cycles/iteration
   - packets (Kpps)
   - avg. packets/batch
   - avg. max vhost qlen
   - upcalls
   - avg. cycles/upcall

The gathered performance metrics can be printed at any time with the
new CLI command

ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len]
    [-pmd core] [dp]

The options are

-nh:            Suppress the histograms
-it iter_len:   Display the last iter_len iteration stats
-ms ms_len:     Display the last ms_len millisecond stats
-pmd core:      Display only the specified PMD

The performance statistics are reset with the existing
dpif-netdev/pmd-stats-clear command.

The output always contains the following global PMD statistics,
similar to the pmd-stats-show command:

Time: 15:24:55.270
Measurement duration: 1.008 s

pmd thread numa_id 0 core_id 1:

  Cycles:            2419034712  (2.40 GHz)
  Iterations:            572817  (1.76 us/it)
  - idle:                486808  (15.9 % cycles)
  - busy:                 86009  (84.1 % cycles)
  Rx packets:           2399607  (2381 Kpps, 848 cycles/pkt)
  Datapath passes:      3599415  (1.50 passes/pkt)
  - EMC hits:            336472  ( 9.3 %)
  - Megaflow hits:      3262943  (90.7 %, 1.00 subtbl lookups/hit)
  - Upcalls:                  0  ( 0.0 %, 0.0 us/upcall)
  - Lost upcalls:             0  ( 0.0 %)
  Tx packets:           2399607  (2381 Kpps)
  Tx batches:            171400  (14.00 pkts/batch)

Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
Acked-by: Billy O'Mahony <billy.o.mahony@intel.com>
---
 NEWS                        |   3 +
 lib/automake.mk             |   1 +
 lib/dpif-netdev-perf.c      | 350 +++++++++++++++++++++++++++++++++++++++++++-
 lib/dpif-netdev-perf.h      | 258 ++++++++++++++++++++++++++++++--
 lib/dpif-netdev-unixctl.man | 157 ++++++++++++++++++++
 lib/dpif-netdev.c           | 183 +++++++++++++++++++++--
 manpages.mk                 |   2 +
 vswitchd/ovs-vswitchd.8.in  |  27 +---
 vswitchd/vswitch.xml        |  12 ++
 9 files changed, 940 insertions(+), 53 deletions(-)
 create mode 100644 lib/dpif-netdev-unixctl.man

Comments

Billy O'Mahony March 18, 2018, 10:26 p.m. UTC | #1
Acked-by: Billy O'Mahony <billy.o.mahony@intel.com>

> -----Original Message-----
> From: Jan Scheurich [mailto:jan.scheurich@ericsson.com]
> Sent: Sunday, March 18, 2018 5:55 PM
> To: dev@openvswitch.org
> Cc: ktraynor@redhat.com; Stokes, Ian <ian.stokes@intel.com>;
> i.maximets@samsung.com; O Mahony, Billy <billy.o.mahony@intel.com>; Jan
> Scheurich <jan.scheurich@ericsson.com>
> Subject: [PATCH v10 2/3] dpif-netdev: Detailed performance stats for PMDs
> 
> This patch instruments the dpif-netdev datapath to record detailed
> statistics of what is happening in every iteration of a PMD thread.
> 
> The collection of detailed statistics can be controlled by a new
> Open_vSwitch configuration parameter "other_config:pmd-perf-metrics".
> By default it is disabled. The run-time overhead, when enabled, is
> in the order of 1%.
> 
> The covered metrics per iteration are:
>   - cycles
>   - packets
>   - (rx) batches
>   - packets/batch
>   - max. vhostuser qlen
>   - upcalls
>   - cycles spent in upcalls
> 
> This raw recorded data is used threefold:
> 
> 1. In histograms for each of the following metrics:
>    - cycles/iteration (log.)
>    - packets/iteration (log.)
>    - cycles/packet
>    - packets/batch
>    - max. vhostuser qlen (log.)
>    - upcalls
>    - cycles/upcall (log)
>    The histograms bins are divided linear or logarithmic.
> 
> 2. A cyclic history of the above statistics for 999 iterations
> 
> 3. A cyclic history of the cummulative/average values per millisecond
>    wall clock for the last 1000 milliseconds:
>    - number of iterations
>    - avg. cycles/iteration
>    - packets (Kpps)
>    - avg. packets/batch
>    - avg. max vhost qlen
>    - upcalls
>    - avg. cycles/upcall
> 
> The gathered performance metrics can be printed at any time with the
> new CLI command
> 
> ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len]
>     [-pmd core] [dp]
> 
> The options are
> 
> -nh:            Suppress the histograms
> -it iter_len:   Display the last iter_len iteration stats
> -ms ms_len:     Display the last ms_len millisecond stats
> -pmd core:      Display only the specified PMD
> 
> The performance statistics are reset with the existing
> dpif-netdev/pmd-stats-clear command.
> 
> The output always contains the following global PMD statistics,
> similar to the pmd-stats-show command:
> 
> Time: 15:24:55.270
> Measurement duration: 1.008 s
> 
> pmd thread numa_id 0 core_id 1:
> 
>   Cycles:            2419034712  (2.40 GHz)
>   Iterations:            572817  (1.76 us/it)
>   - idle:                486808  (15.9 % cycles)
>   - busy:                 86009  (84.1 % cycles)
>   Rx packets:           2399607  (2381 Kpps, 848 cycles/pkt)
>   Datapath passes:      3599415  (1.50 passes/pkt)
>   - EMC hits:            336472  ( 9.3 %)
>   - Megaflow hits:      3262943  (90.7 %, 1.00 subtbl lookups/hit)
>   - Upcalls:                  0  ( 0.0 %, 0.0 us/upcall)
>   - Lost upcalls:             0  ( 0.0 %)
>   Tx packets:           2399607  (2381 Kpps)
>   Tx batches:            171400  (14.00 pkts/batch)
> 
> Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
> Acked-by: Billy O'Mahony <billy.o.mahony@intel.com>
> ---
>  NEWS                        |   3 +
>  lib/automake.mk             |   1 +
>  lib/dpif-netdev-perf.c      | 350
> +++++++++++++++++++++++++++++++++++++++++++-
>  lib/dpif-netdev-perf.h      | 258 ++++++++++++++++++++++++++++++--
>  lib/dpif-netdev-unixctl.man | 157 ++++++++++++++++++++
>  lib/dpif-netdev.c           | 183 +++++++++++++++++++++--
>  manpages.mk                 |   2 +
>  vswitchd/ovs-vswitchd.8.in  |  27 +---
>  vswitchd/vswitch.xml        |  12 ++
>  9 files changed, 940 insertions(+), 53 deletions(-)
>  create mode 100644 lib/dpif-netdev-unixctl.man
> 
> diff --git a/NEWS b/NEWS
> index 8d0b502..8f66fd3 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -73,6 +73,9 @@ v2.9.0 - 19 Feb 2018
>       * Add support for vHost dequeue zero copy (experimental)
>     - Userspace datapath:
>       * Output packet batching support.
> +     * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single
> PMD
> +     * Detailed PMD performance metrics available with new command
> +         ovs-appctl dpif-netdev/pmd-perf-show
>     - vswitchd:
>       * Datapath IDs may now be specified as 0x1 (etc.) instead of 16 digits.
>       * Configuring a controller, or unconfiguring all controllers, now deletes
> diff --git a/lib/automake.mk b/lib/automake.mk
> index 5c26e0f..7a5632d 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -484,6 +484,7 @@ MAN_FRAGMENTS += \
>  	lib/dpctl.man \
>  	lib/memory-unixctl.man \
>  	lib/netdev-dpdk-unixctl.man \
> +	lib/dpif-netdev-unixctl.man \
>  	lib/ofp-version.man \
>  	lib/ovs.tmac \
>  	lib/service.man \
> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
> index f06991a..2b36410 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -15,18 +15,324 @@
>   */
> 
>  #include <config.h>
> +#include <stdint.h>
> 
> +#include "dpif-netdev-perf.h"
>  #include "openvswitch/dynamic-string.h"
>  #include "openvswitch/vlog.h"
> -#include "dpif-netdev-perf.h"
> +#include "ovs-thread.h"
>  #include "timeval.h"
> 
>  VLOG_DEFINE_THIS_MODULE(pmd_perf);
> 
> +#ifdef DPDK_NETDEV
> +static uint64_t
> +get_tsc_hz(void)
> +{
> +    return rte_get_tsc_hz();
> +}
> +#else
> +/* This function is only invoked from PMD threads which depend on DPDK.
> + * A dummy function is sufficient when building without DPDK_NETDEV. */
> +static uint64_t
> +get_tsc_hz(void)
> +{
> +    return 1;
> +}
> +#endif
> +
> +/* Histogram functions. */
> +
> +static void
> +histogram_walls_set_lin(struct histogram *hist, uint32_t min, uint32_t max)
> +{
> +    int i;
> +
> +    ovs_assert(min < max);
> +    for (i = 0; i < NUM_BINS-1; i++) {
> +        hist->wall[i] = min + (i * (max - min)) / (NUM_BINS - 2);
> +    }
> +    hist->wall[NUM_BINS-1] = UINT32_MAX;
> +}
> +
> +static void
> +histogram_walls_set_log(struct histogram *hist, uint32_t min, uint32_t max)
> +{
> +    int i, start, bins, wall;
> +    double log_min, log_max;
> +
> +    ovs_assert(min < max);
> +    if (min > 0) {
> +        log_min = log(min);
> +        log_max = log(max);
> +        start = 0;
> +        bins = NUM_BINS - 1;
> +    } else {
> +        hist->wall[0] = 0;
> +        log_min = log(1);
> +        log_max = log(max);
> +        start = 1;
> +        bins = NUM_BINS - 2;
> +    }
> +    wall = start;
> +    for (i = 0; i < bins; i++) {
> +        /* Make sure each wall is monotonically increasing. */
> +        wall = MAX(wall, exp(log_min + (i * (log_max - log_min)) / (bins-1)));
> +        hist->wall[start + i] = wall++;
> +    }
> +    if (hist->wall[NUM_BINS-2] < max) {
> +        hist->wall[NUM_BINS-2] = max;
> +    }
> +    hist->wall[NUM_BINS-1] = UINT32_MAX;
> +}
> +
> +uint64_t
> +histogram_samples(const struct histogram *hist)
> +{
> +    uint64_t samples = 0;
> +
> +    for (int i = 0; i < NUM_BINS; i++) {
> +        samples += hist->bin[i];
> +    }
> +    return samples;
> +}
> +
> +static void
> +histogram_clear(struct histogram *hist)
> +{
> +    int i;
> +
> +    for (i = 0; i < NUM_BINS; i++) {
> +        hist->bin[i] = 0;
> +    }
> +}
> +
> +static void
> +history_init(struct history *h)
> +{
> +    memset(h, 0, sizeof(*h));
> +}
> +
>  void
>  pmd_perf_stats_init(struct pmd_perf_stats *s)
>  {
> -    memset(s, 0 , sizeof(*s));
> +    memset(s, 0, sizeof(*s));
> +    ovs_mutex_init(&s->stats_mutex);
> +    ovs_mutex_init(&s->clear_mutex);
> +    histogram_walls_set_log(&s->cycles, 500, 24000000);
> +    histogram_walls_set_log(&s->pkts, 0, 1000);
> +    histogram_walls_set_lin(&s->cycles_per_pkt, 100, 30000);
> +    histogram_walls_set_lin(&s->pkts_per_batch, 0, 32);
> +    histogram_walls_set_lin(&s->upcalls, 0, 30);
> +    histogram_walls_set_log(&s->cycles_per_upcall, 1000, 1000000);
> +    histogram_walls_set_log(&s->max_vhost_qfill, 0, 512);
> +    s->start_ms = time_msec();
> +}
> +
> +void
> +pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
> +                              double duration)
> +{
> +    uint64_t stats[PMD_N_STATS];
> +    double us_per_cycle = 1000000.0 / get_tsc_hz();
> +
> +    if (duration == 0) {
> +        return;
> +    }
> +
> +    pmd_perf_read_counters(s, stats);
> +    uint64_t tot_cycles = stats[PMD_CYCLES_ITER_IDLE] +
> +                          stats[PMD_CYCLES_ITER_BUSY];
> +    uint64_t rx_packets = stats[PMD_STAT_RECV];
> +    uint64_t tx_packets = stats[PMD_STAT_SENT_PKTS];
> +    uint64_t tx_batches = stats[PMD_STAT_SENT_BATCHES];
> +    uint64_t passes = stats[PMD_STAT_RECV] +
> +                      stats[PMD_STAT_RECIRC];
> +    uint64_t upcalls = stats[PMD_STAT_MISS];
> +    uint64_t upcall_cycles = stats[PMD_CYCLES_UPCALL];
> +    uint64_t tot_iter = histogram_samples(&s->pkts);
> +    uint64_t idle_iter = s->pkts.bin[0];
> +    uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
> +
> +    ds_put_format(str,
> +            "  Cycles:          %12"PRIu64"  (%.2f GHz)\n"
> +            "  Iterations:      %12"PRIu64"  (%.2f us/it)\n"
> +            "  - idle:          %12"PRIu64"  (%4.1f %% cycles)\n"
> +            "  - busy:          %12"PRIu64"  (%4.1f %% cycles)\n",
> +            tot_cycles, (tot_cycles / duration) / 1E9,
> +            tot_iter, tot_cycles * us_per_cycle / tot_iter,
> +            idle_iter,
> +            100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
> +            busy_iter,
> +            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
> +    if (rx_packets > 0) {
> +        ds_put_format(str,
> +            "  Rx packets:      %12"PRIu64"  (%.0f Kpps, %.0f cycles/pkt)\n"
> +            "  Datapath passes: %12"PRIu64"  (%.2f passes/pkt)\n"
> +            "  - EMC hits:      %12"PRIu64"  (%4.1f %%)\n"
> +            "  - Megaflow hits: %12"PRIu64"  (%4.1f %%, %.2f subtbl lookups/"
> +                                                                     "hit)\n"
> +            "  - Upcalls:       %12"PRIu64"  (%4.1f %%, %.1f us/upcall)\n"
> +            "  - Lost upcalls:  %12"PRIu64"  (%4.1f %%)\n",
> +            rx_packets, (rx_packets / duration) / 1000,
> +            1.0 * stats[PMD_CYCLES_ITER_BUSY] / rx_packets,
> +            passes, rx_packets ? 1.0 * passes / rx_packets : 0,
> +            stats[PMD_STAT_EXACT_HIT],
> +            100.0 * stats[PMD_STAT_EXACT_HIT] / passes,
> +            stats[PMD_STAT_MASKED_HIT],
> +            100.0 * stats[PMD_STAT_MASKED_HIT] / passes,
> +            stats[PMD_STAT_MASKED_HIT]
> +            ? 1.0 * stats[PMD_STAT_MASKED_LOOKUP] /
> stats[PMD_STAT_MASKED_HIT]
> +            : 0,
> +            upcalls, 100.0 * upcalls / passes,
> +            upcalls ? (upcall_cycles * us_per_cycle) / upcalls : 0,
> +            stats[PMD_STAT_LOST],
> +            100.0 * stats[PMD_STAT_LOST] / passes);
> +    } else {
> +        ds_put_format(str,
> +                "  Rx packets:      %12"PRIu64"\n",
> +                0ULL);
> +    }
> +    if (tx_packets > 0) {
> +        ds_put_format(str,
> +            "  Tx packets:      %12"PRIu64"  (%.0f Kpps)\n"
> +            "  Tx batches:      %12"PRIu64"  (%.2f pkts/batch)"
> +            "\n",
> +            tx_packets, (tx_packets / duration) / 1000,
> +            tx_batches, 1.0 * tx_packets / tx_batches);
> +    } else {
> +        ds_put_format(str,
> +                "  Tx packets:      %12"PRIu64"\n"
> +                "\n",
> +                0ULL);
> +    }
> +}
> +
> +void
> +pmd_perf_format_histograms(struct ds *str, struct pmd_perf_stats *s)
> +{
> +    int i;
> +
> +    ds_put_cstr(str, "Histograms\n");
> +    ds_put_format(str,
> +                  "   %-21s  %-21s  %-21s  %-21s  %-21s  %-21s  %-21s\n",
> +                  "cycles/it", "packets/it", "cycles/pkt", "pkts/batch",
> +                  "max vhost qlen", "upcalls/it", "cycles/upcall");
> +    for (i = 0; i < NUM_BINS-1; i++) {
> +        ds_put_format(str,
> +            "   %-9d %-11"PRIu64"  %-9d %-11"PRIu64"  %-9d %-11"PRIu64
> +            "  %-9d %-11"PRIu64"  %-9d %-11"PRIu64"  %-9d %-11"PRIu64
> +            "  %-9d %-11"PRIu64"\n",
> +            s->cycles.wall[i], s->cycles.bin[i],
> +            s->pkts.wall[i],s->pkts.bin[i],
> +            s->cycles_per_pkt.wall[i], s->cycles_per_pkt.bin[i],
> +            s->pkts_per_batch.wall[i], s->pkts_per_batch.bin[i],
> +            s->max_vhost_qfill.wall[i], s->max_vhost_qfill.bin[i],
> +            s->upcalls.wall[i], s->upcalls.bin[i],
> +            s->cycles_per_upcall.wall[i], s->cycles_per_upcall.bin[i]);
> +    }
> +    ds_put_format(str,
> +                  "   %-9s %-11"PRIu64"  %-9s %-11"PRIu64"  %-9s %-11"PRIu64
> +                  "  %-9s %-11"PRIu64"  %-9s %-11"PRIu64"  %-9s %-11"PRIu64
> +                  "  %-9s %-11"PRIu64"\n",
> +                  ">", s->cycles.bin[i],
> +                  ">", s->pkts.bin[i],
> +                  ">", s->cycles_per_pkt.bin[i],
> +                  ">", s->pkts_per_batch.bin[i],
> +                  ">", s->max_vhost_qfill.bin[i],
> +                  ">", s->upcalls.bin[i],
> +                  ">", s->cycles_per_upcall.bin[i]);
> +    if (s->totals.iterations > 0) {
> +        ds_put_cstr(str,
> +                    "-----------------------------------------------------"
> +                    "-----------------------------------------------------"
> +                    "------------------------------------------------\n");
> +        ds_put_format(str,
> +                      "   %-21s  %-21s  %-21s  %-21s  %-21s  %-21s  %-21s\n",
> +                      "cycles/it", "packets/it", "cycles/pkt", "pkts/batch",
> +                      "vhost qlen", "upcalls/it", "cycles/upcall");
> +        ds_put_format(str,
> +                      "   %-21"PRIu64"  %-21.5f  %-21"PRIu64
> +                      "  %-21.5f  %-21.5f  %-21.5f  %-21"PRIu32"\n",
> +                      s->totals.cycles / s->totals.iterations,
> +                      1.0 * s->totals.pkts / s->totals.iterations,
> +                      s->totals.pkts
> +                          ? s->totals.busy_cycles / s->totals.pkts : 0,
> +                      s->totals.batches
> +                          ? 1.0 * s->totals.pkts / s->totals.batches : 0,
> +                      1.0 * s->totals.max_vhost_qfill / s->totals.iterations,
> +                      1.0 * s->totals.upcalls / s->totals.iterations,
> +                      s->totals.upcalls
> +                          ? s->totals.upcall_cycles / s->totals.upcalls : 0);
> +    }
> +}
> +
> +void
> +pmd_perf_format_iteration_history(struct ds *str, struct pmd_perf_stats *s,
> +                                  int n_iter)
> +{
> +    struct iter_stats *is;
> +    size_t index;
> +    int i;
> +
> +    if (n_iter == 0) {
> +        return;
> +    }
> +    ds_put_format(str, "   %-17s   %-10s   %-10s   %-10s   %-10s   "
> +                  "%-10s   %-10s   %-10s\n",
> +                  "tsc", "cycles", "packets", "cycles/pkt", "pkts/batch",
> +                  "vhost qlen", "upcalls", "cycles/upcall");
> +    for (i = 1; i <= n_iter; i++) {
> +        index = (s->iterations.idx + HISTORY_LEN - i) % HISTORY_LEN;
> +        is = &s->iterations.sample[index];
> +        ds_put_format(str,
> +                      "   %-17"PRIu64"   %-11"PRIu64"  %-11"PRIu32
> +                      "  %-11"PRIu64"  %-11"PRIu32"  %-11"PRIu32
> +                      "  %-11"PRIu32"  %-11"PRIu32"\n",
> +                      is->timestamp,
> +                      is->cycles,
> +                      is->pkts,
> +                      is->pkts ? is->cycles / is->pkts : 0,
> +                      is->batches ? is->pkts / is->batches : 0,
> +                      is->max_vhost_qfill,
> +                      is->upcalls,
> +                      is->upcalls ? is->upcall_cycles / is->upcalls : 0);
> +    }
> +}
> +
> +void
> +pmd_perf_format_ms_history(struct ds *str, struct pmd_perf_stats *s, int
> n_ms)
> +{
> +    struct iter_stats *is;
> +    size_t index;
> +    int i;
> +
> +    if (n_ms == 0) {
> +        return;
> +    }
> +    ds_put_format(str,
> +                  "   %-12s   %-10s   %-10s   %-10s   %-10s"
> +                  "   %-10s   %-10s   %-10s   %-10s\n",
> +                  "ms", "iterations", "cycles/it", "Kpps", "cycles/pkt",
> +                  "pkts/batch", "vhost qlen", "upcalls", "cycles/upcall");
> +    for (i = 1; i <= n_ms; i++) {
> +        index = (s->milliseconds.idx + HISTORY_LEN - i) % HISTORY_LEN;
> +        is = &s->milliseconds.sample[index];
> +        ds_put_format(str,
> +                      "   %-12"PRIu64"   %-11"PRIu32"  %-11"PRIu64
> +                      "  %-11"PRIu32"  %-11"PRIu64"  %-11"PRIu32
> +                      "  %-11"PRIu32"  %-11"PRIu32"  %-11"PRIu32"\n",
> +                      is->timestamp,
> +                      is->iterations,
> +                      is->iterations ? is->cycles / is->iterations : 0,
> +                      is->pkts,
> +                      is->pkts ? is->busy_cycles / is->pkts : 0,
> +                      is->batches ? is->pkts / is->batches : 0,
> +                      is->iterations
> +                          ? is->max_vhost_qfill / is->iterations : 0,
> +                      is->upcalls,
> +                      is->upcalls ? is->upcall_cycles / is->upcalls : 0);
> +    }
>  }
> 
>  void
> @@ -51,10 +357,48 @@ pmd_perf_read_counters(struct pmd_perf_stats *s,
>      }
>  }
> 
> +/* This function clears the PMD performance counters from within the PMD
> + * thread or from another thread when the PMD thread is not executing its
> + * poll loop. */
>  void
> -pmd_perf_stats_clear(struct pmd_perf_stats *s)
> +pmd_perf_stats_clear_lock(struct pmd_perf_stats *s)
> +    OVS_REQUIRES(s->stats_mutex)
>  {
> +    ovs_mutex_lock(&s->clear_mutex);
>      for (int i = 0; i < PMD_N_STATS; i++) {
>          atomic_read_relaxed(&s->counters.n[i], &s->counters.zero[i]);
>      }
> +    /* The following stats are only applicable in PMD thread and */
> +    memset(&s->current, 0 , sizeof(struct iter_stats));
> +    memset(&s->totals, 0 , sizeof(struct iter_stats));
> +    histogram_clear(&s->cycles);
> +    histogram_clear(&s->pkts);
> +    histogram_clear(&s->cycles_per_pkt);
> +    histogram_clear(&s->upcalls);
> +    histogram_clear(&s->cycles_per_upcall);
> +    histogram_clear(&s->pkts_per_batch);
> +    histogram_clear(&s->max_vhost_qfill);
> +    history_init(&s->iterations);
> +    history_init(&s->milliseconds);
> +    s->start_ms = time_msec();
> +    s->milliseconds.sample[0].timestamp = s->start_ms;
> +    /* Clearing finished. */
> +    s->clear = false;
> +    ovs_mutex_unlock(&s->clear_mutex);
> +}
> +
> +/* This function can be called from the anywhere to clear the stats
> + * of PMD and non-PMD threads. */
> +void
> +pmd_perf_stats_clear(struct pmd_perf_stats *s)
> +{
> +    if (ovs_mutex_trylock(&s->stats_mutex) == 0) {
> +        /* Locking successful. PMD not polling. */
> +        pmd_perf_stats_clear_lock(s);
> +        ovs_mutex_unlock(&s->stats_mutex);
> +    } else {
> +        /* Request the polling PMD to clear the stats. There is no need to
> +         * block here as stats retrieval is prevented during clearing. */
> +        s->clear = true;
> +    }
>  }
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> index 5993c25..fd9b0fc 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -38,10 +38,18 @@
>  extern "C" {
>  #endif
> 
> -/* This module encapsulates data structures and functions to maintain PMD
> - * performance metrics such as packet counters, execution cycles. It
> - * provides a clean API for dpif-netdev to initialize, update and read and
> +/* This module encapsulates data structures and functions to maintain basic
> PMD
> + * performance metrics such as packet counters, execution cycles as well as
> + * histograms and time series recording for more detailed PMD metrics.
> + *
> + * It provides a clean API for dpif-netdev to initialize, update and read and
>   * reset these metrics.
> + *
> + * The basic set of PMD counters is implemented as atomic_uint64_t variables
> + * to guarantee correct read also in 32-bit systems.
> + *
> + * The detailed PMD performance metrics are only supported on 64-bit systems
> + * with atomic 64-bit read and store semantics for plain uint64_t counters.
>   */
> 
>  /* Set of counter types maintained in pmd_perf_stats. */
> @@ -66,6 +74,7 @@ enum pmd_stat_type {
>      PMD_STAT_SENT_BATCHES,  /* Number of batches sent. */
>      PMD_CYCLES_ITER_IDLE,   /* Cycles spent in idle iterations. */
>      PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
> +    PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
>      PMD_N_STATS
>  };
> 
> @@ -81,18 +90,87 @@ struct pmd_counters {
>      uint64_t zero[PMD_N_STATS];         /* Value at last _clear().  */
>  };
> 
> -/* Container for all performance metrics of a PMD.
> - * Part of the struct dp_netdev_pmd_thread. */
> +/* Data structure to collect statistical distribution of an integer measurement
> + * type in form of a histogram. The wall[] array contains the inclusive
> + * upper boundaries of the bins, while the bin[] array contains the actual
> + * counters per bin. The histogram walls are typically set automatically
> + * using the functions provided below.*/
> +
> +#define NUM_BINS 32             /* Number of histogram bins. */
> +
> +struct histogram {
> +    uint32_t wall[NUM_BINS];
> +    uint64_t bin[NUM_BINS];
> +};
> +
> +/* Data structure to record details PMD execution metrics per iteration for
> + * a history period of up to HISTORY_LEN iterations in circular buffer.
> + * Also used to record up to HISTORY_LEN millisecond averages/totals of these
> + * metrics.*/
> +
> +struct iter_stats {
> +    uint64_t timestamp;         /* TSC or millisecond. */
> +    uint64_t cycles;            /* Number of TSC cycles spent in it. or ms. */
> +    uint64_t busy_cycles;       /* Cycles spent in busy iterations or ms. */
> +    uint32_t iterations;        /* Iterations in ms. */
> +    uint32_t pkts;              /* Packets processed in iteration or ms. */
> +    uint32_t upcalls;           /* Number of upcalls in iteration or ms. */
> +    uint32_t upcall_cycles;     /* Cycles spent in upcalls in it. or ms. */
> +    uint32_t batches;           /* Number of rx batches in iteration or ms. */
> +    uint32_t max_vhost_qfill;   /* Maximum fill level in iteration or ms. */
> +};
> +
> +#define HISTORY_LEN 1000        /* Length of recorded history
> +                                   (iterations and ms). */
> +#define DEF_HIST_SHOW 20        /* Default number of history samples to
> +                                   display. */
> +
> +struct history {
> +    size_t idx;                 /* Slot to which next call to history_store()
> +                                   will write. */
> +    struct iter_stats sample[HISTORY_LEN];
> +};
> +
> +/* Container for all performance metrics of a PMD within the struct
> + * dp_netdev_pmd_thread. The metrics must be updated from within the PMD
> + * thread but can be read from any thread. The basic PMD counters in
> + * struct pmd_counters can be read without protection against concurrent
> + * clearing. The other metrics may only be safely read with the clear_mutex
> + * held to protect against concurrent clearing. */
> 
>  struct pmd_perf_stats {
> -    /* Start of the current PMD iteration in TSC cycles.*/
> -    uint64_t start_it_tsc;
> +    /* Prevents interference between PMD polling and stats clearing. */
> +    struct ovs_mutex stats_mutex;
> +    /* Set by CLI thread to order clearing of PMD stats. */
> +    volatile bool clear;
> +    /* Prevents stats retrieval while clearing is in progress. */
> +    struct ovs_mutex clear_mutex;
> +    /* Start of the current performance measurement period. */
> +    uint64_t start_ms;
>      /* Latest TSC time stamp taken in PMD. */
>      uint64_t last_tsc;
> +    /* Used to space certain checks in time. */
> +    uint64_t next_check_tsc;
>      /* If non-NULL, outermost cycle timer currently running in PMD. */
>      struct cycle_timer *cur_timer;
>      /* Set of PMD counters with their zero offsets. */
>      struct pmd_counters counters;
> +    /* Statistics of the current iteration. */
> +    struct iter_stats current;
> +    /* Totals for the current millisecond. */
> +    struct iter_stats totals;
> +    /* Histograms for the PMD metrics. */
> +    struct histogram cycles;
> +    struct histogram pkts;
> +    struct histogram cycles_per_pkt;
> +    struct histogram upcalls;
> +    struct histogram cycles_per_upcall;
> +    struct histogram pkts_per_batch;
> +    struct histogram max_vhost_qfill;
> +    /* Iteration history buffer. */
> +    struct history iterations;
> +    /* Millisecond history buffer. */
> +    struct history milliseconds;
>  };
> 
>  /* Support for accurate timing of PMD execution on TSC clock cycle level.
> @@ -175,8 +253,14 @@ cycle_timer_stop(struct pmd_perf_stats *s,
>      return now - timer->start;
>  }
> 
> +/* Functions to initialize and reset the PMD performance metrics. */
> +
>  void pmd_perf_stats_init(struct pmd_perf_stats *s);
>  void pmd_perf_stats_clear(struct pmd_perf_stats *s);
> +void pmd_perf_stats_clear_lock(struct pmd_perf_stats *s);
> +
> +/* Functions to read and update PMD counters. */
> +
>  void pmd_perf_read_counters(struct pmd_perf_stats *s,
>                              uint64_t stats[PMD_N_STATS]);
> 
> @@ -199,32 +283,182 @@ pmd_perf_update_counter(struct pmd_perf_stats
> *s,
>      atomic_store_relaxed(&s->counters.n[counter], tmp);
>  }
> 
> +/* Functions to manipulate a sample history. */
> +
> +static inline void
> +histogram_add_sample(struct histogram *hist, uint32_t val)
> +{
> +    /* TODO: Can do better with binary search? */
> +    for (int i = 0; i < NUM_BINS-1; i++) {
> +        if (val <= hist->wall[i]) {
> +            hist->bin[i]++;
> +            return;
> +        }
> +    }
> +    hist->bin[NUM_BINS-1]++;
> +}
> +
> +uint64_t histogram_samples(const struct histogram *hist);
> +
> +/* Add an offset to idx modulo HISTORY_LEN. */
> +static inline uint32_t
> +history_add(uint32_t idx, uint32_t offset)
> +{
> +    return (idx + offset) % HISTORY_LEN;
> +}
> +
> +/* Subtract idx2 from idx1 modulo HISTORY_LEN. */
> +static inline uint32_t
> +history_sub(uint32_t idx1, uint32_t idx2)
> +{
> +    return (idx1 + HISTORY_LEN - idx2) % HISTORY_LEN;
> +}
> +
> +static inline struct iter_stats *
> +history_current(struct history *h)
> +{
> +    return &h->sample[h->idx];
> +}
> +
> +static inline struct iter_stats *
> +history_next(struct history *h)
> +{
> +    size_t next_idx = (h->idx + 1) % HISTORY_LEN;
> +    struct iter_stats *next = &h->sample[next_idx];
> +
> +    memset(next, 0, sizeof(*next));
> +    h->idx = next_idx;
> +    return next;
> +}
> +
> +static inline struct iter_stats *
> +history_store(struct history *h, struct iter_stats *is)
> +{
> +    if (is) {
> +        h->sample[h->idx] = *is;
> +    }
> +    /* Advance the history pointer */
> +    return history_next(h);
> +}
> +
> +/* Functions recording PMD metrics per iteration. */
> +
>  static inline void
>  pmd_perf_start_iteration(struct pmd_perf_stats *s)
>  {
> +    if (s->clear) {
> +        /* Clear the PMD stats before starting next iteration. */
> +        pmd_perf_stats_clear_lock(s);
> +    }
> +    /* Initialize the current interval stats. */
> +    memset(&s->current, 0, sizeof(struct iter_stats));
>      if (OVS_LIKELY(s->last_tsc)) {
>          /* We assume here that last_tsc was updated immediately prior at
>           * the end of the previous iteration, or just before the first
>           * iteration. */
> -        s->start_it_tsc = s->last_tsc;
> +        s->current.timestamp = s->last_tsc;
>      } else {
>          /* In case last_tsc has never been set before. */
> -        s->start_it_tsc = cycles_counter_update(s);
> +        s->current.timestamp = cycles_counter_update(s);
>      }
>  }
> 
>  static inline void
> -pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets)
> +pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
> +                       int tx_packets, bool full_metrics)
>  {
> -    uint64_t cycles = cycles_counter_update(s) - s->start_it_tsc;
> +    uint64_t now_tsc = cycles_counter_update(s);
> +    struct iter_stats *cum_ms;
> +    uint64_t cycles, cycles_per_pkt = 0;
> 
> -    if (rx_packets > 0) {
> +    cycles = now_tsc - s->current.timestamp;
> +    s->current.cycles = cycles;
> +    s->current.pkts = rx_packets;
> +
> +    if (rx_packets + tx_packets > 0) {
>          pmd_perf_update_counter(s, PMD_CYCLES_ITER_BUSY, cycles);
>      } else {
>          pmd_perf_update_counter(s, PMD_CYCLES_ITER_IDLE, cycles);
>      }
> +    /* Add iteration samples to histograms. */
> +    histogram_add_sample(&s->cycles, cycles);
> +    histogram_add_sample(&s->pkts, rx_packets);
> +
> +    if (!full_metrics) {
> +        return;
> +    }
> +
> +    s->counters.n[PMD_CYCLES_UPCALL] += s->current.upcall_cycles;
> +
> +    if (rx_packets > 0) {
> +        cycles_per_pkt = cycles / rx_packets;
> +        histogram_add_sample(&s->cycles_per_pkt, cycles_per_pkt);
> +    }
> +    if (s->current.batches > 0) {
> +        histogram_add_sample(&s->pkts_per_batch,
> +                             rx_packets / s->current.batches);
> +    }
> +    histogram_add_sample(&s->upcalls, s->current.upcalls);
> +    if (s->current.upcalls > 0) {
> +        histogram_add_sample(&s->cycles_per_upcall,
> +                             s->current.upcall_cycles / s->current.upcalls);
> +    }
> +    histogram_add_sample(&s->max_vhost_qfill, s->current.max_vhost_qfill);
> +
> +    /* Add iteration samples to millisecond stats. */
> +    cum_ms = history_current(&s->milliseconds);
> +    cum_ms->iterations++;
> +    cum_ms->cycles += cycles;
> +    if (rx_packets > 0) {
> +        cum_ms->busy_cycles += cycles;
> +    }
> +    cum_ms->pkts += s->current.pkts;
> +    cum_ms->upcalls += s->current.upcalls;
> +    cum_ms->upcall_cycles += s->current.upcall_cycles;
> +    cum_ms->batches += s->current.batches;
> +    cum_ms->max_vhost_qfill += s->current.max_vhost_qfill;
> +
> +    /* Store in iteration history. This advances the iteration idx and
> +     * clears the next slot in the iteration history. */
> +    history_store(&s->iterations, &s->current);
> +    if (now_tsc > s->next_check_tsc) {
> +        /* Check if ms is completed and store in milliseconds history. */
> +        uint64_t now = time_msec();
> +        if (now != cum_ms->timestamp) {
> +            /* Add ms stats to totals. */
> +            s->totals.iterations += cum_ms->iterations;
> +            s->totals.cycles += cum_ms->cycles;
> +            s->totals.busy_cycles += cum_ms->busy_cycles;
> +            s->totals.pkts += cum_ms->pkts;
> +            s->totals.upcalls += cum_ms->upcalls;
> +            s->totals.upcall_cycles += cum_ms->upcall_cycles;
> +            s->totals.batches += cum_ms->batches;
> +            s->totals.max_vhost_qfill += cum_ms->max_vhost_qfill;
> +            cum_ms = history_next(&s->milliseconds);
> +            cum_ms->timestamp = now;
> +        }
> +        s->next_check_tsc = cycles_counter_update(s) + 10000;
> +    }
>  }
> 
> +/* Formatting the output of commands. */
> +
> +struct pmd_perf_params {
> +    int command_type;
> +    bool histograms;
> +    size_t iter_hist_len;
> +    size_t ms_hist_len;
> +};
> +
> +void pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
> +                                   double duration);
> +void pmd_perf_format_histograms(struct ds *str, struct pmd_perf_stats *s);
> +void pmd_perf_format_iteration_history(struct ds *str,
> +                                       struct pmd_perf_stats *s,
> +                                       int n_iter);
> +void pmd_perf_format_ms_history(struct ds *str, struct pmd_perf_stats *s,
> +                                int n_ms);
> +
>  #ifdef  __cplusplus
>  }
>  #endif
> diff --git a/lib/dpif-netdev-unixctl.man b/lib/dpif-netdev-unixctl.man
> new file mode 100644
> index 0000000..76c3e4e
> --- /dev/null
> +++ b/lib/dpif-netdev-unixctl.man
> @@ -0,0 +1,157 @@
> +.SS "DPIF-NETDEV COMMANDS"
> +These commands are used to expose internal information (mostly statistics)
> +about the "dpif-netdev" userspace datapath. If there is only one datapath
> +(as is often the case, unless \fBdpctl/\fR commands are used), the \fIdp\fR
> +argument can be omitted. By default the commands present data for all pmd
> +threads in the datapath. By specifying the "-pmd Core" option one can filter
> +the output for a single pmd in the datapath.
> +.
> +.IP "\fBdpif-netdev/pmd-stats-show\fR [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
> +Shows performance statistics for one or all pmd threads of the datapath
> +\fIdp\fR. The special thread "main" sums up the statistics of every non pmd
> +thread.
> +
> +The sum of "emc hits", "masked hits" and "miss" is the number of
> +packet lookups performed by the datapath. Beware that a recirculated packet
> +experiences one additional lookup per recirculation, so there may be
> +more lookups than forwarded packets in the datapath.
> +
> +Cycles are counted using the TSC or similar facilities (when available on
> +the platform). The duration of one cycle depends on the processing platform.
> +
> +"idle cycles" refers to cycles spent in PMD iterations not forwarding any
> +any packets. "processing cycles" refers to cycles spent in PMD iterations
> +forwarding at least one packet, including the cost for polling, processing and
> +transmitting said packets.
> +
> +To reset these counters use \fBdpif-netdev/pmd-stats-clear\fR.
> +.
> +.IP "\fBdpif-netdev/pmd-stats-clear\fR [\fIdp\fR]"
> +Resets to zero the per pmd thread performance numbers shown by the
> +\fBdpif-netdev/pmd-stats-show\fR and \fBdpif-netdev/pmd-perf-show\fR
> commands.
> +It will NOT reset datapath or bridge statistics, only the values shown by
> +the above commands.
> +.
> +.IP "\fBdpif-netdev/pmd-perf-show\fR [\fB-nh\fR] [\fB-it\fR \fIiter_len\fR] \
> +[\fB-ms\fR \fIms_len\fR] [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
> +Shows detailed performance metrics for one or all pmds threads of the
> +user space datapath.
> +
> +The collection of detailed statistics can be controlled by a new
> +configuration parameter "other_config:pmd-perf-metrics". By default it
> +is disabled. The run-time overhead, when enabled, is in the order of 1%.
> +
> +.RS
> +.IP
> +.PD .4v
> +.IP \(em
> +used cycles
> +.IP \(em
> +forwared packets
> +.IP \(em
> +number of rx batches
> +.IP \(em
> +packets/rx batch
> +.IP \(em
> +max. vhostuser queue fill level
> +.IP \(em
> +number of upcalls
> +.IP \(em
> +cycles spent in upcalls
> +.PD
> +.RE
> +.IP
> +This raw recorded data is used threefold:
> +
> +.RS
> +.IP
> +.PD .4v
> +.IP 1.
> +In histograms for each of the following metrics:
> +.RS
> +.IP \(em
> +cycles/iteration (logarithmic)
> +.IP \(em
> +packets/iteration (logarithmic)
> +.IP \(em
> +cycles/packet
> +.IP \(em
> +packets/batch
> +.IP \(em
> +max. vhostuser qlen (logarithmic)
> +.IP \(em
> +upcalls
> +.IP \(em
> +cycles/upcall (logarithmic)
> +The histograms bins are divided linear or logarithmic.
> +.RE
> +.IP 2.
> +A cyclic history of the above metrics for 1024 iterations
> +.IP 3.
> +A cyclic history of the cummulative/average values per millisecond wall
> +clock for the last 1024 milliseconds:
> +.RS
> +.IP \(em
> +number of iterations
> +.IP \(em
> +avg. cycles/iteration
> +.IP \(em
> +packets (Kpps)
> +.IP \(em
> +avg. packets/batch
> +.IP \(em
> +avg. max vhost qlen
> +.IP \(em
> +upcalls
> +.IP \(em
> +avg. cycles/upcall
> +.RE
> +.PD
> +.RE
> +.IP
> +.
> +The command options are:
> +.RS
> +.IP "\fB-nh\fR"
> +Suppress the histograms
> +.IP "\fB-it\fR \fIiter_len\fR"
> +Display the last iter_len iteration stats
> +.IP "\fB-ms\fR \fIms_len\fR"
> +Display the last ms_len millisecond stats
> +.RE
> +.IP
> +The output always contains the following global PMD statistics:
> +.RS
> +.IP
> +Time: 15:24:55.270 .br
> +Measurement duration: 1.008 s
> +
> +pmd thread numa_id 0 core_id 1:
> +
> +  Cycles:            2419034712  (2.40 GHz)
> +  Iterations:            572817  (1.76 us/it)
> +  - idle:                486808  (15.9 % cycles)
> +  - busy:                 86009  (84.1 % cycles)
> +  Rx packets:           2399607  (2381 Kpps, 848 cycles/pkt)
> +  Datapath passes:      3599415  (1.50 passes/pkt)
> +  - EMC hits:            336472  ( 9.3 %)
> +  - Megaflow hits:      3262943  (90.7 %, 1.00 subtbl lookups/hit)
> +  - Upcalls:                  0  ( 0.0 %, 0.0 us/upcall)
> +  - Lost upcalls:             0  ( 0.0 %)
> +  Tx packets:           2399607  (2381 Kpps)
> +  Tx batches:            171400  (14.00 pkts/batch)
> +.RE
> +.IP
> +Here "Rx packets" actually reflects the number of packets forwarded by the
> +datapath. "Datapath passes" matches the number of packet lookups as
> +reported by the \fBdpif-netdev/pmd-stats-show\fR command.
> +
> +To reset the counters and start a new measurement use
> +\fBdpif-netdev/pmd-stats-clear\fR.
> +.
> +.IP "\fBdpif-netdev/pmd-rxq-show\fR [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
> +For one or all pmd threads of the datapath \fIdp\fR show the list of queue-ids
> +with port names, which this thread polls.
> +.
> +.IP "\fBdpif-netdev/pmd-rxq-rebalance\fR [\fIdp\fR]"
> +Reassigns rxqs to pmds in the datapath \fIdp\fR based on their current usage.
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 86d8739..f245ce2 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -49,6 +49,7 @@
>  #include "id-pool.h"
>  #include "latch.h"
>  #include "netdev.h"
> +#include "netdev-provider.h"
>  #include "netdev-vport.h"
>  #include "netlink.h"
>  #include "odp-execute.h"
> @@ -281,6 +282,8 @@ struct dp_netdev {
> 
>      /* Probability of EMC insertions is a factor of 'emc_insert_min'.*/
>      OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_uint32_t emc_insert_min;
> +    /* Enable collection of PMD performance metrics. */
> +    atomic_bool pmd_perf_metrics;
> 
>      /* Protects access to ofproto-dpif-upcall interface during revalidator
>       * thread synchronization. */
> @@ -356,6 +359,7 @@ struct dp_netdev_rxq {
>                                            particular core. */
>      unsigned intrvl_idx;               /* Write index for 'cycles_intrvl'. */
>      struct dp_netdev_pmd_thread *pmd;  /* pmd thread that polls this queue. */
> +    bool is_vhost;                     /* Is rxq of a vhost port. */
> 
>      /* Counters of cycles spent successfully polling and processing pkts. */
>      atomic_ullong cycles[RXQ_N_CYCLES];
> @@ -717,6 +721,8 @@ static inline bool emc_entry_alive(struct emc_entry
> *ce);
>  static void emc_clear_entry(struct emc_entry *ce);
> 
>  static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
> +static inline bool
> +pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread *pmd);
> 
>  static void
>  emc_cache_init(struct emc_cache *flow_cache)
> @@ -800,7 +806,8 @@ get_dp_netdev(const struct dpif *dpif)
>  enum pmd_info_type {
>      PMD_INFO_SHOW_STATS,  /* Show how cpu cycles are spent. */
>      PMD_INFO_CLEAR_STATS, /* Set the cycles count to 0. */
> -    PMD_INFO_SHOW_RXQ     /* Show poll-lists of pmd threads. */
> +    PMD_INFO_SHOW_RXQ,    /* Show poll lists of pmd threads. */
> +    PMD_INFO_PERF_SHOW,   /* Show pmd performance details. */
>  };
> 
>  static void
> @@ -891,6 +898,47 @@ pmd_info_show_stats(struct ds *reply,
>                    stats[PMD_CYCLES_ITER_BUSY], total_packets);
>  }
> 
> +static void
> +pmd_info_show_perf(struct ds *reply,
> +                   struct dp_netdev_pmd_thread *pmd,
> +                   struct pmd_perf_params *par)
> +{
> +    if (pmd->core_id != NON_PMD_CORE_ID) {
> +        char *time_str =
> +                xastrftime_msec("%H:%M:%S.###", time_wall_msec(), true);
> +        long long now = time_msec();
> +        double duration = (now - pmd->perf_stats.start_ms) / 1000.0;
> +
> +        ds_put_cstr(reply, "\n");
> +        ds_put_format(reply, "Time: %s\n", time_str);
> +        ds_put_format(reply, "Measurement duration: %.3f s\n", duration);
> +        ds_put_cstr(reply, "\n");
> +        format_pmd_thread(reply, pmd);
> +        ds_put_cstr(reply, "\n");
> +        pmd_perf_format_overall_stats(reply, &pmd->perf_stats, duration);
> +        if (pmd_perf_metrics_enabled(pmd)) {
> +            /* Prevent parallel clearing of perf metrics. */
> +            ovs_mutex_lock(&pmd->perf_stats.clear_mutex);
> +            if (par->histograms) {
> +                ds_put_cstr(reply, "\n");
> +                pmd_perf_format_histograms(reply, &pmd->perf_stats);
> +            }
> +            if (par->iter_hist_len > 0) {
> +                ds_put_cstr(reply, "\n");
> +                pmd_perf_format_iteration_history(reply, &pmd->perf_stats,
> +                        par->iter_hist_len);
> +            }
> +            if (par->ms_hist_len > 0) {
> +                ds_put_cstr(reply, "\n");
> +                pmd_perf_format_ms_history(reply, &pmd->perf_stats,
> +                        par->ms_hist_len);
> +            }
> +            ovs_mutex_unlock(&pmd->perf_stats.clear_mutex);
> +        }
> +        free(time_str);
> +    }
> +}
> +
>  static int
>  compare_poll_list(const void *a_, const void *b_)
>  {
> @@ -1068,7 +1116,7 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int
> argc, const char *argv[],
>      ovs_mutex_lock(&dp_netdev_mutex);
> 
>      while (argc > 1) {
> -        if (!strcmp(argv[1], "-pmd") && argc >= 3) {
> +        if (!strcmp(argv[1], "-pmd") && argc > 2) {
>              if (str_to_uint(argv[2], 10, &core_id)) {
>                  filter_on_pmd = true;
>              }
> @@ -1108,6 +1156,8 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int
> argc, const char *argv[],
>              pmd_perf_stats_clear(&pmd->perf_stats);
>          } else if (type == PMD_INFO_SHOW_STATS) {
>              pmd_info_show_stats(&reply, pmd);
> +        } else if (type == PMD_INFO_PERF_SHOW) {
> +            pmd_info_show_perf(&reply, pmd, (struct pmd_perf_params *)aux);
>          }
>      }
>      free(pmd_list);
> @@ -1117,6 +1167,48 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn,
> int argc, const char *argv[],
>      unixctl_command_reply(conn, ds_cstr(&reply));
>      ds_destroy(&reply);
>  }
> +
> +static void
> +pmd_perf_show_cmd(struct unixctl_conn *conn, int argc,
> +                          const char *argv[],
> +                          void *aux OVS_UNUSED)
> +{
> +    struct pmd_perf_params par;
> +    long int it_hist = 0, ms_hist = 0;
> +    par.histograms = true;
> +
> +    while (argc > 1) {
> +        if (!strcmp(argv[1], "-nh")) {
> +            par.histograms = false;
> +            argc -= 1;
> +            argv += 1;
> +        } else if (!strcmp(argv[1], "-it") && argc > 2) {
> +            it_hist = strtol(argv[2], NULL, 10);
> +            if (it_hist < 0) {
> +                it_hist = 0;
> +            } else if (it_hist > HISTORY_LEN) {
> +                it_hist = HISTORY_LEN;
> +            }
> +            argc -= 2;
> +            argv += 2;
> +        } else if (!strcmp(argv[1], "-ms") && argc > 2) {
> +            ms_hist = strtol(argv[2], NULL, 10);
> +            if (ms_hist < 0) {
> +                ms_hist = 0;
> +            } else if (ms_hist > HISTORY_LEN) {
> +                ms_hist = HISTORY_LEN;
> +            }
> +            argc -= 2;
> +            argv += 2;
> +        } else {
> +            break;
> +        }
> +    }
> +    par.iter_hist_len = it_hist;
> +    par.ms_hist_len = ms_hist;
> +    par.command_type = PMD_INFO_PERF_SHOW;
> +    dpif_netdev_pmd_info(conn, argc, argv, &par);
> +}
>  

>  static int
>  dpif_netdev_init(void)
> @@ -1134,6 +1226,12 @@ dpif_netdev_init(void)
>      unixctl_command_register("dpif-netdev/pmd-rxq-show", "[-pmd core] [dp]",
>                               0, 3, dpif_netdev_pmd_info,
>                               (void *)&poll_aux);
> +    unixctl_command_register("dpif-netdev/pmd-perf-show",
> +                             "[-nh] [-it iter-history-len]"
> +                             " [-ms ms-history-len]"
> +                             " [-pmd core] [dp]",
> +                             0, 8, pmd_perf_show_cmd,
> +                             NULL);
>      unixctl_command_register("dpif-netdev/pmd-rxq-rebalance", "[dp]",
>                               0, 1, dpif_netdev_pmd_rebalance,
>                               NULL);
> @@ -3020,6 +3118,18 @@ dpif_netdev_set_config(struct dpif *dpif, const
> struct smap *other_config)
>          }
>      }
> 
> +    bool perf_enabled = smap_get_bool(other_config, "pmd-perf-metrics",
> false);
> +    bool cur_perf_enabled;
> +    atomic_read_relaxed(&dp->pmd_perf_metrics, &cur_perf_enabled);
> +    if (perf_enabled != cur_perf_enabled) {
> +        atomic_store_relaxed(&dp->pmd_perf_metrics, perf_enabled);
> +        if (perf_enabled) {
> +            VLOG_INFO("PMD performance metrics collection enabled");
> +        } else {
> +            VLOG_INFO("PMD performance metrics collection disabled");
> +        }
> +    }
> +
>      return 0;
>  }
> 
> @@ -3189,6 +3299,21 @@ dp_netdev_rxq_get_intrvl_cycles(struct
> dp_netdev_rxq *rx, unsigned idx)
>      return processing_cycles;
>  }
> 
> +static inline bool
> +pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread *pmd)
> +{
> +    /* If stores and reads of 64-bit integers are not atomic, the
> +     * full PMD performance metrics are not available as locked
> +     * access to 64 bit integers would be prohibitively expensive. */
> +#if ATOMIC_LLONG_LOCK_FREE
> +    bool pmd_perf_enabled;
> +    atomic_read_relaxed(&pmd->dp->pmd_perf_metrics, &pmd_perf_enabled);
> +    return pmd_perf_enabled;
> +#else
> +    return false;
> +#endif
> +}
> +
>  static int
>  dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,
>                                     struct tx_port *p)
> @@ -3264,10 +3389,12 @@ dp_netdev_process_rxq_port(struct
> dp_netdev_pmd_thread *pmd,
>                             struct dp_netdev_rxq *rxq,
>                             odp_port_t port_no)
>  {
> +    struct pmd_perf_stats *s = &pmd->perf_stats;
>      struct dp_packet_batch batch;
>      struct cycle_timer timer;
>      int error;
> -    int batch_cnt = 0, output_cnt = 0;
> +    int batch_cnt = 0;
> +    int rem_qlen = 0, *qlen_p = NULL;
>      uint64_t cycles;
> 
>      /* Measure duration for polling and processing rx burst. */
> @@ -3276,20 +3403,37 @@ dp_netdev_process_rxq_port(struct
> dp_netdev_pmd_thread *pmd,
>      pmd->ctx.last_rxq = rxq;
>      dp_packet_batch_init(&batch);
> 
> -    error = netdev_rxq_recv(rxq->rx, &batch, NULL);
> +    /* Fetch the rx queue length only for vhostuser ports. */
> +    if (pmd_perf_metrics_enabled(pmd) && rxq->is_vhost) {
> +        qlen_p = &rem_qlen;
> +    }
> +
> +    error = netdev_rxq_recv(rxq->rx, &batch, qlen_p);
>      if (!error) {
>          /* At least one packet received. */
>          *recirc_depth_get() = 0;
>          pmd_thread_ctx_time_update(pmd);
> -
>          batch_cnt = batch.count;
> +        if (pmd_perf_metrics_enabled(pmd)) {
> +            /* Update batch histogram. */
> +            s->current.batches++;
> +            histogram_add_sample(&s->pkts_per_batch, batch_cnt);
> +            /* Update the maximum vhost rx queue fill level. */
> +            if (rxq->is_vhost && rem_qlen >= 0) {
> +                uint32_t qfill = batch_cnt + rem_qlen;
> +                if (qfill > s->current.max_vhost_qfill) {
> +                    s->current.max_vhost_qfill = qfill;
> +                }
> +            }
> +        }
> +        /* Process packet batch. */
>          dp_netdev_input(pmd, &batch, port_no);
> 
>          /* Assign processing cycles to rx queue. */
>          cycles = cycle_timer_stop(&pmd->perf_stats, &timer);
>          dp_netdev_rxq_add_cycles(rxq, RXQ_CYCLES_PROC_CURR, cycles);
> 
> -        output_cnt = dp_netdev_pmd_flush_output_packets(pmd, false);
> +        dp_netdev_pmd_flush_output_packets(pmd, false);
>      } else {
>          /* Discard cycles. */
>          cycle_timer_stop(&pmd->perf_stats, &timer);
> @@ -3303,7 +3447,7 @@ dp_netdev_process_rxq_port(struct
> dp_netdev_pmd_thread *pmd,
> 
>      pmd->ctx.last_rxq = NULL;
> 
> -    return batch_cnt + output_cnt;
> +    return batch_cnt;
>  }
> 
>  static struct tx_port *
> @@ -3359,6 +3503,7 @@ port_reconfigure(struct dp_netdev_port *port)
>          }
> 
>          port->rxqs[i].port = port;
> +        port->rxqs[i].is_vhost = !strncmp(port->type, "dpdkvhost", 9);
> 
>          err = netdev_rxq_open(netdev, &port->rxqs[i].rx, i);
>          if (err) {
> @@ -4137,23 +4282,26 @@ reload:
>      pmd->intrvl_tsc_prev = 0;
>      atomic_store_relaxed(&pmd->intrvl_cycles, 0);
>      cycles_counter_update(s);
> +    /* Protect pmd stats from external clearing while polling. */
> +    ovs_mutex_lock(&pmd->perf_stats.stats_mutex);
>      for (;;) {
> -        uint64_t iter_packets = 0;
> +        uint64_t rx_packets = 0, tx_packets = 0;
> 
>          pmd_perf_start_iteration(s);
> +
>          for (i = 0; i < poll_cnt; i++) {
>              process_packets =
>                  dp_netdev_process_rxq_port(pmd, poll_list[i].rxq,
>                                             poll_list[i].port_no);
> -            iter_packets += process_packets;
> +            rx_packets += process_packets;
>          }
> 
> -        if (!iter_packets) {
> +        if (!rx_packets) {
>              /* We didn't receive anything in the process loop.
>               * Check if we need to send something.
>               * There was no time updates on current iteration. */
>              pmd_thread_ctx_time_update(pmd);
> -            iter_packets += dp_netdev_pmd_flush_output_packets(pmd, false);
> +            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
>          }
> 
>          if (lc++ > 1024) {
> @@ -4172,8 +4320,10 @@ reload:
>                  break;
>              }
>          }
> -        pmd_perf_end_iteration(s, iter_packets);
> +        pmd_perf_end_iteration(s, rx_packets, tx_packets,
> +                               pmd_perf_metrics_enabled(pmd));
>      }
> +    ovs_mutex_unlock(&pmd->perf_stats.stats_mutex);
> 
>      poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
>      exiting = latch_is_set(&pmd->exit_latch);
> @@ -5068,6 +5218,7 @@ handle_packet_upcall(struct dp_netdev_pmd_thread
> *pmd,
>      struct match match;
>      ovs_u128 ufid;
>      int error;
> +    uint64_t cycles = cycles_counter_update(&pmd->perf_stats);
> 
>      match.tun_md.valid = false;
>      miniflow_expand(&key->mf, &match.flow);
> @@ -5121,6 +5272,14 @@ handle_packet_upcall(struct
> dp_netdev_pmd_thread *pmd,
>          ovs_mutex_unlock(&pmd->flow_mutex);
>          emc_probabilistic_insert(pmd, key, netdev_flow);
>      }
> +    if (pmd_perf_metrics_enabled(pmd)) {
> +        /* Update upcall stats. */
> +        cycles = cycles_counter_update(&pmd->perf_stats) - cycles;
> +        struct pmd_perf_stats *s = &pmd->perf_stats;
> +        s->current.upcalls++;
> +        s->current.upcall_cycles += cycles;
> +        histogram_add_sample(&s->cycles_per_upcall, cycles);
> +    }
>      return error;
>  }
> 
> diff --git a/manpages.mk b/manpages.mk
> index d4bf0ec..aaf8bc2 100644
> --- a/manpages.mk
> +++ b/manpages.mk
> @@ -250,6 +250,7 @@ vswitchd/ovs-vswitchd.8: \
>  	lib/coverage-unixctl.man \
>  	lib/daemon.man \
>  	lib/dpctl.man \
> +	lib/dpif-netdev-unixctl.man \
>  	lib/memory-unixctl.man \
>  	lib/netdev-dpdk-unixctl.man \
>  	lib/service.man \
> @@ -266,6 +267,7 @@ lib/common.man:
>  lib/coverage-unixctl.man:
>  lib/daemon.man:
>  lib/dpctl.man:
> +lib/dpif-netdev-unixctl.man:
>  lib/memory-unixctl.man:
>  lib/netdev-dpdk-unixctl.man:
>  lib/service.man:
> diff --git a/vswitchd/ovs-vswitchd.8.in b/vswitchd/ovs-vswitchd.8.in
> index 80e5f53..8b4034d 100644
> --- a/vswitchd/ovs-vswitchd.8.in
> +++ b/vswitchd/ovs-vswitchd.8.in
> @@ -256,32 +256,7 @@ type).
>  ..
>  .so lib/dpctl.man
>  .
> -.SS "DPIF-NETDEV COMMANDS"
> -These commands are used to expose internal information (mostly statistics)
> -about the ``dpif-netdev'' userspace datapath. If there is only one datapath
> -(as is often the case, unless \fBdpctl/\fR commands are used), the \fIdp\fR
> -argument can be omitted.
> -.IP "\fBdpif-netdev/pmd-stats-show\fR [\fIdp\fR]"
> -Shows performance statistics for each pmd thread of the datapath \fIdp\fR.
> -The special thread ``main'' sums up the statistics of every non pmd thread.
> -The sum of ``emc hits'', ``masked hits'' and ``miss'' is the number of
> -packets received by the datapath.  Cycles are counted using the TSC or similar
> -facilities (when available on the platform).  To reset these counters use
> -\fBdpif-netdev/pmd-stats-clear\fR. The duration of one cycle depends on the
> -measuring infrastructure. ``idle cycles'' refers to cycles spent polling
> -devices but not receiving any packets. ``processing cycles'' refers to cycles
> -spent polling devices and successfully receiving packets, plus the cycles
> -spent processing said packets.
> -.IP "\fBdpif-netdev/pmd-stats-clear\fR [\fIdp\fR]"
> -Resets to zero the per pmd thread performance numbers shown by the
> -\fBdpif-netdev/pmd-stats-show\fR command.  It will NOT reset datapath or
> -bridge statistics, only the values shown by the above command.
> -.IP "\fBdpif-netdev/pmd-rxq-show\fR [\fIdp\fR]"
> -For each pmd thread of the datapath \fIdp\fR shows list of queue-ids with
> -port names, which this thread polls.
> -.IP "\fBdpif-netdev/pmd-rxq-rebalance\fR [\fIdp\fR]"
> -Reassigns rxqs to pmds in the datapath \fIdp\fR based on their current usage.
> -.
> +.so lib/dpif-netdev-unixctl.man
>  .so lib/netdev-dpdk-unixctl.man
>  .so ofproto/ofproto-dpif-unixctl.man
>  .so ofproto/ofproto-unixctl.man
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index f899a19..aac663f 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -375,6 +375,18 @@
>          </p>
>        </column>
> 
> +      <column name="other_config" key="pmd-perf-metrics"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          Enables recording of detailed PMD performance metrics for analysis
> +          and trouble-shooting. This can have a performance impact in the
> +          order of 1%.
> +        </p>
> +        <p>
> +          Defaults to false but can be changed at any time.
> +        </p>
> +      </column>
> +
>        <column name="other_config" key="n-handler-threads"
>                type='{"type": "integer", "minInteger": 1}'>
>          <p>
> --
> 1.9.1
Aaron Conole March 26, 2018, 9:26 p.m. UTC | #2
Hi Jan,

Some stylistic type comments follow.  Sorry to jump in at the end - but
you asked for checkpatch changes, so I improved and ran it against your
patch and found some stuff for which I have an opinion. :)  Maybe
nothing to hold up merging but cleanup stuff.

Jan Scheurich <jan.scheurich@ericsson.com> writes:

> This patch instruments the dpif-netdev datapath to record detailed
> statistics of what is happening in every iteration of a PMD thread.
>
> The collection of detailed statistics can be controlled by a new
> Open_vSwitch configuration parameter "other_config:pmd-perf-metrics".
> By default it is disabled. The run-time overhead, when enabled, is
> in the order of 1%.
>
> The covered metrics per iteration are:
>   - cycles
>   - packets
>   - (rx) batches
>   - packets/batch
>   - max. vhostuser qlen
>   - upcalls
>   - cycles spent in upcalls
>
> This raw recorded data is used threefold:
>
> 1. In histograms for each of the following metrics:
>    - cycles/iteration (log.)
>    - packets/iteration (log.)
>    - cycles/packet
>    - packets/batch
>    - max. vhostuser qlen (log.)
>    - upcalls
>    - cycles/upcall (log)
>    The histograms bins are divided linear or logarithmic.
>
> 2. A cyclic history of the above statistics for 999 iterations
>
> 3. A cyclic history of the cummulative/average values per millisecond
>    wall clock for the last 1000 milliseconds:
>    - number of iterations
>    - avg. cycles/iteration
>    - packets (Kpps)
>    - avg. packets/batch
>    - avg. max vhost qlen
>    - upcalls
>    - avg. cycles/upcall
>
> The gathered performance metrics can be printed at any time with the
> new CLI command
>
> ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len]
>     [-pmd core] [dp]
>
> The options are
>
> -nh:            Suppress the histograms
> -it iter_len:   Display the last iter_len iteration stats
> -ms ms_len:     Display the last ms_len millisecond stats
> -pmd core:      Display only the specified PMD
>
> The performance statistics are reset with the existing
> dpif-netdev/pmd-stats-clear command.
>
> The output always contains the following global PMD statistics,
> similar to the pmd-stats-show command:
>
> Time: 15:24:55.270
> Measurement duration: 1.008 s
>
> pmd thread numa_id 0 core_id 1:
>
>   Cycles:            2419034712  (2.40 GHz)
>   Iterations:            572817  (1.76 us/it)
>   - idle:                486808  (15.9 % cycles)
>   - busy:                 86009  (84.1 % cycles)
>   Rx packets:           2399607  (2381 Kpps, 848 cycles/pkt)
>   Datapath passes:      3599415  (1.50 passes/pkt)
>   - EMC hits:            336472  ( 9.3 %)
>   - Megaflow hits:      3262943  (90.7 %, 1.00 subtbl lookups/hit)
>   - Upcalls:                  0  ( 0.0 %, 0.0 us/upcall)
>   - Lost upcalls:             0  ( 0.0 %)
>   Tx packets:           2399607  (2381 Kpps)
>   Tx batches:            171400  (14.00 pkts/batch)
>
> Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
> Acked-by: Billy O'Mahony <billy.o.mahony@intel.com>
> ---
>  NEWS                        |   3 +
>  lib/automake.mk             |   1 +
>  lib/dpif-netdev-perf.c      | 350 +++++++++++++++++++++++++++++++++++++++++++-
>  lib/dpif-netdev-perf.h      | 258 ++++++++++++++++++++++++++++++--
>  lib/dpif-netdev-unixctl.man | 157 ++++++++++++++++++++
>  lib/dpif-netdev.c           | 183 +++++++++++++++++++++--
>  manpages.mk                 |   2 +
>  vswitchd/ovs-vswitchd.8.in  |  27 +---
>  vswitchd/vswitch.xml        |  12 ++
>  9 files changed, 940 insertions(+), 53 deletions(-)
>  create mode 100644 lib/dpif-netdev-unixctl.man
>
> diff --git a/NEWS b/NEWS
> index 8d0b502..8f66fd3 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -73,6 +73,9 @@ v2.9.0 - 19 Feb 2018
>       * Add support for vHost dequeue zero copy (experimental)
>     - Userspace datapath:
>       * Output packet batching support.
> +     * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
> +     * Detailed PMD performance metrics available with new command
> +         ovs-appctl dpif-netdev/pmd-perf-show
>     - vswitchd:
>       * Datapath IDs may now be specified as 0x1 (etc.) instead of 16 digits.
>       * Configuring a controller, or unconfiguring all controllers, now deletes
> diff --git a/lib/automake.mk b/lib/automake.mk
> index 5c26e0f..7a5632d 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -484,6 +484,7 @@ MAN_FRAGMENTS += \
>  	lib/dpctl.man \
>  	lib/memory-unixctl.man \
>  	lib/netdev-dpdk-unixctl.man \
> +	lib/dpif-netdev-unixctl.man \
>  	lib/ofp-version.man \
>  	lib/ovs.tmac \
>  	lib/service.man \
> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
> index f06991a..2b36410 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -15,18 +15,324 @@
>   */
>  
>  #include <config.h>
> +#include <stdint.h>
>  
> +#include "dpif-netdev-perf.h"
>  #include "openvswitch/dynamic-string.h"
>  #include "openvswitch/vlog.h"
> -#include "dpif-netdev-perf.h"
> +#include "ovs-thread.h"
>  #include "timeval.h"
>  
>  VLOG_DEFINE_THIS_MODULE(pmd_perf);
>  
> +#ifdef DPDK_NETDEV
> +static uint64_t
> +get_tsc_hz(void)
> +{
> +    return rte_get_tsc_hz();
> +}
> +#else
> +/* This function is only invoked from PMD threads which depend on DPDK.
> + * A dummy function is sufficient when building without DPDK_NETDEV. */
> +static uint64_t
> +get_tsc_hz(void)
> +{
> +    return 1;
> +}
> +#endif
> +
> +/* Histogram functions. */
> +
> +static void
> +histogram_walls_set_lin(struct histogram *hist, uint32_t min, uint32_t max)
> +{
> +    int i;
> +
> +    ovs_assert(min < max);
> +    for (i = 0; i < NUM_BINS-1; i++) {
> +        hist->wall[i] = min + (i * (max - min)) / (NUM_BINS - 2);
> +    }
> +    hist->wall[NUM_BINS-1] = UINT32_MAX;
> +}
> +
> +static void
> +histogram_walls_set_log(struct histogram *hist, uint32_t min, uint32_t max)
> +{
> +    int i, start, bins, wall;
> +    double log_min, log_max;
> +
> +    ovs_assert(min < max);
> +    if (min > 0) {
> +        log_min = log(min);
> +        log_max = log(max);
> +        start = 0;
> +        bins = NUM_BINS - 1;
> +    } else {
> +        hist->wall[0] = 0;
> +        log_min = log(1);
> +        log_max = log(max);
> +        start = 1;
> +        bins = NUM_BINS - 2;
> +    }
> +    wall = start;
> +    for (i = 0; i < bins; i++) {
> +        /* Make sure each wall is monotonically increasing. */
> +        wall = MAX(wall, exp(log_min + (i * (log_max - log_min)) / (bins-1)));
> +        hist->wall[start + i] = wall++;
> +    }
> +    if (hist->wall[NUM_BINS-2] < max) {
> +        hist->wall[NUM_BINS-2] = max;
> +    }
> +    hist->wall[NUM_BINS-1] = UINT32_MAX;
> +}
> +
> +uint64_t
> +histogram_samples(const struct histogram *hist)
> +{
> +    uint64_t samples = 0;
> +
> +    for (int i = 0; i < NUM_BINS; i++) {
> +        samples += hist->bin[i];
> +    }
> +    return samples;
> +}
> +
> +static void
> +histogram_clear(struct histogram *hist)
> +{
> +    int i;
> +
> +    for (i = 0; i < NUM_BINS; i++) {
> +        hist->bin[i] = 0;
> +    }
> +}
> +
> +static void
> +history_init(struct history *h)
> +{
> +    memset(h, 0, sizeof(*h));
> +}
> +
>  void
>  pmd_perf_stats_init(struct pmd_perf_stats *s)
>  {
> -    memset(s, 0 , sizeof(*s));
> +    memset(s, 0, sizeof(*s));
> +    ovs_mutex_init(&s->stats_mutex);
> +    ovs_mutex_init(&s->clear_mutex);

       Just include some comments on these constants so that it makes it
       clear what is being measured, etc.

> +    histogram_walls_set_log(&s->cycles, 500, 24000000);
> +    histogram_walls_set_log(&s->pkts, 0, 1000);
> +    histogram_walls_set_lin(&s->cycles_per_pkt, 100, 30000);
> +    histogram_walls_set_lin(&s->pkts_per_batch, 0, 32);
> +    histogram_walls_set_lin(&s->upcalls, 0, 30);
> +    histogram_walls_set_log(&s->cycles_per_upcall, 1000, 1000000);
> +    histogram_walls_set_log(&s->max_vhost_qfill, 0, 512);
> +    s->start_ms = time_msec();
> +}
> +
> +void
> +pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
> +                              double duration)
> +{
> +    uint64_t stats[PMD_N_STATS];
> +    double us_per_cycle = 1000000.0 / get_tsc_hz();
> +
> +    if (duration == 0) {
> +        return;
> +    }
> +
> +    pmd_perf_read_counters(s, stats);
> +    uint64_t tot_cycles = stats[PMD_CYCLES_ITER_IDLE] +
> +                          stats[PMD_CYCLES_ITER_BUSY];
> +    uint64_t rx_packets = stats[PMD_STAT_RECV];
> +    uint64_t tx_packets = stats[PMD_STAT_SENT_PKTS];
> +    uint64_t tx_batches = stats[PMD_STAT_SENT_BATCHES];
> +    uint64_t passes = stats[PMD_STAT_RECV] +
> +                      stats[PMD_STAT_RECIRC];
> +    uint64_t upcalls = stats[PMD_STAT_MISS];
> +    uint64_t upcall_cycles = stats[PMD_CYCLES_UPCALL];
> +    uint64_t tot_iter = histogram_samples(&s->pkts);
> +    uint64_t idle_iter = s->pkts.bin[0];
> +    uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
> +
> +    ds_put_format(str,
> +            "  Cycles:          %12"PRIu64"  (%.2f GHz)\n"
> +            "  Iterations:      %12"PRIu64"  (%.2f us/it)\n"
> +            "  - idle:          %12"PRIu64"  (%4.1f %% cycles)\n"
> +            "  - busy:          %12"PRIu64"  (%4.1f %% cycles)\n",
> +            tot_cycles, (tot_cycles / duration) / 1E9,
> +            tot_iter, tot_cycles * us_per_cycle / tot_iter,
> +            idle_iter,
> +            100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
> +            busy_iter,
> +            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
> +    if (rx_packets > 0) {
> +        ds_put_format(str,
> +            "  Rx packets:      %12"PRIu64"  (%.0f Kpps, %.0f cycles/pkt)\n"
> +            "  Datapath passes: %12"PRIu64"  (%.2f passes/pkt)\n"
> +            "  - EMC hits:      %12"PRIu64"  (%4.1f %%)\n"
> +            "  - Megaflow hits: %12"PRIu64"  (%4.1f %%, %.2f subtbl lookups/"
> +                                                                     "hit)\n"
> +            "  - Upcalls:       %12"PRIu64"  (%4.1f %%, %.1f us/upcall)\n"
> +            "  - Lost upcalls:  %12"PRIu64"  (%4.1f %%)\n",
> +            rx_packets, (rx_packets / duration) / 1000,
> +            1.0 * stats[PMD_CYCLES_ITER_BUSY] / rx_packets,
> +            passes, rx_packets ? 1.0 * passes / rx_packets : 0,
> +            stats[PMD_STAT_EXACT_HIT],
> +            100.0 * stats[PMD_STAT_EXACT_HIT] / passes,
> +            stats[PMD_STAT_MASKED_HIT],
> +            100.0 * stats[PMD_STAT_MASKED_HIT] / passes,
> +            stats[PMD_STAT_MASKED_HIT]
> +            ? 1.0 * stats[PMD_STAT_MASKED_LOOKUP] / stats[PMD_STAT_MASKED_HIT]
> +            : 0,
> +            upcalls, 100.0 * upcalls / passes,
> +            upcalls ? (upcall_cycles * us_per_cycle) / upcalls : 0,
> +            stats[PMD_STAT_LOST],
> +            100.0 * stats[PMD_STAT_LOST] / passes);
> +    } else {
> +        ds_put_format(str,
> +                "  Rx packets:      %12"PRIu64"\n",
> +                0ULL);
> +    }
> +    if (tx_packets > 0) {
> +        ds_put_format(str,
> +            "  Tx packets:      %12"PRIu64"  (%.0f Kpps)\n"
> +            "  Tx batches:      %12"PRIu64"  (%.2f pkts/batch)"
> +            "\n",
> +            tx_packets, (tx_packets / duration) / 1000,
> +            tx_batches, 1.0 * tx_packets / tx_batches);
> +    } else {
> +        ds_put_format(str,
> +                "  Tx packets:      %12"PRIu64"\n"
> +                "\n",
> +                0ULL);
> +    }
> +}
> +
> +void
> +pmd_perf_format_histograms(struct ds *str, struct pmd_perf_stats *s)
> +{
> +    int i;
> +
> +    ds_put_cstr(str, "Histograms\n");
> +    ds_put_format(str,
> +                  "   %-21s  %-21s  %-21s  %-21s  %-21s  %-21s  %-21s\n",
> +                  "cycles/it", "packets/it", "cycles/pkt", "pkts/batch",
> +                  "max vhost qlen", "upcalls/it", "cycles/upcall");
> +    for (i = 0; i < NUM_BINS-1; i++) {
> +        ds_put_format(str,
> +            "   %-9d %-11"PRIu64"  %-9d %-11"PRIu64"  %-9d %-11"PRIu64
> +            "  %-9d %-11"PRIu64"  %-9d %-11"PRIu64"  %-9d %-11"PRIu64
> +            "  %-9d %-11"PRIu64"\n",
> +            s->cycles.wall[i], s->cycles.bin[i],
> +            s->pkts.wall[i],s->pkts.bin[i],
> +            s->cycles_per_pkt.wall[i], s->cycles_per_pkt.bin[i],
> +            s->pkts_per_batch.wall[i], s->pkts_per_batch.bin[i],
> +            s->max_vhost_qfill.wall[i], s->max_vhost_qfill.bin[i],
> +            s->upcalls.wall[i], s->upcalls.bin[i],
> +            s->cycles_per_upcall.wall[i], s->cycles_per_upcall.bin[i]);
> +    }
> +    ds_put_format(str,
> +                  "   %-9s %-11"PRIu64"  %-9s %-11"PRIu64"  %-9s %-11"PRIu64
> +                  "  %-9s %-11"PRIu64"  %-9s %-11"PRIu64"  %-9s %-11"PRIu64
> +                  "  %-9s %-11"PRIu64"\n",
> +                  ">", s->cycles.bin[i],
> +                  ">", s->pkts.bin[i],
> +                  ">", s->cycles_per_pkt.bin[i],
> +                  ">", s->pkts_per_batch.bin[i],
> +                  ">", s->max_vhost_qfill.bin[i],
> +                  ">", s->upcalls.bin[i],
> +                  ">", s->cycles_per_upcall.bin[i]);
> +    if (s->totals.iterations > 0) {
> +        ds_put_cstr(str,
> +                    "-----------------------------------------------------"
> +                    "-----------------------------------------------------"
> +                    "------------------------------------------------\n");
> +        ds_put_format(str,
> +                      "   %-21s  %-21s  %-21s  %-21s  %-21s  %-21s  %-21s\n",
> +                      "cycles/it", "packets/it", "cycles/pkt", "pkts/batch",
> +                      "vhost qlen", "upcalls/it", "cycles/upcall");
> +        ds_put_format(str,
> +                      "   %-21"PRIu64"  %-21.5f  %-21"PRIu64
> +                      "  %-21.5f  %-21.5f  %-21.5f  %-21"PRIu32"\n",
> +                      s->totals.cycles / s->totals.iterations,
> +                      1.0 * s->totals.pkts / s->totals.iterations,
> +                      s->totals.pkts
> +                          ? s->totals.busy_cycles / s->totals.pkts : 0,
> +                      s->totals.batches
> +                          ? 1.0 * s->totals.pkts / s->totals.batches : 0,
> +                      1.0 * s->totals.max_vhost_qfill / s->totals.iterations,
> +                      1.0 * s->totals.upcalls / s->totals.iterations,
> +                      s->totals.upcalls
> +                          ? s->totals.upcall_cycles / s->totals.upcalls : 0);
> +    }
> +}
> +
> +void
> +pmd_perf_format_iteration_history(struct ds *str, struct pmd_perf_stats *s,
> +                                  int n_iter)
> +{
> +    struct iter_stats *is;
> +    size_t index;
> +    int i;
> +
> +    if (n_iter == 0) {
> +        return;
> +    }
> +    ds_put_format(str, "   %-17s   %-10s   %-10s   %-10s   %-10s   "
> +                  "%-10s   %-10s   %-10s\n",
> +                  "tsc", "cycles", "packets", "cycles/pkt", "pkts/batch",
> +                  "vhost qlen", "upcalls", "cycles/upcall");
> +    for (i = 1; i <= n_iter; i++) {
> +        index = (s->iterations.idx + HISTORY_LEN - i) % HISTORY_LEN;

    index = history_sub(s->iterations.idx, i);

> +        is = &s->iterations.sample[index];
> +        ds_put_format(str,
> +                      "   %-17"PRIu64"   %-11"PRIu64"  %-11"PRIu32
> +                      "  %-11"PRIu64"  %-11"PRIu32"  %-11"PRIu32
> +                      "  %-11"PRIu32"  %-11"PRIu32"\n",
> +                      is->timestamp,
> +                      is->cycles,
> +                      is->pkts,
> +                      is->pkts ? is->cycles / is->pkts : 0,
> +                      is->batches ? is->pkts / is->batches : 0,
> +                      is->max_vhost_qfill,
> +                      is->upcalls,
> +                      is->upcalls ? is->upcall_cycles / is->upcalls : 0);
> +    }
> +}
> +
> +void
> +pmd_perf_format_ms_history(struct ds *str, struct pmd_perf_stats *s, int n_ms)
> +{
> +    struct iter_stats *is;
> +    size_t index;
> +    int i;
> +
> +    if (n_ms == 0) {
> +        return;
> +    }
> +    ds_put_format(str,
> +                  "   %-12s   %-10s   %-10s   %-10s   %-10s"
> +                  "   %-10s   %-10s   %-10s   %-10s\n",
> +                  "ms", "iterations", "cycles/it", "Kpps", "cycles/pkt",
> +                  "pkts/batch", "vhost qlen", "upcalls", "cycles/upcall");
> +    for (i = 1; i <= n_ms; i++) {
> +        index = (s->milliseconds.idx + HISTORY_LEN - i) % HISTORY_LEN;

    index = history_sub(s->milliseconds.idx, i)

> +        is = &s->milliseconds.sample[index];
> +        ds_put_format(str,
> +                      "   %-12"PRIu64"   %-11"PRIu32"  %-11"PRIu64
> +                      "  %-11"PRIu32"  %-11"PRIu64"  %-11"PRIu32
> +                      "  %-11"PRIu32"  %-11"PRIu32"  %-11"PRIu32"\n",
> +                      is->timestamp,
> +                      is->iterations,
> +                      is->iterations ? is->cycles / is->iterations : 0,
> +                      is->pkts,
> +                      is->pkts ? is->busy_cycles / is->pkts : 0,
> +                      is->batches ? is->pkts / is->batches : 0,
> +                      is->iterations
> +                          ? is->max_vhost_qfill / is->iterations : 0,
> +                      is->upcalls,
> +                      is->upcalls ? is->upcall_cycles / is->upcalls : 0);
> +    }
>  }
>  
>  void
> @@ -51,10 +357,48 @@ pmd_perf_read_counters(struct pmd_perf_stats *s,
>      }
>  }
>  
> +/* This function clears the PMD performance counters from within the PMD
> + * thread or from another thread when the PMD thread is not executing its
> + * poll loop. */
>  void
> -pmd_perf_stats_clear(struct pmd_perf_stats *s)
> +pmd_perf_stats_clear_lock(struct pmd_perf_stats *s)
> +    OVS_REQUIRES(s->stats_mutex)
>  {
> +    ovs_mutex_lock(&s->clear_mutex);
>      for (int i = 0; i < PMD_N_STATS; i++) {
>          atomic_read_relaxed(&s->counters.n[i], &s->counters.zero[i]);
>      }
> +    /* The following stats are only applicable in PMD thread and */
> +    memset(&s->current, 0 , sizeof(struct iter_stats));
> +    memset(&s->totals, 0 , sizeof(struct iter_stats));
> +    histogram_clear(&s->cycles);
> +    histogram_clear(&s->pkts);
> +    histogram_clear(&s->cycles_per_pkt);
> +    histogram_clear(&s->upcalls);
> +    histogram_clear(&s->cycles_per_upcall);
> +    histogram_clear(&s->pkts_per_batch);
> +    histogram_clear(&s->max_vhost_qfill);
> +    history_init(&s->iterations);
> +    history_init(&s->milliseconds);
> +    s->start_ms = time_msec();
> +    s->milliseconds.sample[0].timestamp = s->start_ms;
> +    /* Clearing finished. */
> +    s->clear = false;
> +    ovs_mutex_unlock(&s->clear_mutex);
> +}
> +
> +/* This function can be called from the anywhere to clear the stats
> + * of PMD and non-PMD threads. */
> +void
> +pmd_perf_stats_clear(struct pmd_perf_stats *s)
> +{
> +    if (ovs_mutex_trylock(&s->stats_mutex) == 0) {
> +        /* Locking successful. PMD not polling. */
> +        pmd_perf_stats_clear_lock(s);
> +        ovs_mutex_unlock(&s->stats_mutex);
> +    } else {
> +        /* Request the polling PMD to clear the stats. There is no need to
> +         * block here as stats retrieval is prevented during clearing. */
> +        s->clear = true;
> +    }
>  }
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> index 5993c25..fd9b0fc 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -38,10 +38,18 @@
>  extern "C" {
>  #endif
>  
> -/* This module encapsulates data structures and functions to maintain PMD
> - * performance metrics such as packet counters, execution cycles. It
> - * provides a clean API for dpif-netdev to initialize, update and read and
> +/* This module encapsulates data structures and functions to maintain basic PMD
> + * performance metrics such as packet counters, execution cycles as well as
> + * histograms and time series recording for more detailed PMD metrics.
> + *
> + * It provides a clean API for dpif-netdev to initialize, update and read and
>   * reset these metrics.
> + *
> + * The basic set of PMD counters is implemented as atomic_uint64_t variables
> + * to guarantee correct read also in 32-bit systems.
> + *
> + * The detailed PMD performance metrics are only supported on 64-bit systems
> + * with atomic 64-bit read and store semantics for plain uint64_t counters.
>   */
>  
>  /* Set of counter types maintained in pmd_perf_stats. */
> @@ -66,6 +74,7 @@ enum pmd_stat_type {
>      PMD_STAT_SENT_BATCHES,  /* Number of batches sent. */
>      PMD_CYCLES_ITER_IDLE,   /* Cycles spent in idle iterations. */
>      PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
> +    PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
>      PMD_N_STATS
>  };
>  
> @@ -81,18 +90,87 @@ struct pmd_counters {
>      uint64_t zero[PMD_N_STATS];         /* Value at last _clear().  */
>  };
>  
> -/* Container for all performance metrics of a PMD.
> - * Part of the struct dp_netdev_pmd_thread. */
> +/* Data structure to collect statistical distribution of an integer measurement
> + * type in form of a histogram. The wall[] array contains the inclusive
> + * upper boundaries of the bins, while the bin[] array contains the actual
> + * counters per bin. The histogram walls are typically set automatically
> + * using the functions provided below.*/
> +
> +#define NUM_BINS 32             /* Number of histogram bins. */
> +
> +struct histogram {
> +    uint32_t wall[NUM_BINS];
> +    uint64_t bin[NUM_BINS];
> +};
> +
> +/* Data structure to record details PMD execution metrics per iteration for
> + * a history period of up to HISTORY_LEN iterations in circular buffer.
> + * Also used to record up to HISTORY_LEN millisecond averages/totals of these
> + * metrics.*/
> +
> +struct iter_stats {
> +    uint64_t timestamp;         /* TSC or millisecond. */
> +    uint64_t cycles;            /* Number of TSC cycles spent in it. or ms. */
> +    uint64_t busy_cycles;       /* Cycles spent in busy iterations or ms. */
> +    uint32_t iterations;        /* Iterations in ms. */
> +    uint32_t pkts;              /* Packets processed in iteration or ms. */
> +    uint32_t upcalls;           /* Number of upcalls in iteration or ms. */
> +    uint32_t upcall_cycles;     /* Cycles spent in upcalls in it. or ms. */
> +    uint32_t batches;           /* Number of rx batches in iteration or ms. */
> +    uint32_t max_vhost_qfill;   /* Maximum fill level in iteration or ms. */
> +};
> +
> +#define HISTORY_LEN 1000        /* Length of recorded history
> +                                   (iterations and ms). */
> +#define DEF_HIST_SHOW 20        /* Default number of history samples to
> +                                   display. */
> +
> +struct history {
> +    size_t idx;                 /* Slot to which next call to history_store()
> +                                   will write. */
> +    struct iter_stats sample[HISTORY_LEN];
> +};
> +
> +/* Container for all performance metrics of a PMD within the struct
> + * dp_netdev_pmd_thread. The metrics must be updated from within the PMD
> + * thread but can be read from any thread. The basic PMD counters in
> + * struct pmd_counters can be read without protection against concurrent
> + * clearing. The other metrics may only be safely read with the clear_mutex
> + * held to protect against concurrent clearing. */
>  
>  struct pmd_perf_stats {
> -    /* Start of the current PMD iteration in TSC cycles.*/
> -    uint64_t start_it_tsc;
> +    /* Prevents interference between PMD polling and stats clearing. */
> +    struct ovs_mutex stats_mutex;
> +    /* Set by CLI thread to order clearing of PMD stats. */
> +    volatile bool clear;
> +    /* Prevents stats retrieval while clearing is in progress. */
> +    struct ovs_mutex clear_mutex;
> +    /* Start of the current performance measurement period. */
> +    uint64_t start_ms;
>      /* Latest TSC time stamp taken in PMD. */
>      uint64_t last_tsc;
> +    /* Used to space certain checks in time. */
> +    uint64_t next_check_tsc;
>      /* If non-NULL, outermost cycle timer currently running in PMD. */
>      struct cycle_timer *cur_timer;
>      /* Set of PMD counters with their zero offsets. */
>      struct pmd_counters counters;
> +    /* Statistics of the current iteration. */
> +    struct iter_stats current;
> +    /* Totals for the current millisecond. */
> +    struct iter_stats totals;
> +    /* Histograms for the PMD metrics. */
> +    struct histogram cycles;
> +    struct histogram pkts;
> +    struct histogram cycles_per_pkt;
> +    struct histogram upcalls;
> +    struct histogram cycles_per_upcall;
> +    struct histogram pkts_per_batch;
> +    struct histogram max_vhost_qfill;
> +    /* Iteration history buffer. */
> +    struct history iterations;
> +    /* Millisecond history buffer. */
> +    struct history milliseconds;
>  };
>  
>  /* Support for accurate timing of PMD execution on TSC clock cycle level.
> @@ -175,8 +253,14 @@ cycle_timer_stop(struct pmd_perf_stats *s,
>      return now - timer->start;
>  }
>  
> +/* Functions to initialize and reset the PMD performance metrics. */
> +
>  void pmd_perf_stats_init(struct pmd_perf_stats *s);
>  void pmd_perf_stats_clear(struct pmd_perf_stats *s);
> +void pmd_perf_stats_clear_lock(struct pmd_perf_stats *s);
> +
> +/* Functions to read and update PMD counters. */
> +
>  void pmd_perf_read_counters(struct pmd_perf_stats *s,
>                              uint64_t stats[PMD_N_STATS]);
>  
> @@ -199,32 +283,182 @@ pmd_perf_update_counter(struct pmd_perf_stats *s,
>      atomic_store_relaxed(&s->counters.n[counter], tmp);
>  }
>  
> +/* Functions to manipulate a sample history. */
> +
> +static inline void
> +histogram_add_sample(struct histogram *hist, uint32_t val)
> +{
> +    /* TODO: Can do better with binary search? */
> +    for (int i = 0; i < NUM_BINS-1; i++) {
> +        if (val <= hist->wall[i]) {
> +            hist->bin[i]++;
> +            return;
> +        }
> +    }
> +    hist->bin[NUM_BINS-1]++;
> +}
> +
> +uint64_t histogram_samples(const struct histogram *hist);
> +
> +/* Add an offset to idx modulo HISTORY_LEN. */
> +static inline uint32_t
> +history_add(uint32_t idx, uint32_t offset)
> +{
> +    return (idx + offset) % HISTORY_LEN;
> +}
> +
> +/* Subtract idx2 from idx1 modulo HISTORY_LEN. */

Do the comments on these two functions (history_add and history_sub)
really do anything to help the reader?  Maybe they should explain when
these functions would be used for calculating an index into the
timing history sample array?

> +static inline uint32_t
> +history_sub(uint32_t idx1, uint32_t idx2)
> +{
> +    return (idx1 + HISTORY_LEN - idx2) % HISTORY_LEN;
> +}
> +
> +static inline struct iter_stats *
> +history_current(struct history *h)
> +{
> +    return &h->sample[h->idx];
> +}
> +
> +static inline struct iter_stats *
> +history_next(struct history *h)
> +{
> +    size_t next_idx = (h->idx + 1) % HISTORY_LEN;

Maybe:
       size_t next_idx = history_add(h->idx, 1);

> +    struct iter_stats *next = &h->sample[next_idx];
> +
> +    memset(next, 0, sizeof(*next));
> +    h->idx = next_idx;
> +    return next;
> +}
> +
> +static inline struct iter_stats *
> +history_store(struct history *h, struct iter_stats *is)
> +{
> +    if (is) {
> +        h->sample[h->idx] = *is;
> +    }
> +    /* Advance the history pointer */
> +    return history_next(h);
> +}
> +
> +/* Functions recording PMD metrics per iteration. */
> +
>  static inline void
>  pmd_perf_start_iteration(struct pmd_perf_stats *s)
>  {
> +    if (s->clear) {
> +        /* Clear the PMD stats before starting next iteration. */
> +        pmd_perf_stats_clear_lock(s);
> +    }
> +    /* Initialize the current interval stats. */
> +    memset(&s->current, 0, sizeof(struct iter_stats));
>      if (OVS_LIKELY(s->last_tsc)) {
>          /* We assume here that last_tsc was updated immediately prior at
>           * the end of the previous iteration, or just before the first
>           * iteration. */
> -        s->start_it_tsc = s->last_tsc;
> +        s->current.timestamp = s->last_tsc;
>      } else {
>          /* In case last_tsc has never been set before. */
> -        s->start_it_tsc = cycles_counter_update(s);
> +        s->current.timestamp = cycles_counter_update(s);
>      }
>  }
>  
>  static inline void
> -pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets)
> +pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
> +                       int tx_packets, bool full_metrics)

This function is sufficiently complex enough to not be part of the
header.

>  {
> -    uint64_t cycles = cycles_counter_update(s) - s->start_it_tsc;
> +    uint64_t now_tsc = cycles_counter_update(s);
> +    struct iter_stats *cum_ms;
> +    uint64_t cycles, cycles_per_pkt = 0;
>  
> -    if (rx_packets > 0) {
> +    cycles = now_tsc - s->current.timestamp;
> +    s->current.cycles = cycles;
> +    s->current.pkts = rx_packets;
> +
> +    if (rx_packets + tx_packets > 0) {
>          pmd_perf_update_counter(s, PMD_CYCLES_ITER_BUSY, cycles);
>      } else {
>          pmd_perf_update_counter(s, PMD_CYCLES_ITER_IDLE, cycles);
>      }
> +    /* Add iteration samples to histograms. */
> +    histogram_add_sample(&s->cycles, cycles);
> +    histogram_add_sample(&s->pkts, rx_packets);
> +
> +    if (!full_metrics) {
> +        return;
> +    }
> +
> +    s->counters.n[PMD_CYCLES_UPCALL] += s->current.upcall_cycles;
> +
> +    if (rx_packets > 0) {
> +        cycles_per_pkt = cycles / rx_packets;
> +        histogram_add_sample(&s->cycles_per_pkt, cycles_per_pkt);
> +    }
> +    if (s->current.batches > 0) {
> +        histogram_add_sample(&s->pkts_per_batch,
> +                             rx_packets / s->current.batches);
> +    }
> +    histogram_add_sample(&s->upcalls, s->current.upcalls);
> +    if (s->current.upcalls > 0) {
> +        histogram_add_sample(&s->cycles_per_upcall,
> +                             s->current.upcall_cycles / s->current.upcalls);
> +    }
> +    histogram_add_sample(&s->max_vhost_qfill, s->current.max_vhost_qfill);
> +
> +    /* Add iteration samples to millisecond stats. */
> +    cum_ms = history_current(&s->milliseconds);
> +    cum_ms->iterations++;
> +    cum_ms->cycles += cycles;
> +    if (rx_packets > 0) {
> +        cum_ms->busy_cycles += cycles;
> +    }
> +    cum_ms->pkts += s->current.pkts;
> +    cum_ms->upcalls += s->current.upcalls;
> +    cum_ms->upcall_cycles += s->current.upcall_cycles;
> +    cum_ms->batches += s->current.batches;
> +    cum_ms->max_vhost_qfill += s->current.max_vhost_qfill;
> +
> +    /* Store in iteration history. This advances the iteration idx and
> +     * clears the next slot in the iteration history. */
> +    history_store(&s->iterations, &s->current);
> +    if (now_tsc > s->next_check_tsc) {
> +        /* Check if ms is completed and store in milliseconds history. */
> +        uint64_t now = time_msec();
> +        if (now != cum_ms->timestamp) {
> +            /* Add ms stats to totals. */
> +            s->totals.iterations += cum_ms->iterations;
> +            s->totals.cycles += cum_ms->cycles;
> +            s->totals.busy_cycles += cum_ms->busy_cycles;
> +            s->totals.pkts += cum_ms->pkts;
> +            s->totals.upcalls += cum_ms->upcalls;
> +            s->totals.upcall_cycles += cum_ms->upcall_cycles;
> +            s->totals.batches += cum_ms->batches;
> +            s->totals.max_vhost_qfill += cum_ms->max_vhost_qfill;
> +            cum_ms = history_next(&s->milliseconds);
> +            cum_ms->timestamp = now;
> +        }
> +        s->next_check_tsc = cycles_counter_update(s) + 10000;

This is spacing by 10ms?  Am I reading correctly?  Maybe a comment, or a
constant.

> +    }
>  }
>  
> +/* Formatting the output of commands. */
> +
> +struct pmd_perf_params {
> +    int command_type;
> +    bool histograms;
> +    size_t iter_hist_len;
> +    size_t ms_hist_len;
> +};
> +
> +void pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
> +                                   double duration);
> +void pmd_perf_format_histograms(struct ds *str, struct pmd_perf_stats *s);
> +void pmd_perf_format_iteration_history(struct ds *str,
> +                                       struct pmd_perf_stats *s,
> +                                       int n_iter);
> +void pmd_perf_format_ms_history(struct ds *str, struct pmd_perf_stats *s,
> +                                int n_ms);
> +
>  #ifdef  __cplusplus
>  }
>  #endif
> diff --git a/lib/dpif-netdev-unixctl.man b/lib/dpif-netdev-unixctl.man
> new file mode 100644
> index 0000000..76c3e4e
> --- /dev/null
> +++ b/lib/dpif-netdev-unixctl.man
> @@ -0,0 +1,157 @@
> +.SS "DPIF-NETDEV COMMANDS"
> +These commands are used to expose internal information (mostly statistics)
> +about the "dpif-netdev" userspace datapath. If there is only one datapath
> +(as is often the case, unless \fBdpctl/\fR commands are used), the \fIdp\fR
> +argument can be omitted. By default the commands present data for all pmd
> +threads in the datapath. By specifying the "-pmd Core" option one can filter
> +the output for a single pmd in the datapath.
> +.
> +.IP "\fBdpif-netdev/pmd-stats-show\fR [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
> +Shows performance statistics for one or all pmd threads of the datapath
> +\fIdp\fR. The special thread "main" sums up the statistics of every non pmd
> +thread.
> +
> +The sum of "emc hits", "masked hits" and "miss" is the number of
> +packet lookups performed by the datapath. Beware that a recirculated packet
> +experiences one additional lookup per recirculation, so there may be
> +more lookups than forwarded packets in the datapath.
> +
> +Cycles are counted using the TSC or similar facilities (when available on
> +the platform). The duration of one cycle depends on the processing platform.
> +
> +"idle cycles" refers to cycles spent in PMD iterations not forwarding any
> +any packets. "processing cycles" refers to cycles spent in PMD iterations
> +forwarding at least one packet, including the cost for polling, processing and
> +transmitting said packets.
> +
> +To reset these counters use \fBdpif-netdev/pmd-stats-clear\fR.
> +.
> +.IP "\fBdpif-netdev/pmd-stats-clear\fR [\fIdp\fR]"
> +Resets to zero the per pmd thread performance numbers shown by the
> +\fBdpif-netdev/pmd-stats-show\fR and \fBdpif-netdev/pmd-perf-show\fR commands.
> +It will NOT reset datapath or bridge statistics, only the values shown by
> +the above commands.
> +.
> +.IP "\fBdpif-netdev/pmd-perf-show\fR [\fB-nh\fR] [\fB-it\fR \fIiter_len\fR] \
> +[\fB-ms\fR \fIms_len\fR] [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
> +Shows detailed performance metrics for one or all pmds threads of the
> +user space datapath.
> +
> +The collection of detailed statistics can be controlled by a new
> +configuration parameter "other_config:pmd-perf-metrics". By default it
> +is disabled. The run-time overhead, when enabled, is in the order of 1%.
> +
> +.RS
> +.IP
> +.PD .4v
> +.IP \(em
> +used cycles
> +.IP \(em
> +forwared packets
> +.IP \(em
> +number of rx batches
> +.IP \(em
> +packets/rx batch
> +.IP \(em
> +max. vhostuser queue fill level
> +.IP \(em
> +number of upcalls
> +.IP \(em
> +cycles spent in upcalls
> +.PD
> +.RE
> +.IP
> +This raw recorded data is used threefold:
> +
> +.RS
> +.IP
> +.PD .4v
> +.IP 1.
> +In histograms for each of the following metrics:
> +.RS
> +.IP \(em
> +cycles/iteration (logarithmic)
> +.IP \(em
> +packets/iteration (logarithmic)
> +.IP \(em
> +cycles/packet
> +.IP \(em
> +packets/batch
> +.IP \(em
> +max. vhostuser qlen (logarithmic)
> +.IP \(em
> +upcalls
> +.IP \(em
> +cycles/upcall (logarithmic)
> +The histograms bins are divided linear or logarithmic.
> +.RE
> +.IP 2.
> +A cyclic history of the above metrics for 1024 iterations
> +.IP 3.
> +A cyclic history of the cummulative/average values per millisecond wall
> +clock for the last 1024 milliseconds:
> +.RS
> +.IP \(em
> +number of iterations
> +.IP \(em
> +avg. cycles/iteration
> +.IP \(em
> +packets (Kpps)
> +.IP \(em
> +avg. packets/batch
> +.IP \(em
> +avg. max vhost qlen
> +.IP \(em
> +upcalls
> +.IP \(em
> +avg. cycles/upcall
> +.RE
> +.PD
> +.RE
> +.IP
> +.
> +The command options are:
> +.RS
> +.IP "\fB-nh\fR"
> +Suppress the histograms
> +.IP "\fB-it\fR \fIiter_len\fR"
> +Display the last iter_len iteration stats
> +.IP "\fB-ms\fR \fIms_len\fR"
> +Display the last ms_len millisecond stats
> +.RE
> +.IP
> +The output always contains the following global PMD statistics:
> +.RS
> +.IP
> +Time: 15:24:55.270 .br
> +Measurement duration: 1.008 s
> +
> +pmd thread numa_id 0 core_id 1:
> +
> +  Cycles:            2419034712  (2.40 GHz)
> +  Iterations:            572817  (1.76 us/it)
> +  - idle:                486808  (15.9 % cycles)
> +  - busy:                 86009  (84.1 % cycles)
> +  Rx packets:           2399607  (2381 Kpps, 848 cycles/pkt)
> +  Datapath passes:      3599415  (1.50 passes/pkt)
> +  - EMC hits:            336472  ( 9.3 %)
> +  - Megaflow hits:      3262943  (90.7 %, 1.00 subtbl lookups/hit)
> +  - Upcalls:                  0  ( 0.0 %, 0.0 us/upcall)
> +  - Lost upcalls:             0  ( 0.0 %)
> +  Tx packets:           2399607  (2381 Kpps)
> +  Tx batches:            171400  (14.00 pkts/batch)
> +.RE
> +.IP
> +Here "Rx packets" actually reflects the number of packets forwarded by the
> +datapath. "Datapath passes" matches the number of packet lookups as
> +reported by the \fBdpif-netdev/pmd-stats-show\fR command.
> +
> +To reset the counters and start a new measurement use
> +\fBdpif-netdev/pmd-stats-clear\fR.
> +.
> +.IP "\fBdpif-netdev/pmd-rxq-show\fR [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
> +For one or all pmd threads of the datapath \fIdp\fR show the list of queue-ids
> +with port names, which this thread polls.
> +.
> +.IP "\fBdpif-netdev/pmd-rxq-rebalance\fR [\fIdp\fR]"
> +Reassigns rxqs to pmds in the datapath \fIdp\fR based on their current usage.
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 86d8739..f245ce2 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -49,6 +49,7 @@
>  #include "id-pool.h"
>  #include "latch.h"
>  #include "netdev.h"
> +#include "netdev-provider.h"
>  #include "netdev-vport.h"
>  #include "netlink.h"
>  #include "odp-execute.h"
> @@ -281,6 +282,8 @@ struct dp_netdev {
>  
>      /* Probability of EMC insertions is a factor of 'emc_insert_min'.*/
>      OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_uint32_t emc_insert_min;
> +    /* Enable collection of PMD performance metrics. */
> +    atomic_bool pmd_perf_metrics;
>  
>      /* Protects access to ofproto-dpif-upcall interface during revalidator
>       * thread synchronization. */
> @@ -356,6 +359,7 @@ struct dp_netdev_rxq {
>                                            particular core. */
>      unsigned intrvl_idx;               /* Write index for 'cycles_intrvl'. */
>      struct dp_netdev_pmd_thread *pmd;  /* pmd thread that polls this queue. */
> +    bool is_vhost;                     /* Is rxq of a vhost port. */
>  
>      /* Counters of cycles spent successfully polling and processing pkts. */
>      atomic_ullong cycles[RXQ_N_CYCLES];
> @@ -717,6 +721,8 @@ static inline bool emc_entry_alive(struct emc_entry *ce);
>  static void emc_clear_entry(struct emc_entry *ce);
>  
>  static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
> +static inline bool
> +pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread *pmd);
>  
>  static void
>  emc_cache_init(struct emc_cache *flow_cache)
> @@ -800,7 +806,8 @@ get_dp_netdev(const struct dpif *dpif)
>  enum pmd_info_type {
>      PMD_INFO_SHOW_STATS,  /* Show how cpu cycles are spent. */
>      PMD_INFO_CLEAR_STATS, /* Set the cycles count to 0. */
> -    PMD_INFO_SHOW_RXQ     /* Show poll-lists of pmd threads. */
> +    PMD_INFO_SHOW_RXQ,    /* Show poll lists of pmd threads. */
> +    PMD_INFO_PERF_SHOW,   /* Show pmd performance details. */
>  };
>  
>  static void
> @@ -891,6 +898,47 @@ pmd_info_show_stats(struct ds *reply,
>                    stats[PMD_CYCLES_ITER_BUSY], total_packets);
>  }
>  
> +static void
> +pmd_info_show_perf(struct ds *reply,
> +                   struct dp_netdev_pmd_thread *pmd,
> +                   struct pmd_perf_params *par)
> +{
> +    if (pmd->core_id != NON_PMD_CORE_ID) {
> +        char *time_str =
> +                xastrftime_msec("%H:%M:%S.###", time_wall_msec(), true);
> +        long long now = time_msec();
> +        double duration = (now - pmd->perf_stats.start_ms) / 1000.0;
> +
> +        ds_put_cstr(reply, "\n");
> +        ds_put_format(reply, "Time: %s\n", time_str);
> +        ds_put_format(reply, "Measurement duration: %.3f s\n", duration);
> +        ds_put_cstr(reply, "\n");
> +        format_pmd_thread(reply, pmd);
> +        ds_put_cstr(reply, "\n");
> +        pmd_perf_format_overall_stats(reply, &pmd->perf_stats, duration);
> +        if (pmd_perf_metrics_enabled(pmd)) {
> +            /* Prevent parallel clearing of perf metrics. */
> +            ovs_mutex_lock(&pmd->perf_stats.clear_mutex);
> +            if (par->histograms) {
> +                ds_put_cstr(reply, "\n");
> +                pmd_perf_format_histograms(reply, &pmd->perf_stats);
> +            }
> +            if (par->iter_hist_len > 0) {
> +                ds_put_cstr(reply, "\n");
> +                pmd_perf_format_iteration_history(reply, &pmd->perf_stats,
> +                        par->iter_hist_len);
> +            }
> +            if (par->ms_hist_len > 0) {
> +                ds_put_cstr(reply, "\n");
> +                pmd_perf_format_ms_history(reply, &pmd->perf_stats,
> +                        par->ms_hist_len);
> +            }
> +            ovs_mutex_unlock(&pmd->perf_stats.clear_mutex);
> +        }
> +        free(time_str);
> +    }
> +}
> +
>  static int
>  compare_poll_list(const void *a_, const void *b_)
>  {
> @@ -1068,7 +1116,7 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[],
>      ovs_mutex_lock(&dp_netdev_mutex);
>  
>      while (argc > 1) {
> -        if (!strcmp(argv[1], "-pmd") && argc >= 3) {
> +        if (!strcmp(argv[1], "-pmd") && argc > 2) {
>              if (str_to_uint(argv[2], 10, &core_id)) {
>                  filter_on_pmd = true;
>              }
> @@ -1108,6 +1156,8 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[],
>              pmd_perf_stats_clear(&pmd->perf_stats);
>          } else if (type == PMD_INFO_SHOW_STATS) {
>              pmd_info_show_stats(&reply, pmd);
> +        } else if (type == PMD_INFO_PERF_SHOW) {
> +            pmd_info_show_perf(&reply, pmd, (struct pmd_perf_params *)aux);
>          }
>      }
>      free(pmd_list);
> @@ -1117,6 +1167,48 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[],
>      unixctl_command_reply(conn, ds_cstr(&reply));
>      ds_destroy(&reply);
>  }
> +
> +static void
> +pmd_perf_show_cmd(struct unixctl_conn *conn, int argc,
> +                          const char *argv[],
> +                          void *aux OVS_UNUSED)
> +{
> +    struct pmd_perf_params par;
> +    long int it_hist = 0, ms_hist = 0;
> +    par.histograms = true;
> +
> +    while (argc > 1) {
> +        if (!strcmp(argv[1], "-nh")) {
> +            par.histograms = false;
> +            argc -= 1;
> +            argv += 1;
> +        } else if (!strcmp(argv[1], "-it") && argc > 2) {
> +            it_hist = strtol(argv[2], NULL, 10);
> +            if (it_hist < 0) {
> +                it_hist = 0;
> +            } else if (it_hist > HISTORY_LEN) {
> +                it_hist = HISTORY_LEN;
> +            }
> +            argc -= 2;
> +            argv += 2;
> +        } else if (!strcmp(argv[1], "-ms") && argc > 2) {
> +            ms_hist = strtol(argv[2], NULL, 10);
> +            if (ms_hist < 0) {
> +                ms_hist = 0;
> +            } else if (ms_hist > HISTORY_LEN) {
> +                ms_hist = HISTORY_LEN;
> +            }
> +            argc -= 2;
> +            argv += 2;
> +        } else {
> +            break;
> +        }
> +    }
> +    par.iter_hist_len = it_hist;
> +    par.ms_hist_len = ms_hist;
> +    par.command_type = PMD_INFO_PERF_SHOW;
> +    dpif_netdev_pmd_info(conn, argc, argv, &par);
> +}
>  
>  static int
>  dpif_netdev_init(void)
> @@ -1134,6 +1226,12 @@ dpif_netdev_init(void)
>      unixctl_command_register("dpif-netdev/pmd-rxq-show", "[-pmd core] [dp]",
>                               0, 3, dpif_netdev_pmd_info,
>                               (void *)&poll_aux);
> +    unixctl_command_register("dpif-netdev/pmd-perf-show",
> +                             "[-nh] [-it iter-history-len]"
> +                             " [-ms ms-history-len]"
> +                             " [-pmd core] [dp]",
> +                             0, 8, pmd_perf_show_cmd,
> +                             NULL);
>      unixctl_command_register("dpif-netdev/pmd-rxq-rebalance", "[dp]",
>                               0, 1, dpif_netdev_pmd_rebalance,
>                               NULL);
> @@ -3020,6 +3118,18 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>          }
>      }
>  
> +    bool perf_enabled = smap_get_bool(other_config, "pmd-perf-metrics", false);
> +    bool cur_perf_enabled;
> +    atomic_read_relaxed(&dp->pmd_perf_metrics, &cur_perf_enabled);
> +    if (perf_enabled != cur_perf_enabled) {
> +        atomic_store_relaxed(&dp->pmd_perf_metrics, perf_enabled);
> +        if (perf_enabled) {
> +            VLOG_INFO("PMD performance metrics collection enabled");
> +        } else {
> +            VLOG_INFO("PMD performance metrics collection disabled");
> +        }
> +    }
> +
>      return 0;
>  }
>  
> @@ -3189,6 +3299,21 @@ dp_netdev_rxq_get_intrvl_cycles(struct dp_netdev_rxq *rx, unsigned idx)
>      return processing_cycles;
>  }
>  
> +static inline bool
> +pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread *pmd)
> +{
> +    /* If stores and reads of 64-bit integers are not atomic, the
> +     * full PMD performance metrics are not available as locked
> +     * access to 64 bit integers would be prohibitively expensive. */
> +#if ATOMIC_LLONG_LOCK_FREE
> +    bool pmd_perf_enabled;
> +    atomic_read_relaxed(&pmd->dp->pmd_perf_metrics, &pmd_perf_enabled);
> +    return pmd_perf_enabled;
> +#else
> +    return false;
> +#endif
> +}
> +
>  static int
>  dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,
>                                     struct tx_port *p)
> @@ -3264,10 +3389,12 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
>                             struct dp_netdev_rxq *rxq,
>                             odp_port_t port_no)
>  {
> +    struct pmd_perf_stats *s = &pmd->perf_stats;
>      struct dp_packet_batch batch;
>      struct cycle_timer timer;
>      int error;
> -    int batch_cnt = 0, output_cnt = 0;
> +    int batch_cnt = 0;
> +    int rem_qlen = 0, *qlen_p = NULL;
>      uint64_t cycles;
>  
>      /* Measure duration for polling and processing rx burst. */
> @@ -3276,20 +3403,37 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
>      pmd->ctx.last_rxq = rxq;
>      dp_packet_batch_init(&batch);
>  
> -    error = netdev_rxq_recv(rxq->rx, &batch, NULL);
> +    /* Fetch the rx queue length only for vhostuser ports. */
> +    if (pmd_perf_metrics_enabled(pmd) && rxq->is_vhost) {
> +        qlen_p = &rem_qlen;
> +    }
> +
> +    error = netdev_rxq_recv(rxq->rx, &batch, qlen_p);
>      if (!error) {
>          /* At least one packet received. */
>          *recirc_depth_get() = 0;
>          pmd_thread_ctx_time_update(pmd);
> -
>          batch_cnt = batch.count;
> +        if (pmd_perf_metrics_enabled(pmd)) {
> +            /* Update batch histogram. */
> +            s->current.batches++;
> +            histogram_add_sample(&s->pkts_per_batch, batch_cnt);
> +            /* Update the maximum vhost rx queue fill level. */
> +            if (rxq->is_vhost && rem_qlen >= 0) {
> +                uint32_t qfill = batch_cnt + rem_qlen;
> +                if (qfill > s->current.max_vhost_qfill) {
> +                    s->current.max_vhost_qfill = qfill;
> +                }
> +            }
> +        }
> +        /* Process packet batch. */
>          dp_netdev_input(pmd, &batch, port_no);
>  
>          /* Assign processing cycles to rx queue. */
>          cycles = cycle_timer_stop(&pmd->perf_stats, &timer);
>          dp_netdev_rxq_add_cycles(rxq, RXQ_CYCLES_PROC_CURR, cycles);
>  
> -        output_cnt = dp_netdev_pmd_flush_output_packets(pmd, false);
> +        dp_netdev_pmd_flush_output_packets(pmd, false);
>      } else {
>          /* Discard cycles. */
>          cycle_timer_stop(&pmd->perf_stats, &timer);
> @@ -3303,7 +3447,7 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
>  
>      pmd->ctx.last_rxq = NULL;
>  
> -    return batch_cnt + output_cnt;
> +    return batch_cnt;
>  }
>  
>  static struct tx_port *
> @@ -3359,6 +3503,7 @@ port_reconfigure(struct dp_netdev_port *port)
>          }
>  
>          port->rxqs[i].port = port;
> +        port->rxqs[i].is_vhost = !strncmp(port->type, "dpdkvhost", 9);
>  
>          err = netdev_rxq_open(netdev, &port->rxqs[i].rx, i);
>          if (err) {
> @@ -4137,23 +4282,26 @@ reload:
>      pmd->intrvl_tsc_prev = 0;
>      atomic_store_relaxed(&pmd->intrvl_cycles, 0);
>      cycles_counter_update(s);
> +    /* Protect pmd stats from external clearing while polling. */
> +    ovs_mutex_lock(&pmd->perf_stats.stats_mutex);
>      for (;;) {
> -        uint64_t iter_packets = 0;
> +        uint64_t rx_packets = 0, tx_packets = 0;
>  
>          pmd_perf_start_iteration(s);
> +
>          for (i = 0; i < poll_cnt; i++) {
>              process_packets =
>                  dp_netdev_process_rxq_port(pmd, poll_list[i].rxq,
>                                             poll_list[i].port_no);
> -            iter_packets += process_packets;
> +            rx_packets += process_packets;
>          }
>  
> -        if (!iter_packets) {
> +        if (!rx_packets) {
>              /* We didn't receive anything in the process loop.
>               * Check if we need to send something.
>               * There was no time updates on current iteration. */
>              pmd_thread_ctx_time_update(pmd);
> -            iter_packets += dp_netdev_pmd_flush_output_packets(pmd, false);
> +            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
>          }
>  
>          if (lc++ > 1024) {
> @@ -4172,8 +4320,10 @@ reload:
>                  break;
>              }
>          }
> -        pmd_perf_end_iteration(s, iter_packets);
> +        pmd_perf_end_iteration(s, rx_packets, tx_packets,
> +                               pmd_perf_metrics_enabled(pmd));
>      }
> +    ovs_mutex_unlock(&pmd->perf_stats.stats_mutex);
>  
>      poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
>      exiting = latch_is_set(&pmd->exit_latch);
> @@ -5068,6 +5218,7 @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
>      struct match match;
>      ovs_u128 ufid;
>      int error;
> +    uint64_t cycles = cycles_counter_update(&pmd->perf_stats);
>  
>      match.tun_md.valid = false;
>      miniflow_expand(&key->mf, &match.flow);
> @@ -5121,6 +5272,14 @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
>          ovs_mutex_unlock(&pmd->flow_mutex);
>          emc_probabilistic_insert(pmd, key, netdev_flow);
>      }
> +    if (pmd_perf_metrics_enabled(pmd)) {
> +        /* Update upcall stats. */
> +        cycles = cycles_counter_update(&pmd->perf_stats) - cycles;
> +        struct pmd_perf_stats *s = &pmd->perf_stats;
> +        s->current.upcalls++;
> +        s->current.upcall_cycles += cycles;
> +        histogram_add_sample(&s->cycles_per_upcall, cycles);
> +    }
>      return error;
>  }
>  
> diff --git a/manpages.mk b/manpages.mk
> index d4bf0ec..aaf8bc2 100644
> --- a/manpages.mk
> +++ b/manpages.mk
> @@ -250,6 +250,7 @@ vswitchd/ovs-vswitchd.8: \
>  	lib/coverage-unixctl.man \
>  	lib/daemon.man \
>  	lib/dpctl.man \
> +	lib/dpif-netdev-unixctl.man \
>  	lib/memory-unixctl.man \
>  	lib/netdev-dpdk-unixctl.man \
>  	lib/service.man \
> @@ -266,6 +267,7 @@ lib/common.man:
>  lib/coverage-unixctl.man:
>  lib/daemon.man:
>  lib/dpctl.man:
> +lib/dpif-netdev-unixctl.man:
>  lib/memory-unixctl.man:
>  lib/netdev-dpdk-unixctl.man:
>  lib/service.man:
> diff --git a/vswitchd/ovs-vswitchd.8.in b/vswitchd/ovs-vswitchd.8.in
> index 80e5f53..8b4034d 100644
> --- a/vswitchd/ovs-vswitchd.8.in
> +++ b/vswitchd/ovs-vswitchd.8.in
> @@ -256,32 +256,7 @@ type).
>  ..
>  .so lib/dpctl.man
>  .
> -.SS "DPIF-NETDEV COMMANDS"
> -These commands are used to expose internal information (mostly statistics)
> -about the ``dpif-netdev'' userspace datapath. If there is only one datapath
> -(as is often the case, unless \fBdpctl/\fR commands are used), the \fIdp\fR
> -argument can be omitted.
> -.IP "\fBdpif-netdev/pmd-stats-show\fR [\fIdp\fR]"
> -Shows performance statistics for each pmd thread of the datapath \fIdp\fR.
> -The special thread ``main'' sums up the statistics of every non pmd thread.
> -The sum of ``emc hits'', ``masked hits'' and ``miss'' is the number of
> -packets received by the datapath.  Cycles are counted using the TSC or similar
> -facilities (when available on the platform).  To reset these counters use
> -\fBdpif-netdev/pmd-stats-clear\fR. The duration of one cycle depends on the
> -measuring infrastructure. ``idle cycles'' refers to cycles spent polling
> -devices but not receiving any packets. ``processing cycles'' refers to cycles
> -spent polling devices and successfully receiving packets, plus the cycles
> -spent processing said packets.
> -.IP "\fBdpif-netdev/pmd-stats-clear\fR [\fIdp\fR]"
> -Resets to zero the per pmd thread performance numbers shown by the
> -\fBdpif-netdev/pmd-stats-show\fR command.  It will NOT reset datapath or
> -bridge statistics, only the values shown by the above command.
> -.IP "\fBdpif-netdev/pmd-rxq-show\fR [\fIdp\fR]"
> -For each pmd thread of the datapath \fIdp\fR shows list of queue-ids with
> -port names, which this thread polls.
> -.IP "\fBdpif-netdev/pmd-rxq-rebalance\fR [\fIdp\fR]"
> -Reassigns rxqs to pmds in the datapath \fIdp\fR based on their current usage.
> -.
> +.so lib/dpif-netdev-unixctl.man
>  .so lib/netdev-dpdk-unixctl.man
>  .so ofproto/ofproto-dpif-unixctl.man
>  .so ofproto/ofproto-unixctl.man
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index f899a19..aac663f 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -375,6 +375,18 @@
>          </p>
>        </column>
>  
> +      <column name="other_config" key="pmd-perf-metrics"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          Enables recording of detailed PMD performance metrics for analysis
> +          and trouble-shooting. This can have a performance impact in the
> +          order of 1%.
> +        </p>
> +        <p>
> +          Defaults to false but can be changed at any time.
> +        </p>
> +      </column>
> +
>        <column name="other_config" key="n-handler-threads"
>                type='{"type": "integer", "minInteger": 1}'>
>          <p>
Jan Scheurich March 26, 2018, 9:35 p.m. UTC | #3
Hi Aaron,

Thanks for the feedback. A few good suggestions are always welcome.
I will include fixes for your comments in the (hopefully) final version.

Regards, Jan 

> -----Original Message-----
> From: Aaron Conole [mailto:aconole@redhat.com]
> Sent: Monday, 26 March, 2018 23:27
> To: Jan Scheurich <jan.scheurich@ericsson.com>
> Cc: dev@openvswitch.org; i.maximets@samsung.com
> Subject: Re: [ovs-dev] [PATCH v10 2/3] dpif-netdev: Detailed performance stats for PMDs
> 
> Hi Jan,
> 
> Some stylistic type comments follow.  Sorry to jump in at the end - but
> you asked for checkpatch changes, so I improved and ran it against your
> patch and found some stuff for which I have an opinion. :)  Maybe
> nothing to hold up merging but cleanup stuff.
> 
> Jan Scheurich <jan.scheurich@ericsson.com> writes:
> 
> > This patch instruments the dpif-netdev datapath to record detailed
> > statistics of what is happening in every iteration of a PMD thread.
> >
> > The collection of detailed statistics can be controlled by a new
> > Open_vSwitch configuration parameter "other_config:pmd-perf-metrics".
> > By default it is disabled. The run-time overhead, when enabled, is
> > in the order of 1%.
> >
> > The covered metrics per iteration are:
> >   - cycles
> >   - packets
> >   - (rx) batches
> >   - packets/batch
> >   - max. vhostuser qlen
> >   - upcalls
> >   - cycles spent in upcalls
> >
> > This raw recorded data is used threefold:
> >
> > 1. In histograms for each of the following metrics:
> >    - cycles/iteration (log.)
> >    - packets/iteration (log.)
> >    - cycles/packet
> >    - packets/batch
> >    - max. vhostuser qlen (log.)
> >    - upcalls
> >    - cycles/upcall (log)
> >    The histograms bins are divided linear or logarithmic.
> >
> > 2. A cyclic history of the above statistics for 999 iterations
> >
> > 3. A cyclic history of the cummulative/average values per millisecond
> >    wall clock for the last 1000 milliseconds:
> >    - number of iterations
> >    - avg. cycles/iteration
> >    - packets (Kpps)
> >    - avg. packets/batch
> >    - avg. max vhost qlen
> >    - upcalls
> >    - avg. cycles/upcall
> >
> > The gathered performance metrics can be printed at any time with the
> > new CLI command
> >
> > ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len]
> >     [-pmd core] [dp]
> >
> > The options are
> >
> > -nh:            Suppress the histograms
> > -it iter_len:   Display the last iter_len iteration stats
> > -ms ms_len:     Display the last ms_len millisecond stats
> > -pmd core:      Display only the specified PMD
> >
> > The performance statistics are reset with the existing
> > dpif-netdev/pmd-stats-clear command.
> >
> > The output always contains the following global PMD statistics,
> > similar to the pmd-stats-show command:
> >
> > Time: 15:24:55.270
> > Measurement duration: 1.008 s
> >
> > pmd thread numa_id 0 core_id 1:
> >
> >   Cycles:            2419034712  (2.40 GHz)
> >   Iterations:            572817  (1.76 us/it)
> >   - idle:                486808  (15.9 % cycles)
> >   - busy:                 86009  (84.1 % cycles)
> >   Rx packets:           2399607  (2381 Kpps, 848 cycles/pkt)
> >   Datapath passes:      3599415  (1.50 passes/pkt)
> >   - EMC hits:            336472  ( 9.3 %)
> >   - Megaflow hits:      3262943  (90.7 %, 1.00 subtbl lookups/hit)
> >   - Upcalls:                  0  ( 0.0 %, 0.0 us/upcall)
> >   - Lost upcalls:             0  ( 0.0 %)
> >   Tx packets:           2399607  (2381 Kpps)
> >   Tx batches:            171400  (14.00 pkts/batch)
> >
> > Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
> > Acked-by: Billy O'Mahony <billy.o.mahony@intel.com>
> > ---
> >  NEWS                        |   3 +
> >  lib/automake.mk             |   1 +
> >  lib/dpif-netdev-perf.c      | 350 +++++++++++++++++++++++++++++++++++++++++++-
> >  lib/dpif-netdev-perf.h      | 258 ++++++++++++++++++++++++++++++--
> >  lib/dpif-netdev-unixctl.man | 157 ++++++++++++++++++++
> >  lib/dpif-netdev.c           | 183 +++++++++++++++++++++--
> >  manpages.mk                 |   2 +
> >  vswitchd/ovs-vswitchd.8.in  |  27 +---
> >  vswitchd/vswitch.xml        |  12 ++
> >  9 files changed, 940 insertions(+), 53 deletions(-)
> >  create mode 100644 lib/dpif-netdev-unixctl.man
> >
> > diff --git a/NEWS b/NEWS
> > index 8d0b502..8f66fd3 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -73,6 +73,9 @@ v2.9.0 - 19 Feb 2018
> >       * Add support for vHost dequeue zero copy (experimental)
> >     - Userspace datapath:
> >       * Output packet batching support.
> > +     * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
> > +     * Detailed PMD performance metrics available with new command
> > +         ovs-appctl dpif-netdev/pmd-perf-show
> >     - vswitchd:
> >       * Datapath IDs may now be specified as 0x1 (etc.) instead of 16 digits.
> >       * Configuring a controller, or unconfiguring all controllers, now deletes
> > diff --git a/lib/automake.mk b/lib/automake.mk
> > index 5c26e0f..7a5632d 100644
> > --- a/lib/automake.mk
> > +++ b/lib/automake.mk
> > @@ -484,6 +484,7 @@ MAN_FRAGMENTS += \
> >  	lib/dpctl.man \
> >  	lib/memory-unixctl.man \
> >  	lib/netdev-dpdk-unixctl.man \
> > +	lib/dpif-netdev-unixctl.man \
> >  	lib/ofp-version.man \
> >  	lib/ovs.tmac \
> >  	lib/service.man \
> > diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
> > index f06991a..2b36410 100644
> > --- a/lib/dpif-netdev-perf.c
> > +++ b/lib/dpif-netdev-perf.c
> > @@ -15,18 +15,324 @@
> >   */
> >
> >  #include <config.h>
> > +#include <stdint.h>
> >
> > +#include "dpif-netdev-perf.h"
> >  #include "openvswitch/dynamic-string.h"
> >  #include "openvswitch/vlog.h"
> > -#include "dpif-netdev-perf.h"
> > +#include "ovs-thread.h"
> >  #include "timeval.h"
> >
> >  VLOG_DEFINE_THIS_MODULE(pmd_perf);
> >
> > +#ifdef DPDK_NETDEV
> > +static uint64_t
> > +get_tsc_hz(void)
> > +{
> > +    return rte_get_tsc_hz();
> > +}
> > +#else
> > +/* This function is only invoked from PMD threads which depend on DPDK.
> > + * A dummy function is sufficient when building without DPDK_NETDEV. */
> > +static uint64_t
> > +get_tsc_hz(void)
> > +{
> > +    return 1;
> > +}
> > +#endif
> > +
> > +/* Histogram functions. */
> > +
> > +static void
> > +histogram_walls_set_lin(struct histogram *hist, uint32_t min, uint32_t max)
> > +{
> > +    int i;
> > +
> > +    ovs_assert(min < max);
> > +    for (i = 0; i < NUM_BINS-1; i++) {
> > +        hist->wall[i] = min + (i * (max - min)) / (NUM_BINS - 2);
> > +    }
> > +    hist->wall[NUM_BINS-1] = UINT32_MAX;
> > +}
> > +
> > +static void
> > +histogram_walls_set_log(struct histogram *hist, uint32_t min, uint32_t max)
> > +{
> > +    int i, start, bins, wall;
> > +    double log_min, log_max;
> > +
> > +    ovs_assert(min < max);
> > +    if (min > 0) {
> > +        log_min = log(min);
> > +        log_max = log(max);
> > +        start = 0;
> > +        bins = NUM_BINS - 1;
> > +    } else {
> > +        hist->wall[0] = 0;
> > +        log_min = log(1);
> > +        log_max = log(max);
> > +        start = 1;
> > +        bins = NUM_BINS - 2;
> > +    }
> > +    wall = start;
> > +    for (i = 0; i < bins; i++) {
> > +        /* Make sure each wall is monotonically increasing. */
> > +        wall = MAX(wall, exp(log_min + (i * (log_max - log_min)) / (bins-1)));
> > +        hist->wall[start + i] = wall++;
> > +    }
> > +    if (hist->wall[NUM_BINS-2] < max) {
> > +        hist->wall[NUM_BINS-2] = max;
> > +    }
> > +    hist->wall[NUM_BINS-1] = UINT32_MAX;
> > +}
> > +
> > +uint64_t
> > +histogram_samples(const struct histogram *hist)
> > +{
> > +    uint64_t samples = 0;
> > +
> > +    for (int i = 0; i < NUM_BINS; i++) {
> > +        samples += hist->bin[i];
> > +    }
> > +    return samples;
> > +}
> > +
> > +static void
> > +histogram_clear(struct histogram *hist)
> > +{
> > +    int i;
> > +
> > +    for (i = 0; i < NUM_BINS; i++) {
> > +        hist->bin[i] = 0;
> > +    }
> > +}
> > +
> > +static void
> > +history_init(struct history *h)
> > +{
> > +    memset(h, 0, sizeof(*h));
> > +}
> > +
> >  void
> >  pmd_perf_stats_init(struct pmd_perf_stats *s)
> >  {
> > -    memset(s, 0 , sizeof(*s));
> > +    memset(s, 0, sizeof(*s));
> > +    ovs_mutex_init(&s->stats_mutex);
> > +    ovs_mutex_init(&s->clear_mutex);
> 
>        Just include some comments on these constants so that it makes it
>        clear what is being measured, etc.
> 
> > +    histogram_walls_set_log(&s->cycles, 500, 24000000);
> > +    histogram_walls_set_log(&s->pkts, 0, 1000);
> > +    histogram_walls_set_lin(&s->cycles_per_pkt, 100, 30000);
> > +    histogram_walls_set_lin(&s->pkts_per_batch, 0, 32);
> > +    histogram_walls_set_lin(&s->upcalls, 0, 30);
> > +    histogram_walls_set_log(&s->cycles_per_upcall, 1000, 1000000);
> > +    histogram_walls_set_log(&s->max_vhost_qfill, 0, 512);
> > +    s->start_ms = time_msec();
> > +}
> > +
> > +void
> > +pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
> > +                              double duration)
> > +{
> > +    uint64_t stats[PMD_N_STATS];
> > +    double us_per_cycle = 1000000.0 / get_tsc_hz();
> > +
> > +    if (duration == 0) {
> > +        return;
> > +    }
> > +
> > +    pmd_perf_read_counters(s, stats);
> > +    uint64_t tot_cycles = stats[PMD_CYCLES_ITER_IDLE] +
> > +                          stats[PMD_CYCLES_ITER_BUSY];
> > +    uint64_t rx_packets = stats[PMD_STAT_RECV];
> > +    uint64_t tx_packets = stats[PMD_STAT_SENT_PKTS];
> > +    uint64_t tx_batches = stats[PMD_STAT_SENT_BATCHES];
> > +    uint64_t passes = stats[PMD_STAT_RECV] +
> > +                      stats[PMD_STAT_RECIRC];
> > +    uint64_t upcalls = stats[PMD_STAT_MISS];
> > +    uint64_t upcall_cycles = stats[PMD_CYCLES_UPCALL];
> > +    uint64_t tot_iter = histogram_samples(&s->pkts);
> > +    uint64_t idle_iter = s->pkts.bin[0];
> > +    uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
> > +
> > +    ds_put_format(str,
> > +            "  Cycles:          %12"PRIu64"  (%.2f GHz)\n"
> > +            "  Iterations:      %12"PRIu64"  (%.2f us/it)\n"
> > +            "  - idle:          %12"PRIu64"  (%4.1f %% cycles)\n"
> > +            "  - busy:          %12"PRIu64"  (%4.1f %% cycles)\n",
> > +            tot_cycles, (tot_cycles / duration) / 1E9,
> > +            tot_iter, tot_cycles * us_per_cycle / tot_iter,
> > +            idle_iter,
> > +            100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
> > +            busy_iter,
> > +            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
> > +    if (rx_packets > 0) {
> > +        ds_put_format(str,
> > +            "  Rx packets:      %12"PRIu64"  (%.0f Kpps, %.0f cycles/pkt)\n"
> > +            "  Datapath passes: %12"PRIu64"  (%.2f passes/pkt)\n"
> > +            "  - EMC hits:      %12"PRIu64"  (%4.1f %%)\n"
> > +            "  - Megaflow hits: %12"PRIu64"  (%4.1f %%, %.2f subtbl lookups/"
> > +                                                                     "hit)\n"
> > +            "  - Upcalls:       %12"PRIu64"  (%4.1f %%, %.1f us/upcall)\n"
> > +            "  - Lost upcalls:  %12"PRIu64"  (%4.1f %%)\n",
> > +            rx_packets, (rx_packets / duration) / 1000,
> > +            1.0 * stats[PMD_CYCLES_ITER_BUSY] / rx_packets,
> > +            passes, rx_packets ? 1.0 * passes / rx_packets : 0,
> > +            stats[PMD_STAT_EXACT_HIT],
> > +            100.0 * stats[PMD_STAT_EXACT_HIT] / passes,
> > +            stats[PMD_STAT_MASKED_HIT],
> > +            100.0 * stats[PMD_STAT_MASKED_HIT] / passes,
> > +            stats[PMD_STAT_MASKED_HIT]
> > +            ? 1.0 * stats[PMD_STAT_MASKED_LOOKUP] / stats[PMD_STAT_MASKED_HIT]
> > +            : 0,
> > +            upcalls, 100.0 * upcalls / passes,
> > +            upcalls ? (upcall_cycles * us_per_cycle) / upcalls : 0,
> > +            stats[PMD_STAT_LOST],
> > +            100.0 * stats[PMD_STAT_LOST] / passes);
> > +    } else {
> > +        ds_put_format(str,
> > +                "  Rx packets:      %12"PRIu64"\n",
> > +                0ULL);
> > +    }
> > +    if (tx_packets > 0) {
> > +        ds_put_format(str,
> > +            "  Tx packets:      %12"PRIu64"  (%.0f Kpps)\n"
> > +            "  Tx batches:      %12"PRIu64"  (%.2f pkts/batch)"
> > +            "\n",
> > +            tx_packets, (tx_packets / duration) / 1000,
> > +            tx_batches, 1.0 * tx_packets / tx_batches);
> > +    } else {
> > +        ds_put_format(str,
> > +                "  Tx packets:      %12"PRIu64"\n"
> > +                "\n",
> > +                0ULL);
> > +    }
> > +}
> > +
> > +void
> > +pmd_perf_format_histograms(struct ds *str, struct pmd_perf_stats *s)
> > +{
> > +    int i;
> > +
> > +    ds_put_cstr(str, "Histograms\n");
> > +    ds_put_format(str,
> > +                  "   %-21s  %-21s  %-21s  %-21s  %-21s  %-21s  %-21s\n",
> > +                  "cycles/it", "packets/it", "cycles/pkt", "pkts/batch",
> > +                  "max vhost qlen", "upcalls/it", "cycles/upcall");
> > +    for (i = 0; i < NUM_BINS-1; i++) {
> > +        ds_put_format(str,
> > +            "   %-9d %-11"PRIu64"  %-9d %-11"PRIu64"  %-9d %-11"PRIu64
> > +            "  %-9d %-11"PRIu64"  %-9d %-11"PRIu64"  %-9d %-11"PRIu64
> > +            "  %-9d %-11"PRIu64"\n",
> > +            s->cycles.wall[i], s->cycles.bin[i],
> > +            s->pkts.wall[i],s->pkts.bin[i],
> > +            s->cycles_per_pkt.wall[i], s->cycles_per_pkt.bin[i],
> > +            s->pkts_per_batch.wall[i], s->pkts_per_batch.bin[i],
> > +            s->max_vhost_qfill.wall[i], s->max_vhost_qfill.bin[i],
> > +            s->upcalls.wall[i], s->upcalls.bin[i],
> > +            s->cycles_per_upcall.wall[i], s->cycles_per_upcall.bin[i]);
> > +    }
> > +    ds_put_format(str,
> > +                  "   %-9s %-11"PRIu64"  %-9s %-11"PRIu64"  %-9s %-11"PRIu64
> > +                  "  %-9s %-11"PRIu64"  %-9s %-11"PRIu64"  %-9s %-11"PRIu64
> > +                  "  %-9s %-11"PRIu64"\n",
> > +                  ">", s->cycles.bin[i],
> > +                  ">", s->pkts.bin[i],
> > +                  ">", s->cycles_per_pkt.bin[i],
> > +                  ">", s->pkts_per_batch.bin[i],
> > +                  ">", s->max_vhost_qfill.bin[i],
> > +                  ">", s->upcalls.bin[i],
> > +                  ">", s->cycles_per_upcall.bin[i]);
> > +    if (s->totals.iterations > 0) {
> > +        ds_put_cstr(str,
> > +                    "-----------------------------------------------------"
> > +                    "-----------------------------------------------------"
> > +                    "------------------------------------------------\n");
> > +        ds_put_format(str,
> > +                      "   %-21s  %-21s  %-21s  %-21s  %-21s  %-21s  %-21s\n",
> > +                      "cycles/it", "packets/it", "cycles/pkt", "pkts/batch",
> > +                      "vhost qlen", "upcalls/it", "cycles/upcall");
> > +        ds_put_format(str,
> > +                      "   %-21"PRIu64"  %-21.5f  %-21"PRIu64
> > +                      "  %-21.5f  %-21.5f  %-21.5f  %-21"PRIu32"\n",
> > +                      s->totals.cycles / s->totals.iterations,
> > +                      1.0 * s->totals.pkts / s->totals.iterations,
> > +                      s->totals.pkts
> > +                          ? s->totals.busy_cycles / s->totals.pkts : 0,
> > +                      s->totals.batches
> > +                          ? 1.0 * s->totals.pkts / s->totals.batches : 0,
> > +                      1.0 * s->totals.max_vhost_qfill / s->totals.iterations,
> > +                      1.0 * s->totals.upcalls / s->totals.iterations,
> > +                      s->totals.upcalls
> > +                          ? s->totals.upcall_cycles / s->totals.upcalls : 0);
> > +    }
> > +}
> > +
> > +void
> > +pmd_perf_format_iteration_history(struct ds *str, struct pmd_perf_stats *s,
> > +                                  int n_iter)
> > +{
> > +    struct iter_stats *is;
> > +    size_t index;
> > +    int i;
> > +
> > +    if (n_iter == 0) {
> > +        return;
> > +    }
> > +    ds_put_format(str, "   %-17s   %-10s   %-10s   %-10s   %-10s   "
> > +                  "%-10s   %-10s   %-10s\n",
> > +                  "tsc", "cycles", "packets", "cycles/pkt", "pkts/batch",
> > +                  "vhost qlen", "upcalls", "cycles/upcall");
> > +    for (i = 1; i <= n_iter; i++) {
> > +        index = (s->iterations.idx + HISTORY_LEN - i) % HISTORY_LEN;
> 
>     index = history_sub(s->iterations.idx, i);
> 
> > +        is = &s->iterations.sample[index];
> > +        ds_put_format(str,
> > +                      "   %-17"PRIu64"   %-11"PRIu64"  %-11"PRIu32
> > +                      "  %-11"PRIu64"  %-11"PRIu32"  %-11"PRIu32
> > +                      "  %-11"PRIu32"  %-11"PRIu32"\n",
> > +                      is->timestamp,
> > +                      is->cycles,
> > +                      is->pkts,
> > +                      is->pkts ? is->cycles / is->pkts : 0,
> > +                      is->batches ? is->pkts / is->batches : 0,
> > +                      is->max_vhost_qfill,
> > +                      is->upcalls,
> > +                      is->upcalls ? is->upcall_cycles / is->upcalls : 0);
> > +    }
> > +}
> > +
> > +void
> > +pmd_perf_format_ms_history(struct ds *str, struct pmd_perf_stats *s, int n_ms)
> > +{
> > +    struct iter_stats *is;
> > +    size_t index;
> > +    int i;
> > +
> > +    if (n_ms == 0) {
> > +        return;
> > +    }
> > +    ds_put_format(str,
> > +                  "   %-12s   %-10s   %-10s   %-10s   %-10s"
> > +                  "   %-10s   %-10s   %-10s   %-10s\n",
> > +                  "ms", "iterations", "cycles/it", "Kpps", "cycles/pkt",
> > +                  "pkts/batch", "vhost qlen", "upcalls", "cycles/upcall");
> > +    for (i = 1; i <= n_ms; i++) {
> > +        index = (s->milliseconds.idx + HISTORY_LEN - i) % HISTORY_LEN;
> 
>     index = history_sub(s->milliseconds.idx, i)
> 
> > +        is = &s->milliseconds.sample[index];
> > +        ds_put_format(str,
> > +                      "   %-12"PRIu64"   %-11"PRIu32"  %-11"PRIu64
> > +                      "  %-11"PRIu32"  %-11"PRIu64"  %-11"PRIu32
> > +                      "  %-11"PRIu32"  %-11"PRIu32"  %-11"PRIu32"\n",
> > +                      is->timestamp,
> > +                      is->iterations,
> > +                      is->iterations ? is->cycles / is->iterations : 0,
> > +                      is->pkts,
> > +                      is->pkts ? is->busy_cycles / is->pkts : 0,
> > +                      is->batches ? is->pkts / is->batches : 0,
> > +                      is->iterations
> > +                          ? is->max_vhost_qfill / is->iterations : 0,
> > +                      is->upcalls,
> > +                      is->upcalls ? is->upcall_cycles / is->upcalls : 0);
> > +    }
> >  }
> >
> >  void
> > @@ -51,10 +357,48 @@ pmd_perf_read_counters(struct pmd_perf_stats *s,
> >      }
> >  }
> >
> > +/* This function clears the PMD performance counters from within the PMD
> > + * thread or from another thread when the PMD thread is not executing its
> > + * poll loop. */
> >  void
> > -pmd_perf_stats_clear(struct pmd_perf_stats *s)
> > +pmd_perf_stats_clear_lock(struct pmd_perf_stats *s)
> > +    OVS_REQUIRES(s->stats_mutex)
> >  {
> > +    ovs_mutex_lock(&s->clear_mutex);
> >      for (int i = 0; i < PMD_N_STATS; i++) {
> >          atomic_read_relaxed(&s->counters.n[i], &s->counters.zero[i]);
> >      }
> > +    /* The following stats are only applicable in PMD thread and */
> > +    memset(&s->current, 0 , sizeof(struct iter_stats));
> > +    memset(&s->totals, 0 , sizeof(struct iter_stats));
> > +    histogram_clear(&s->cycles);
> > +    histogram_clear(&s->pkts);
> > +    histogram_clear(&s->cycles_per_pkt);
> > +    histogram_clear(&s->upcalls);
> > +    histogram_clear(&s->cycles_per_upcall);
> > +    histogram_clear(&s->pkts_per_batch);
> > +    histogram_clear(&s->max_vhost_qfill);
> > +    history_init(&s->iterations);
> > +    history_init(&s->milliseconds);
> > +    s->start_ms = time_msec();
> > +    s->milliseconds.sample[0].timestamp = s->start_ms;
> > +    /* Clearing finished. */
> > +    s->clear = false;
> > +    ovs_mutex_unlock(&s->clear_mutex);
> > +}
> > +
> > +/* This function can be called from the anywhere to clear the stats
> > + * of PMD and non-PMD threads. */
> > +void
> > +pmd_perf_stats_clear(struct pmd_perf_stats *s)
> > +{
> > +    if (ovs_mutex_trylock(&s->stats_mutex) == 0) {
> > +        /* Locking successful. PMD not polling. */
> > +        pmd_perf_stats_clear_lock(s);
> > +        ovs_mutex_unlock(&s->stats_mutex);
> > +    } else {
> > +        /* Request the polling PMD to clear the stats. There is no need to
> > +         * block here as stats retrieval is prevented during clearing. */
> > +        s->clear = true;
> > +    }
> >  }
> > diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> > index 5993c25..fd9b0fc 100644
> > --- a/lib/dpif-netdev-perf.h
> > +++ b/lib/dpif-netdev-perf.h
> > @@ -38,10 +38,18 @@
> >  extern "C" {
> >  #endif
> >
> > -/* This module encapsulates data structures and functions to maintain PMD
> > - * performance metrics such as packet counters, execution cycles. It
> > - * provides a clean API for dpif-netdev to initialize, update and read and
> > +/* This module encapsulates data structures and functions to maintain basic PMD
> > + * performance metrics such as packet counters, execution cycles as well as
> > + * histograms and time series recording for more detailed PMD metrics.
> > + *
> > + * It provides a clean API for dpif-netdev to initialize, update and read and
> >   * reset these metrics.
> > + *
> > + * The basic set of PMD counters is implemented as atomic_uint64_t variables
> > + * to guarantee correct read also in 32-bit systems.
> > + *
> > + * The detailed PMD performance metrics are only supported on 64-bit systems
> > + * with atomic 64-bit read and store semantics for plain uint64_t counters.
> >   */
> >
> >  /* Set of counter types maintained in pmd_perf_stats. */
> > @@ -66,6 +74,7 @@ enum pmd_stat_type {
> >      PMD_STAT_SENT_BATCHES,  /* Number of batches sent. */
> >      PMD_CYCLES_ITER_IDLE,   /* Cycles spent in idle iterations. */
> >      PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
> > +    PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
> >      PMD_N_STATS
> >  };
> >
> > @@ -81,18 +90,87 @@ struct pmd_counters {
> >      uint64_t zero[PMD_N_STATS];         /* Value at last _clear().  */
> >  };
> >
> > -/* Container for all performance metrics of a PMD.
> > - * Part of the struct dp_netdev_pmd_thread. */
> > +/* Data structure to collect statistical distribution of an integer measurement
> > + * type in form of a histogram. The wall[] array contains the inclusive
> > + * upper boundaries of the bins, while the bin[] array contains the actual
> > + * counters per bin. The histogram walls are typically set automatically
> > + * using the functions provided below.*/
> > +
> > +#define NUM_BINS 32             /* Number of histogram bins. */
> > +
> > +struct histogram {
> > +    uint32_t wall[NUM_BINS];
> > +    uint64_t bin[NUM_BINS];
> > +};
> > +
> > +/* Data structure to record details PMD execution metrics per iteration for
> > + * a history period of up to HISTORY_LEN iterations in circular buffer.
> > + * Also used to record up to HISTORY_LEN millisecond averages/totals of these
> > + * metrics.*/
> > +
> > +struct iter_stats {
> > +    uint64_t timestamp;         /* TSC or millisecond. */
> > +    uint64_t cycles;            /* Number of TSC cycles spent in it. or ms. */
> > +    uint64_t busy_cycles;       /* Cycles spent in busy iterations or ms. */
> > +    uint32_t iterations;        /* Iterations in ms. */
> > +    uint32_t pkts;              /* Packets processed in iteration or ms. */
> > +    uint32_t upcalls;           /* Number of upcalls in iteration or ms. */
> > +    uint32_t upcall_cycles;     /* Cycles spent in upcalls in it. or ms. */
> > +    uint32_t batches;           /* Number of rx batches in iteration or ms. */
> > +    uint32_t max_vhost_qfill;   /* Maximum fill level in iteration or ms. */
> > +};
> > +
> > +#define HISTORY_LEN 1000        /* Length of recorded history
> > +                                   (iterations and ms). */
> > +#define DEF_HIST_SHOW 20        /* Default number of history samples to
> > +                                   display. */
> > +
> > +struct history {
> > +    size_t idx;                 /* Slot to which next call to history_store()
> > +                                   will write. */
> > +    struct iter_stats sample[HISTORY_LEN];
> > +};
> > +
> > +/* Container for all performance metrics of a PMD within the struct
> > + * dp_netdev_pmd_thread. The metrics must be updated from within the PMD
> > + * thread but can be read from any thread. The basic PMD counters in
> > + * struct pmd_counters can be read without protection against concurrent
> > + * clearing. The other metrics may only be safely read with the clear_mutex
> > + * held to protect against concurrent clearing. */
> >
> >  struct pmd_perf_stats {
> > -    /* Start of the current PMD iteration in TSC cycles.*/
> > -    uint64_t start_it_tsc;
> > +    /* Prevents interference between PMD polling and stats clearing. */
> > +    struct ovs_mutex stats_mutex;
> > +    /* Set by CLI thread to order clearing of PMD stats. */
> > +    volatile bool clear;
> > +    /* Prevents stats retrieval while clearing is in progress. */
> > +    struct ovs_mutex clear_mutex;
> > +    /* Start of the current performance measurement period. */
> > +    uint64_t start_ms;
> >      /* Latest TSC time stamp taken in PMD. */
> >      uint64_t last_tsc;
> > +    /* Used to space certain checks in time. */
> > +    uint64_t next_check_tsc;
> >      /* If non-NULL, outermost cycle timer currently running in PMD. */
> >      struct cycle_timer *cur_timer;
> >      /* Set of PMD counters with their zero offsets. */
> >      struct pmd_counters counters;
> > +    /* Statistics of the current iteration. */
> > +    struct iter_stats current;
> > +    /* Totals for the current millisecond. */
> > +    struct iter_stats totals;
> > +    /* Histograms for the PMD metrics. */
> > +    struct histogram cycles;
> > +    struct histogram pkts;
> > +    struct histogram cycles_per_pkt;
> > +    struct histogram upcalls;
> > +    struct histogram cycles_per_upcall;
> > +    struct histogram pkts_per_batch;
> > +    struct histogram max_vhost_qfill;
> > +    /* Iteration history buffer. */
> > +    struct history iterations;
> > +    /* Millisecond history buffer. */
> > +    struct history milliseconds;
> >  };
> >
> >  /* Support for accurate timing of PMD execution on TSC clock cycle level.
> > @@ -175,8 +253,14 @@ cycle_timer_stop(struct pmd_perf_stats *s,
> >      return now - timer->start;
> >  }
> >
> > +/* Functions to initialize and reset the PMD performance metrics. */
> > +
> >  void pmd_perf_stats_init(struct pmd_perf_stats *s);
> >  void pmd_perf_stats_clear(struct pmd_perf_stats *s);
> > +void pmd_perf_stats_clear_lock(struct pmd_perf_stats *s);
> > +
> > +/* Functions to read and update PMD counters. */
> > +
> >  void pmd_perf_read_counters(struct pmd_perf_stats *s,
> >                              uint64_t stats[PMD_N_STATS]);
> >
> > @@ -199,32 +283,182 @@ pmd_perf_update_counter(struct pmd_perf_stats *s,
> >      atomic_store_relaxed(&s->counters.n[counter], tmp);
> >  }
> >
> > +/* Functions to manipulate a sample history. */
> > +
> > +static inline void
> > +histogram_add_sample(struct histogram *hist, uint32_t val)
> > +{
> > +    /* TODO: Can do better with binary search? */
> > +    for (int i = 0; i < NUM_BINS-1; i++) {
> > +        if (val <= hist->wall[i]) {
> > +            hist->bin[i]++;
> > +            return;
> > +        }
> > +    }
> > +    hist->bin[NUM_BINS-1]++;
> > +}
> > +
> > +uint64_t histogram_samples(const struct histogram *hist);
> > +
> > +/* Add an offset to idx modulo HISTORY_LEN. */
> > +static inline uint32_t
> > +history_add(uint32_t idx, uint32_t offset)
> > +{
> > +    return (idx + offset) % HISTORY_LEN;
> > +}
> > +
> > +/* Subtract idx2 from idx1 modulo HISTORY_LEN. */
> 
> Do the comments on these two functions (history_add and history_sub)
> really do anything to help the reader?  Maybe they should explain when
> these functions would be used for calculating an index into the
> timing history sample array?
> 
> > +static inline uint32_t
> > +history_sub(uint32_t idx1, uint32_t idx2)
> > +{
> > +    return (idx1 + HISTORY_LEN - idx2) % HISTORY_LEN;
> > +}
> > +
> > +static inline struct iter_stats *
> > +history_current(struct history *h)
> > +{
> > +    return &h->sample[h->idx];
> > +}
> > +
> > +static inline struct iter_stats *
> > +history_next(struct history *h)
> > +{
> > +    size_t next_idx = (h->idx + 1) % HISTORY_LEN;
> 
> Maybe:
>        size_t next_idx = history_add(h->idx, 1);
> 
> > +    struct iter_stats *next = &h->sample[next_idx];
> > +
> > +    memset(next, 0, sizeof(*next));
> > +    h->idx = next_idx;
> > +    return next;
> > +}
> > +
> > +static inline struct iter_stats *
> > +history_store(struct history *h, struct iter_stats *is)
> > +{
> > +    if (is) {
> > +        h->sample[h->idx] = *is;
> > +    }
> > +    /* Advance the history pointer */
> > +    return history_next(h);
> > +}
> > +
> > +/* Functions recording PMD metrics per iteration. */
> > +
> >  static inline void
> >  pmd_perf_start_iteration(struct pmd_perf_stats *s)
> >  {
> > +    if (s->clear) {
> > +        /* Clear the PMD stats before starting next iteration. */
> > +        pmd_perf_stats_clear_lock(s);
> > +    }
> > +    /* Initialize the current interval stats. */
> > +    memset(&s->current, 0, sizeof(struct iter_stats));
> >      if (OVS_LIKELY(s->last_tsc)) {
> >          /* We assume here that last_tsc was updated immediately prior at
> >           * the end of the previous iteration, or just before the first
> >           * iteration. */
> > -        s->start_it_tsc = s->last_tsc;
> > +        s->current.timestamp = s->last_tsc;
> >      } else {
> >          /* In case last_tsc has never been set before. */
> > -        s->start_it_tsc = cycles_counter_update(s);
> > +        s->current.timestamp = cycles_counter_update(s);
> >      }
> >  }
> >
> >  static inline void
> > -pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets)
> > +pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
> > +                       int tx_packets, bool full_metrics)
> 
> This function is sufficiently complex enough to not be part of the
> header.
> 
> >  {
> > -    uint64_t cycles = cycles_counter_update(s) - s->start_it_tsc;
> > +    uint64_t now_tsc = cycles_counter_update(s);
> > +    struct iter_stats *cum_ms;
> > +    uint64_t cycles, cycles_per_pkt = 0;
> >
> > -    if (rx_packets > 0) {
> > +    cycles = now_tsc - s->current.timestamp;
> > +    s->current.cycles = cycles;
> > +    s->current.pkts = rx_packets;
> > +
> > +    if (rx_packets + tx_packets > 0) {
> >          pmd_perf_update_counter(s, PMD_CYCLES_ITER_BUSY, cycles);
> >      } else {
> >          pmd_perf_update_counter(s, PMD_CYCLES_ITER_IDLE, cycles);
> >      }
> > +    /* Add iteration samples to histograms. */
> > +    histogram_add_sample(&s->cycles, cycles);
> > +    histogram_add_sample(&s->pkts, rx_packets);
> > +
> > +    if (!full_metrics) {
> > +        return;
> > +    }
> > +
> > +    s->counters.n[PMD_CYCLES_UPCALL] += s->current.upcall_cycles;
> > +
> > +    if (rx_packets > 0) {
> > +        cycles_per_pkt = cycles / rx_packets;
> > +        histogram_add_sample(&s->cycles_per_pkt, cycles_per_pkt);
> > +    }
> > +    if (s->current.batches > 0) {
> > +        histogram_add_sample(&s->pkts_per_batch,
> > +                             rx_packets / s->current.batches);
> > +    }
> > +    histogram_add_sample(&s->upcalls, s->current.upcalls);
> > +    if (s->current.upcalls > 0) {
> > +        histogram_add_sample(&s->cycles_per_upcall,
> > +                             s->current.upcall_cycles / s->current.upcalls);
> > +    }
> > +    histogram_add_sample(&s->max_vhost_qfill, s->current.max_vhost_qfill);
> > +
> > +    /* Add iteration samples to millisecond stats. */
> > +    cum_ms = history_current(&s->milliseconds);
> > +    cum_ms->iterations++;
> > +    cum_ms->cycles += cycles;
> > +    if (rx_packets > 0) {
> > +        cum_ms->busy_cycles += cycles;
> > +    }
> > +    cum_ms->pkts += s->current.pkts;
> > +    cum_ms->upcalls += s->current.upcalls;
> > +    cum_ms->upcall_cycles += s->current.upcall_cycles;
> > +    cum_ms->batches += s->current.batches;
> > +    cum_ms->max_vhost_qfill += s->current.max_vhost_qfill;
> > +
> > +    /* Store in iteration history. This advances the iteration idx and
> > +     * clears the next slot in the iteration history. */
> > +    history_store(&s->iterations, &s->current);
> > +    if (now_tsc > s->next_check_tsc) {
> > +        /* Check if ms is completed and store in milliseconds history. */
> > +        uint64_t now = time_msec();
> > +        if (now != cum_ms->timestamp) {
> > +            /* Add ms stats to totals. */
> > +            s->totals.iterations += cum_ms->iterations;
> > +            s->totals.cycles += cum_ms->cycles;
> > +            s->totals.busy_cycles += cum_ms->busy_cycles;
> > +            s->totals.pkts += cum_ms->pkts;
> > +            s->totals.upcalls += cum_ms->upcalls;
> > +            s->totals.upcall_cycles += cum_ms->upcall_cycles;
> > +            s->totals.batches += cum_ms->batches;
> > +            s->totals.max_vhost_qfill += cum_ms->max_vhost_qfill;
> > +            cum_ms = history_next(&s->milliseconds);
> > +            cum_ms->timestamp = now;
> > +        }
> > +        s->next_check_tsc = cycles_counter_update(s) + 10000;
> 
> This is spacing by 10ms?  Am I reading correctly?  Maybe a comment, or a
> constant.
> 
> > +    }
> >  }
> >
> > +/* Formatting the output of commands. */
> > +
> > +struct pmd_perf_params {
> > +    int command_type;
> > +    bool histograms;
> > +    size_t iter_hist_len;
> > +    size_t ms_hist_len;
> > +};
> > +
> > +void pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
> > +                                   double duration);
> > +void pmd_perf_format_histograms(struct ds *str, struct pmd_perf_stats *s);
> > +void pmd_perf_format_iteration_history(struct ds *str,
> > +                                       struct pmd_perf_stats *s,
> > +                                       int n_iter);
> > +void pmd_perf_format_ms_history(struct ds *str, struct pmd_perf_stats *s,
> > +                                int n_ms);
> > +
> >  #ifdef  __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/dpif-netdev-unixctl.man b/lib/dpif-netdev-unixctl.man
> > new file mode 100644
> > index 0000000..76c3e4e
> > --- /dev/null
> > +++ b/lib/dpif-netdev-unixctl.man
> > @@ -0,0 +1,157 @@
> > +.SS "DPIF-NETDEV COMMANDS"
> > +These commands are used to expose internal information (mostly statistics)
> > +about the "dpif-netdev" userspace datapath. If there is only one datapath
> > +(as is often the case, unless \fBdpctl/\fR commands are used), the \fIdp\fR
> > +argument can be omitted. By default the commands present data for all pmd
> > +threads in the datapath. By specifying the "-pmd Core" option one can filter
> > +the output for a single pmd in the datapath.
> > +.
> > +.IP "\fBdpif-netdev/pmd-stats-show\fR [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
> > +Shows performance statistics for one or all pmd threads of the datapath
> > +\fIdp\fR. The special thread "main" sums up the statistics of every non pmd
> > +thread.
> > +
> > +The sum of "emc hits", "masked hits" and "miss" is the number of
> > +packet lookups performed by the datapath. Beware that a recirculated packet
> > +experiences one additional lookup per recirculation, so there may be
> > +more lookups than forwarded packets in the datapath.
> > +
> > +Cycles are counted using the TSC or similar facilities (when available on
> > +the platform). The duration of one cycle depends on the processing platform.
> > +
> > +"idle cycles" refers to cycles spent in PMD iterations not forwarding any
> > +any packets. "processing cycles" refers to cycles spent in PMD iterations
> > +forwarding at least one packet, including the cost for polling, processing and
> > +transmitting said packets.
> > +
> > +To reset these counters use \fBdpif-netdev/pmd-stats-clear\fR.
> > +.
> > +.IP "\fBdpif-netdev/pmd-stats-clear\fR [\fIdp\fR]"
> > +Resets to zero the per pmd thread performance numbers shown by the
> > +\fBdpif-netdev/pmd-stats-show\fR and \fBdpif-netdev/pmd-perf-show\fR commands.
> > +It will NOT reset datapath or bridge statistics, only the values shown by
> > +the above commands.
> > +.
> > +.IP "\fBdpif-netdev/pmd-perf-show\fR [\fB-nh\fR] [\fB-it\fR \fIiter_len\fR] \
> > +[\fB-ms\fR \fIms_len\fR] [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
> > +Shows detailed performance metrics for one or all pmds threads of the
> > +user space datapath.
> > +
> > +The collection of detailed statistics can be controlled by a new
> > +configuration parameter "other_config:pmd-perf-metrics". By default it
> > +is disabled. The run-time overhead, when enabled, is in the order of 1%.
> > +
> > +.RS
> > +.IP
> > +.PD .4v
> > +.IP \(em
> > +used cycles
> > +.IP \(em
> > +forwared packets
> > +.IP \(em
> > +number of rx batches
> > +.IP \(em
> > +packets/rx batch
> > +.IP \(em
> > +max. vhostuser queue fill level
> > +.IP \(em
> > +number of upcalls
> > +.IP \(em
> > +cycles spent in upcalls
> > +.PD
> > +.RE
> > +.IP
> > +This raw recorded data is used threefold:
> > +
> > +.RS
> > +.IP
> > +.PD .4v
> > +.IP 1.
> > +In histograms for each of the following metrics:
> > +.RS
> > +.IP \(em
> > +cycles/iteration (logarithmic)
> > +.IP \(em
> > +packets/iteration (logarithmic)
> > +.IP \(em
> > +cycles/packet
> > +.IP \(em
> > +packets/batch
> > +.IP \(em
> > +max. vhostuser qlen (logarithmic)
> > +.IP \(em
> > +upcalls
> > +.IP \(em
> > +cycles/upcall (logarithmic)
> > +The histograms bins are divided linear or logarithmic.
> > +.RE
> > +.IP 2.
> > +A cyclic history of the above metrics for 1024 iterations
> > +.IP 3.
> > +A cyclic history of the cummulative/average values per millisecond wall
> > +clock for the last 1024 milliseconds:
> > +.RS
> > +.IP \(em
> > +number of iterations
> > +.IP \(em
> > +avg. cycles/iteration
> > +.IP \(em
> > +packets (Kpps)
> > +.IP \(em
> > +avg. packets/batch
> > +.IP \(em
> > +avg. max vhost qlen
> > +.IP \(em
> > +upcalls
> > +.IP \(em
> > +avg. cycles/upcall
> > +.RE
> > +.PD
> > +.RE
> > +.IP
> > +.
> > +The command options are:
> > +.RS
> > +.IP "\fB-nh\fR"
> > +Suppress the histograms
> > +.IP "\fB-it\fR \fIiter_len\fR"
> > +Display the last iter_len iteration stats
> > +.IP "\fB-ms\fR \fIms_len\fR"
> > +Display the last ms_len millisecond stats
> > +.RE
> > +.IP
> > +The output always contains the following global PMD statistics:
> > +.RS
> > +.IP
> > +Time: 15:24:55.270 .br
> > +Measurement duration: 1.008 s
> > +
> > +pmd thread numa_id 0 core_id 1:
> > +
> > +  Cycles:            2419034712  (2.40 GHz)
> > +  Iterations:            572817  (1.76 us/it)
> > +  - idle:                486808  (15.9 % cycles)
> > +  - busy:                 86009  (84.1 % cycles)
> > +  Rx packets:           2399607  (2381 Kpps, 848 cycles/pkt)
> > +  Datapath passes:      3599415  (1.50 passes/pkt)
> > +  - EMC hits:            336472  ( 9.3 %)
> > +  - Megaflow hits:      3262943  (90.7 %, 1.00 subtbl lookups/hit)
> > +  - Upcalls:                  0  ( 0.0 %, 0.0 us/upcall)
> > +  - Lost upcalls:             0  ( 0.0 %)
> > +  Tx packets:           2399607  (2381 Kpps)
> > +  Tx batches:            171400  (14.00 pkts/batch)
> > +.RE
> > +.IP
> > +Here "Rx packets" actually reflects the number of packets forwarded by the
> > +datapath. "Datapath passes" matches the number of packet lookups as
> > +reported by the \fBdpif-netdev/pmd-stats-show\fR command.
> > +
> > +To reset the counters and start a new measurement use
> > +\fBdpif-netdev/pmd-stats-clear\fR.
> > +.
> > +.IP "\fBdpif-netdev/pmd-rxq-show\fR [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
> > +For one or all pmd threads of the datapath \fIdp\fR show the list of queue-ids
> > +with port names, which this thread polls.
> > +.
> > +.IP "\fBdpif-netdev/pmd-rxq-rebalance\fR [\fIdp\fR]"
> > +Reassigns rxqs to pmds in the datapath \fIdp\fR based on their current usage.
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > index 86d8739..f245ce2 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -49,6 +49,7 @@
> >  #include "id-pool.h"
> >  #include "latch.h"
> >  #include "netdev.h"
> > +#include "netdev-provider.h"
> >  #include "netdev-vport.h"
> >  #include "netlink.h"
> >  #include "odp-execute.h"
> > @@ -281,6 +282,8 @@ struct dp_netdev {
> >
> >      /* Probability of EMC insertions is a factor of 'emc_insert_min'.*/
> >      OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_uint32_t emc_insert_min;
> > +    /* Enable collection of PMD performance metrics. */
> > +    atomic_bool pmd_perf_metrics;
> >
> >      /* Protects access to ofproto-dpif-upcall interface during revalidator
> >       * thread synchronization. */
> > @@ -356,6 +359,7 @@ struct dp_netdev_rxq {
> >                                            particular core. */
> >      unsigned intrvl_idx;               /* Write index for 'cycles_intrvl'. */
> >      struct dp_netdev_pmd_thread *pmd;  /* pmd thread that polls this queue. */
> > +    bool is_vhost;                     /* Is rxq of a vhost port. */
> >
> >      /* Counters of cycles spent successfully polling and processing pkts. */
> >      atomic_ullong cycles[RXQ_N_CYCLES];
> > @@ -717,6 +721,8 @@ static inline bool emc_entry_alive(struct emc_entry *ce);
> >  static void emc_clear_entry(struct emc_entry *ce);
> >
> >  static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
> > +static inline bool
> > +pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread *pmd);
> >
> >  static void
> >  emc_cache_init(struct emc_cache *flow_cache)
> > @@ -800,7 +806,8 @@ get_dp_netdev(const struct dpif *dpif)
> >  enum pmd_info_type {
> >      PMD_INFO_SHOW_STATS,  /* Show how cpu cycles are spent. */
> >      PMD_INFO_CLEAR_STATS, /* Set the cycles count to 0. */
> > -    PMD_INFO_SHOW_RXQ     /* Show poll-lists of pmd threads. */
> > +    PMD_INFO_SHOW_RXQ,    /* Show poll lists of pmd threads. */
> > +    PMD_INFO_PERF_SHOW,   /* Show pmd performance details. */
> >  };
> >
> >  static void
> > @@ -891,6 +898,47 @@ pmd_info_show_stats(struct ds *reply,
> >                    stats[PMD_CYCLES_ITER_BUSY], total_packets);
> >  }
> >
> > +static void
> > +pmd_info_show_perf(struct ds *reply,
> > +                   struct dp_netdev_pmd_thread *pmd,
> > +                   struct pmd_perf_params *par)
> > +{
> > +    if (pmd->core_id != NON_PMD_CORE_ID) {
> > +        char *time_str =
> > +                xastrftime_msec("%H:%M:%S.###", time_wall_msec(), true);
> > +        long long now = time_msec();
> > +        double duration = (now - pmd->perf_stats.start_ms) / 1000.0;
> > +
> > +        ds_put_cstr(reply, "\n");
> > +        ds_put_format(reply, "Time: %s\n", time_str);
> > +        ds_put_format(reply, "Measurement duration: %.3f s\n", duration);
> > +        ds_put_cstr(reply, "\n");
> > +        format_pmd_thread(reply, pmd);
> > +        ds_put_cstr(reply, "\n");
> > +        pmd_perf_format_overall_stats(reply, &pmd->perf_stats, duration);
> > +        if (pmd_perf_metrics_enabled(pmd)) {
> > +            /* Prevent parallel clearing of perf metrics. */
> > +            ovs_mutex_lock(&pmd->perf_stats.clear_mutex);
> > +            if (par->histograms) {
> > +                ds_put_cstr(reply, "\n");
> > +                pmd_perf_format_histograms(reply, &pmd->perf_stats);
> > +            }
> > +            if (par->iter_hist_len > 0) {
> > +                ds_put_cstr(reply, "\n");
> > +                pmd_perf_format_iteration_history(reply, &pmd->perf_stats,
> > +                        par->iter_hist_len);
> > +            }
> > +            if (par->ms_hist_len > 0) {
> > +                ds_put_cstr(reply, "\n");
> > +                pmd_perf_format_ms_history(reply, &pmd->perf_stats,
> > +                        par->ms_hist_len);
> > +            }
> > +            ovs_mutex_unlock(&pmd->perf_stats.clear_mutex);
> > +        }
> > +        free(time_str);
> > +    }
> > +}
> > +
> >  static int
> >  compare_poll_list(const void *a_, const void *b_)
> >  {
> > @@ -1068,7 +1116,7 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[],
> >      ovs_mutex_lock(&dp_netdev_mutex);
> >
> >      while (argc > 1) {
> > -        if (!strcmp(argv[1], "-pmd") && argc >= 3) {
> > +        if (!strcmp(argv[1], "-pmd") && argc > 2) {
> >              if (str_to_uint(argv[2], 10, &core_id)) {
> >                  filter_on_pmd = true;
> >              }
> > @@ -1108,6 +1156,8 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[],
> >              pmd_perf_stats_clear(&pmd->perf_stats);
> >          } else if (type == PMD_INFO_SHOW_STATS) {
> >              pmd_info_show_stats(&reply, pmd);
> > +        } else if (type == PMD_INFO_PERF_SHOW) {
> > +            pmd_info_show_perf(&reply, pmd, (struct pmd_perf_params *)aux);
> >          }
> >      }
> >      free(pmd_list);
> > @@ -1117,6 +1167,48 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[],
> >      unixctl_command_reply(conn, ds_cstr(&reply));
> >      ds_destroy(&reply);
> >  }
> > +
> > +static void
> > +pmd_perf_show_cmd(struct unixctl_conn *conn, int argc,
> > +                          const char *argv[],
> > +                          void *aux OVS_UNUSED)
> > +{
> > +    struct pmd_perf_params par;
> > +    long int it_hist = 0, ms_hist = 0;
> > +    par.histograms = true;
> > +
> > +    while (argc > 1) {
> > +        if (!strcmp(argv[1], "-nh")) {
> > +            par.histograms = false;
> > +            argc -= 1;
> > +            argv += 1;
> > +        } else if (!strcmp(argv[1], "-it") && argc > 2) {
> > +            it_hist = strtol(argv[2], NULL, 10);
> > +            if (it_hist < 0) {
> > +                it_hist = 0;
> > +            } else if (it_hist > HISTORY_LEN) {
> > +                it_hist = HISTORY_LEN;
> > +            }
> > +            argc -= 2;
> > +            argv += 2;
> > +        } else if (!strcmp(argv[1], "-ms") && argc > 2) {
> > +            ms_hist = strtol(argv[2], NULL, 10);
> > +            if (ms_hist < 0) {
> > +                ms_hist = 0;
> > +            } else if (ms_hist > HISTORY_LEN) {
> > +                ms_hist = HISTORY_LEN;
> > +            }
> > +            argc -= 2;
> > +            argv += 2;
> > +        } else {
> > +            break;
> > +        }
> > +    }
> > +    par.iter_hist_len = it_hist;
> > +    par.ms_hist_len = ms_hist;
> > +    par.command_type = PMD_INFO_PERF_SHOW;
> > +    dpif_netdev_pmd_info(conn, argc, argv, &par);
> > +}
> >  

> >  static int
> >  dpif_netdev_init(void)
> > @@ -1134,6 +1226,12 @@ dpif_netdev_init(void)
> >      unixctl_command_register("dpif-netdev/pmd-rxq-show", "[-pmd core] [dp]",
> >                               0, 3, dpif_netdev_pmd_info,
> >                               (void *)&poll_aux);
> > +    unixctl_command_register("dpif-netdev/pmd-perf-show",
> > +                             "[-nh] [-it iter-history-len]"
> > +                             " [-ms ms-history-len]"
> > +                             " [-pmd core] [dp]",
> > +                             0, 8, pmd_perf_show_cmd,
> > +                             NULL);
> >      unixctl_command_register("dpif-netdev/pmd-rxq-rebalance", "[dp]",
> >                               0, 1, dpif_netdev_pmd_rebalance,
> >                               NULL);
> > @@ -3020,6 +3118,18 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
> >          }
> >      }
> >
> > +    bool perf_enabled = smap_get_bool(other_config, "pmd-perf-metrics", false);
> > +    bool cur_perf_enabled;
> > +    atomic_read_relaxed(&dp->pmd_perf_metrics, &cur_perf_enabled);
> > +    if (perf_enabled != cur_perf_enabled) {
> > +        atomic_store_relaxed(&dp->pmd_perf_metrics, perf_enabled);
> > +        if (perf_enabled) {
> > +            VLOG_INFO("PMD performance metrics collection enabled");
> > +        } else {
> > +            VLOG_INFO("PMD performance metrics collection disabled");
> > +        }
> > +    }
> > +
> >      return 0;
> >  }
> >
> > @@ -3189,6 +3299,21 @@ dp_netdev_rxq_get_intrvl_cycles(struct dp_netdev_rxq *rx, unsigned idx)
> >      return processing_cycles;
> >  }
> >
> > +static inline bool
> > +pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread *pmd)
> > +{
> > +    /* If stores and reads of 64-bit integers are not atomic, the
> > +     * full PMD performance metrics are not available as locked
> > +     * access to 64 bit integers would be prohibitively expensive. */
> > +#if ATOMIC_LLONG_LOCK_FREE
> > +    bool pmd_perf_enabled;
> > +    atomic_read_relaxed(&pmd->dp->pmd_perf_metrics, &pmd_perf_enabled);
> > +    return pmd_perf_enabled;
> > +#else
> > +    return false;
> > +#endif
> > +}
> > +
> >  static int
> >  dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,
> >                                     struct tx_port *p)
> > @@ -3264,10 +3389,12 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
> >                             struct dp_netdev_rxq *rxq,
> >                             odp_port_t port_no)
> >  {
> > +    struct pmd_perf_stats *s = &pmd->perf_stats;
> >      struct dp_packet_batch batch;
> >      struct cycle_timer timer;
> >      int error;
> > -    int batch_cnt = 0, output_cnt = 0;
> > +    int batch_cnt = 0;
> > +    int rem_qlen = 0, *qlen_p = NULL;
> >      uint64_t cycles;
> >
> >      /* Measure duration for polling and processing rx burst. */
> > @@ -3276,20 +3403,37 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
> >      pmd->ctx.last_rxq = rxq;
> >      dp_packet_batch_init(&batch);
> >
> > -    error = netdev_rxq_recv(rxq->rx, &batch, NULL);
> > +    /* Fetch the rx queue length only for vhostuser ports. */
> > +    if (pmd_perf_metrics_enabled(pmd) && rxq->is_vhost) {
> > +        qlen_p = &rem_qlen;
> > +    }
> > +
> > +    error = netdev_rxq_recv(rxq->rx, &batch, qlen_p);
> >      if (!error) {
> >          /* At least one packet received. */
> >          *recirc_depth_get() = 0;
> >          pmd_thread_ctx_time_update(pmd);
> > -
> >          batch_cnt = batch.count;
> > +        if (pmd_perf_metrics_enabled(pmd)) {
> > +            /* Update batch histogram. */
> > +            s->current.batches++;
> > +            histogram_add_sample(&s->pkts_per_batch, batch_cnt);
> > +            /* Update the maximum vhost rx queue fill level. */
> > +            if (rxq->is_vhost && rem_qlen >= 0) {
> > +                uint32_t qfill = batch_cnt + rem_qlen;
> > +                if (qfill > s->current.max_vhost_qfill) {
> > +                    s->current.max_vhost_qfill = qfill;
> > +                }
> > +            }
> > +        }
> > +        /* Process packet batch. */
> >          dp_netdev_input(pmd, &batch, port_no);
> >
> >          /* Assign processing cycles to rx queue. */
> >          cycles = cycle_timer_stop(&pmd->perf_stats, &timer);
> >          dp_netdev_rxq_add_cycles(rxq, RXQ_CYCLES_PROC_CURR, cycles);
> >
> > -        output_cnt = dp_netdev_pmd_flush_output_packets(pmd, false);
> > +        dp_netdev_pmd_flush_output_packets(pmd, false);
> >      } else {
> >          /* Discard cycles. */
> >          cycle_timer_stop(&pmd->perf_stats, &timer);
> > @@ -3303,7 +3447,7 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
> >
> >      pmd->ctx.last_rxq = NULL;
> >
> > -    return batch_cnt + output_cnt;
> > +    return batch_cnt;
> >  }
> >
> >  static struct tx_port *
> > @@ -3359,6 +3503,7 @@ port_reconfigure(struct dp_netdev_port *port)
> >          }
> >
> >          port->rxqs[i].port = port;
> > +        port->rxqs[i].is_vhost = !strncmp(port->type, "dpdkvhost", 9);
> >
> >          err = netdev_rxq_open(netdev, &port->rxqs[i].rx, i);
> >          if (err) {
> > @@ -4137,23 +4282,26 @@ reload:
> >      pmd->intrvl_tsc_prev = 0;
> >      atomic_store_relaxed(&pmd->intrvl_cycles, 0);
> >      cycles_counter_update(s);
> > +    /* Protect pmd stats from external clearing while polling. */
> > +    ovs_mutex_lock(&pmd->perf_stats.stats_mutex);
> >      for (;;) {
> > -        uint64_t iter_packets = 0;
> > +        uint64_t rx_packets = 0, tx_packets = 0;
> >
> >          pmd_perf_start_iteration(s);
> > +
> >          for (i = 0; i < poll_cnt; i++) {
> >              process_packets =
> >                  dp_netdev_process_rxq_port(pmd, poll_list[i].rxq,
> >                                             poll_list[i].port_no);
> > -            iter_packets += process_packets;
> > +            rx_packets += process_packets;
> >          }
> >
> > -        if (!iter_packets) {
> > +        if (!rx_packets) {
> >              /* We didn't receive anything in the process loop.
> >               * Check if we need to send something.
> >               * There was no time updates on current iteration. */
> >              pmd_thread_ctx_time_update(pmd);
> > -            iter_packets += dp_netdev_pmd_flush_output_packets(pmd, false);
> > +            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
> >          }
> >
> >          if (lc++ > 1024) {
> > @@ -4172,8 +4320,10 @@ reload:
> >                  break;
> >              }
> >          }
> > -        pmd_perf_end_iteration(s, iter_packets);
> > +        pmd_perf_end_iteration(s, rx_packets, tx_packets,
> > +                               pmd_perf_metrics_enabled(pmd));
> >      }
> > +    ovs_mutex_unlock(&pmd->perf_stats.stats_mutex);
> >
> >      poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
> >      exiting = latch_is_set(&pmd->exit_latch);
> > @@ -5068,6 +5218,7 @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
> >      struct match match;
> >      ovs_u128 ufid;
> >      int error;
> > +    uint64_t cycles = cycles_counter_update(&pmd->perf_stats);
> >
> >      match.tun_md.valid = false;
> >      miniflow_expand(&key->mf, &match.flow);
> > @@ -5121,6 +5272,14 @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
> >          ovs_mutex_unlock(&pmd->flow_mutex);
> >          emc_probabilistic_insert(pmd, key, netdev_flow);
> >      }
> > +    if (pmd_perf_metrics_enabled(pmd)) {
> > +        /* Update upcall stats. */
> > +        cycles = cycles_counter_update(&pmd->perf_stats) - cycles;
> > +        struct pmd_perf_stats *s = &pmd->perf_stats;
> > +        s->current.upcalls++;
> > +        s->current.upcall_cycles += cycles;
> > +        histogram_add_sample(&s->cycles_per_upcall, cycles);
> > +    }
> >      return error;
> >  }
> >
> > diff --git a/manpages.mk b/manpages.mk
> > index d4bf0ec..aaf8bc2 100644
> > --- a/manpages.mk
> > +++ b/manpages.mk
> > @@ -250,6 +250,7 @@ vswitchd/ovs-vswitchd.8: \
> >  	lib/coverage-unixctl.man \
> >  	lib/daemon.man \
> >  	lib/dpctl.man \
> > +	lib/dpif-netdev-unixctl.man \
> >  	lib/memory-unixctl.man \
> >  	lib/netdev-dpdk-unixctl.man \
> >  	lib/service.man \
> > @@ -266,6 +267,7 @@ lib/common.man:
> >  lib/coverage-unixctl.man:
> >  lib/daemon.man:
> >  lib/dpctl.man:
> > +lib/dpif-netdev-unixctl.man:
> >  lib/memory-unixctl.man:
> >  lib/netdev-dpdk-unixctl.man:
> >  lib/service.man:
> > diff --git a/vswitchd/ovs-vswitchd.8.in b/vswitchd/ovs-vswitchd.8.in
> > index 80e5f53..8b4034d 100644
> > --- a/vswitchd/ovs-vswitchd.8.in
> > +++ b/vswitchd/ovs-vswitchd.8.in
> > @@ -256,32 +256,7 @@ type).
> >  ..
> >  .so lib/dpctl.man
> >  .
> > -.SS "DPIF-NETDEV COMMANDS"
> > -These commands are used to expose internal information (mostly statistics)
> > -about the ``dpif-netdev'' userspace datapath. If there is only one datapath
> > -(as is often the case, unless \fBdpctl/\fR commands are used), the \fIdp\fR
> > -argument can be omitted.
> > -.IP "\fBdpif-netdev/pmd-stats-show\fR [\fIdp\fR]"
> > -Shows performance statistics for each pmd thread of the datapath \fIdp\fR.
> > -The special thread ``main'' sums up the statistics of every non pmd thread.
> > -The sum of ``emc hits'', ``masked hits'' and ``miss'' is the number of
> > -packets received by the datapath.  Cycles are counted using the TSC or similar
> > -facilities (when available on the platform).  To reset these counters use
> > -\fBdpif-netdev/pmd-stats-clear\fR. The duration of one cycle depends on the
> > -measuring infrastructure. ``idle cycles'' refers to cycles spent polling
> > -devices but not receiving any packets. ``processing cycles'' refers to cycles
> > -spent polling devices and successfully receiving packets, plus the cycles
> > -spent processing said packets.
> > -.IP "\fBdpif-netdev/pmd-stats-clear\fR [\fIdp\fR]"
> > -Resets to zero the per pmd thread performance numbers shown by the
> > -\fBdpif-netdev/pmd-stats-show\fR command.  It will NOT reset datapath or
> > -bridge statistics, only the values shown by the above command.
> > -.IP "\fBdpif-netdev/pmd-rxq-show\fR [\fIdp\fR]"
> > -For each pmd thread of the datapath \fIdp\fR shows list of queue-ids with
> > -port names, which this thread polls.
> > -.IP "\fBdpif-netdev/pmd-rxq-rebalance\fR [\fIdp\fR]"
> > -Reassigns rxqs to pmds in the datapath \fIdp\fR based on their current usage.
> > -.
> > +.so lib/dpif-netdev-unixctl.man
> >  .so lib/netdev-dpdk-unixctl.man
> >  .so ofproto/ofproto-dpif-unixctl.man
> >  .so ofproto/ofproto-unixctl.man
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> > index f899a19..aac663f 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -375,6 +375,18 @@
> >          </p>
> >        </column>
> >
> > +      <column name="other_config" key="pmd-perf-metrics"
> > +              type='{"type": "boolean"}'>
> > +        <p>
> > +          Enables recording of detailed PMD performance metrics for analysis
> > +          and trouble-shooting. This can have a performance impact in the
> > +          order of 1%.
> > +        </p>
> > +        <p>
> > +          Defaults to false but can be changed at any time.
> > +        </p>
> > +      </column>
> > +
> >        <column name="other_config" key="n-handler-threads"
> >                type='{"type": "integer", "minInteger": 1}'>
> >          <p>
Ilya Maximets March 27, 2018, 1:10 p.m. UTC | #4
Comments inline.

Best regards, Ilya Maximets.

On 18.03.2018 20:55, Jan Scheurich wrote:
> This patch instruments the dpif-netdev datapath to record detailed
> statistics of what is happening in every iteration of a PMD thread.
> 
> The collection of detailed statistics can be controlled by a new
> Open_vSwitch configuration parameter "other_config:pmd-perf-metrics".
> By default it is disabled. The run-time overhead, when enabled, is
> in the order of 1%.
> 
> The covered metrics per iteration are:
>   - cycles
>   - packets
>   - (rx) batches
>   - packets/batch
>   - max. vhostuser qlen
>   - upcalls
>   - cycles spent in upcalls
> 
> This raw recorded data is used threefold:
> 
> 1. In histograms for each of the following metrics:
>    - cycles/iteration (log.)
>    - packets/iteration (log.)
>    - cycles/packet
>    - packets/batch
>    - max. vhostuser qlen (log.)
>    - upcalls
>    - cycles/upcall (log)
>    The histograms bins are divided linear or logarithmic.
> 
> 2. A cyclic history of the above statistics for 999 iterations
> 
> 3. A cyclic history of the cummulative/average values per millisecond
>    wall clock for the last 1000 milliseconds:
>    - number of iterations
>    - avg. cycles/iteration
>    - packets (Kpps)
>    - avg. packets/batch
>    - avg. max vhost qlen
>    - upcalls
>    - avg. cycles/upcall
> 
> The gathered performance metrics can be printed at any time with the
> new CLI command
> 
> ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len]
>     [-pmd core] [dp]
> 
> The options are
> 
> -nh:            Suppress the histograms
> -it iter_len:   Display the last iter_len iteration stats
> -ms ms_len:     Display the last ms_len millisecond stats
> -pmd core:      Display only the specified PMD
> 
> The performance statistics are reset with the existing
> dpif-netdev/pmd-stats-clear command.
> 
> The output always contains the following global PMD statistics,
> similar to the pmd-stats-show command:
> 
> Time: 15:24:55.270
> Measurement duration: 1.008 s
> 
> pmd thread numa_id 0 core_id 1:
> 
>   Cycles:            2419034712  (2.40 GHz)
>   Iterations:            572817  (1.76 us/it)
>   - idle:                486808  (15.9 % cycles)
>   - busy:                 86009  (84.1 % cycles)
>   Rx packets:           2399607  (2381 Kpps, 848 cycles/pkt)
>   Datapath passes:      3599415  (1.50 passes/pkt)
>   - EMC hits:            336472  ( 9.3 %)
>   - Megaflow hits:      3262943  (90.7 %, 1.00 subtbl lookups/hit)
>   - Upcalls:                  0  ( 0.0 %, 0.0 us/upcall)
>   - Lost upcalls:             0  ( 0.0 %)
>   Tx packets:           2399607  (2381 Kpps)
>   Tx batches:            171400  (14.00 pkts/batch)
> 
> Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
> Acked-by: Billy O'Mahony <billy.o.mahony@intel.com>
> ---
>  NEWS                        |   3 +
>  lib/automake.mk             |   1 +
>  lib/dpif-netdev-perf.c      | 350 +++++++++++++++++++++++++++++++++++++++++++-
>  lib/dpif-netdev-perf.h      | 258 ++++++++++++++++++++++++++++++--
>  lib/dpif-netdev-unixctl.man | 157 ++++++++++++++++++++
>  lib/dpif-netdev.c           | 183 +++++++++++++++++++++--
>  manpages.mk                 |   2 +
>  vswitchd/ovs-vswitchd.8.in  |  27 +---
>  vswitchd/vswitch.xml        |  12 ++
>  9 files changed, 940 insertions(+), 53 deletions(-)
>  create mode 100644 lib/dpif-netdev-unixctl.man
> 
> diff --git a/NEWS b/NEWS
> index 8d0b502..8f66fd3 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -73,6 +73,9 @@ v2.9.0 - 19 Feb 2018
>       * Add support for vHost dequeue zero copy (experimental)
>     - Userspace datapath:
>       * Output packet batching support.
> +     * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
> +     * Detailed PMD performance metrics available with new command
> +         ovs-appctl dpif-netdev/pmd-perf-show

I guess, this should go to Post-v2.9.0.

>     - vswitchd:
>       * Datapath IDs may now be specified as 0x1 (etc.) instead of 16 digits.
>       * Configuring a controller, or unconfiguring all controllers, now deletes
> diff --git a/lib/automake.mk b/lib/automake.mk
> index 5c26e0f..7a5632d 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -484,6 +484,7 @@ MAN_FRAGMENTS += \
>  	lib/dpctl.man \
>  	lib/memory-unixctl.man \
>  	lib/netdev-dpdk-unixctl.man \
> +	lib/dpif-netdev-unixctl.man \
>  	lib/ofp-version.man \
>  	lib/ovs.tmac \
>  	lib/service.man \
> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
> index f06991a..2b36410 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -15,18 +15,324 @@
>   */
>  
>  #include <config.h>
> +#include <stdint.h>
>  
> +#include "dpif-netdev-perf.h"
>  #include "openvswitch/dynamic-string.h"
>  #include "openvswitch/vlog.h"
> -#include "dpif-netdev-perf.h"
> +#include "ovs-thread.h"
>  #include "timeval.h"
>  
>  VLOG_DEFINE_THIS_MODULE(pmd_perf);
>  
> +#ifdef DPDK_NETDEV
> +static uint64_t
> +get_tsc_hz(void)
> +{
> +    return rte_get_tsc_hz();
> +}
> +#else
> +/* This function is only invoked from PMD threads which depend on DPDK.
> + * A dummy function is sufficient when building without DPDK_NETDEV. */
> +static uint64_t
> +get_tsc_hz(void)
> +{
> +    return 1;
> +}
> +#endif
> +
> +/* Histogram functions. */
> +
> +static void
> +histogram_walls_set_lin(struct histogram *hist, uint32_t min, uint32_t max)
> +{
> +    int i;
> +
> +    ovs_assert(min < max);
> +    for (i = 0; i < NUM_BINS-1; i++) {
> +        hist->wall[i] = min + (i * (max - min)) / (NUM_BINS - 2);
> +    }
> +    hist->wall[NUM_BINS-1] = UINT32_MAX;
> +}
> +
> +static void
> +histogram_walls_set_log(struct histogram *hist, uint32_t min, uint32_t max)
> +{
> +    int i, start, bins, wall;
> +    double log_min, log_max;
> +
> +    ovs_assert(min < max);
> +    if (min > 0) {
> +        log_min = log(min);
> +        log_max = log(max);
> +        start = 0;
> +        bins = NUM_BINS - 1;
> +    } else {
> +        hist->wall[0] = 0;
> +        log_min = log(1);
> +        log_max = log(max);
> +        start = 1;
> +        bins = NUM_BINS - 2;
> +    }
> +    wall = start;
> +    for (i = 0; i < bins; i++) {
> +        /* Make sure each wall is monotonically increasing. */
> +        wall = MAX(wall, exp(log_min + (i * (log_max - log_min)) / (bins-1)));
> +        hist->wall[start + i] = wall++;
> +    }
> +    if (hist->wall[NUM_BINS-2] < max) {
> +        hist->wall[NUM_BINS-2] = max;
> +    }
> +    hist->wall[NUM_BINS-1] = UINT32_MAX;
> +}
> +
> +uint64_t
> +histogram_samples(const struct histogram *hist)
> +{
> +    uint64_t samples = 0;
> +
> +    for (int i = 0; i < NUM_BINS; i++) {
> +        samples += hist->bin[i];
> +    }
> +    return samples;
> +}
> +
> +static void
> +histogram_clear(struct histogram *hist)
> +{
> +    int i;
> +
> +    for (i = 0; i < NUM_BINS; i++) {
> +        hist->bin[i] = 0;
> +    }
> +}
> +
> +static void
> +history_init(struct history *h)
> +{
> +    memset(h, 0, sizeof(*h));
> +}
> +
>  void
>  pmd_perf_stats_init(struct pmd_perf_stats *s)
>  {
> -    memset(s, 0 , sizeof(*s));
> +    memset(s, 0, sizeof(*s));
> +    ovs_mutex_init(&s->stats_mutex);
> +    ovs_mutex_init(&s->clear_mutex);
> +    histogram_walls_set_log(&s->cycles, 500, 24000000);
> +    histogram_walls_set_log(&s->pkts, 0, 1000);
> +    histogram_walls_set_lin(&s->cycles_per_pkt, 100, 30000);
> +    histogram_walls_set_lin(&s->pkts_per_batch, 0, 32);
> +    histogram_walls_set_lin(&s->upcalls, 0, 30);
> +    histogram_walls_set_log(&s->cycles_per_upcall, 1000, 1000000);
> +    histogram_walls_set_log(&s->max_vhost_qfill, 0, 512);
> +    s->start_ms = time_msec();
> +}
> +
> +void
> +pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
> +                              double duration)
> +{
> +    uint64_t stats[PMD_N_STATS];
> +    double us_per_cycle = 1000000.0 / get_tsc_hz();
> +
> +    if (duration == 0) {
> +        return;
> +    }
> +
> +    pmd_perf_read_counters(s, stats);
> +    uint64_t tot_cycles = stats[PMD_CYCLES_ITER_IDLE] +
> +                          stats[PMD_CYCLES_ITER_BUSY];
> +    uint64_t rx_packets = stats[PMD_STAT_RECV];
> +    uint64_t tx_packets = stats[PMD_STAT_SENT_PKTS];
> +    uint64_t tx_batches = stats[PMD_STAT_SENT_BATCHES];
> +    uint64_t passes = stats[PMD_STAT_RECV] +
> +                      stats[PMD_STAT_RECIRC];
> +    uint64_t upcalls = stats[PMD_STAT_MISS];
> +    uint64_t upcall_cycles = stats[PMD_CYCLES_UPCALL];
> +    uint64_t tot_iter = histogram_samples(&s->pkts);
> +    uint64_t idle_iter = s->pkts.bin[0];
> +    uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
> +
> +    ds_put_format(str,
> +            "  Cycles:          %12"PRIu64"  (%.2f GHz)\n"
> +            "  Iterations:      %12"PRIu64"  (%.2f us/it)\n"
> +            "  - idle:          %12"PRIu64"  (%4.1f %% cycles)\n"
> +            "  - busy:          %12"PRIu64"  (%4.1f %% cycles)\n",
> +            tot_cycles, (tot_cycles / duration) / 1E9,
> +            tot_iter, tot_cycles * us_per_cycle / tot_iter,
> +            idle_iter,
> +            100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
> +            busy_iter,
> +            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
> +    if (rx_packets > 0) {
> +        ds_put_format(str,
> +            "  Rx packets:      %12"PRIu64"  (%.0f Kpps, %.0f cycles/pkt)\n"
> +            "  Datapath passes: %12"PRIu64"  (%.2f passes/pkt)\n"
> +            "  - EMC hits:      %12"PRIu64"  (%4.1f %%)\n"
> +            "  - Megaflow hits: %12"PRIu64"  (%4.1f %%, %.2f subtbl lookups/"
> +                                                                     "hit)\n"
> +            "  - Upcalls:       %12"PRIu64"  (%4.1f %%, %.1f us/upcall)\n"
> +            "  - Lost upcalls:  %12"PRIu64"  (%4.1f %%)\n",
> +            rx_packets, (rx_packets / duration) / 1000,
> +            1.0 * stats[PMD_CYCLES_ITER_BUSY] / rx_packets,
> +            passes, rx_packets ? 1.0 * passes / rx_packets : 0,
> +            stats[PMD_STAT_EXACT_HIT],
> +            100.0 * stats[PMD_STAT_EXACT_HIT] / passes,
> +            stats[PMD_STAT_MASKED_HIT],
> +            100.0 * stats[PMD_STAT_MASKED_HIT] / passes,
> +            stats[PMD_STAT_MASKED_HIT]
> +            ? 1.0 * stats[PMD_STAT_MASKED_LOOKUP] / stats[PMD_STAT_MASKED_HIT]
> +            : 0,
> +            upcalls, 100.0 * upcalls / passes,
> +            upcalls ? (upcall_cycles * us_per_cycle) / upcalls : 0,
> +            stats[PMD_STAT_LOST],
> +            100.0 * stats[PMD_STAT_LOST] / passes);
> +    } else {
> +        ds_put_format(str,
> +                "  Rx packets:      %12"PRIu64"\n",
> +                0ULL);



> +    }
> +    if (tx_packets > 0) {
> +        ds_put_format(str,
> +            "  Tx packets:      %12"PRIu64"  (%.0f Kpps)\n"
> +            "  Tx batches:      %12"PRIu64"  (%.2f pkts/batch)"
> +            "\n",
> +            tx_packets, (tx_packets / duration) / 1000,
> +            tx_batches, 1.0 * tx_packets / tx_batches);
> +    } else {
> +        ds_put_format(str,
> +                "  Tx packets:      %12"PRIu64"\n"
> +                "\n",
> +                0ULL);

I have a few interesting warnings on 64bit ARMv8.

Clang:

lib/dpif-netdev-perf.c:216:17: error: format specifies type 'unsigned long' but the argument has type 'unsigned long long' [-Werror,-Wformat]
                0ULL);
                ^~~~
lib/dpif-netdev-perf.c:229:17: error: format specifies type 'unsigned long' but the argument has type 'unsigned long long' [-Werror,-Wformat]
                0ULL);
                ^~~~

GCC:

lib/dpif-netdev-perf.c: In function ‘pmd_perf_format_overall_stats’:
lib/dpif-netdev-perf.c:215:17: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 3 has type ‘long long unsigned int’ [-Werror=format=]
                 "  Rx packets:      %12"PRIu64"\n",
                 ^
lib/dpif-netdev-perf.c:227:17: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 3 has type ‘long long unsigned int’ [-Werror=format=]
                 "  Tx packets:      %12"PRIu64"\n"
                 ^

Both are coming from the fact that PRIu64 expands to '%lu'.
Why we need this printing at all? Can we just print 0 in a string?
Otherwise, the only way to fix these warnings is to cast 0 directly to uint64_t.

Also, for what reason are all that line wraps like moving "\n" or a single
argument to the next line in code above?

> +    }
> +}
> +
> +void
> +pmd_perf_format_histograms(struct ds *str, struct pmd_perf_stats *s)
> +{
> +    int i;
> +
> +    ds_put_cstr(str, "Histograms\n");
> +    ds_put_format(str,
> +                  "   %-21s  %-21s  %-21s  %-21s  %-21s  %-21s  %-21s\n",
> +                  "cycles/it", "packets/it", "cycles/pkt", "pkts/batch",
> +                  "max vhost qlen", "upcalls/it", "cycles/upcall");
> +    for (i = 0; i < NUM_BINS-1; i++) {
> +        ds_put_format(str,
> +            "   %-9d %-11"PRIu64"  %-9d %-11"PRIu64"  %-9d %-11"PRIu64
> +            "  %-9d %-11"PRIu64"  %-9d %-11"PRIu64"  %-9d %-11"PRIu64
> +            "  %-9d %-11"PRIu64"\n",
> +            s->cycles.wall[i], s->cycles.bin[i],
> +            s->pkts.wall[i],s->pkts.bin[i],
> +            s->cycles_per_pkt.wall[i], s->cycles_per_pkt.bin[i],
> +            s->pkts_per_batch.wall[i], s->pkts_per_batch.bin[i],
> +            s->max_vhost_qfill.wall[i], s->max_vhost_qfill.bin[i],
> +            s->upcalls.wall[i], s->upcalls.bin[i],
> +            s->cycles_per_upcall.wall[i], s->cycles_per_upcall.bin[i]);
> +    }
> +    ds_put_format(str,
> +                  "   %-9s %-11"PRIu64"  %-9s %-11"PRIu64"  %-9s %-11"PRIu64
> +                  "  %-9s %-11"PRIu64"  %-9s %-11"PRIu64"  %-9s %-11"PRIu64
> +                  "  %-9s %-11"PRIu64"\n",
> +                  ">", s->cycles.bin[i],
> +                  ">", s->pkts.bin[i],
> +                  ">", s->cycles_per_pkt.bin[i],
> +                  ">", s->pkts_per_batch.bin[i],
> +                  ">", s->max_vhost_qfill.bin[i],
> +                  ">", s->upcalls.bin[i],
> +                  ">", s->cycles_per_upcall.bin[i]);
> +    if (s->totals.iterations > 0) {
> +        ds_put_cstr(str,
> +                    "-----------------------------------------------------"
> +                    "-----------------------------------------------------"
> +                    "------------------------------------------------\n");
> +        ds_put_format(str,
> +                      "   %-21s  %-21s  %-21s  %-21s  %-21s  %-21s  %-21s\n",
> +                      "cycles/it", "packets/it", "cycles/pkt", "pkts/batch",
> +                      "vhost qlen", "upcalls/it", "cycles/upcall");
> +        ds_put_format(str,
> +                      "   %-21"PRIu64"  %-21.5f  %-21"PRIu64
> +                      "  %-21.5f  %-21.5f  %-21.5f  %-21"PRIu32"\n",
> +                      s->totals.cycles / s->totals.iterations,
> +                      1.0 * s->totals.pkts / s->totals.iterations,
> +                      s->totals.pkts
> +                          ? s->totals.busy_cycles / s->totals.pkts : 0,
> +                      s->totals.batches
> +                          ? 1.0 * s->totals.pkts / s->totals.batches : 0,
> +                      1.0 * s->totals.max_vhost_qfill / s->totals.iterations,
> +                      1.0 * s->totals.upcalls / s->totals.iterations,
> +                      s->totals.upcalls
> +                          ? s->totals.upcall_cycles / s->totals.upcalls : 0);
> +    }
> +}
> +
> +void
> +pmd_perf_format_iteration_history(struct ds *str, struct pmd_perf_stats *s,
> +                                  int n_iter)
> +{
> +    struct iter_stats *is;
> +    size_t index;
> +    int i;
> +
> +    if (n_iter == 0) {
> +        return;
> +    }
> +    ds_put_format(str, "   %-17s   %-10s   %-10s   %-10s   %-10s   "
> +                  "%-10s   %-10s   %-10s\n",
> +                  "tsc", "cycles", "packets", "cycles/pkt", "pkts/batch",
> +                  "vhost qlen", "upcalls", "cycles/upcall");
> +    for (i = 1; i <= n_iter; i++) {
> +        index = (s->iterations.idx + HISTORY_LEN - i) % HISTORY_LEN;
> +        is = &s->iterations.sample[index];
> +        ds_put_format(str,
> +                      "   %-17"PRIu64"   %-11"PRIu64"  %-11"PRIu32
> +                      "  %-11"PRIu64"  %-11"PRIu32"  %-11"PRIu32
> +                      "  %-11"PRIu32"  %-11"PRIu32"\n",
> +                      is->timestamp,
> +                      is->cycles,
> +                      is->pkts,
> +                      is->pkts ? is->cycles / is->pkts : 0,
> +                      is->batches ? is->pkts / is->batches : 0,
> +                      is->max_vhost_qfill,
> +                      is->upcalls,
> +                      is->upcalls ? is->upcall_cycles / is->upcalls : 0);
> +    }
> +}
> +
> +void
> +pmd_perf_format_ms_history(struct ds *str, struct pmd_perf_stats *s, int n_ms)
> +{
> +    struct iter_stats *is;
> +    size_t index;
> +    int i;
> +
> +    if (n_ms == 0) {
> +        return;
> +    }
> +    ds_put_format(str,
> +                  "   %-12s   %-10s   %-10s   %-10s   %-10s"
> +                  "   %-10s   %-10s   %-10s   %-10s\n",
> +                  "ms", "iterations", "cycles/it", "Kpps", "cycles/pkt",
> +                  "pkts/batch", "vhost qlen", "upcalls", "cycles/upcall");
> +    for (i = 1; i <= n_ms; i++) {
> +        index = (s->milliseconds.idx + HISTORY_LEN - i) % HISTORY_LEN;
> +        is = &s->milliseconds.sample[index];
> +        ds_put_format(str,
> +                      "   %-12"PRIu64"   %-11"PRIu32"  %-11"PRIu64
> +                      "  %-11"PRIu32"  %-11"PRIu64"  %-11"PRIu32
> +                      "  %-11"PRIu32"  %-11"PRIu32"  %-11"PRIu32"\n",
> +                      is->timestamp,
> +                      is->iterations,
> +                      is->iterations ? is->cycles / is->iterations : 0,
> +                      is->pkts,
> +                      is->pkts ? is->busy_cycles / is->pkts : 0,
> +                      is->batches ? is->pkts / is->batches : 0,
> +                      is->iterations
> +                          ? is->max_vhost_qfill / is->iterations : 0,
> +                      is->upcalls,
> +                      is->upcalls ? is->upcall_cycles / is->upcalls : 0);
> +    }
>  }
>  
>  void
> @@ -51,10 +357,48 @@ pmd_perf_read_counters(struct pmd_perf_stats *s,
>      }
>  }
>  
> +/* This function clears the PMD performance counters from within the PMD
> + * thread or from another thread when the PMD thread is not executing its
> + * poll loop. */
>  void
> -pmd_perf_stats_clear(struct pmd_perf_stats *s)
> +pmd_perf_stats_clear_lock(struct pmd_perf_stats *s)
> +    OVS_REQUIRES(s->stats_mutex)
>  {
> +    ovs_mutex_lock(&s->clear_mutex);
>      for (int i = 0; i < PMD_N_STATS; i++) {
>          atomic_read_relaxed(&s->counters.n[i], &s->counters.zero[i]);
>      }
> +    /* The following stats are only applicable in PMD thread and */
> +    memset(&s->current, 0 , sizeof(struct iter_stats));
> +    memset(&s->totals, 0 , sizeof(struct iter_stats));
> +    histogram_clear(&s->cycles);
> +    histogram_clear(&s->pkts);
> +    histogram_clear(&s->cycles_per_pkt);
> +    histogram_clear(&s->upcalls);
> +    histogram_clear(&s->cycles_per_upcall);
> +    histogram_clear(&s->pkts_per_batch);
> +    histogram_clear(&s->max_vhost_qfill);
> +    history_init(&s->iterations);
> +    history_init(&s->milliseconds);
> +    s->start_ms = time_msec();
> +    s->milliseconds.sample[0].timestamp = s->start_ms;
> +    /* Clearing finished. */
> +    s->clear = false;
> +    ovs_mutex_unlock(&s->clear_mutex);
> +}
> +
> +/* This function can be called from the anywhere to clear the stats
> + * of PMD and non-PMD threads. */
> +void
> +pmd_perf_stats_clear(struct pmd_perf_stats *s)
> +{
> +    if (ovs_mutex_trylock(&s->stats_mutex) == 0) {
> +        /* Locking successful. PMD not polling. */
> +        pmd_perf_stats_clear_lock(s);
> +        ovs_mutex_unlock(&s->stats_mutex);
> +    } else {
> +        /* Request the polling PMD to clear the stats. There is no need to
> +         * block here as stats retrieval is prevented during clearing. */
> +        s->clear = true;
> +    }
>  }
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> index 5993c25..fd9b0fc 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -38,10 +38,18 @@
>  extern "C" {
>  #endif
>  
> -/* This module encapsulates data structures and functions to maintain PMD
> - * performance metrics such as packet counters, execution cycles. It
> - * provides a clean API for dpif-netdev to initialize, update and read and
> +/* This module encapsulates data structures and functions to maintain basic PMD
> + * performance metrics such as packet counters, execution cycles as well as
> + * histograms and time series recording for more detailed PMD metrics.
> + *
> + * It provides a clean API for dpif-netdev to initialize, update and read and
>   * reset these metrics.
> + *
> + * The basic set of PMD counters is implemented as atomic_uint64_t variables
> + * to guarantee correct read also in 32-bit systems.
> + *
> + * The detailed PMD performance metrics are only supported on 64-bit systems
> + * with atomic 64-bit read and store semantics for plain uint64_t counters.
>   */
>  
>  /* Set of counter types maintained in pmd_perf_stats. */
> @@ -66,6 +74,7 @@ enum pmd_stat_type {
>      PMD_STAT_SENT_BATCHES,  /* Number of batches sent. */
>      PMD_CYCLES_ITER_IDLE,   /* Cycles spent in idle iterations. */
>      PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
> +    PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
>      PMD_N_STATS
>  };
>  
> @@ -81,18 +90,87 @@ struct pmd_counters {
>      uint64_t zero[PMD_N_STATS];         /* Value at last _clear().  */
>  };
>  
> -/* Container for all performance metrics of a PMD.
> - * Part of the struct dp_netdev_pmd_thread. */
> +/* Data structure to collect statistical distribution of an integer measurement
> + * type in form of a histogram. The wall[] array contains the inclusive
> + * upper boundaries of the bins, while the bin[] array contains the actual
> + * counters per bin. The histogram walls are typically set automatically
> + * using the functions provided below.*/
> +
> +#define NUM_BINS 32             /* Number of histogram bins. */
> +
> +struct histogram {
> +    uint32_t wall[NUM_BINS];
> +    uint64_t bin[NUM_BINS];
> +};
> +
> +/* Data structure to record details PMD execution metrics per iteration for
> + * a history period of up to HISTORY_LEN iterations in circular buffer.
> + * Also used to record up to HISTORY_LEN millisecond averages/totals of these
> + * metrics.*/
> +
> +struct iter_stats {
> +    uint64_t timestamp;         /* TSC or millisecond. */
> +    uint64_t cycles;            /* Number of TSC cycles spent in it. or ms. */
> +    uint64_t busy_cycles;       /* Cycles spent in busy iterations or ms. */
> +    uint32_t iterations;        /* Iterations in ms. */
> +    uint32_t pkts;              /* Packets processed in iteration or ms. */
> +    uint32_t upcalls;           /* Number of upcalls in iteration or ms. */
> +    uint32_t upcall_cycles;     /* Cycles spent in upcalls in it. or ms. */
> +    uint32_t batches;           /* Number of rx batches in iteration or ms. */
> +    uint32_t max_vhost_qfill;   /* Maximum fill level in iteration or ms. */
> +};
> +
> +#define HISTORY_LEN 1000        /* Length of recorded history
> +                                   (iterations and ms). */
> +#define DEF_HIST_SHOW 20        /* Default number of history samples to
> +                                   display. */
> +
> +struct history {
> +    size_t idx;                 /* Slot to which next call to history_store()
> +                                   will write. */
> +    struct iter_stats sample[HISTORY_LEN];
> +};
> +
> +/* Container for all performance metrics of a PMD within the struct
> + * dp_netdev_pmd_thread. The metrics must be updated from within the PMD
> + * thread but can be read from any thread. The basic PMD counters in
> + * struct pmd_counters can be read without protection against concurrent
> + * clearing. The other metrics may only be safely read with the clear_mutex
> + * held to protect against concurrent clearing. */
>  
>  struct pmd_perf_stats {
> -    /* Start of the current PMD iteration in TSC cycles.*/
> -    uint64_t start_it_tsc;
> +    /* Prevents interference between PMD polling and stats clearing. */
> +    struct ovs_mutex stats_mutex;
> +    /* Set by CLI thread to order clearing of PMD stats. */
> +    volatile bool clear;
> +    /* Prevents stats retrieval while clearing is in progress. */
> +    struct ovs_mutex clear_mutex;
> +    /* Start of the current performance measurement period. */
> +    uint64_t start_ms;
>      /* Latest TSC time stamp taken in PMD. */
>      uint64_t last_tsc;
> +    /* Used to space certain checks in time. */
> +    uint64_t next_check_tsc;
>      /* If non-NULL, outermost cycle timer currently running in PMD. */
>      struct cycle_timer *cur_timer;
>      /* Set of PMD counters with their zero offsets. */
>      struct pmd_counters counters;
> +    /* Statistics of the current iteration. */
> +    struct iter_stats current;
> +    /* Totals for the current millisecond. */
> +    struct iter_stats totals;
> +    /* Histograms for the PMD metrics. */
> +    struct histogram cycles;
> +    struct histogram pkts;
> +    struct histogram cycles_per_pkt;
> +    struct histogram upcalls;
> +    struct histogram cycles_per_upcall;
> +    struct histogram pkts_per_batch;
> +    struct histogram max_vhost_qfill;
> +    /* Iteration history buffer. */
> +    struct history iterations;
> +    /* Millisecond history buffer. */
> +    struct history milliseconds;
>  };
>  
>  /* Support for accurate timing of PMD execution on TSC clock cycle level.
> @@ -175,8 +253,14 @@ cycle_timer_stop(struct pmd_perf_stats *s,
>      return now - timer->start;
>  }
>  
> +/* Functions to initialize and reset the PMD performance metrics. */
> +
>  void pmd_perf_stats_init(struct pmd_perf_stats *s);
>  void pmd_perf_stats_clear(struct pmd_perf_stats *s);
> +void pmd_perf_stats_clear_lock(struct pmd_perf_stats *s);
> +
> +/* Functions to read and update PMD counters. */
> +
>  void pmd_perf_read_counters(struct pmd_perf_stats *s,
>                              uint64_t stats[PMD_N_STATS]);
>  
> @@ -199,32 +283,182 @@ pmd_perf_update_counter(struct pmd_perf_stats *s,
>      atomic_store_relaxed(&s->counters.n[counter], tmp);
>  }
>  
> +/* Functions to manipulate a sample history. */
> +
> +static inline void
> +histogram_add_sample(struct histogram *hist, uint32_t val)
> +{
> +    /* TODO: Can do better with binary search? */
> +    for (int i = 0; i < NUM_BINS-1; i++) {
> +        if (val <= hist->wall[i]) {
> +            hist->bin[i]++;
> +            return;
> +        }
> +    }
> +    hist->bin[NUM_BINS-1]++;
> +}
> +
> +uint64_t histogram_samples(const struct histogram *hist);
> +
> +/* Add an offset to idx modulo HISTORY_LEN. */
> +static inline uint32_t
> +history_add(uint32_t idx, uint32_t offset)
> +{
> +    return (idx + offset) % HISTORY_LEN;
> +}
> +
> +/* Subtract idx2 from idx1 modulo HISTORY_LEN. */
> +static inline uint32_t
> +history_sub(uint32_t idx1, uint32_t idx2)
> +{
> +    return (idx1 + HISTORY_LEN - idx2) % HISTORY_LEN;
> +}
> +
> +static inline struct iter_stats *
> +history_current(struct history *h)
> +{
> +    return &h->sample[h->idx];
> +}
> +
> +static inline struct iter_stats *
> +history_next(struct history *h)
> +{
> +    size_t next_idx = (h->idx + 1) % HISTORY_LEN;
> +    struct iter_stats *next = &h->sample[next_idx];
> +
> +    memset(next, 0, sizeof(*next));
> +    h->idx = next_idx;
> +    return next;
> +}
> +
> +static inline struct iter_stats *
> +history_store(struct history *h, struct iter_stats *is)
> +{
> +    if (is) {
> +        h->sample[h->idx] = *is;
> +    }
> +    /* Advance the history pointer */
> +    return history_next(h);
> +}
> +
> +/* Functions recording PMD metrics per iteration. */
> +
>  static inline void
>  pmd_perf_start_iteration(struct pmd_perf_stats *s)
>  {
> +    if (s->clear) {
> +        /* Clear the PMD stats before starting next iteration. */
> +        pmd_perf_stats_clear_lock(s);
> +    }
> +    /* Initialize the current interval stats. */
> +    memset(&s->current, 0, sizeof(struct iter_stats));
>      if (OVS_LIKELY(s->last_tsc)) {
>          /* We assume here that last_tsc was updated immediately prior at
>           * the end of the previous iteration, or just before the first
>           * iteration. */
> -        s->start_it_tsc = s->last_tsc;
> +        s->current.timestamp = s->last_tsc;
>      } else {
>          /* In case last_tsc has never been set before. */
> -        s->start_it_tsc = cycles_counter_update(s);
> +        s->current.timestamp = cycles_counter_update(s);
>      }
>  }
>  
>  static inline void
> -pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets)
> +pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
> +                       int tx_packets, bool full_metrics)
>  {
> -    uint64_t cycles = cycles_counter_update(s) - s->start_it_tsc;
> +    uint64_t now_tsc = cycles_counter_update(s);
> +    struct iter_stats *cum_ms;
> +    uint64_t cycles, cycles_per_pkt = 0;
>  
> -    if (rx_packets > 0) {
> +    cycles = now_tsc - s->current.timestamp;
> +    s->current.cycles = cycles;
> +    s->current.pkts = rx_packets;
> +
> +    if (rx_packets + tx_packets > 0) {
>          pmd_perf_update_counter(s, PMD_CYCLES_ITER_BUSY, cycles);
>      } else {
>          pmd_perf_update_counter(s, PMD_CYCLES_ITER_IDLE, cycles);
>      }
> +    /* Add iteration samples to histograms. */
> +    histogram_add_sample(&s->cycles, cycles);
> +    histogram_add_sample(&s->pkts, rx_packets);
> +
> +    if (!full_metrics) {
> +        return;
> +    }
> +
> +    s->counters.n[PMD_CYCLES_UPCALL] += s->current.upcall_cycles;
> +
> +    if (rx_packets > 0) {
> +        cycles_per_pkt = cycles / rx_packets;
> +        histogram_add_sample(&s->cycles_per_pkt, cycles_per_pkt);
> +    }
> +    if (s->current.batches > 0) {
> +        histogram_add_sample(&s->pkts_per_batch,
> +                             rx_packets / s->current.batches);
> +    }
> +    histogram_add_sample(&s->upcalls, s->current.upcalls);
> +    if (s->current.upcalls > 0) {
> +        histogram_add_sample(&s->cycles_per_upcall,
> +                             s->current.upcall_cycles / s->current.upcalls);
> +    }
> +    histogram_add_sample(&s->max_vhost_qfill, s->current.max_vhost_qfill);
> +
> +    /* Add iteration samples to millisecond stats. */
> +    cum_ms = history_current(&s->milliseconds);
> +    cum_ms->iterations++;
> +    cum_ms->cycles += cycles;
> +    if (rx_packets > 0) {
> +        cum_ms->busy_cycles += cycles;
> +    }
> +    cum_ms->pkts += s->current.pkts;
> +    cum_ms->upcalls += s->current.upcalls;
> +    cum_ms->upcall_cycles += s->current.upcall_cycles;
> +    cum_ms->batches += s->current.batches;
> +    cum_ms->max_vhost_qfill += s->current.max_vhost_qfill;
> +
> +    /* Store in iteration history. This advances the iteration idx and
> +     * clears the next slot in the iteration history. */
> +    history_store(&s->iterations, &s->current);
> +    if (now_tsc > s->next_check_tsc) {
> +        /* Check if ms is completed and store in milliseconds history. */
> +        uint64_t now = time_msec();
> +        if (now != cum_ms->timestamp) {
> +            /* Add ms stats to totals. */
> +            s->totals.iterations += cum_ms->iterations;
> +            s->totals.cycles += cum_ms->cycles;
> +            s->totals.busy_cycles += cum_ms->busy_cycles;
> +            s->totals.pkts += cum_ms->pkts;
> +            s->totals.upcalls += cum_ms->upcalls;
> +            s->totals.upcall_cycles += cum_ms->upcall_cycles;
> +            s->totals.batches += cum_ms->batches;
> +            s->totals.max_vhost_qfill += cum_ms->max_vhost_qfill;
> +            cum_ms = history_next(&s->milliseconds);
> +            cum_ms->timestamp = now;
> +        }
> +        s->next_check_tsc = cycles_counter_update(s) + 10000;
> +    }
>  }
>  
> +/* Formatting the output of commands. */
> +
> +struct pmd_perf_params {
> +    int command_type;
> +    bool histograms;
> +    size_t iter_hist_len;
> +    size_t ms_hist_len;
> +};
> +
> +void pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
> +                                   double duration);
> +void pmd_perf_format_histograms(struct ds *str, struct pmd_perf_stats *s);
> +void pmd_perf_format_iteration_history(struct ds *str,
> +                                       struct pmd_perf_stats *s,
> +                                       int n_iter);
> +void pmd_perf_format_ms_history(struct ds *str, struct pmd_perf_stats *s,
> +                                int n_ms);
> +
>  #ifdef  __cplusplus
>  }
>  #endif
> diff --git a/lib/dpif-netdev-unixctl.man b/lib/dpif-netdev-unixctl.man
> new file mode 100644
> index 0000000..76c3e4e
> --- /dev/null
> +++ b/lib/dpif-netdev-unixctl.man
> @@ -0,0 +1,157 @@
> +.SS "DPIF-NETDEV COMMANDS"
> +These commands are used to expose internal information (mostly statistics)
> +about the "dpif-netdev" userspace datapath. If there is only one datapath
> +(as is often the case, unless \fBdpctl/\fR commands are used), the \fIdp\fR
> +argument can be omitted. By default the commands present data for all pmd
> +threads in the datapath. By specifying the "-pmd Core" option one can filter
> +the output for a single pmd in the datapath.
> +.
> +.IP "\fBdpif-netdev/pmd-stats-show\fR [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
> +Shows performance statistics for one or all pmd threads of the datapath
> +\fIdp\fR. The special thread "main" sums up the statistics of every non pmd
> +thread.
> +
> +The sum of "emc hits", "masked hits" and "miss" is the number of
> +packet lookups performed by the datapath. Beware that a recirculated packet
> +experiences one additional lookup per recirculation, so there may be
> +more lookups than forwarded packets in the datapath.
> +
> +Cycles are counted using the TSC or similar facilities (when available on
> +the platform). The duration of one cycle depends on the processing platform.
> +
> +"idle cycles" refers to cycles spent in PMD iterations not forwarding any
> +any packets. "processing cycles" refers to cycles spent in PMD iterations
> +forwarding at least one packet, including the cost for polling, processing and
> +transmitting said packets.
> +
> +To reset these counters use \fBdpif-netdev/pmd-stats-clear\fR.
> +.
> +.IP "\fBdpif-netdev/pmd-stats-clear\fR [\fIdp\fR]"
> +Resets to zero the per pmd thread performance numbers shown by the
> +\fBdpif-netdev/pmd-stats-show\fR and \fBdpif-netdev/pmd-perf-show\fR commands.
> +It will NOT reset datapath or bridge statistics, only the values shown by
> +the above commands.
> +.
> +.IP "\fBdpif-netdev/pmd-perf-show\fR [\fB-nh\fR] [\fB-it\fR \fIiter_len\fR] \
> +[\fB-ms\fR \fIms_len\fR] [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
> +Shows detailed performance metrics for one or all pmds threads of the
> +user space datapath.
> +
> +The collection of detailed statistics can be controlled by a new
> +configuration parameter "other_config:pmd-perf-metrics". By default it
> +is disabled. The run-time overhead, when enabled, is in the order of 1%.
> +
> +.RS
> +.IP
> +.PD .4v
> +.IP \(em
> +used cycles
> +.IP \(em
> +forwared packets
> +.IP \(em
> +number of rx batches
> +.IP \(em
> +packets/rx batch
> +.IP \(em
> +max. vhostuser queue fill level
> +.IP \(em
> +number of upcalls
> +.IP \(em
> +cycles spent in upcalls
> +.PD
> +.RE
> +.IP
> +This raw recorded data is used threefold:
> +
> +.RS
> +.IP
> +.PD .4v
> +.IP 1.
> +In histograms for each of the following metrics:
> +.RS
> +.IP \(em
> +cycles/iteration (logarithmic)
> +.IP \(em
> +packets/iteration (logarithmic)
> +.IP \(em
> +cycles/packet
> +.IP \(em
> +packets/batch
> +.IP \(em
> +max. vhostuser qlen (logarithmic)
> +.IP \(em
> +upcalls
> +.IP \(em
> +cycles/upcall (logarithmic)
> +The histograms bins are divided linear or logarithmic.
> +.RE
> +.IP 2.
> +A cyclic history of the above metrics for 1024 iterations
> +.IP 3.
> +A cyclic history of the cummulative/average values per millisecond wall
> +clock for the last 1024 milliseconds:
> +.RS
> +.IP \(em
> +number of iterations
> +.IP \(em
> +avg. cycles/iteration
> +.IP \(em
> +packets (Kpps)
> +.IP \(em
> +avg. packets/batch
> +.IP \(em
> +avg. max vhost qlen
> +.IP \(em
> +upcalls
> +.IP \(em
> +avg. cycles/upcall
> +.RE
> +.PD
> +.RE
> +.IP
> +.
> +The command options are:
> +.RS
> +.IP "\fB-nh\fR"
> +Suppress the histograms
> +.IP "\fB-it\fR \fIiter_len\fR"
> +Display the last iter_len iteration stats
> +.IP "\fB-ms\fR \fIms_len\fR"
> +Display the last ms_len millisecond stats
> +.RE
> +.IP
> +The output always contains the following global PMD statistics:
> +.RS
> +.IP
> +Time: 15:24:55.270 .br
> +Measurement duration: 1.008 s
> +
> +pmd thread numa_id 0 core_id 1:
> +
> +  Cycles:            2419034712  (2.40 GHz)
> +  Iterations:            572817  (1.76 us/it)
> +  - idle:                486808  (15.9 % cycles)
> +  - busy:                 86009  (84.1 % cycles)
> +  Rx packets:           2399607  (2381 Kpps, 848 cycles/pkt)
> +  Datapath passes:      3599415  (1.50 passes/pkt)
> +  - EMC hits:            336472  ( 9.3 %)
> +  - Megaflow hits:      3262943  (90.7 %, 1.00 subtbl lookups/hit)
> +  - Upcalls:                  0  ( 0.0 %, 0.0 us/upcall)
> +  - Lost upcalls:             0  ( 0.0 %)
> +  Tx packets:           2399607  (2381 Kpps)
> +  Tx batches:            171400  (14.00 pkts/batch)
> +.RE
> +.IP
> +Here "Rx packets" actually reflects the number of packets forwarded by the
> +datapath. "Datapath passes" matches the number of packet lookups as
> +reported by the \fBdpif-netdev/pmd-stats-show\fR command.
> +
> +To reset the counters and start a new measurement use
> +\fBdpif-netdev/pmd-stats-clear\fR.
> +.
> +.IP "\fBdpif-netdev/pmd-rxq-show\fR [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
> +For one or all pmd threads of the datapath \fIdp\fR show the list of queue-ids
> +with port names, which this thread polls.
> +.
> +.IP "\fBdpif-netdev/pmd-rxq-rebalance\fR [\fIdp\fR]"
> +Reassigns rxqs to pmds in the datapath \fIdp\fR based on their current usage.
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 86d8739..f245ce2 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -49,6 +49,7 @@
>  #include "id-pool.h"
>  #include "latch.h"
>  #include "netdev.h"
> +#include "netdev-provider.h"
>  #include "netdev-vport.h"
>  #include "netlink.h"
>  #include "odp-execute.h"
> @@ -281,6 +282,8 @@ struct dp_netdev {
>  
>      /* Probability of EMC insertions is a factor of 'emc_insert_min'.*/
>      OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_uint32_t emc_insert_min;
> +    /* Enable collection of PMD performance metrics. */
> +    atomic_bool pmd_perf_metrics;
>  
>      /* Protects access to ofproto-dpif-upcall interface during revalidator
>       * thread synchronization. */
> @@ -356,6 +359,7 @@ struct dp_netdev_rxq {
>                                            particular core. */
>      unsigned intrvl_idx;               /* Write index for 'cycles_intrvl'. */
>      struct dp_netdev_pmd_thread *pmd;  /* pmd thread that polls this queue. */
> +    bool is_vhost;                     /* Is rxq of a vhost port. */
>  
>      /* Counters of cycles spent successfully polling and processing pkts. */
>      atomic_ullong cycles[RXQ_N_CYCLES];
> @@ -717,6 +721,8 @@ static inline bool emc_entry_alive(struct emc_entry *ce);
>  static void emc_clear_entry(struct emc_entry *ce);
>  
>  static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
> +static inline bool
> +pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread *pmd);
>  
>  static void
>  emc_cache_init(struct emc_cache *flow_cache)
> @@ -800,7 +806,8 @@ get_dp_netdev(const struct dpif *dpif)
>  enum pmd_info_type {
>      PMD_INFO_SHOW_STATS,  /* Show how cpu cycles are spent. */
>      PMD_INFO_CLEAR_STATS, /* Set the cycles count to 0. */
> -    PMD_INFO_SHOW_RXQ     /* Show poll-lists of pmd threads. */
> +    PMD_INFO_SHOW_RXQ,    /* Show poll lists of pmd threads. */
> +    PMD_INFO_PERF_SHOW,   /* Show pmd performance details. */
>  };
>  
>  static void
> @@ -891,6 +898,47 @@ pmd_info_show_stats(struct ds *reply,
>                    stats[PMD_CYCLES_ITER_BUSY], total_packets);
>  }
>  
> +static void
> +pmd_info_show_perf(struct ds *reply,
> +                   struct dp_netdev_pmd_thread *pmd,
> +                   struct pmd_perf_params *par)
> +{
> +    if (pmd->core_id != NON_PMD_CORE_ID) {
> +        char *time_str =
> +                xastrftime_msec("%H:%M:%S.###", time_wall_msec(), true);
> +        long long now = time_msec();
> +        double duration = (now - pmd->perf_stats.start_ms) / 1000.0;
> +
> +        ds_put_cstr(reply, "\n");
> +        ds_put_format(reply, "Time: %s\n", time_str);
> +        ds_put_format(reply, "Measurement duration: %.3f s\n", duration);
> +        ds_put_cstr(reply, "\n");
> +        format_pmd_thread(reply, pmd);
> +        ds_put_cstr(reply, "\n");
> +        pmd_perf_format_overall_stats(reply, &pmd->perf_stats, duration);
> +        if (pmd_perf_metrics_enabled(pmd)) {
> +            /* Prevent parallel clearing of perf metrics. */
> +            ovs_mutex_lock(&pmd->perf_stats.clear_mutex);
> +            if (par->histograms) {
> +                ds_put_cstr(reply, "\n");
> +                pmd_perf_format_histograms(reply, &pmd->perf_stats);
> +            }
> +            if (par->iter_hist_len > 0) {
> +                ds_put_cstr(reply, "\n");
> +                pmd_perf_format_iteration_history(reply, &pmd->perf_stats,
> +                        par->iter_hist_len);
> +            }
> +            if (par->ms_hist_len > 0) {
> +                ds_put_cstr(reply, "\n");
> +                pmd_perf_format_ms_history(reply, &pmd->perf_stats,
> +                        par->ms_hist_len);
> +            }
> +            ovs_mutex_unlock(&pmd->perf_stats.clear_mutex);
> +        }
> +        free(time_str);
> +    }
> +}
> +
>  static int
>  compare_poll_list(const void *a_, const void *b_)
>  {
> @@ -1068,7 +1116,7 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[],
>      ovs_mutex_lock(&dp_netdev_mutex);
>  
>      while (argc > 1) {
> -        if (!strcmp(argv[1], "-pmd") && argc >= 3) {
> +        if (!strcmp(argv[1], "-pmd") && argc > 2) {
>              if (str_to_uint(argv[2], 10, &core_id)) {
>                  filter_on_pmd = true;
>              }
> @@ -1108,6 +1156,8 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[],
>              pmd_perf_stats_clear(&pmd->perf_stats);
>          } else if (type == PMD_INFO_SHOW_STATS) {
>              pmd_info_show_stats(&reply, pmd);
> +        } else if (type == PMD_INFO_PERF_SHOW) {
> +            pmd_info_show_perf(&reply, pmd, (struct pmd_perf_params *)aux);
>          }
>      }
>      free(pmd_list);
> @@ -1117,6 +1167,48 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[],
>      unixctl_command_reply(conn, ds_cstr(&reply));
>      ds_destroy(&reply);
>  }
> +
> +static void
> +pmd_perf_show_cmd(struct unixctl_conn *conn, int argc,
> +                          const char *argv[],
> +                          void *aux OVS_UNUSED)
> +{
> +    struct pmd_perf_params par;
> +    long int it_hist = 0, ms_hist = 0;
> +    par.histograms = true;
> +
> +    while (argc > 1) {
> +        if (!strcmp(argv[1], "-nh")) {
> +            par.histograms = false;
> +            argc -= 1;
> +            argv += 1;
> +        } else if (!strcmp(argv[1], "-it") && argc > 2) {
> +            it_hist = strtol(argv[2], NULL, 10);
> +            if (it_hist < 0) {
> +                it_hist = 0;
> +            } else if (it_hist > HISTORY_LEN) {
> +                it_hist = HISTORY_LEN;
> +            }
> +            argc -= 2;
> +            argv += 2;
> +        } else if (!strcmp(argv[1], "-ms") && argc > 2) {
> +            ms_hist = strtol(argv[2], NULL, 10);
> +            if (ms_hist < 0) {
> +                ms_hist = 0;
> +            } else if (ms_hist > HISTORY_LEN) {
> +                ms_hist = HISTORY_LEN;
> +            }
> +            argc -= 2;
> +            argv += 2;
> +        } else {
> +            break;
> +        }
> +    }
> +    par.iter_hist_len = it_hist;
> +    par.ms_hist_len = ms_hist;
> +    par.command_type = PMD_INFO_PERF_SHOW;
> +    dpif_netdev_pmd_info(conn, argc, argv, &par);
> +}
>  
>  static int
>  dpif_netdev_init(void)
> @@ -1134,6 +1226,12 @@ dpif_netdev_init(void)
>      unixctl_command_register("dpif-netdev/pmd-rxq-show", "[-pmd core] [dp]",
>                               0, 3, dpif_netdev_pmd_info,
>                               (void *)&poll_aux);
> +    unixctl_command_register("dpif-netdev/pmd-perf-show",
> +                             "[-nh] [-it iter-history-len]"
> +                             " [-ms ms-history-len]"
> +                             " [-pmd core] [dp]",
> +                             0, 8, pmd_perf_show_cmd,
> +                             NULL);
>      unixctl_command_register("dpif-netdev/pmd-rxq-rebalance", "[dp]",
>                               0, 1, dpif_netdev_pmd_rebalance,
>                               NULL);
> @@ -3020,6 +3118,18 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>          }
>      }
>  
> +    bool perf_enabled = smap_get_bool(other_config, "pmd-perf-metrics", false);
> +    bool cur_perf_enabled;
> +    atomic_read_relaxed(&dp->pmd_perf_metrics, &cur_perf_enabled);
> +    if (perf_enabled != cur_perf_enabled) {
> +        atomic_store_relaxed(&dp->pmd_perf_metrics, perf_enabled);
> +        if (perf_enabled) {
> +            VLOG_INFO("PMD performance metrics collection enabled");
> +        } else {
> +            VLOG_INFO("PMD performance metrics collection disabled");
> +        }
> +    }
> +
>      return 0;
>  }
>  
> @@ -3189,6 +3299,21 @@ dp_netdev_rxq_get_intrvl_cycles(struct dp_netdev_rxq *rx, unsigned idx)
>      return processing_cycles;
>  }
>  
> +static inline bool
> +pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread *pmd)

lib/dpif-netdev.c:3308:61: error: unused parameter 'pmd' [-Werror,-Wunused-parameter]
pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread *pmd)
                                                            ^

Need to mark as OVS_UNUSED.

> +{
> +    /* If stores and reads of 64-bit integers are not atomic, the
> +     * full PMD performance metrics are not available as locked
> +     * access to 64 bit integers would be prohibitively expensive. */
> +#if ATOMIC_LLONG_LOCK_FREE

Hmm. I have the following in configure log:

checking value of __atomic_always_lock_free(1)... 1
checking value of __atomic_always_lock_free(2)... 1
checking value of __atomic_always_lock_free(4)... 1
checking value of __atomic_always_lock_free(8)... 1

But, 'pmd' is unused.
So, we need to use ATOMIC_ALWAYS_LOCK_FREE_8B for now or fix the
ATOMIC_LLONG_LOCK_FREE macro.


> +    bool pmd_perf_enabled;
> +    atomic_read_relaxed(&pmd->dp->pmd_perf_metrics, &pmd_perf_enabled);
> +    return pmd_perf_enabled;
> +#else
> +    return false;
> +#endif
> +}
> +
>  static int
>  dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,
>                                     struct tx_port *p)
> @@ -3264,10 +3389,12 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
>                             struct dp_netdev_rxq *rxq,
>                             odp_port_t port_no)
>  {
> +    struct pmd_perf_stats *s = &pmd->perf_stats;
>      struct dp_packet_batch batch;
>      struct cycle_timer timer;
>      int error;
> -    int batch_cnt = 0, output_cnt = 0;
> +    int batch_cnt = 0;
> +    int rem_qlen = 0, *qlen_p = NULL;
>      uint64_t cycles;
>  
>      /* Measure duration for polling and processing rx burst. */
> @@ -3276,20 +3403,37 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
>      pmd->ctx.last_rxq = rxq;
>      dp_packet_batch_init(&batch);
>  
> -    error = netdev_rxq_recv(rxq->rx, &batch, NULL);
> +    /* Fetch the rx queue length only for vhostuser ports. */
> +    if (pmd_perf_metrics_enabled(pmd) && rxq->is_vhost) {
> +        qlen_p = &rem_qlen;
> +    }
> +
> +    error = netdev_rxq_recv(rxq->rx, &batch, qlen_p);
>      if (!error) {
>          /* At least one packet received. */
>          *recirc_depth_get() = 0;
>          pmd_thread_ctx_time_update(pmd);
> -
>          batch_cnt = batch.count;
> +        if (pmd_perf_metrics_enabled(pmd)) {
> +            /* Update batch histogram. */
> +            s->current.batches++;
> +            histogram_add_sample(&s->pkts_per_batch, batch_cnt);
> +            /* Update the maximum vhost rx queue fill level. */
> +            if (rxq->is_vhost && rem_qlen >= 0) {
> +                uint32_t qfill = batch_cnt + rem_qlen;
> +                if (qfill > s->current.max_vhost_qfill) {
> +                    s->current.max_vhost_qfill = qfill;
> +                }
> +            }
> +        }
> +        /* Process packet batch. */
>          dp_netdev_input(pmd, &batch, port_no);
>  
>          /* Assign processing cycles to rx queue. */
>          cycles = cycle_timer_stop(&pmd->perf_stats, &timer);
>          dp_netdev_rxq_add_cycles(rxq, RXQ_CYCLES_PROC_CURR, cycles);
>  
> -        output_cnt = dp_netdev_pmd_flush_output_packets(pmd, false);
> +        dp_netdev_pmd_flush_output_packets(pmd, false);
>      } else {
>          /* Discard cycles. */
>          cycle_timer_stop(&pmd->perf_stats, &timer);
> @@ -3303,7 +3447,7 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
>  
>      pmd->ctx.last_rxq = NULL;
>  
> -    return batch_cnt + output_cnt;
> +    return batch_cnt;
>  }
>  
>  static struct tx_port *
> @@ -3359,6 +3503,7 @@ port_reconfigure(struct dp_netdev_port *port)
>          }
>  
>          port->rxqs[i].port = port;
> +        port->rxqs[i].is_vhost = !strncmp(port->type, "dpdkvhost", 9);
>  
>          err = netdev_rxq_open(netdev, &port->rxqs[i].rx, i);
>          if (err) {
> @@ -4137,23 +4282,26 @@ reload:
>      pmd->intrvl_tsc_prev = 0;
>      atomic_store_relaxed(&pmd->intrvl_cycles, 0);
>      cycles_counter_update(s);
> +    /* Protect pmd stats from external clearing while polling. */
> +    ovs_mutex_lock(&pmd->perf_stats.stats_mutex);
>      for (;;) {
> -        uint64_t iter_packets = 0;
> +        uint64_t rx_packets = 0, tx_packets = 0;
>  
>          pmd_perf_start_iteration(s);
> +
>          for (i = 0; i < poll_cnt; i++) {
>              process_packets =
>                  dp_netdev_process_rxq_port(pmd, poll_list[i].rxq,
>                                             poll_list[i].port_no);
> -            iter_packets += process_packets;
> +            rx_packets += process_packets;
>          }
>  
> -        if (!iter_packets) {
> +        if (!rx_packets) {
>              /* We didn't receive anything in the process loop.
>               * Check if we need to send something.
>               * There was no time updates on current iteration. */
>              pmd_thread_ctx_time_update(pmd);
> -            iter_packets += dp_netdev_pmd_flush_output_packets(pmd, false);
> +            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
>          }
>  
>          if (lc++ > 1024) {
> @@ -4172,8 +4320,10 @@ reload:
>                  break;
>              }
>          }
> -        pmd_perf_end_iteration(s, iter_packets);
> +        pmd_perf_end_iteration(s, rx_packets, tx_packets,
> +                               pmd_perf_metrics_enabled(pmd));
>      }
> +    ovs_mutex_unlock(&pmd->perf_stats.stats_mutex);
>  
>      poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
>      exiting = latch_is_set(&pmd->exit_latch);
> @@ -5068,6 +5218,7 @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
>      struct match match;
>      ovs_u128 ufid;
>      int error;
> +    uint64_t cycles = cycles_counter_update(&pmd->perf_stats);
>  
>      match.tun_md.valid = false;
>      miniflow_expand(&key->mf, &match.flow);
> @@ -5121,6 +5272,14 @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
>          ovs_mutex_unlock(&pmd->flow_mutex);
>          emc_probabilistic_insert(pmd, key, netdev_flow);
>      }
> +    if (pmd_perf_metrics_enabled(pmd)) {
> +        /* Update upcall stats. */
> +        cycles = cycles_counter_update(&pmd->perf_stats) - cycles;
> +        struct pmd_perf_stats *s = &pmd->perf_stats;
> +        s->current.upcalls++;
> +        s->current.upcall_cycles += cycles;
> +        histogram_add_sample(&s->cycles_per_upcall, cycles);
> +    }
>      return error;
>  }
>  
> diff --git a/manpages.mk b/manpages.mk
> index d4bf0ec..aaf8bc2 100644
> --- a/manpages.mk
> +++ b/manpages.mk
> @@ -250,6 +250,7 @@ vswitchd/ovs-vswitchd.8: \
>  	lib/coverage-unixctl.man \
>  	lib/daemon.man \
>  	lib/dpctl.man \
> +	lib/dpif-netdev-unixctl.man \
>  	lib/memory-unixctl.man \
>  	lib/netdev-dpdk-unixctl.man \
>  	lib/service.man \
> @@ -266,6 +267,7 @@ lib/common.man:
>  lib/coverage-unixctl.man:
>  lib/daemon.man:
>  lib/dpctl.man:
> +lib/dpif-netdev-unixctl.man:
>  lib/memory-unixctl.man:
>  lib/netdev-dpdk-unixctl.man:
>  lib/service.man:
> diff --git a/vswitchd/ovs-vswitchd.8.in b/vswitchd/ovs-vswitchd.8.in
> index 80e5f53..8b4034d 100644
> --- a/vswitchd/ovs-vswitchd.8.in
> +++ b/vswitchd/ovs-vswitchd.8.in
> @@ -256,32 +256,7 @@ type).
>  ..
>  .so lib/dpctl.man
>  .
> -.SS "DPIF-NETDEV COMMANDS"
> -These commands are used to expose internal information (mostly statistics)
> -about the ``dpif-netdev'' userspace datapath. If there is only one datapath
> -(as is often the case, unless \fBdpctl/\fR commands are used), the \fIdp\fR
> -argument can be omitted.
> -.IP "\fBdpif-netdev/pmd-stats-show\fR [\fIdp\fR]"
> -Shows performance statistics for each pmd thread of the datapath \fIdp\fR.
> -The special thread ``main'' sums up the statistics of every non pmd thread.
> -The sum of ``emc hits'', ``masked hits'' and ``miss'' is the number of
> -packets received by the datapath.  Cycles are counted using the TSC or similar
> -facilities (when available on the platform).  To reset these counters use
> -\fBdpif-netdev/pmd-stats-clear\fR. The duration of one cycle depends on the
> -measuring infrastructure. ``idle cycles'' refers to cycles spent polling
> -devices but not receiving any packets. ``processing cycles'' refers to cycles
> -spent polling devices and successfully receiving packets, plus the cycles
> -spent processing said packets.
> -.IP "\fBdpif-netdev/pmd-stats-clear\fR [\fIdp\fR]"
> -Resets to zero the per pmd thread performance numbers shown by the
> -\fBdpif-netdev/pmd-stats-show\fR command.  It will NOT reset datapath or
> -bridge statistics, only the values shown by the above command.
> -.IP "\fBdpif-netdev/pmd-rxq-show\fR [\fIdp\fR]"
> -For each pmd thread of the datapath \fIdp\fR shows list of queue-ids with
> -port names, which this thread polls.
> -.IP "\fBdpif-netdev/pmd-rxq-rebalance\fR [\fIdp\fR]"
> -Reassigns rxqs to pmds in the datapath \fIdp\fR based on their current usage.
> -.
> +.so lib/dpif-netdev-unixctl.man
>  .so lib/netdev-dpdk-unixctl.man
>  .so ofproto/ofproto-dpif-unixctl.man
>  .so ofproto/ofproto-unixctl.man
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index f899a19..aac663f 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -375,6 +375,18 @@
>          </p>
>        </column>
>  
> +      <column name="other_config" key="pmd-perf-metrics"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          Enables recording of detailed PMD performance metrics for analysis
> +          and trouble-shooting. This can have a performance impact in the
> +          order of 1%.
> +        </p>
> +        <p>
> +          Defaults to false but can be changed at any time.
> +        </p>
> +      </column>
> +
>        <column name="other_config" key="n-handler-threads"
>                type='{"type": "integer", "minInteger": 1}'>
>          <p>
>
Stokes, Ian March 27, 2018, 2:20 p.m. UTC | #5
> Comments inline.
> 
> Best regards, Ilya Maximets.
> 
> On 18.03.2018 20:55, Jan Scheurich wrote:
> > This patch instruments the dpif-netdev datapath to record detailed
> > statistics of what is happening in every iteration of a PMD thread.
> >
> > The collection of detailed statistics can be controlled by a new
> > Open_vSwitch configuration parameter "other_config:pmd-perf-metrics".
> > By default it is disabled. The run-time overhead, when enabled, is
> > in the order of 1%.
> >

[snip]

> > +    }
> > +    if (tx_packets > 0) {
> > +        ds_put_format(str,
> > +            "  Tx packets:      %12"PRIu64"  (%.0f Kpps)\n"
> > +            "  Tx batches:      %12"PRIu64"  (%.2f pkts/batch)"
> > +            "\n",
> > +            tx_packets, (tx_packets / duration) / 1000,
> > +            tx_batches, 1.0 * tx_packets / tx_batches);
> > +    } else {
> > +        ds_put_format(str,
> > +                "  Tx packets:      %12"PRIu64"\n"
> > +                "\n",
> > +                0ULL);
> 
> I have a few interesting warnings on 64bit ARMv8.
> 
> Clang:
> 
> lib/dpif-netdev-perf.c:216:17: error: format specifies type 'unsigned
> long' but the argument has type 'unsigned long long' [-Werror,-Wformat]
>                 0ULL);
>                 ^~~~
> lib/dpif-netdev-perf.c:229:17: error: format specifies type 'unsigned
> long' but the argument has type 'unsigned long long' [-Werror,-Wformat]
>                 0ULL);
>                 ^~~~
> 
> GCC:
> 
> lib/dpif-netdev-perf.c: In function ‘pmd_perf_format_overall_stats’:
> lib/dpif-netdev-perf.c:215:17: error: format ‘%lu’ expects argument of
> type ‘long unsigned int’, but argument 3 has type ‘long long unsigned int’
> [-Werror=format=]
>                  "  Rx packets:      %12"PRIu64"\n",
>                  ^
> lib/dpif-netdev-perf.c:227:17: error: format ‘%lu’ expects argument of
> type ‘long unsigned int’, but argument 3 has type ‘long long unsigned int’
> [-Werror=format=]
>                  "  Tx packets:      %12"PRIu64"\n"
>                  ^
> 
> Both are coming from the fact that PRIu64 expands to '%lu'.
> Why we need this printing at all? Can we just print 0 in a string?
> Otherwise, the only way to fix these warnings is to cast 0 directly to
> uint64_t.

I see the same in Travis.

In the v9 of the series the format used was 0UL. This allowed compilation in Travis except for when compiling OVS with the 32 bit flag.
From the logs the introduction of 0ULL seems to avoid the issue for 32 bit compilation but introduces the problem for 64 bit compilation.

I don’t see a way around it either without casting.

Ian
Jan Scheurich March 27, 2018, 2:28 p.m. UTC | #6
> -----Original Message-----
> From: Stokes, Ian [mailto:ian.stokes@intel.com]
> Sent: Tuesday, 27 March, 2018 16:21
> To: Ilya Maximets <i.maximets@samsung.com>; Jan Scheurich <jan.scheurich@ericsson.com>; dev@openvswitch.org
> Cc: ktraynor@redhat.com; O Mahony, Billy <billy.o.mahony@intel.com>
> Subject: RE: [PATCH v10 2/3] dpif-netdev: Detailed performance stats for PMDs
> 
> > Comments inline.
> >
> > Best regards, Ilya Maximets.
> >
> > On 18.03.2018 20:55, Jan Scheurich wrote:
> > > This patch instruments the dpif-netdev datapath to record detailed
> > > statistics of what is happening in every iteration of a PMD thread.
> > >
> > > The collection of detailed statistics can be controlled by a new
> > > Open_vSwitch configuration parameter "other_config:pmd-perf-metrics".
> > > By default it is disabled. The run-time overhead, when enabled, is
> > > in the order of 1%.
> > >
> 
> [snip]
> 
> > > +    }
> > > +    if (tx_packets > 0) {
> > > +        ds_put_format(str,
> > > +            "  Tx packets:      %12"PRIu64"  (%.0f Kpps)\n"
> > > +            "  Tx batches:      %12"PRIu64"  (%.2f pkts/batch)"
> > > +            "\n",
> > > +            tx_packets, (tx_packets / duration) / 1000,
> > > +            tx_batches, 1.0 * tx_packets / tx_batches);
> > > +    } else {
> > > +        ds_put_format(str,
> > > +                "  Tx packets:      %12"PRIu64"\n"
> > > +                "\n",
> > > +                0ULL);
> >
> > I have a few interesting warnings on 64bit ARMv8.
> >
> > Clang:
> >
> > lib/dpif-netdev-perf.c:216:17: error: format specifies type 'unsigned
> > long' but the argument has type 'unsigned long long' [-Werror,-Wformat]
> >                 0ULL);
> >                 ^~~~
> > lib/dpif-netdev-perf.c:229:17: error: format specifies type 'unsigned
> > long' but the argument has type 'unsigned long long' [-Werror,-Wformat]
> >                 0ULL);
> >                 ^~~~
> >
> > GCC:
> >
> > lib/dpif-netdev-perf.c: In function ‘pmd_perf_format_overall_stats’:
> > lib/dpif-netdev-perf.c:215:17: error: format ‘%lu’ expects argument of
> > type ‘long unsigned int’, but argument 3 has type ‘long long unsigned int’
> > [-Werror=format=]
> >                  "  Rx packets:      %12"PRIu64"\n",
> >                  ^
> > lib/dpif-netdev-perf.c:227:17: error: format ‘%lu’ expects argument of
> > type ‘long unsigned int’, but argument 3 has type ‘long long unsigned int’
> > [-Werror=format=]
> >                  "  Tx packets:      %12"PRIu64"\n"
> >                  ^
> >
> > Both are coming from the fact that PRIu64 expands to '%lu'.
> > Why we need this printing at all? Can we just print 0 in a string?
> > Otherwise, the only way to fix these warnings is to cast 0 directly to
> > uint64_t.
> 
> I see the same in Travis.
> 
> In the v9 of the series the format used was 0UL. This allowed compilation in Travis except for when compiling OVS with the 32 bit flag.
> From the logs the introduction of 0ULL seems to avoid the issue for 32 bit compilation but introduces the problem for 64 bit
> compilation.
> 
> I don’t see a way around it either without casting.
> 
> Ian

I'll work around this by printing "0" as a string :-)
diff mbox series

Patch

diff --git a/NEWS b/NEWS
index 8d0b502..8f66fd3 100644
--- a/NEWS
+++ b/NEWS
@@ -73,6 +73,9 @@  v2.9.0 - 19 Feb 2018
      * Add support for vHost dequeue zero copy (experimental)
    - Userspace datapath:
      * Output packet batching support.
+     * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
+     * Detailed PMD performance metrics available with new command
+         ovs-appctl dpif-netdev/pmd-perf-show
    - vswitchd:
      * Datapath IDs may now be specified as 0x1 (etc.) instead of 16 digits.
      * Configuring a controller, or unconfiguring all controllers, now deletes
diff --git a/lib/automake.mk b/lib/automake.mk
index 5c26e0f..7a5632d 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -484,6 +484,7 @@  MAN_FRAGMENTS += \
 	lib/dpctl.man \
 	lib/memory-unixctl.man \
 	lib/netdev-dpdk-unixctl.man \
+	lib/dpif-netdev-unixctl.man \
 	lib/ofp-version.man \
 	lib/ovs.tmac \
 	lib/service.man \
diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
index f06991a..2b36410 100644
--- a/lib/dpif-netdev-perf.c
+++ b/lib/dpif-netdev-perf.c
@@ -15,18 +15,324 @@ 
  */
 
 #include <config.h>
+#include <stdint.h>
 
+#include "dpif-netdev-perf.h"
 #include "openvswitch/dynamic-string.h"
 #include "openvswitch/vlog.h"
-#include "dpif-netdev-perf.h"
+#include "ovs-thread.h"
 #include "timeval.h"
 
 VLOG_DEFINE_THIS_MODULE(pmd_perf);
 
+#ifdef DPDK_NETDEV
+static uint64_t
+get_tsc_hz(void)
+{
+    return rte_get_tsc_hz();
+}
+#else
+/* This function is only invoked from PMD threads which depend on DPDK.
+ * A dummy function is sufficient when building without DPDK_NETDEV. */
+static uint64_t
+get_tsc_hz(void)
+{
+    return 1;
+}
+#endif
+
+/* Histogram functions. */
+
+static void
+histogram_walls_set_lin(struct histogram *hist, uint32_t min, uint32_t max)
+{
+    int i;
+
+    ovs_assert(min < max);
+    for (i = 0; i < NUM_BINS-1; i++) {
+        hist->wall[i] = min + (i * (max - min)) / (NUM_BINS - 2);
+    }
+    hist->wall[NUM_BINS-1] = UINT32_MAX;
+}
+
+static void
+histogram_walls_set_log(struct histogram *hist, uint32_t min, uint32_t max)
+{
+    int i, start, bins, wall;
+    double log_min, log_max;
+
+    ovs_assert(min < max);
+    if (min > 0) {
+        log_min = log(min);
+        log_max = log(max);
+        start = 0;
+        bins = NUM_BINS - 1;
+    } else {
+        hist->wall[0] = 0;
+        log_min = log(1);
+        log_max = log(max);
+        start = 1;
+        bins = NUM_BINS - 2;
+    }
+    wall = start;
+    for (i = 0; i < bins; i++) {
+        /* Make sure each wall is monotonically increasing. */
+        wall = MAX(wall, exp(log_min + (i * (log_max - log_min)) / (bins-1)));
+        hist->wall[start + i] = wall++;
+    }
+    if (hist->wall[NUM_BINS-2] < max) {
+        hist->wall[NUM_BINS-2] = max;
+    }
+    hist->wall[NUM_BINS-1] = UINT32_MAX;
+}
+
+uint64_t
+histogram_samples(const struct histogram *hist)
+{
+    uint64_t samples = 0;
+
+    for (int i = 0; i < NUM_BINS; i++) {
+        samples += hist->bin[i];
+    }
+    return samples;
+}
+
+static void
+histogram_clear(struct histogram *hist)
+{
+    int i;
+
+    for (i = 0; i < NUM_BINS; i++) {
+        hist->bin[i] = 0;
+    }
+}
+
+static void
+history_init(struct history *h)
+{
+    memset(h, 0, sizeof(*h));
+}
+
 void
 pmd_perf_stats_init(struct pmd_perf_stats *s)
 {
-    memset(s, 0 , sizeof(*s));
+    memset(s, 0, sizeof(*s));
+    ovs_mutex_init(&s->stats_mutex);
+    ovs_mutex_init(&s->clear_mutex);
+    histogram_walls_set_log(&s->cycles, 500, 24000000);
+    histogram_walls_set_log(&s->pkts, 0, 1000);
+    histogram_walls_set_lin(&s->cycles_per_pkt, 100, 30000);
+    histogram_walls_set_lin(&s->pkts_per_batch, 0, 32);
+    histogram_walls_set_lin(&s->upcalls, 0, 30);
+    histogram_walls_set_log(&s->cycles_per_upcall, 1000, 1000000);
+    histogram_walls_set_log(&s->max_vhost_qfill, 0, 512);
+    s->start_ms = time_msec();
+}
+
+void
+pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
+                              double duration)
+{
+    uint64_t stats[PMD_N_STATS];
+    double us_per_cycle = 1000000.0 / get_tsc_hz();
+
+    if (duration == 0) {
+        return;
+    }
+
+    pmd_perf_read_counters(s, stats);
+    uint64_t tot_cycles = stats[PMD_CYCLES_ITER_IDLE] +
+                          stats[PMD_CYCLES_ITER_BUSY];
+    uint64_t rx_packets = stats[PMD_STAT_RECV];
+    uint64_t tx_packets = stats[PMD_STAT_SENT_PKTS];
+    uint64_t tx_batches = stats[PMD_STAT_SENT_BATCHES];
+    uint64_t passes = stats[PMD_STAT_RECV] +
+                      stats[PMD_STAT_RECIRC];
+    uint64_t upcalls = stats[PMD_STAT_MISS];
+    uint64_t upcall_cycles = stats[PMD_CYCLES_UPCALL];
+    uint64_t tot_iter = histogram_samples(&s->pkts);
+    uint64_t idle_iter = s->pkts.bin[0];
+    uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
+
+    ds_put_format(str,
+            "  Cycles:          %12"PRIu64"  (%.2f GHz)\n"
+            "  Iterations:      %12"PRIu64"  (%.2f us/it)\n"
+            "  - idle:          %12"PRIu64"  (%4.1f %% cycles)\n"
+            "  - busy:          %12"PRIu64"  (%4.1f %% cycles)\n",
+            tot_cycles, (tot_cycles / duration) / 1E9,
+            tot_iter, tot_cycles * us_per_cycle / tot_iter,
+            idle_iter,
+            100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
+            busy_iter,
+            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
+    if (rx_packets > 0) {
+        ds_put_format(str,
+            "  Rx packets:      %12"PRIu64"  (%.0f Kpps, %.0f cycles/pkt)\n"
+            "  Datapath passes: %12"PRIu64"  (%.2f passes/pkt)\n"
+            "  - EMC hits:      %12"PRIu64"  (%4.1f %%)\n"
+            "  - Megaflow hits: %12"PRIu64"  (%4.1f %%, %.2f subtbl lookups/"
+                                                                     "hit)\n"
+            "  - Upcalls:       %12"PRIu64"  (%4.1f %%, %.1f us/upcall)\n"
+            "  - Lost upcalls:  %12"PRIu64"  (%4.1f %%)\n",
+            rx_packets, (rx_packets / duration) / 1000,
+            1.0 * stats[PMD_CYCLES_ITER_BUSY] / rx_packets,
+            passes, rx_packets ? 1.0 * passes / rx_packets : 0,
+            stats[PMD_STAT_EXACT_HIT],
+            100.0 * stats[PMD_STAT_EXACT_HIT] / passes,
+            stats[PMD_STAT_MASKED_HIT],
+            100.0 * stats[PMD_STAT_MASKED_HIT] / passes,
+            stats[PMD_STAT_MASKED_HIT]
+            ? 1.0 * stats[PMD_STAT_MASKED_LOOKUP] / stats[PMD_STAT_MASKED_HIT]
+            : 0,
+            upcalls, 100.0 * upcalls / passes,
+            upcalls ? (upcall_cycles * us_per_cycle) / upcalls : 0,
+            stats[PMD_STAT_LOST],
+            100.0 * stats[PMD_STAT_LOST] / passes);
+    } else {
+        ds_put_format(str,
+                "  Rx packets:      %12"PRIu64"\n",
+                0ULL);
+    }
+    if (tx_packets > 0) {
+        ds_put_format(str,
+            "  Tx packets:      %12"PRIu64"  (%.0f Kpps)\n"
+            "  Tx batches:      %12"PRIu64"  (%.2f pkts/batch)"
+            "\n",
+            tx_packets, (tx_packets / duration) / 1000,
+            tx_batches, 1.0 * tx_packets / tx_batches);
+    } else {
+        ds_put_format(str,
+                "  Tx packets:      %12"PRIu64"\n"
+                "\n",
+                0ULL);
+    }
+}
+
+void
+pmd_perf_format_histograms(struct ds *str, struct pmd_perf_stats *s)
+{
+    int i;
+
+    ds_put_cstr(str, "Histograms\n");
+    ds_put_format(str,
+                  "   %-21s  %-21s  %-21s  %-21s  %-21s  %-21s  %-21s\n",
+                  "cycles/it", "packets/it", "cycles/pkt", "pkts/batch",
+                  "max vhost qlen", "upcalls/it", "cycles/upcall");
+    for (i = 0; i < NUM_BINS-1; i++) {
+        ds_put_format(str,
+            "   %-9d %-11"PRIu64"  %-9d %-11"PRIu64"  %-9d %-11"PRIu64
+            "  %-9d %-11"PRIu64"  %-9d %-11"PRIu64"  %-9d %-11"PRIu64
+            "  %-9d %-11"PRIu64"\n",
+            s->cycles.wall[i], s->cycles.bin[i],
+            s->pkts.wall[i],s->pkts.bin[i],
+            s->cycles_per_pkt.wall[i], s->cycles_per_pkt.bin[i],
+            s->pkts_per_batch.wall[i], s->pkts_per_batch.bin[i],
+            s->max_vhost_qfill.wall[i], s->max_vhost_qfill.bin[i],
+            s->upcalls.wall[i], s->upcalls.bin[i],
+            s->cycles_per_upcall.wall[i], s->cycles_per_upcall.bin[i]);
+    }
+    ds_put_format(str,
+                  "   %-9s %-11"PRIu64"  %-9s %-11"PRIu64"  %-9s %-11"PRIu64
+                  "  %-9s %-11"PRIu64"  %-9s %-11"PRIu64"  %-9s %-11"PRIu64
+                  "  %-9s %-11"PRIu64"\n",
+                  ">", s->cycles.bin[i],
+                  ">", s->pkts.bin[i],
+                  ">", s->cycles_per_pkt.bin[i],
+                  ">", s->pkts_per_batch.bin[i],
+                  ">", s->max_vhost_qfill.bin[i],
+                  ">", s->upcalls.bin[i],
+                  ">", s->cycles_per_upcall.bin[i]);
+    if (s->totals.iterations > 0) {
+        ds_put_cstr(str,
+                    "-----------------------------------------------------"
+                    "-----------------------------------------------------"
+                    "------------------------------------------------\n");
+        ds_put_format(str,
+                      "   %-21s  %-21s  %-21s  %-21s  %-21s  %-21s  %-21s\n",
+                      "cycles/it", "packets/it", "cycles/pkt", "pkts/batch",
+                      "vhost qlen", "upcalls/it", "cycles/upcall");
+        ds_put_format(str,
+                      "   %-21"PRIu64"  %-21.5f  %-21"PRIu64
+                      "  %-21.5f  %-21.5f  %-21.5f  %-21"PRIu32"\n",
+                      s->totals.cycles / s->totals.iterations,
+                      1.0 * s->totals.pkts / s->totals.iterations,
+                      s->totals.pkts
+                          ? s->totals.busy_cycles / s->totals.pkts : 0,
+                      s->totals.batches
+                          ? 1.0 * s->totals.pkts / s->totals.batches : 0,
+                      1.0 * s->totals.max_vhost_qfill / s->totals.iterations,
+                      1.0 * s->totals.upcalls / s->totals.iterations,
+                      s->totals.upcalls
+                          ? s->totals.upcall_cycles / s->totals.upcalls : 0);
+    }
+}
+
+void
+pmd_perf_format_iteration_history(struct ds *str, struct pmd_perf_stats *s,
+                                  int n_iter)
+{
+    struct iter_stats *is;
+    size_t index;
+    int i;
+
+    if (n_iter == 0) {
+        return;
+    }
+    ds_put_format(str, "   %-17s   %-10s   %-10s   %-10s   %-10s   "
+                  "%-10s   %-10s   %-10s\n",
+                  "tsc", "cycles", "packets", "cycles/pkt", "pkts/batch",
+                  "vhost qlen", "upcalls", "cycles/upcall");
+    for (i = 1; i <= n_iter; i++) {
+        index = (s->iterations.idx + HISTORY_LEN - i) % HISTORY_LEN;
+        is = &s->iterations.sample[index];
+        ds_put_format(str,
+                      "   %-17"PRIu64"   %-11"PRIu64"  %-11"PRIu32
+                      "  %-11"PRIu64"  %-11"PRIu32"  %-11"PRIu32
+                      "  %-11"PRIu32"  %-11"PRIu32"\n",
+                      is->timestamp,
+                      is->cycles,
+                      is->pkts,
+                      is->pkts ? is->cycles / is->pkts : 0,
+                      is->batches ? is->pkts / is->batches : 0,
+                      is->max_vhost_qfill,
+                      is->upcalls,
+                      is->upcalls ? is->upcall_cycles / is->upcalls : 0);
+    }
+}
+
+void
+pmd_perf_format_ms_history(struct ds *str, struct pmd_perf_stats *s, int n_ms)
+{
+    struct iter_stats *is;
+    size_t index;
+    int i;
+
+    if (n_ms == 0) {
+        return;
+    }
+    ds_put_format(str,
+                  "   %-12s   %-10s   %-10s   %-10s   %-10s"
+                  "   %-10s   %-10s   %-10s   %-10s\n",
+                  "ms", "iterations", "cycles/it", "Kpps", "cycles/pkt",
+                  "pkts/batch", "vhost qlen", "upcalls", "cycles/upcall");
+    for (i = 1; i <= n_ms; i++) {
+        index = (s->milliseconds.idx + HISTORY_LEN - i) % HISTORY_LEN;
+        is = &s->milliseconds.sample[index];
+        ds_put_format(str,
+                      "   %-12"PRIu64"   %-11"PRIu32"  %-11"PRIu64
+                      "  %-11"PRIu32"  %-11"PRIu64"  %-11"PRIu32
+                      "  %-11"PRIu32"  %-11"PRIu32"  %-11"PRIu32"\n",
+                      is->timestamp,
+                      is->iterations,
+                      is->iterations ? is->cycles / is->iterations : 0,
+                      is->pkts,
+                      is->pkts ? is->busy_cycles / is->pkts : 0,
+                      is->batches ? is->pkts / is->batches : 0,
+                      is->iterations
+                          ? is->max_vhost_qfill / is->iterations : 0,
+                      is->upcalls,
+                      is->upcalls ? is->upcall_cycles / is->upcalls : 0);
+    }
 }
 
 void
@@ -51,10 +357,48 @@  pmd_perf_read_counters(struct pmd_perf_stats *s,
     }
 }
 
+/* This function clears the PMD performance counters from within the PMD
+ * thread or from another thread when the PMD thread is not executing its
+ * poll loop. */
 void
-pmd_perf_stats_clear(struct pmd_perf_stats *s)
+pmd_perf_stats_clear_lock(struct pmd_perf_stats *s)
+    OVS_REQUIRES(s->stats_mutex)
 {
+    ovs_mutex_lock(&s->clear_mutex);
     for (int i = 0; i < PMD_N_STATS; i++) {
         atomic_read_relaxed(&s->counters.n[i], &s->counters.zero[i]);
     }
+    /* The following stats are only applicable in PMD thread and */
+    memset(&s->current, 0 , sizeof(struct iter_stats));
+    memset(&s->totals, 0 , sizeof(struct iter_stats));
+    histogram_clear(&s->cycles);
+    histogram_clear(&s->pkts);
+    histogram_clear(&s->cycles_per_pkt);
+    histogram_clear(&s->upcalls);
+    histogram_clear(&s->cycles_per_upcall);
+    histogram_clear(&s->pkts_per_batch);
+    histogram_clear(&s->max_vhost_qfill);
+    history_init(&s->iterations);
+    history_init(&s->milliseconds);
+    s->start_ms = time_msec();
+    s->milliseconds.sample[0].timestamp = s->start_ms;
+    /* Clearing finished. */
+    s->clear = false;
+    ovs_mutex_unlock(&s->clear_mutex);
+}
+
+/* This function can be called from the anywhere to clear the stats
+ * of PMD and non-PMD threads. */
+void
+pmd_perf_stats_clear(struct pmd_perf_stats *s)
+{
+    if (ovs_mutex_trylock(&s->stats_mutex) == 0) {
+        /* Locking successful. PMD not polling. */
+        pmd_perf_stats_clear_lock(s);
+        ovs_mutex_unlock(&s->stats_mutex);
+    } else {
+        /* Request the polling PMD to clear the stats. There is no need to
+         * block here as stats retrieval is prevented during clearing. */
+        s->clear = true;
+    }
 }
diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
index 5993c25..fd9b0fc 100644
--- a/lib/dpif-netdev-perf.h
+++ b/lib/dpif-netdev-perf.h
@@ -38,10 +38,18 @@ 
 extern "C" {
 #endif
 
-/* This module encapsulates data structures and functions to maintain PMD
- * performance metrics such as packet counters, execution cycles. It
- * provides a clean API for dpif-netdev to initialize, update and read and
+/* This module encapsulates data structures and functions to maintain basic PMD
+ * performance metrics such as packet counters, execution cycles as well as
+ * histograms and time series recording for more detailed PMD metrics.
+ *
+ * It provides a clean API for dpif-netdev to initialize, update and read and
  * reset these metrics.
+ *
+ * The basic set of PMD counters is implemented as atomic_uint64_t variables
+ * to guarantee correct read also in 32-bit systems.
+ *
+ * The detailed PMD performance metrics are only supported on 64-bit systems
+ * with atomic 64-bit read and store semantics for plain uint64_t counters.
  */
 
 /* Set of counter types maintained in pmd_perf_stats. */
@@ -66,6 +74,7 @@  enum pmd_stat_type {
     PMD_STAT_SENT_BATCHES,  /* Number of batches sent. */
     PMD_CYCLES_ITER_IDLE,   /* Cycles spent in idle iterations. */
     PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
+    PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
     PMD_N_STATS
 };
 
@@ -81,18 +90,87 @@  struct pmd_counters {
     uint64_t zero[PMD_N_STATS];         /* Value at last _clear().  */
 };
 
-/* Container for all performance metrics of a PMD.
- * Part of the struct dp_netdev_pmd_thread. */
+/* Data structure to collect statistical distribution of an integer measurement
+ * type in form of a histogram. The wall[] array contains the inclusive
+ * upper boundaries of the bins, while the bin[] array contains the actual
+ * counters per bin. The histogram walls are typically set automatically
+ * using the functions provided below.*/
+
+#define NUM_BINS 32             /* Number of histogram bins. */
+
+struct histogram {
+    uint32_t wall[NUM_BINS];
+    uint64_t bin[NUM_BINS];
+};
+
+/* Data structure to record details PMD execution metrics per iteration for
+ * a history period of up to HISTORY_LEN iterations in circular buffer.
+ * Also used to record up to HISTORY_LEN millisecond averages/totals of these
+ * metrics.*/
+
+struct iter_stats {
+    uint64_t timestamp;         /* TSC or millisecond. */
+    uint64_t cycles;            /* Number of TSC cycles spent in it. or ms. */
+    uint64_t busy_cycles;       /* Cycles spent in busy iterations or ms. */
+    uint32_t iterations;        /* Iterations in ms. */
+    uint32_t pkts;              /* Packets processed in iteration or ms. */
+    uint32_t upcalls;           /* Number of upcalls in iteration or ms. */
+    uint32_t upcall_cycles;     /* Cycles spent in upcalls in it. or ms. */
+    uint32_t batches;           /* Number of rx batches in iteration or ms. */
+    uint32_t max_vhost_qfill;   /* Maximum fill level in iteration or ms. */
+};
+
+#define HISTORY_LEN 1000        /* Length of recorded history
+                                   (iterations and ms). */
+#define DEF_HIST_SHOW 20        /* Default number of history samples to
+                                   display. */
+
+struct history {
+    size_t idx;                 /* Slot to which next call to history_store()
+                                   will write. */
+    struct iter_stats sample[HISTORY_LEN];
+};
+
+/* Container for all performance metrics of a PMD within the struct
+ * dp_netdev_pmd_thread. The metrics must be updated from within the PMD
+ * thread but can be read from any thread. The basic PMD counters in
+ * struct pmd_counters can be read without protection against concurrent
+ * clearing. The other metrics may only be safely read with the clear_mutex
+ * held to protect against concurrent clearing. */
 
 struct pmd_perf_stats {
-    /* Start of the current PMD iteration in TSC cycles.*/
-    uint64_t start_it_tsc;
+    /* Prevents interference between PMD polling and stats clearing. */
+    struct ovs_mutex stats_mutex;
+    /* Set by CLI thread to order clearing of PMD stats. */
+    volatile bool clear;
+    /* Prevents stats retrieval while clearing is in progress. */
+    struct ovs_mutex clear_mutex;
+    /* Start of the current performance measurement period. */
+    uint64_t start_ms;
     /* Latest TSC time stamp taken in PMD. */
     uint64_t last_tsc;
+    /* Used to space certain checks in time. */
+    uint64_t next_check_tsc;
     /* If non-NULL, outermost cycle timer currently running in PMD. */
     struct cycle_timer *cur_timer;
     /* Set of PMD counters with their zero offsets. */
     struct pmd_counters counters;
+    /* Statistics of the current iteration. */
+    struct iter_stats current;
+    /* Totals for the current millisecond. */
+    struct iter_stats totals;
+    /* Histograms for the PMD metrics. */
+    struct histogram cycles;
+    struct histogram pkts;
+    struct histogram cycles_per_pkt;
+    struct histogram upcalls;
+    struct histogram cycles_per_upcall;
+    struct histogram pkts_per_batch;
+    struct histogram max_vhost_qfill;
+    /* Iteration history buffer. */
+    struct history iterations;
+    /* Millisecond history buffer. */
+    struct history milliseconds;
 };
 
 /* Support for accurate timing of PMD execution on TSC clock cycle level.
@@ -175,8 +253,14 @@  cycle_timer_stop(struct pmd_perf_stats *s,
     return now - timer->start;
 }
 
+/* Functions to initialize and reset the PMD performance metrics. */
+
 void pmd_perf_stats_init(struct pmd_perf_stats *s);
 void pmd_perf_stats_clear(struct pmd_perf_stats *s);
+void pmd_perf_stats_clear_lock(struct pmd_perf_stats *s);
+
+/* Functions to read and update PMD counters. */
+
 void pmd_perf_read_counters(struct pmd_perf_stats *s,
                             uint64_t stats[PMD_N_STATS]);
 
@@ -199,32 +283,182 @@  pmd_perf_update_counter(struct pmd_perf_stats *s,
     atomic_store_relaxed(&s->counters.n[counter], tmp);
 }
 
+/* Functions to manipulate a sample history. */
+
+static inline void
+histogram_add_sample(struct histogram *hist, uint32_t val)
+{
+    /* TODO: Can do better with binary search? */
+    for (int i = 0; i < NUM_BINS-1; i++) {
+        if (val <= hist->wall[i]) {
+            hist->bin[i]++;
+            return;
+        }
+    }
+    hist->bin[NUM_BINS-1]++;
+}
+
+uint64_t histogram_samples(const struct histogram *hist);
+
+/* Add an offset to idx modulo HISTORY_LEN. */
+static inline uint32_t
+history_add(uint32_t idx, uint32_t offset)
+{
+    return (idx + offset) % HISTORY_LEN;
+}
+
+/* Subtract idx2 from idx1 modulo HISTORY_LEN. */
+static inline uint32_t
+history_sub(uint32_t idx1, uint32_t idx2)
+{
+    return (idx1 + HISTORY_LEN - idx2) % HISTORY_LEN;
+}
+
+static inline struct iter_stats *
+history_current(struct history *h)
+{
+    return &h->sample[h->idx];
+}
+
+static inline struct iter_stats *
+history_next(struct history *h)
+{
+    size_t next_idx = (h->idx + 1) % HISTORY_LEN;
+    struct iter_stats *next = &h->sample[next_idx];
+
+    memset(next, 0, sizeof(*next));
+    h->idx = next_idx;
+    return next;
+}
+
+static inline struct iter_stats *
+history_store(struct history *h, struct iter_stats *is)
+{
+    if (is) {
+        h->sample[h->idx] = *is;
+    }
+    /* Advance the history pointer */
+    return history_next(h);
+}
+
+/* Functions recording PMD metrics per iteration. */
+
 static inline void
 pmd_perf_start_iteration(struct pmd_perf_stats *s)
 {
+    if (s->clear) {
+        /* Clear the PMD stats before starting next iteration. */
+        pmd_perf_stats_clear_lock(s);
+    }
+    /* Initialize the current interval stats. */
+    memset(&s->current, 0, sizeof(struct iter_stats));
     if (OVS_LIKELY(s->last_tsc)) {
         /* We assume here that last_tsc was updated immediately prior at
          * the end of the previous iteration, or just before the first
          * iteration. */
-        s->start_it_tsc = s->last_tsc;
+        s->current.timestamp = s->last_tsc;
     } else {
         /* In case last_tsc has never been set before. */
-        s->start_it_tsc = cycles_counter_update(s);
+        s->current.timestamp = cycles_counter_update(s);
     }
 }
 
 static inline void
-pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets)
+pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
+                       int tx_packets, bool full_metrics)
 {
-    uint64_t cycles = cycles_counter_update(s) - s->start_it_tsc;
+    uint64_t now_tsc = cycles_counter_update(s);
+    struct iter_stats *cum_ms;
+    uint64_t cycles, cycles_per_pkt = 0;
 
-    if (rx_packets > 0) {
+    cycles = now_tsc - s->current.timestamp;
+    s->current.cycles = cycles;
+    s->current.pkts = rx_packets;
+
+    if (rx_packets + tx_packets > 0) {
         pmd_perf_update_counter(s, PMD_CYCLES_ITER_BUSY, cycles);
     } else {
         pmd_perf_update_counter(s, PMD_CYCLES_ITER_IDLE, cycles);
     }
+    /* Add iteration samples to histograms. */
+    histogram_add_sample(&s->cycles, cycles);
+    histogram_add_sample(&s->pkts, rx_packets);
+
+    if (!full_metrics) {
+        return;
+    }
+
+    s->counters.n[PMD_CYCLES_UPCALL] += s->current.upcall_cycles;
+
+    if (rx_packets > 0) {
+        cycles_per_pkt = cycles / rx_packets;
+        histogram_add_sample(&s->cycles_per_pkt, cycles_per_pkt);
+    }
+    if (s->current.batches > 0) {
+        histogram_add_sample(&s->pkts_per_batch,
+                             rx_packets / s->current.batches);
+    }
+    histogram_add_sample(&s->upcalls, s->current.upcalls);
+    if (s->current.upcalls > 0) {
+        histogram_add_sample(&s->cycles_per_upcall,
+                             s->current.upcall_cycles / s->current.upcalls);
+    }
+    histogram_add_sample(&s->max_vhost_qfill, s->current.max_vhost_qfill);
+
+    /* Add iteration samples to millisecond stats. */
+    cum_ms = history_current(&s->milliseconds);
+    cum_ms->iterations++;
+    cum_ms->cycles += cycles;
+    if (rx_packets > 0) {
+        cum_ms->busy_cycles += cycles;
+    }
+    cum_ms->pkts += s->current.pkts;
+    cum_ms->upcalls += s->current.upcalls;
+    cum_ms->upcall_cycles += s->current.upcall_cycles;
+    cum_ms->batches += s->current.batches;
+    cum_ms->max_vhost_qfill += s->current.max_vhost_qfill;
+
+    /* Store in iteration history. This advances the iteration idx and
+     * clears the next slot in the iteration history. */
+    history_store(&s->iterations, &s->current);
+    if (now_tsc > s->next_check_tsc) {
+        /* Check if ms is completed and store in milliseconds history. */
+        uint64_t now = time_msec();
+        if (now != cum_ms->timestamp) {
+            /* Add ms stats to totals. */
+            s->totals.iterations += cum_ms->iterations;
+            s->totals.cycles += cum_ms->cycles;
+            s->totals.busy_cycles += cum_ms->busy_cycles;
+            s->totals.pkts += cum_ms->pkts;
+            s->totals.upcalls += cum_ms->upcalls;
+            s->totals.upcall_cycles += cum_ms->upcall_cycles;
+            s->totals.batches += cum_ms->batches;
+            s->totals.max_vhost_qfill += cum_ms->max_vhost_qfill;
+            cum_ms = history_next(&s->milliseconds);
+            cum_ms->timestamp = now;
+        }
+        s->next_check_tsc = cycles_counter_update(s) + 10000;
+    }
 }
 
+/* Formatting the output of commands. */
+
+struct pmd_perf_params {
+    int command_type;
+    bool histograms;
+    size_t iter_hist_len;
+    size_t ms_hist_len;
+};
+
+void pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
+                                   double duration);
+void pmd_perf_format_histograms(struct ds *str, struct pmd_perf_stats *s);
+void pmd_perf_format_iteration_history(struct ds *str,
+                                       struct pmd_perf_stats *s,
+                                       int n_iter);
+void pmd_perf_format_ms_history(struct ds *str, struct pmd_perf_stats *s,
+                                int n_ms);
+
 #ifdef  __cplusplus
 }
 #endif
diff --git a/lib/dpif-netdev-unixctl.man b/lib/dpif-netdev-unixctl.man
new file mode 100644
index 0000000..76c3e4e
--- /dev/null
+++ b/lib/dpif-netdev-unixctl.man
@@ -0,0 +1,157 @@ 
+.SS "DPIF-NETDEV COMMANDS"
+These commands are used to expose internal information (mostly statistics)
+about the "dpif-netdev" userspace datapath. If there is only one datapath
+(as is often the case, unless \fBdpctl/\fR commands are used), the \fIdp\fR
+argument can be omitted. By default the commands present data for all pmd
+threads in the datapath. By specifying the "-pmd Core" option one can filter
+the output for a single pmd in the datapath.
+.
+.IP "\fBdpif-netdev/pmd-stats-show\fR [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
+Shows performance statistics for one or all pmd threads of the datapath
+\fIdp\fR. The special thread "main" sums up the statistics of every non pmd
+thread.
+
+The sum of "emc hits", "masked hits" and "miss" is the number of
+packet lookups performed by the datapath. Beware that a recirculated packet
+experiences one additional lookup per recirculation, so there may be
+more lookups than forwarded packets in the datapath.
+
+Cycles are counted using the TSC or similar facilities (when available on
+the platform). The duration of one cycle depends on the processing platform.
+
+"idle cycles" refers to cycles spent in PMD iterations not forwarding any
+any packets. "processing cycles" refers to cycles spent in PMD iterations
+forwarding at least one packet, including the cost for polling, processing and
+transmitting said packets.
+
+To reset these counters use \fBdpif-netdev/pmd-stats-clear\fR.
+.
+.IP "\fBdpif-netdev/pmd-stats-clear\fR [\fIdp\fR]"
+Resets to zero the per pmd thread performance numbers shown by the
+\fBdpif-netdev/pmd-stats-show\fR and \fBdpif-netdev/pmd-perf-show\fR commands.
+It will NOT reset datapath or bridge statistics, only the values shown by
+the above commands.
+.
+.IP "\fBdpif-netdev/pmd-perf-show\fR [\fB-nh\fR] [\fB-it\fR \fIiter_len\fR] \
+[\fB-ms\fR \fIms_len\fR] [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
+Shows detailed performance metrics for one or all pmds threads of the
+user space datapath.
+
+The collection of detailed statistics can be controlled by a new
+configuration parameter "other_config:pmd-perf-metrics". By default it
+is disabled. The run-time overhead, when enabled, is in the order of 1%.
+
+.RS
+.IP
+.PD .4v
+.IP \(em
+used cycles
+.IP \(em
+forwared packets
+.IP \(em
+number of rx batches
+.IP \(em
+packets/rx batch
+.IP \(em
+max. vhostuser queue fill level
+.IP \(em
+number of upcalls
+.IP \(em
+cycles spent in upcalls
+.PD
+.RE
+.IP
+This raw recorded data is used threefold:
+
+.RS
+.IP
+.PD .4v
+.IP 1.
+In histograms for each of the following metrics:
+.RS
+.IP \(em
+cycles/iteration (logarithmic)
+.IP \(em
+packets/iteration (logarithmic)
+.IP \(em
+cycles/packet
+.IP \(em
+packets/batch
+.IP \(em
+max. vhostuser qlen (logarithmic)
+.IP \(em
+upcalls
+.IP \(em
+cycles/upcall (logarithmic)
+The histograms bins are divided linear or logarithmic.
+.RE
+.IP 2.
+A cyclic history of the above metrics for 1024 iterations
+.IP 3.
+A cyclic history of the cummulative/average values per millisecond wall
+clock for the last 1024 milliseconds:
+.RS
+.IP \(em
+number of iterations
+.IP \(em
+avg. cycles/iteration
+.IP \(em
+packets (Kpps)
+.IP \(em
+avg. packets/batch
+.IP \(em
+avg. max vhost qlen
+.IP \(em
+upcalls
+.IP \(em
+avg. cycles/upcall
+.RE
+.PD
+.RE
+.IP
+.
+The command options are:
+.RS
+.IP "\fB-nh\fR"
+Suppress the histograms
+.IP "\fB-it\fR \fIiter_len\fR"
+Display the last iter_len iteration stats
+.IP "\fB-ms\fR \fIms_len\fR"
+Display the last ms_len millisecond stats
+.RE
+.IP
+The output always contains the following global PMD statistics:
+.RS
+.IP
+Time: 15:24:55.270 .br
+Measurement duration: 1.008 s
+
+pmd thread numa_id 0 core_id 1:
+
+  Cycles:            2419034712  (2.40 GHz)
+  Iterations:            572817  (1.76 us/it)
+  - idle:                486808  (15.9 % cycles)
+  - busy:                 86009  (84.1 % cycles)
+  Rx packets:           2399607  (2381 Kpps, 848 cycles/pkt)
+  Datapath passes:      3599415  (1.50 passes/pkt)
+  - EMC hits:            336472  ( 9.3 %)
+  - Megaflow hits:      3262943  (90.7 %, 1.00 subtbl lookups/hit)
+  - Upcalls:                  0  ( 0.0 %, 0.0 us/upcall)
+  - Lost upcalls:             0  ( 0.0 %)
+  Tx packets:           2399607  (2381 Kpps)
+  Tx batches:            171400  (14.00 pkts/batch)
+.RE
+.IP
+Here "Rx packets" actually reflects the number of packets forwarded by the
+datapath. "Datapath passes" matches the number of packet lookups as
+reported by the \fBdpif-netdev/pmd-stats-show\fR command.
+
+To reset the counters and start a new measurement use
+\fBdpif-netdev/pmd-stats-clear\fR.
+.
+.IP "\fBdpif-netdev/pmd-rxq-show\fR [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
+For one or all pmd threads of the datapath \fIdp\fR show the list of queue-ids
+with port names, which this thread polls.
+.
+.IP "\fBdpif-netdev/pmd-rxq-rebalance\fR [\fIdp\fR]"
+Reassigns rxqs to pmds in the datapath \fIdp\fR based on their current usage.
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 86d8739..f245ce2 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -49,6 +49,7 @@ 
 #include "id-pool.h"
 #include "latch.h"
 #include "netdev.h"
+#include "netdev-provider.h"
 #include "netdev-vport.h"
 #include "netlink.h"
 #include "odp-execute.h"
@@ -281,6 +282,8 @@  struct dp_netdev {
 
     /* Probability of EMC insertions is a factor of 'emc_insert_min'.*/
     OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_uint32_t emc_insert_min;
+    /* Enable collection of PMD performance metrics. */
+    atomic_bool pmd_perf_metrics;
 
     /* Protects access to ofproto-dpif-upcall interface during revalidator
      * thread synchronization. */
@@ -356,6 +359,7 @@  struct dp_netdev_rxq {
                                           particular core. */
     unsigned intrvl_idx;               /* Write index for 'cycles_intrvl'. */
     struct dp_netdev_pmd_thread *pmd;  /* pmd thread that polls this queue. */
+    bool is_vhost;                     /* Is rxq of a vhost port. */
 
     /* Counters of cycles spent successfully polling and processing pkts. */
     atomic_ullong cycles[RXQ_N_CYCLES];
@@ -717,6 +721,8 @@  static inline bool emc_entry_alive(struct emc_entry *ce);
 static void emc_clear_entry(struct emc_entry *ce);
 
 static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
+static inline bool
+pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread *pmd);
 
 static void
 emc_cache_init(struct emc_cache *flow_cache)
@@ -800,7 +806,8 @@  get_dp_netdev(const struct dpif *dpif)
 enum pmd_info_type {
     PMD_INFO_SHOW_STATS,  /* Show how cpu cycles are spent. */
     PMD_INFO_CLEAR_STATS, /* Set the cycles count to 0. */
-    PMD_INFO_SHOW_RXQ     /* Show poll-lists of pmd threads. */
+    PMD_INFO_SHOW_RXQ,    /* Show poll lists of pmd threads. */
+    PMD_INFO_PERF_SHOW,   /* Show pmd performance details. */
 };
 
 static void
@@ -891,6 +898,47 @@  pmd_info_show_stats(struct ds *reply,
                   stats[PMD_CYCLES_ITER_BUSY], total_packets);
 }
 
+static void
+pmd_info_show_perf(struct ds *reply,
+                   struct dp_netdev_pmd_thread *pmd,
+                   struct pmd_perf_params *par)
+{
+    if (pmd->core_id != NON_PMD_CORE_ID) {
+        char *time_str =
+                xastrftime_msec("%H:%M:%S.###", time_wall_msec(), true);
+        long long now = time_msec();
+        double duration = (now - pmd->perf_stats.start_ms) / 1000.0;
+
+        ds_put_cstr(reply, "\n");
+        ds_put_format(reply, "Time: %s\n", time_str);
+        ds_put_format(reply, "Measurement duration: %.3f s\n", duration);
+        ds_put_cstr(reply, "\n");
+        format_pmd_thread(reply, pmd);
+        ds_put_cstr(reply, "\n");
+        pmd_perf_format_overall_stats(reply, &pmd->perf_stats, duration);
+        if (pmd_perf_metrics_enabled(pmd)) {
+            /* Prevent parallel clearing of perf metrics. */
+            ovs_mutex_lock(&pmd->perf_stats.clear_mutex);
+            if (par->histograms) {
+                ds_put_cstr(reply, "\n");
+                pmd_perf_format_histograms(reply, &pmd->perf_stats);
+            }
+            if (par->iter_hist_len > 0) {
+                ds_put_cstr(reply, "\n");
+                pmd_perf_format_iteration_history(reply, &pmd->perf_stats,
+                        par->iter_hist_len);
+            }
+            if (par->ms_hist_len > 0) {
+                ds_put_cstr(reply, "\n");
+                pmd_perf_format_ms_history(reply, &pmd->perf_stats,
+                        par->ms_hist_len);
+            }
+            ovs_mutex_unlock(&pmd->perf_stats.clear_mutex);
+        }
+        free(time_str);
+    }
+}
+
 static int
 compare_poll_list(const void *a_, const void *b_)
 {
@@ -1068,7 +1116,7 @@  dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[],
     ovs_mutex_lock(&dp_netdev_mutex);
 
     while (argc > 1) {
-        if (!strcmp(argv[1], "-pmd") && argc >= 3) {
+        if (!strcmp(argv[1], "-pmd") && argc > 2) {
             if (str_to_uint(argv[2], 10, &core_id)) {
                 filter_on_pmd = true;
             }
@@ -1108,6 +1156,8 @@  dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[],
             pmd_perf_stats_clear(&pmd->perf_stats);
         } else if (type == PMD_INFO_SHOW_STATS) {
             pmd_info_show_stats(&reply, pmd);
+        } else if (type == PMD_INFO_PERF_SHOW) {
+            pmd_info_show_perf(&reply, pmd, (struct pmd_perf_params *)aux);
         }
     }
     free(pmd_list);
@@ -1117,6 +1167,48 @@  dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[],
     unixctl_command_reply(conn, ds_cstr(&reply));
     ds_destroy(&reply);
 }
+
+static void
+pmd_perf_show_cmd(struct unixctl_conn *conn, int argc,
+                          const char *argv[],
+                          void *aux OVS_UNUSED)
+{
+    struct pmd_perf_params par;
+    long int it_hist = 0, ms_hist = 0;
+    par.histograms = true;
+
+    while (argc > 1) {
+        if (!strcmp(argv[1], "-nh")) {
+            par.histograms = false;
+            argc -= 1;
+            argv += 1;
+        } else if (!strcmp(argv[1], "-it") && argc > 2) {
+            it_hist = strtol(argv[2], NULL, 10);
+            if (it_hist < 0) {
+                it_hist = 0;
+            } else if (it_hist > HISTORY_LEN) {
+                it_hist = HISTORY_LEN;
+            }
+            argc -= 2;
+            argv += 2;
+        } else if (!strcmp(argv[1], "-ms") && argc > 2) {
+            ms_hist = strtol(argv[2], NULL, 10);
+            if (ms_hist < 0) {
+                ms_hist = 0;
+            } else if (ms_hist > HISTORY_LEN) {
+                ms_hist = HISTORY_LEN;
+            }
+            argc -= 2;
+            argv += 2;
+        } else {
+            break;
+        }
+    }
+    par.iter_hist_len = it_hist;
+    par.ms_hist_len = ms_hist;
+    par.command_type = PMD_INFO_PERF_SHOW;
+    dpif_netdev_pmd_info(conn, argc, argv, &par);
+}
 
 static int
 dpif_netdev_init(void)
@@ -1134,6 +1226,12 @@  dpif_netdev_init(void)
     unixctl_command_register("dpif-netdev/pmd-rxq-show", "[-pmd core] [dp]",
                              0, 3, dpif_netdev_pmd_info,
                              (void *)&poll_aux);
+    unixctl_command_register("dpif-netdev/pmd-perf-show",
+                             "[-nh] [-it iter-history-len]"
+                             " [-ms ms-history-len]"
+                             " [-pmd core] [dp]",
+                             0, 8, pmd_perf_show_cmd,
+                             NULL);
     unixctl_command_register("dpif-netdev/pmd-rxq-rebalance", "[dp]",
                              0, 1, dpif_netdev_pmd_rebalance,
                              NULL);
@@ -3020,6 +3118,18 @@  dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
         }
     }
 
+    bool perf_enabled = smap_get_bool(other_config, "pmd-perf-metrics", false);
+    bool cur_perf_enabled;
+    atomic_read_relaxed(&dp->pmd_perf_metrics, &cur_perf_enabled);
+    if (perf_enabled != cur_perf_enabled) {
+        atomic_store_relaxed(&dp->pmd_perf_metrics, perf_enabled);
+        if (perf_enabled) {
+            VLOG_INFO("PMD performance metrics collection enabled");
+        } else {
+            VLOG_INFO("PMD performance metrics collection disabled");
+        }
+    }
+
     return 0;
 }
 
@@ -3189,6 +3299,21 @@  dp_netdev_rxq_get_intrvl_cycles(struct dp_netdev_rxq *rx, unsigned idx)
     return processing_cycles;
 }
 
+static inline bool
+pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread *pmd)
+{
+    /* If stores and reads of 64-bit integers are not atomic, the
+     * full PMD performance metrics are not available as locked
+     * access to 64 bit integers would be prohibitively expensive. */
+#if ATOMIC_LLONG_LOCK_FREE
+    bool pmd_perf_enabled;
+    atomic_read_relaxed(&pmd->dp->pmd_perf_metrics, &pmd_perf_enabled);
+    return pmd_perf_enabled;
+#else
+    return false;
+#endif
+}
+
 static int
 dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,
                                    struct tx_port *p)
@@ -3264,10 +3389,12 @@  dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
                            struct dp_netdev_rxq *rxq,
                            odp_port_t port_no)
 {
+    struct pmd_perf_stats *s = &pmd->perf_stats;
     struct dp_packet_batch batch;
     struct cycle_timer timer;
     int error;
-    int batch_cnt = 0, output_cnt = 0;
+    int batch_cnt = 0;
+    int rem_qlen = 0, *qlen_p = NULL;
     uint64_t cycles;
 
     /* Measure duration for polling and processing rx burst. */
@@ -3276,20 +3403,37 @@  dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
     pmd->ctx.last_rxq = rxq;
     dp_packet_batch_init(&batch);
 
-    error = netdev_rxq_recv(rxq->rx, &batch, NULL);
+    /* Fetch the rx queue length only for vhostuser ports. */
+    if (pmd_perf_metrics_enabled(pmd) && rxq->is_vhost) {
+        qlen_p = &rem_qlen;
+    }
+
+    error = netdev_rxq_recv(rxq->rx, &batch, qlen_p);
     if (!error) {
         /* At least one packet received. */
         *recirc_depth_get() = 0;
         pmd_thread_ctx_time_update(pmd);
-
         batch_cnt = batch.count;
+        if (pmd_perf_metrics_enabled(pmd)) {
+            /* Update batch histogram. */
+            s->current.batches++;
+            histogram_add_sample(&s->pkts_per_batch, batch_cnt);
+            /* Update the maximum vhost rx queue fill level. */
+            if (rxq->is_vhost && rem_qlen >= 0) {
+                uint32_t qfill = batch_cnt + rem_qlen;
+                if (qfill > s->current.max_vhost_qfill) {
+                    s->current.max_vhost_qfill = qfill;
+                }
+            }
+        }
+        /* Process packet batch. */
         dp_netdev_input(pmd, &batch, port_no);
 
         /* Assign processing cycles to rx queue. */
         cycles = cycle_timer_stop(&pmd->perf_stats, &timer);
         dp_netdev_rxq_add_cycles(rxq, RXQ_CYCLES_PROC_CURR, cycles);
 
-        output_cnt = dp_netdev_pmd_flush_output_packets(pmd, false);
+        dp_netdev_pmd_flush_output_packets(pmd, false);
     } else {
         /* Discard cycles. */
         cycle_timer_stop(&pmd->perf_stats, &timer);
@@ -3303,7 +3447,7 @@  dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
 
     pmd->ctx.last_rxq = NULL;
 
-    return batch_cnt + output_cnt;
+    return batch_cnt;
 }
 
 static struct tx_port *
@@ -3359,6 +3503,7 @@  port_reconfigure(struct dp_netdev_port *port)
         }
 
         port->rxqs[i].port = port;
+        port->rxqs[i].is_vhost = !strncmp(port->type, "dpdkvhost", 9);
 
         err = netdev_rxq_open(netdev, &port->rxqs[i].rx, i);
         if (err) {
@@ -4137,23 +4282,26 @@  reload:
     pmd->intrvl_tsc_prev = 0;
     atomic_store_relaxed(&pmd->intrvl_cycles, 0);
     cycles_counter_update(s);
+    /* Protect pmd stats from external clearing while polling. */
+    ovs_mutex_lock(&pmd->perf_stats.stats_mutex);
     for (;;) {
-        uint64_t iter_packets = 0;
+        uint64_t rx_packets = 0, tx_packets = 0;
 
         pmd_perf_start_iteration(s);
+
         for (i = 0; i < poll_cnt; i++) {
             process_packets =
                 dp_netdev_process_rxq_port(pmd, poll_list[i].rxq,
                                            poll_list[i].port_no);
-            iter_packets += process_packets;
+            rx_packets += process_packets;
         }
 
-        if (!iter_packets) {
+        if (!rx_packets) {
             /* We didn't receive anything in the process loop.
              * Check if we need to send something.
              * There was no time updates on current iteration. */
             pmd_thread_ctx_time_update(pmd);
-            iter_packets += dp_netdev_pmd_flush_output_packets(pmd, false);
+            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
         }
 
         if (lc++ > 1024) {
@@ -4172,8 +4320,10 @@  reload:
                 break;
             }
         }
-        pmd_perf_end_iteration(s, iter_packets);
+        pmd_perf_end_iteration(s, rx_packets, tx_packets,
+                               pmd_perf_metrics_enabled(pmd));
     }
+    ovs_mutex_unlock(&pmd->perf_stats.stats_mutex);
 
     poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
     exiting = latch_is_set(&pmd->exit_latch);
@@ -5068,6 +5218,7 @@  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
     struct match match;
     ovs_u128 ufid;
     int error;
+    uint64_t cycles = cycles_counter_update(&pmd->perf_stats);
 
     match.tun_md.valid = false;
     miniflow_expand(&key->mf, &match.flow);
@@ -5121,6 +5272,14 @@  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
         ovs_mutex_unlock(&pmd->flow_mutex);
         emc_probabilistic_insert(pmd, key, netdev_flow);
     }
+    if (pmd_perf_metrics_enabled(pmd)) {
+        /* Update upcall stats. */
+        cycles = cycles_counter_update(&pmd->perf_stats) - cycles;
+        struct pmd_perf_stats *s = &pmd->perf_stats;
+        s->current.upcalls++;
+        s->current.upcall_cycles += cycles;
+        histogram_add_sample(&s->cycles_per_upcall, cycles);
+    }
     return error;
 }
 
diff --git a/manpages.mk b/manpages.mk
index d4bf0ec..aaf8bc2 100644
--- a/manpages.mk
+++ b/manpages.mk
@@ -250,6 +250,7 @@  vswitchd/ovs-vswitchd.8: \
 	lib/coverage-unixctl.man \
 	lib/daemon.man \
 	lib/dpctl.man \
+	lib/dpif-netdev-unixctl.man \
 	lib/memory-unixctl.man \
 	lib/netdev-dpdk-unixctl.man \
 	lib/service.man \
@@ -266,6 +267,7 @@  lib/common.man:
 lib/coverage-unixctl.man:
 lib/daemon.man:
 lib/dpctl.man:
+lib/dpif-netdev-unixctl.man:
 lib/memory-unixctl.man:
 lib/netdev-dpdk-unixctl.man:
 lib/service.man:
diff --git a/vswitchd/ovs-vswitchd.8.in b/vswitchd/ovs-vswitchd.8.in
index 80e5f53..8b4034d 100644
--- a/vswitchd/ovs-vswitchd.8.in
+++ b/vswitchd/ovs-vswitchd.8.in
@@ -256,32 +256,7 @@  type).
 ..
 .so lib/dpctl.man
 .
-.SS "DPIF-NETDEV COMMANDS"
-These commands are used to expose internal information (mostly statistics)
-about the ``dpif-netdev'' userspace datapath. If there is only one datapath
-(as is often the case, unless \fBdpctl/\fR commands are used), the \fIdp\fR
-argument can be omitted.
-.IP "\fBdpif-netdev/pmd-stats-show\fR [\fIdp\fR]"
-Shows performance statistics for each pmd thread of the datapath \fIdp\fR.
-The special thread ``main'' sums up the statistics of every non pmd thread.
-The sum of ``emc hits'', ``masked hits'' and ``miss'' is the number of
-packets received by the datapath.  Cycles are counted using the TSC or similar
-facilities (when available on the platform).  To reset these counters use
-\fBdpif-netdev/pmd-stats-clear\fR. The duration of one cycle depends on the
-measuring infrastructure. ``idle cycles'' refers to cycles spent polling
-devices but not receiving any packets. ``processing cycles'' refers to cycles
-spent polling devices and successfully receiving packets, plus the cycles
-spent processing said packets.
-.IP "\fBdpif-netdev/pmd-stats-clear\fR [\fIdp\fR]"
-Resets to zero the per pmd thread performance numbers shown by the
-\fBdpif-netdev/pmd-stats-show\fR command.  It will NOT reset datapath or
-bridge statistics, only the values shown by the above command.
-.IP "\fBdpif-netdev/pmd-rxq-show\fR [\fIdp\fR]"
-For each pmd thread of the datapath \fIdp\fR shows list of queue-ids with
-port names, which this thread polls.
-.IP "\fBdpif-netdev/pmd-rxq-rebalance\fR [\fIdp\fR]"
-Reassigns rxqs to pmds in the datapath \fIdp\fR based on their current usage.
-.
+.so lib/dpif-netdev-unixctl.man
 .so lib/netdev-dpdk-unixctl.man
 .so ofproto/ofproto-dpif-unixctl.man
 .so ofproto/ofproto-unixctl.man
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index f899a19..aac663f 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -375,6 +375,18 @@ 
         </p>
       </column>
 
+      <column name="other_config" key="pmd-perf-metrics"
+              type='{"type": "boolean"}'>
+        <p>
+          Enables recording of detailed PMD performance metrics for analysis
+          and trouble-shooting. This can have a performance impact in the
+          order of 1%.
+        </p>
+        <p>
+          Defaults to false but can be changed at any time.
+        </p>
+      </column>
+
       <column name="other_config" key="n-handler-threads"
               type='{"type": "integer", "minInteger": 1}'>
         <p>