diff mbox series

[ovs-dev,v3,2/2] dpif-netdev: Add load based PMD sleeping.

Message ID 20230106145937.693463-3-ktraynor@redhat.com
State Superseded
Headers show
Series Add pmd sleeping. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed

Commit Message

Kevin Traynor Jan. 6, 2023, 2:59 p.m. UTC
Sleep for an incremental amount of time if none of the Rx queues
assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
on an polling iteration of the PMD.

Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
sleep time to zero (i.e. no sleep).

Sleep time will be increased on each iteration where the low load
conditions remain up to a total of the max sleep time which is set
by the user e.g:
ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500

The default pmd-maxsleep value is 0, which means that no sleeps
will occur and the default behaviour is unchanged from previously.

Also add new stats to pmd-perf-show to get visibility of operation
e.g.
...
   - sleep iterations:       153994  ( 76.8 % of iterations)
   Sleep time:               9159399  us ( 46 us/iteration avg.)
...

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
---
 Documentation/topics/dpdk/pmd.rst | 51 ++++++++++++++++++++++++
 NEWS                              |  3 ++
 lib/dpif-netdev-perf.c            | 24 +++++++++---
 lib/dpif-netdev-perf.h            |  5 ++-
 lib/dpif-netdev.c                 | 64 +++++++++++++++++++++++++++++--
 tests/pmd.at                      | 46 ++++++++++++++++++++++
 vswitchd/vswitch.xml              | 26 +++++++++++++
 7 files changed, 209 insertions(+), 10 deletions(-)

Comments

David Marchand Jan. 9, 2023, 3:23 p.m. UTC | #1
On Fri, Jan 6, 2023 at 4:00 PM Kevin Traynor <ktraynor@redhat.com> wrote:
>
> Sleep for an incremental amount of time if none of the Rx queues
> assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
> on an polling iteration of the PMD.
>
> Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
> sleep time to zero (i.e. no sleep).
>
> Sleep time will be increased on each iteration where the low load
> conditions remain up to a total of the max sleep time which is set
> by the user e.g:
> ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500
>
> The default pmd-maxsleep value is 0, which means that no sleeps
> will occur and the default behaviour is unchanged from previously.
>
> Also add new stats to pmd-perf-show to get visibility of operation
> e.g.
> ...
>    - sleep iterations:       153994  ( 76.8 % of iterations)
>    Sleep time:               9159399  us ( 46 us/iteration avg.)
> ...
>
> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
> ---
>  Documentation/topics/dpdk/pmd.rst | 51 ++++++++++++++++++++++++
>  NEWS                              |  3 ++
>  lib/dpif-netdev-perf.c            | 24 +++++++++---
>  lib/dpif-netdev-perf.h            |  5 ++-
>  lib/dpif-netdev.c                 | 64 +++++++++++++++++++++++++++++--
>  tests/pmd.at                      | 46 ++++++++++++++++++++++
>  vswitchd/vswitch.xml              | 26 +++++++++++++
>  7 files changed, 209 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> index 9006fd40f..89f6b3052 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -325,4 +325,55 @@ reassignment due to PMD Auto Load Balance. For example, this could be set
>  (in min) such that a reassignment is triggered at most every few hours.
>
> +PMD Power Saving (Experimental)
> +-------------------------------

I would stick to: "PMD load based sleeping"
The powersaving comes from some external configuration that this patch
does not cover.

Maybe you could mention something about c-states, but it seems out of
OVS scope itself.


> +
> +PMD threads constantly poll Rx queues which are assigned to them. In order to
> +reduce the CPU cycles they use, they can sleep for small periods of time
> +when there is no load or very-low load on all the Rx queues they poll.
> +
> +This can be enabled by setting the max requested sleep time (in microseconds)
> +for a PMD thread::
> +
> +    $ ovs-vsctl set open_vswitch . other_config:pmd-maxsleep=500
> +
> +Non-zero values will be rounded up to the nearest 10 microseconds to avoid
> +requesting very small sleep times.
> +
> +With a non-zero max value a PMD may request to sleep by an incrementing amount
> +of time up to the maximum time. If at any point the threshold of at least half
> +a batch of packets (i.e. 16) is received from an Rx queue that the PMD is
> +polling is met, the requested sleep time will be reset to 0. At that point no
> +sleeps will occur until the no/low load conditions return.
> +
> +Sleeping in a PMD thread will mean there is a period of time when the PMD
> +thread will not process packets. Sleep times requested are not guaranteed
> +and can differ significantly depending on system configuration. The actual
> +time not processing packets will be determined by the sleep and processor
> +wake-up times and should be tested with each system configuration.
> +
> +Sleep time statistics for 10 secs can be seen with::
> +
> +    $ ovs-appctl dpif-netdev/pmd-stats-clear \
> +        && sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show
> +
> +Example output, showing that during the last 10 seconds, 76.8% of iterations
> +had a sleep of some length. The total amount of sleep time was 9.15 seconds and
> +the average sleep time per iteration was 46 microseconds::
> +
> +   - sleep iterations:       153994  ( 76.8 % of iterations)
> +   Sleep time:               9159399  us ( 46 us/iteration avg.)
> +
> +.. note::
> +
> +    If there is a sudden spike of packets while the PMD thread is sleeping and
> +    the processor is in a low-power state it may result in some lost packets or
> +    extra latency before the PMD thread returns to processing packets at full
> +    rate.
> +
> +.. note::
> +
> +    Default Linux kernel hrtimer resolution is set to 50 microseconds so this
> +    will add overhead to requested sleep time.
> +
>  .. _ovs-vswitchd(8):
>      http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html
> diff --git a/NEWS b/NEWS
> index 2f6ededfe..54d97825e 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -31,4 +31,7 @@ Post-v3.0.0
>       * Add '-secs' argument to appctl 'dpif-netdev/pmd-rxq-show' to show
>         the pmd usage of an Rx queue over a configurable time period.
> +     * Add new experiemental PMD load based sleeping feature. PMD threads can

*experimental


> +       request to sleep up to a user configured 'pmd-maxsleep' value under no
> +       and low load conditions.
>
>
> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
> index a2a7d8f0b..bc6b779a7 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -231,4 +231,6 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
>      uint64_t idle_iter = s->pkts.bin[0];
>      uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
> +    uint64_t sleep_iter = stats[PMD_PWR_SLEEP_ITER];
> +    uint64_t tot_sleep_cycles = stats[PMD_PWR_SLEEP_CYCLES];

I would remove _PWR_.


>
>      ds_put_format(str,
> @@ -236,11 +238,17 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
>              "  - Used TSC cycles:  %12"PRIu64"  (%5.1f %% of total cycles)\n"
>              "  - idle iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
> -            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n",
> -            tot_iter, tot_cycles * us_per_cycle / tot_iter,
> +            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
> +            "  - sleep iterations: %12"PRIu64"  (%5.1f %% of iterations)\n"
> +            " Sleep time:          %12.0f  us (%3.0f us/iteration avg.)\n",

This gives:

pmd thread numa_id 1 core_id 5:

  Iterations:               884937  (361.43 us/it)
  - Used TSC cycles:   24829488529  (  1.2 % of total cycles)
  - idle iterations:        563472  (  1.4 % of used cycles)
  - busy iterations:        321465  ( 98.6 % of used cycles)
  - sleep iterations:       569487  ( 64.4 % of iterations)
 Sleep time:             310297274  us (351 us/iteration avg.)
 ^^^
I would add another space before Sleep so it aligns with the rest.

And maybe put the unit as a comment, since no other stat detailed its
unit so far.
+            "  Sleep time (us):    %12.0f  (%3.0f us/iteration avg.)\n",

  Rx packets:             10000000  (13 Kpps, 2447 cycles/pkt)
  Datapath passes:        10000000  (1.00 passes/pkt)
  - PHWOL hits:                  0  (  0.0 %)
  - MFEX Opt hits:               0  (  0.0 %)
"""


> +            tot_iter,
> +            (tot_cycles + tot_sleep_cycles) * us_per_cycle / tot_iter,
>              tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz,
>              idle_iter,
>              100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
>              busy_iter,
> -            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
> +            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles,
> +            sleep_iter, tot_iter ? 100.0 * sleep_iter / tot_iter : 0,
> +            tot_sleep_cycles * us_per_cycle,
> +            tot_iter ? (tot_sleep_cycles * us_per_cycle) / tot_iter : 0);
>      if (rx_packets > 0) {
>          ds_put_format(str,
> @@ -519,5 +527,6 @@ OVS_REQUIRES(s->stats_mutex)
>  void
>  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
> -                       int tx_packets, bool full_metrics)
> +                       int tx_packets, uint64_t sleep_cycles,
> +                       bool full_metrics)
>  {
>      uint64_t now_tsc = cycles_counter_update(s);
> @@ -526,5 +535,5 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
>      char *reason = NULL;
>
> -    cycles = now_tsc - s->start_tsc;
> +    cycles = now_tsc - s->start_tsc - sleep_cycles;
>      s->current.timestamp = s->iteration_cnt;
>      s->current.cycles = cycles;
> @@ -540,4 +549,9 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
>      histogram_add_sample(&s->pkts, rx_packets);
>
> +    if (sleep_cycles) {
> +        pmd_perf_update_counter(s, PMD_PWR_SLEEP_ITER, 1);
> +        pmd_perf_update_counter(s, PMD_PWR_SLEEP_CYCLES, sleep_cycles);
> +    }
> +
>      if (!full_metrics) {
>          return;
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> index 9673dddd8..ebf776827 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -81,4 +81,6 @@ enum pmd_stat_type {
>      PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
>      PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
> +    PMD_PWR_SLEEP_ITER,     /* Iterations where a sleep has taken place. */
> +    PMD_PWR_SLEEP_CYCLES,   /* Total cycles slept to save power. */
>      PMD_N_STATS
>  };
> @@ -409,5 +411,6 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s);
>  void
>  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
> -                       int tx_packets, bool full_metrics);
> +                       int tx_packets, uint64_t sleep_cycles,
> +                       bool full_metrics);
>
>  /* Formatting the output of commands. */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 7127068fe..af97f9a83 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -172,4 +172,9 @@ static struct odp_support dp_netdev_support = {
>  #define PMD_RCU_QUIESCE_INTERVAL 10000LL
>
> +/* Number of pkts Rx on an interface that will stop pmd thread sleeping. */
> +#define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST / 2)
> +/* Time in uS to increment a pmd thread sleep time. */
> +#define PMD_PWR_INC_US 10

Idem, no _PWR_.


> +
>  struct dpcls {
>      struct cmap_node node;      /* Within dp_netdev_pmd_thread.classifiers */

The rest lgtm.
With this fixed, you can add:
Reviewed-by: David Marchand <david.marchand@redhat.com>
Robin Jarry Jan. 9, 2023, 4 p.m. UTC | #2
Kevin Traynor, Jan 06, 2023 at 15:59:
> Sleep for an incremental amount of time if none of the Rx queues
> assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
> on an polling iteration of the PMD.
>
> Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
> sleep time to zero (i.e. no sleep).
>
> Sleep time will be increased on each iteration where the low load
> conditions remain up to a total of the max sleep time which is set
> by the user e.g:
> ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500
>
> The default pmd-maxsleep value is 0, which means that no sleeps
> will occur and the default behaviour is unchanged from previously.
>
> Also add new stats to pmd-perf-show to get visibility of operation
> e.g.
> ...
>    - sleep iterations:       153994  ( 76.8 % of iterations)
>    Sleep time:               9159399  us ( 46 us/iteration avg.)
> ...
>
> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>

Hi Kevin,

For the record, here are a few numbers that were gathered on a HP DL360
Gen9 server (Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz) with and without
this patch series applied.

Single socket, Physical to physical test, 2 cores in pmd-cpu-mask, power
measurement with pcm-power:

+------------+------------+------------+--------------+-----------------+
|            | Reference: | Powersave: | pmd-maxsleep | Power off       |
|            | disabled   |            | 500us        | unused cores    |
|            | c-states   | C6 enabled | C6 enabled   | (X remaining)   |
+------------+------------+------------+--------------+-----------------+
| No OvS     | 33 W       | 11.30W     | N/A          | 2 cores online  |
|            |            |            |              | All OFF: 11.30W |
+------------+------------+------------+--------------+-----------------+
| No traffic | 37W        | 26.5W      | 12W          | 12W             |
| 0 PPS      |            |            |              |                 |
+------------+------------+------------+--------------+-----------------+
| Idle       | 37W        | 26.5W      | 12W          | 12W             |
| 1k pps     |            |            |              |                 |
+------------+------------+------------+--------------+-----------------+
| Medium     | 37W        | 27W        | 15-20W       | 15-20W          |
| 1 Mpps     |            |            |              |                 |
+------------+------------+------------+--------------+-----------------+
| High       | 38W        | 28W        | 28W          | 28W             |
| 14 Mpps    |            |            |              |                 |
+------------+------------+------------+--------------+-----------------+

> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> index 9006fd40f..89f6b3052 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -325,4 +325,55 @@ reassignment due to PMD Auto Load Balance. For example, this could be set
>  (in min) such that a reassignment is triggered at most every few hours.
>  
> +PMD Power Saving (Experimental)
> +-------------------------------
> +
> +PMD threads constantly poll Rx queues which are assigned to them. In order to
> +reduce the CPU cycles they use, they can sleep for small periods of time
> +when there is no load or very-low load on all the Rx queues they poll.
> +
> +This can be enabled by setting the max requested sleep time (in microseconds)
> +for a PMD thread::
> +
> +    $ ovs-vsctl set open_vswitch . other_config:pmd-maxsleep=500
> +
> +Non-zero values will be rounded up to the nearest 10 microseconds to avoid
> +requesting very small sleep times.
> +
> +With a non-zero max value a PMD may request to sleep by an incrementing amount
> +of time up to the maximum time. If at any point the threshold of at least half
> +a batch of packets (i.e. 16) is received from an Rx queue that the PMD is
> +polling is met, the requested sleep time will be reset to 0. At that point no
> +sleeps will occur until the no/low load conditions return.
> +
> +Sleeping in a PMD thread will mean there is a period of time when the PMD
> +thread will not process packets. Sleep times requested are not guaranteed
> +and can differ significantly depending on system configuration. The actual
> +time not processing packets will be determined by the sleep and processor
> +wake-up times and should be tested with each system configuration.
> +
> +Sleep time statistics for 10 secs can be seen with::
> +
> +    $ ovs-appctl dpif-netdev/pmd-stats-clear \
> +        && sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show
> +
> +Example output, showing that during the last 10 seconds, 76.8% of iterations
> +had a sleep of some length. The total amount of sleep time was 9.15 seconds and
> +the average sleep time per iteration was 46 microseconds::
> +
> +   - sleep iterations:       153994  ( 76.8 % of iterations)
> +   Sleep time:               9159399  us ( 46 us/iteration avg.)
> +
> +.. note::
> +
> +    If there is a sudden spike of packets while the PMD thread is sleeping and
> +    the processor is in a low-power state it may result in some lost packets or
> +    extra latency before the PMD thread returns to processing packets at full
> +    rate.
> +
> +.. note::
> +
> +    Default Linux kernel hrtimer resolution is set to 50 microseconds so this
> +    will add overhead to requested sleep time.

I wonder if it would make sense to round up to the nearest hrtimer
resolution (if such info can be retrieved at runtime).

Cheers,

Reviewed-by: Robin Jarry <rjarry@redhat.com>
Kevin Traynor Jan. 10, 2023, 1:45 p.m. UTC | #3
On 09/01/2023 16:00, Robin Jarry wrote:
> Kevin Traynor, Jan 06, 2023 at 15:59:
>> Sleep for an incremental amount of time if none of the Rx queues
>> assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
>> on an polling iteration of the PMD.
>>
>> Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
>> sleep time to zero (i.e. no sleep).
>>
>> Sleep time will be increased on each iteration where the low load
>> conditions remain up to a total of the max sleep time which is set
>> by the user e.g:
>> ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500
>>
>> The default pmd-maxsleep value is 0, which means that no sleeps
>> will occur and the default behaviour is unchanged from previously.
>>
>> Also add new stats to pmd-perf-show to get visibility of operation
>> e.g.
>> ...
>>     - sleep iterations:       153994  ( 76.8 % of iterations)
>>     Sleep time:               9159399  us ( 46 us/iteration avg.)
>> ...
>>
>> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
> 
> Hi Kevin,
> 

Hi Robin,

> For the record, here are a few numbers that were gathered on a HP DL360
> Gen9 server (Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz) with and without
> this patch series applied.
> 
> Single socket, Physical to physical test, 2 cores in pmd-cpu-mask, power
> measurement with pcm-power:
> 
> +------------+------------+------------+--------------+-----------------+
> |            | Reference: | Powersave: | pmd-maxsleep | Power off       |
> |            | disabled   |            | 500us        | unused cores    |
> |            | c-states   | C6 enabled | C6 enabled   | (X remaining)   |
> +------------+------------+------------+--------------+-----------------+
> | No OvS     | 33 W       | 11.30W     | N/A          | 2 cores online  |
> |            |            |            |              | All OFF: 11.30W |
> +------------+------------+------------+--------------+-----------------+
> | No traffic | 37W        | 26.5W      | 12W          | 12W             |
> | 0 PPS      |            |            |              |                 |
> +------------+------------+------------+--------------+-----------------+
> | Idle       | 37W        | 26.5W      | 12W          | 12W             |
> | 1k pps     |            |            |              |                 |
> +------------+------------+------------+--------------+-----------------+
> | Medium     | 37W        | 27W        | 15-20W       | 15-20W          |
> | 1 Mpps     |            |            |              |                 |
> +------------+------------+------------+--------------+-----------------+
> | High       | 38W        | 28W        | 28W          | 28W             |
> | 14 Mpps    |            |            |              |                 |
> +------------+------------+------------+--------------+-----------------+
> 

Interesting, thanks for trying it out. This is a good test showing that 
system configuration changes are also needed to save power.

One thing to note is that probably the rest of the cores and most things 
in the package are not doing much else now that the pmd threads are 
sleeping in your test.

Sleeping these 2 cores alone when there are workloads on other cores may 
result in much less power saving for the package overall. So YMMV 
depending on the system config and workloads.

>> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
>> index 9006fd40f..89f6b3052 100644
>> --- a/Documentation/topics/dpdk/pmd.rst
>> +++ b/Documentation/topics/dpdk/pmd.rst
>> @@ -325,4 +325,55 @@ reassignment due to PMD Auto Load Balance. For example, this could be set
>>   (in min) such that a reassignment is triggered at most every few hours.
>>   
>> +PMD Power Saving (Experimental)
>> +-------------------------------
>> +
>> +PMD threads constantly poll Rx queues which are assigned to them. In order to
>> +reduce the CPU cycles they use, they can sleep for small periods of time
>> +when there is no load or very-low load on all the Rx queues they poll.
>> +
>> +This can be enabled by setting the max requested sleep time (in microseconds)
>> +for a PMD thread::
>> +
>> +    $ ovs-vsctl set open_vswitch . other_config:pmd-maxsleep=500
>> +
>> +Non-zero values will be rounded up to the nearest 10 microseconds to avoid
>> +requesting very small sleep times.
>> +
>> +With a non-zero max value a PMD may request to sleep by an incrementing amount
>> +of time up to the maximum time. If at any point the threshold of at least half
>> +a batch of packets (i.e. 16) is received from an Rx queue that the PMD is
>> +polling is met, the requested sleep time will be reset to 0. At that point no
>> +sleeps will occur until the no/low load conditions return.
>> +
>> +Sleeping in a PMD thread will mean there is a period of time when the PMD
>> +thread will not process packets. Sleep times requested are not guaranteed
>> +and can differ significantly depending on system configuration. The actual
>> +time not processing packets will be determined by the sleep and processor
>> +wake-up times and should be tested with each system configuration.
>> +
>> +Sleep time statistics for 10 secs can be seen with::
>> +
>> +    $ ovs-appctl dpif-netdev/pmd-stats-clear \
>> +        && sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show
>> +
>> +Example output, showing that during the last 10 seconds, 76.8% of iterations
>> +had a sleep of some length. The total amount of sleep time was 9.15 seconds and
>> +the average sleep time per iteration was 46 microseconds::
>> +
>> +   - sleep iterations:       153994  ( 76.8 % of iterations)
>> +   Sleep time:               9159399  us ( 46 us/iteration avg.)
>> +
>> +.. note::
>> +
>> +    If there is a sudden spike of packets while the PMD thread is sleeping and
>> +    the processor is in a low-power state it may result in some lost packets or
>> +    extra latency before the PMD thread returns to processing packets at full
>> +    rate.
>> +
>> +.. note::
>> +
>> +    Default Linux kernel hrtimer resolution is set to 50 microseconds so this
>> +    will add overhead to requested sleep time.
> 
> I wonder if it would make sense to round up to the nearest hrtimer
> resolution (if such info can be retrieved at runtime).
> 

Hmm, I think I used the wrong word describing as 'resolution'. iiuc, the 
kernel groups timer expirations so the timer expires later than 
expected. In this case, it manifests more like a fixed overhead and 
changing the resolution in OVS will not reduce the overhead.

David showed me that the slack timer could be changed to reduce 
overhead, but it's not something I would be comfortable to do at the 
moment as it could have some unintended consequences.

I have changed the text to:
"By default Linux kernel groups timer expirations and this can add an
  overhead of up to 50 microseconds to a requested timer expiration."

Hope it's a bit clearer. Thanks for reviewing and your tests.

> Cheers,
> 
> Reviewed-by: Robin Jarry <rjarry@redhat.com>
>
Kevin Traynor Jan. 10, 2023, 1:46 p.m. UTC | #4
On 09/01/2023 15:23, David Marchand wrote:
> On Fri, Jan 6, 2023 at 4:00 PM Kevin Traynor <ktraynor@redhat.com> wrote:
>>
>> Sleep for an incremental amount of time if none of the Rx queues
>> assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
>> on an polling iteration of the PMD.
>>
>> Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
>> sleep time to zero (i.e. no sleep).
>>
>> Sleep time will be increased on each iteration where the low load
>> conditions remain up to a total of the max sleep time which is set
>> by the user e.g:
>> ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500
>>
>> The default pmd-maxsleep value is 0, which means that no sleeps
>> will occur and the default behaviour is unchanged from previously.
>>
>> Also add new stats to pmd-perf-show to get visibility of operation
>> e.g.
>> ...
>>     - sleep iterations:       153994  ( 76.8 % of iterations)
>>     Sleep time:               9159399  us ( 46 us/iteration avg.)
>> ...
>>
>> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
>> ---
>>   Documentation/topics/dpdk/pmd.rst | 51 ++++++++++++++++++++++++
>>   NEWS                              |  3 ++
>>   lib/dpif-netdev-perf.c            | 24 +++++++++---
>>   lib/dpif-netdev-perf.h            |  5 ++-
>>   lib/dpif-netdev.c                 | 64 +++++++++++++++++++++++++++++--
>>   tests/pmd.at                      | 46 ++++++++++++++++++++++
>>   vswitchd/vswitch.xml              | 26 +++++++++++++
>>   7 files changed, 209 insertions(+), 10 deletions(-)
>>
>> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
>> index 9006fd40f..89f6b3052 100644
>> --- a/Documentation/topics/dpdk/pmd.rst
>> +++ b/Documentation/topics/dpdk/pmd.rst
>> @@ -325,4 +325,55 @@ reassignment due to PMD Auto Load Balance. For example, this could be set
>>   (in min) such that a reassignment is triggered at most every few hours.
>>
>> +PMD Power Saving (Experimental)
>> +-------------------------------
> 
> I would stick to: "PMD load based sleeping"
> The powersaving comes from some external configuration that this patch
> does not cover.
> 

Yes, you are right, I should have updated that title in v3.

> Maybe you could mention something about c-states, but it seems out of
> OVS scope itself.
> 

I gave it a quick mention so the reader is aware that there are external 
system configuration dependencies if they want to achieve some power 
saving, but agree how to enable C-states etc. is out of scope of OVS.

"Any potential power saving from PMD load based sleeping is dependent on 
the system configuration (e.g. enabling processor C-states) and workloads."

> 
>> +
>> +PMD threads constantly poll Rx queues which are assigned to them. In order to
>> +reduce the CPU cycles they use, they can sleep for small periods of time
>> +when there is no load or very-low load on all the Rx queues they poll.
>> +
>> +This can be enabled by setting the max requested sleep time (in microseconds)
>> +for a PMD thread::
>> +
>> +    $ ovs-vsctl set open_vswitch . other_config:pmd-maxsleep=500
>> +
>> +Non-zero values will be rounded up to the nearest 10 microseconds to avoid
>> +requesting very small sleep times.
>> +
>> +With a non-zero max value a PMD may request to sleep by an incrementing amount
>> +of time up to the maximum time. If at any point the threshold of at least half
>> +a batch of packets (i.e. 16) is received from an Rx queue that the PMD is
>> +polling is met, the requested sleep time will be reset to 0. At that point no
>> +sleeps will occur until the no/low load conditions return.
>> +
>> +Sleeping in a PMD thread will mean there is a period of time when the PMD
>> +thread will not process packets. Sleep times requested are not guaranteed
>> +and can differ significantly depending on system configuration. The actual
>> +time not processing packets will be determined by the sleep and processor
>> +wake-up times and should be tested with each system configuration.
>> +
>> +Sleep time statistics for 10 secs can be seen with::
>> +
>> +    $ ovs-appctl dpif-netdev/pmd-stats-clear \
>> +        && sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show
>> +
>> +Example output, showing that during the last 10 seconds, 76.8% of iterations
>> +had a sleep of some length. The total amount of sleep time was 9.15 seconds and
>> +the average sleep time per iteration was 46 microseconds::
>> +
>> +   - sleep iterations:       153994  ( 76.8 % of iterations)
>> +   Sleep time:               9159399  us ( 46 us/iteration avg.)
>> +
>> +.. note::
>> +
>> +    If there is a sudden spike of packets while the PMD thread is sleeping and
>> +    the processor is in a low-power state it may result in some lost packets or
>> +    extra latency before the PMD thread returns to processing packets at full
>> +    rate.
>> +
>> +.. note::
>> +
>> +    Default Linux kernel hrtimer resolution is set to 50 microseconds so this
>> +    will add overhead to requested sleep time.
>> +
>>   .. _ovs-vswitchd(8):
>>       http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html
>> diff --git a/NEWS b/NEWS
>> index 2f6ededfe..54d97825e 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -31,4 +31,7 @@ Post-v3.0.0
>>        * Add '-secs' argument to appctl 'dpif-netdev/pmd-rxq-show' to show
>>          the pmd usage of an Rx queue over a configurable time period.
>> +     * Add new experiemental PMD load based sleeping feature. PMD threads can
> 
> *experimental
> 

Fixed

> 
>> +       request to sleep up to a user configured 'pmd-maxsleep' value under no
>> +       and low load conditions.
>>
>>
>> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
>> index a2a7d8f0b..bc6b779a7 100644
>> --- a/lib/dpif-netdev-perf.c
>> +++ b/lib/dpif-netdev-perf.c
>> @@ -231,4 +231,6 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
>>       uint64_t idle_iter = s->pkts.bin[0];
>>       uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
>> +    uint64_t sleep_iter = stats[PMD_PWR_SLEEP_ITER];
>> +    uint64_t tot_sleep_cycles = stats[PMD_PWR_SLEEP_CYCLES];
> 
> I would remove _PWR_.
> 

Removed it. I also changed to 'PMD_CYCLES_SLEEP' as the other PMD cycles 
counters use the 'PMD_CYCLES_' format

> 
>>
>>       ds_put_format(str,
>> @@ -236,11 +238,17 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
>>               "  - Used TSC cycles:  %12"PRIu64"  (%5.1f %% of total cycles)\n"
>>               "  - idle iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
>> -            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n",
>> -            tot_iter, tot_cycles * us_per_cycle / tot_iter,
>> +            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
>> +            "  - sleep iterations: %12"PRIu64"  (%5.1f %% of iterations)\n"
>> +            " Sleep time:          %12.0f  us (%3.0f us/iteration avg.)\n",
> 
> This gives:
> 
> pmd thread numa_id 1 core_id 5:
> 
>    Iterations:               884937  (361.43 us/it)
>    - Used TSC cycles:   24829488529  (  1.2 % of total cycles)
>    - idle iterations:        563472  (  1.4 % of used cycles)
>    - busy iterations:        321465  ( 98.6 % of used cycles)
>    - sleep iterations:       569487  ( 64.4 % of iterations)
>   Sleep time:             310297274  us (351 us/iteration avg.)
>   ^^^
> I would add another space before Sleep so it aligns with the rest.
> 
> And maybe put the unit as a comment, since no other stat detailed its
> unit so far.
> +            "  Sleep time (us):    %12.0f  (%3.0f us/iteration avg.)\n",
> 

Reformatted to match this suggestion.

>    Rx packets:             10000000  (13 Kpps, 2447 cycles/pkt)
>    Datapath passes:        10000000  (1.00 passes/pkt)
>    - PHWOL hits:                  0  (  0.0 %)
>    - MFEX Opt hits:               0  (  0.0 %)
> """
> 
> 
>> +            tot_iter,
>> +            (tot_cycles + tot_sleep_cycles) * us_per_cycle / tot_iter,
>>               tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz,
>>               idle_iter,
>>               100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
>>               busy_iter,
>> -            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
>> +            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles,
>> +            sleep_iter, tot_iter ? 100.0 * sleep_iter / tot_iter : 0,
>> +            tot_sleep_cycles * us_per_cycle,
>> +            tot_iter ? (tot_sleep_cycles * us_per_cycle) / tot_iter : 0);
>>       if (rx_packets > 0) {
>>           ds_put_format(str,
>> @@ -519,5 +527,6 @@ OVS_REQUIRES(s->stats_mutex)
>>   void
>>   pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
>> -                       int tx_packets, bool full_metrics)
>> +                       int tx_packets, uint64_t sleep_cycles,
>> +                       bool full_metrics)
>>   {
>>       uint64_t now_tsc = cycles_counter_update(s);
>> @@ -526,5 +535,5 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
>>       char *reason = NULL;
>>
>> -    cycles = now_tsc - s->start_tsc;
>> +    cycles = now_tsc - s->start_tsc - sleep_cycles;
>>       s->current.timestamp = s->iteration_cnt;
>>       s->current.cycles = cycles;
>> @@ -540,4 +549,9 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
>>       histogram_add_sample(&s->pkts, rx_packets);
>>
>> +    if (sleep_cycles) {
>> +        pmd_perf_update_counter(s, PMD_PWR_SLEEP_ITER, 1);
>> +        pmd_perf_update_counter(s, PMD_PWR_SLEEP_CYCLES, sleep_cycles);
>> +    }
>> +
>>       if (!full_metrics) {
>>           return;
>> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
>> index 9673dddd8..ebf776827 100644
>> --- a/lib/dpif-netdev-perf.h
>> +++ b/lib/dpif-netdev-perf.h
>> @@ -81,4 +81,6 @@ enum pmd_stat_type {
>>       PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
>>       PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
>> +    PMD_PWR_SLEEP_ITER,     /* Iterations where a sleep has taken place. */
>> +    PMD_PWR_SLEEP_CYCLES,   /* Total cycles slept to save power. */
>>       PMD_N_STATS
>>   };
>> @@ -409,5 +411,6 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s);
>>   void
>>   pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
>> -                       int tx_packets, bool full_metrics);
>> +                       int tx_packets, uint64_t sleep_cycles,
>> +                       bool full_metrics);
>>
>>   /* Formatting the output of commands. */
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 7127068fe..af97f9a83 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -172,4 +172,9 @@ static struct odp_support dp_netdev_support = {
>>   #define PMD_RCU_QUIESCE_INTERVAL 10000LL
>>
>> +/* Number of pkts Rx on an interface that will stop pmd thread sleeping. */
>> +#define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST / 2)
>> +/* Time in uS to increment a pmd thread sleep time. */
>> +#define PMD_PWR_INC_US 10
> 
> Idem, no _PWR_.

I removed the _PWR_, but PMD_INC_US looked a bit ambiguous, so I changed 
them to both have a 'PMD_SLEEP_' prefix, and then '_THRESH' and '_INC_US'.

> 
>> +
>>   struct dpcls {
>>       struct cmap_node node;      /* Within dp_netdev_pmd_thread.classifiers */
> 
> The rest lgtm.
> With this fixed, you can add:
> Reviewed-by: David Marchand <david.marchand@redhat.com>
> 
> 

Thanks again for reviewing and trying out. The changes in v4 are 
probably small enough to keep your RvB, but maybe it's better to ask you 
to resend to confirm. The only functional change is updating the pmd ctx 
timer after sleep.
diff mbox series

Patch

diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
index 9006fd40f..89f6b3052 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -325,4 +325,55 @@  reassignment due to PMD Auto Load Balance. For example, this could be set
 (in min) such that a reassignment is triggered at most every few hours.
 
+PMD Power Saving (Experimental)
+-------------------------------
+
+PMD threads constantly poll Rx queues which are assigned to them. In order to
+reduce the CPU cycles they use, they can sleep for small periods of time
+when there is no load or very-low load on all the Rx queues they poll.
+
+This can be enabled by setting the max requested sleep time (in microseconds)
+for a PMD thread::
+
+    $ ovs-vsctl set open_vswitch . other_config:pmd-maxsleep=500
+
+Non-zero values will be rounded up to the nearest 10 microseconds to avoid
+requesting very small sleep times.
+
+With a non-zero max value a PMD may request to sleep by an incrementing amount
+of time up to the maximum time. If at any point the threshold of at least half
+a batch of packets (i.e. 16) is received from an Rx queue that the PMD is
+polling is met, the requested sleep time will be reset to 0. At that point no
+sleeps will occur until the no/low load conditions return.
+
+Sleeping in a PMD thread will mean there is a period of time when the PMD
+thread will not process packets. Sleep times requested are not guaranteed
+and can differ significantly depending on system configuration. The actual
+time not processing packets will be determined by the sleep and processor
+wake-up times and should be tested with each system configuration.
+
+Sleep time statistics for 10 secs can be seen with::
+
+    $ ovs-appctl dpif-netdev/pmd-stats-clear \
+        && sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show
+
+Example output, showing that during the last 10 seconds, 76.8% of iterations
+had a sleep of some length. The total amount of sleep time was 9.15 seconds and
+the average sleep time per iteration was 46 microseconds::
+
+   - sleep iterations:       153994  ( 76.8 % of iterations)
+   Sleep time:               9159399  us ( 46 us/iteration avg.)
+
+.. note::
+
+    If there is a sudden spike of packets while the PMD thread is sleeping and
+    the processor is in a low-power state it may result in some lost packets or
+    extra latency before the PMD thread returns to processing packets at full
+    rate.
+
+.. note::
+
+    Default Linux kernel hrtimer resolution is set to 50 microseconds so this
+    will add overhead to requested sleep time.
+
 .. _ovs-vswitchd(8):
     http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html
diff --git a/NEWS b/NEWS
index 2f6ededfe..54d97825e 100644
--- a/NEWS
+++ b/NEWS
@@ -31,4 +31,7 @@  Post-v3.0.0
      * Add '-secs' argument to appctl 'dpif-netdev/pmd-rxq-show' to show
        the pmd usage of an Rx queue over a configurable time period.
+     * Add new experiemental PMD load based sleeping feature. PMD threads can
+       request to sleep up to a user configured 'pmd-maxsleep' value under no
+       and low load conditions.
 
 
diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
index a2a7d8f0b..bc6b779a7 100644
--- a/lib/dpif-netdev-perf.c
+++ b/lib/dpif-netdev-perf.c
@@ -231,4 +231,6 @@  pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
     uint64_t idle_iter = s->pkts.bin[0];
     uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
+    uint64_t sleep_iter = stats[PMD_PWR_SLEEP_ITER];
+    uint64_t tot_sleep_cycles = stats[PMD_PWR_SLEEP_CYCLES];
 
     ds_put_format(str,
@@ -236,11 +238,17 @@  pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
             "  - Used TSC cycles:  %12"PRIu64"  (%5.1f %% of total cycles)\n"
             "  - idle iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
-            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n",
-            tot_iter, tot_cycles * us_per_cycle / tot_iter,
+            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
+            "  - sleep iterations: %12"PRIu64"  (%5.1f %% of iterations)\n"
+            " Sleep time:          %12.0f  us (%3.0f us/iteration avg.)\n",
+            tot_iter,
+            (tot_cycles + tot_sleep_cycles) * us_per_cycle / tot_iter,
             tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz,
             idle_iter,
             100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
             busy_iter,
-            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
+            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles,
+            sleep_iter, tot_iter ? 100.0 * sleep_iter / tot_iter : 0,
+            tot_sleep_cycles * us_per_cycle,
+            tot_iter ? (tot_sleep_cycles * us_per_cycle) / tot_iter : 0);
     if (rx_packets > 0) {
         ds_put_format(str,
@@ -519,5 +527,6 @@  OVS_REQUIRES(s->stats_mutex)
 void
 pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
-                       int tx_packets, bool full_metrics)
+                       int tx_packets, uint64_t sleep_cycles,
+                       bool full_metrics)
 {
     uint64_t now_tsc = cycles_counter_update(s);
@@ -526,5 +535,5 @@  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
     char *reason = NULL;
 
-    cycles = now_tsc - s->start_tsc;
+    cycles = now_tsc - s->start_tsc - sleep_cycles;
     s->current.timestamp = s->iteration_cnt;
     s->current.cycles = cycles;
@@ -540,4 +549,9 @@  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
     histogram_add_sample(&s->pkts, rx_packets);
 
+    if (sleep_cycles) {
+        pmd_perf_update_counter(s, PMD_PWR_SLEEP_ITER, 1);
+        pmd_perf_update_counter(s, PMD_PWR_SLEEP_CYCLES, sleep_cycles);
+    }
+
     if (!full_metrics) {
         return;
diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
index 9673dddd8..ebf776827 100644
--- a/lib/dpif-netdev-perf.h
+++ b/lib/dpif-netdev-perf.h
@@ -81,4 +81,6 @@  enum pmd_stat_type {
     PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
     PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
+    PMD_PWR_SLEEP_ITER,     /* Iterations where a sleep has taken place. */
+    PMD_PWR_SLEEP_CYCLES,   /* Total cycles slept to save power. */
     PMD_N_STATS
 };
@@ -409,5 +411,6 @@  pmd_perf_start_iteration(struct pmd_perf_stats *s);
 void
 pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
-                       int tx_packets, bool full_metrics);
+                       int tx_packets, uint64_t sleep_cycles,
+                       bool full_metrics);
 
 /* Formatting the output of commands. */
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 7127068fe..af97f9a83 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -172,4 +172,9 @@  static struct odp_support dp_netdev_support = {
 #define PMD_RCU_QUIESCE_INTERVAL 10000LL
 
+/* Number of pkts Rx on an interface that will stop pmd thread sleeping. */
+#define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST / 2)
+/* Time in uS to increment a pmd thread sleep time. */
+#define PMD_PWR_INC_US 10
+
 struct dpcls {
     struct cmap_node node;      /* Within dp_netdev_pmd_thread.classifiers */
@@ -280,4 +285,6 @@  struct dp_netdev {
     /* Enable collection of PMD performance metrics. */
     atomic_bool pmd_perf_metrics;
+    /* Max load based sleep request. */
+    atomic_uint64_t pmd_max_sleep;
     /* Enable the SMC cache from ovsdb config */
     atomic_bool smc_enable_db;
@@ -4822,6 +4829,8 @@  dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
     uint8_t cur_rebalance_load;
     uint32_t rebalance_load, rebalance_improve;
+    uint64_t  pmd_max_sleep, cur_pmd_max_sleep;
     bool log_autolb = false;
     enum sched_assignment_type pmd_rxq_assign_type;
+    static bool first_set_config = true;
 
     tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
@@ -4970,4 +4979,17 @@  dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
 
     set_pmd_auto_lb(dp, autolb_state, log_autolb);
+
+    pmd_max_sleep = smap_get_ullong(other_config, "pmd-maxsleep", 0);
+    pmd_max_sleep = ROUND_UP(pmd_max_sleep, 10);
+    pmd_max_sleep = MIN(PMD_RCU_QUIESCE_INTERVAL, pmd_max_sleep);
+    atomic_read_relaxed(&dp->pmd_max_sleep, &cur_pmd_max_sleep);
+    if (first_set_config || pmd_max_sleep != cur_pmd_max_sleep) {
+        atomic_store_relaxed(&dp->pmd_max_sleep, pmd_max_sleep);
+        VLOG_INFO("PMD max sleep request is %"PRIu64" usecs.", pmd_max_sleep);
+        VLOG_INFO("PMD load based sleeps are %s.",
+                  pmd_max_sleep ? "enabled" : "disabled" );
+    }
+
+    first_set_config  = false;
     return 0;
 }
@@ -6930,4 +6952,5 @@  pmd_thread_main(void *f_)
     int i;
     int process_packets = 0;
+    uint64_t sleep_time = 0;
 
     poll_list = NULL;
@@ -6990,8 +7013,11 @@  reload:
     for (;;) {
         uint64_t rx_packets = 0, tx_packets = 0;
+        uint64_t time_slept = 0;
+        uint64_t max_sleep;
 
         pmd_perf_start_iteration(s);
 
         atomic_read_relaxed(&pmd->dp->smc_enable_db, &pmd->ctx.smc_enable_db);
+        atomic_read_relaxed(&pmd->dp->pmd_max_sleep, &max_sleep);
 
         for (i = 0; i < poll_cnt; i++) {
@@ -7012,4 +7038,7 @@  reload:
                                            poll_list[i].port_no);
             rx_packets += process_packets;
+            if (process_packets >= PMD_PWR_NO_SLEEP_THRESH) {
+                sleep_time = 0;
+            }
         }
 
@@ -7019,5 +7048,27 @@  reload:
              * There was no time updates on current iteration. */
             pmd_thread_ctx_time_update(pmd);
-            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
+            tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
+                                                   max_sleep && sleep_time
+                                                   ? true : false);
+        }
+
+        if (max_sleep) {
+            /* Check if a sleep should happen on this iteration. */
+            if (sleep_time) {
+                struct cycle_timer sleep_timer;
+
+                cycle_timer_start(&pmd->perf_stats, &sleep_timer);
+                xnanosleep_no_quiesce(sleep_time * 1000);
+                time_slept = cycle_timer_stop(&pmd->perf_stats, &sleep_timer);
+            }
+            if (sleep_time < max_sleep) {
+                /* Increase sleep time for next iteration. */
+                sleep_time += PMD_PWR_INC_US;
+            } else {
+                sleep_time = max_sleep;
+            }
+        } else {
+            /* Reset sleep time as max sleep policy may have been changed. */
+            sleep_time = 0;
         }
 
@@ -7059,5 +7110,5 @@  reload:
         }
 
-        pmd_perf_end_iteration(s, rx_packets, tx_packets,
+        pmd_perf_end_iteration(s, rx_packets, tx_packets, time_slept,
                                pmd_perf_metrics_enabled(pmd));
     }
@@ -9910,5 +9961,5 @@  dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
 {
     struct dpcls *cls;
-    uint64_t tot_idle = 0, tot_proc = 0;
+    uint64_t tot_idle = 0, tot_proc = 0, tot_sleep = 0;
     unsigned int pmd_load = 0;
 
@@ -9927,8 +9978,11 @@  dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
             tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] -
                        pmd->prev_stats[PMD_CYCLES_ITER_BUSY];
+            tot_sleep = pmd->perf_stats.counters.n[PMD_PWR_SLEEP_CYCLES] -
+                        pmd->prev_stats[PMD_PWR_SLEEP_CYCLES];
 
             if (pmd_alb->is_enabled && !pmd->isolated) {
                 if (tot_proc) {
-                    pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc));
+                    pmd_load = ((tot_proc * 100) /
+                                    (tot_idle + tot_proc + tot_sleep));
                 }
 
@@ -9947,4 +10001,6 @@  dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
         pmd->prev_stats[PMD_CYCLES_ITER_BUSY] =
                         pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY];
+        pmd->prev_stats[PMD_PWR_SLEEP_CYCLES] =
+                        pmd->perf_stats.counters.n[PMD_PWR_SLEEP_CYCLES];
 
         /* Get the cycles that were used to process each queue and store. */
diff --git a/tests/pmd.at b/tests/pmd.at
index ed90f88c4..e0f58f7a6 100644
--- a/tests/pmd.at
+++ b/tests/pmd.at
@@ -1255,2 +1255,48 @@  ovs-appctl: ovs-vswitchd: server returned an error
 OVS_VSWITCHD_STOP
 AT_CLEANUP
+
+dnl Check default state
+AT_SETUP([PMD - pmd sleep])
+OVS_VSWITCHD_START
+
+dnl Check default
+OVS_WAIT_UNTIL([tail ovs-vswitchd.log | grep "PMD max sleep request is 0 usecs."])
+OVS_WAIT_UNTIL([tail ovs-vswitchd.log | grep "PMD load based sleeps are disabled."])
+
+dnl Check low value max sleep
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="1"])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 10 usecs."])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])
+
+dnl Check high value max sleep
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="10000"])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 10000 usecs."])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])
+
+dnl Check setting max sleep to zero
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="0"])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 0 usecs."])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are disabled."])
+
+dnl Check above high value max sleep
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="10001"])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 10000 usecs."])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])
+
+dnl Check rounding
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="490"])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 490 usecs."])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])
+dnl Check rounding
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="491"])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 500 usecs."])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])
+
+OVS_VSWITCHD_STOP
+AT_CLEANUP
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index f9bdb2d92..8c4acfb18 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -789,4 +789,30 @@ 
         </p>
       </column>
+      <column name="other_config" key="pmd-maxsleep"
+              type='{"type": "integer",
+                     "minInteger": 0, "maxInteger": 10000}'>
+        <p>
+          Specifies the maximum sleep time that will be requested in
+          microseconds per iteration for a PMD thread which has received zero
+          or a small amount of packets from the Rx queues it is polling.
+        </p>
+        <p>
+          The actual sleep time requested is based on the load
+          of the Rx queues that the PMD polls and may be less than
+          the maximum value.
+        </p>
+        <p>
+          The default value is <code>0 microseconds</code>, which means
+          that the PMD will not sleep regardless of the load from the
+          Rx queues that it polls.
+        </p>
+        <p>
+          To avoid requesting very small sleeps (e.g. less than 10 us) the
+          value will be rounded up to the nearest 10 us.
+        </p>
+        <p>
+          The maximum value is <code>10000 microseconds</code>.
+        </p>
+      </column>
       <column name="other_config" key="userspace-tso-enable"
               type='{"type": "boolean"}'>