diff mbox series

[ovs-dev] dpif-netdev: Load based PMD sleeping.

Message ID 20221129140131.361338-1-ktraynor@redhat.com
State Superseded
Headers show
Series [ovs-dev] dpif-netdev: Load based PMD sleeping. | expand

Checks

Context Check Description
ovsrobot/intel-ovs-compilation success test: success
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed

Commit Message

Kevin Traynor Nov. 29, 2022, 2:01 p.m. UTC
Sleep for an incremental amount of time if none of the Rx queues
assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
on an polling iteration of the PMD.

Upon detecting >= 16 pkts on an Rxq, reset the sleep time to zero.

Sleep time will be increased by 1 uS on each iteration where
the low load conditions remain up to a total of 250 uS.

The feature is off by default and can be enabled by:
ovs-vsctl set Open_vSwitch . other_config:pmd-powersave=true

Also add new stats to pmd-perf-show to get visibility of operation:
Iterations:
...
  - No-sleep thresh:           576  ( 98.1 % of busy it)
  - Max sleeps:              19424  (167 us/it avg.)
...

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
---
 Documentation/topics/dpdk/pmd.rst | 20 +++++++++++
 lib/dpif-netdev-perf.c            | 25 ++++++++++++--
 lib/dpif-netdev-perf.h            |  7 +++-
 lib/dpif-netdev.c                 | 56 +++++++++++++++++++++++++++++--
 vswitchd/vswitch.xml              | 17 ++++++++++
 5 files changed, 118 insertions(+), 7 deletions(-)

Comments

Thilak Raj Surendra Babu Dec. 7, 2022, 9:28 a.m. UTC | #1
Hi Kevin,
Thank you for the patch. It is pretty clean for power saving.
I tried the patch below is what I see 

Ping test:
Ping from physical machine to br0 ip-address.

Poll:
[root@dargil-4 pktgen-dpdk]# ping 10.15.251.33 -i 0.001 -c 10
PING 10.15.251.33 (10.15.251.33) 56(84) bytes of data.
64 bytes from 10.15.251.33: icmp_seq=1 ttl=64 time=0.081 ms
64 bytes from 10.15.251.33: icmp_seq=2 ttl=64 time=0.061 ms
64 bytes from 10.15.251.33: icmp_seq=3 ttl=64 time=0.050 ms
64 bytes from 10.15.251.33: icmp_seq=4 ttl=64 time=0.049 ms
64 bytes from 10.15.251.33: icmp_seq=5 ttl=64 time=0.048 ms
64 bytes from 10.15.251.33: icmp_seq=6 ttl=64 time=0.048 ms
64 bytes from 10.15.251.33: icmp_seq=7 ttl=64 time=0.047 ms
64 bytes from 10.15.251.33: icmp_seq=8 ttl=64 time=0.047 ms
64 bytes from 10.15.251.33: icmp_seq=9 ttl=64 time=0.047 ms
64 bytes from 10.15.251.33: icmp_seq=10 ttl=64 time=0.049 ms

--- 10.15.251.33 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9ms
rtt min/avg/max/mdev = 0.047/0.052/0.081/0.013 ms

With your changes:
[root@dargil-4 pktgen-dpdk]# ping 10.15.251.33 -i 0.001 -c 10
PING 10.15.251.33 (10.15.251.33) 56(84) bytes of data.
64 bytes from 10.15.251.33: icmp_seq=1 ttl=64 time=0.189 ms
64 bytes from 10.15.251.33: icmp_seq=2 ttl=64 time=0.367 ms
64 bytes from 10.15.251.33: icmp_seq=3 ttl=64 time=0.351 ms
64 bytes from 10.15.251.33: icmp_seq=4 ttl=64 time=0.315 ms
64 bytes from 10.15.251.33: icmp_seq=5 ttl=64 time=0.240 ms
64 bytes from 10.15.251.33: icmp_seq=6 ttl=64 time=0.204 ms
64 bytes from 10.15.251.33: icmp_seq=7 ttl=64 time=0.204 ms
64 bytes from 10.15.251.33: icmp_seq=8 ttl=64 time=0.138 ms
64 bytes from 10.15.251.33: icmp_seq=9 ttl=64 time=0.137 ms
64 bytes from 10.15.251.33: icmp_seq=10 ttl=64 time=0.417 ms

--- 10.15.251.33 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9ms
rtt min/avg/max/mdev = 0.137/0.256/0.417/0.094 ms

At a high level, I was thinking of the below.
Please let me know what you think?

1. Can we have the option of making a particular pmd thread sleep instead of it being at the global level.
There are some interfaces for which the flows being serviced by them are very sensitive towards latency(heartbeats) 
2. I was thinking about using umwait to reduce the latency of the first packet(but portability becomes a challenge I suppose) . But I guess this patch is very generic enough to be adopted on all platforms.
3. Instead of sleeping on the next immediate iteration if the previous iteration had less than half of the burst size,
can we have a logic where we record a window of packets received(say last 60 iterations) and based on that decide to sleep?
I am trying to avoid scenarios where a PMD thread has two interfaces and one interface has < BURST/2 packets coming in which induces latency for the packets received on the other interface.

Thanks
Thilak Raj S

-----Original Message-----
From: Kevin Traynor <ktraynor@redhat.com> 
Sent: 29 November 2022 06:02
To: dev@openvswitch.org
Cc: david.marchand@redhat.com; cfontain@redhat.com; cian.ferriter@intel.com; Thilak Raj Surendra Babu <thilakraj.sb@nutanix.com>; Kevin Traynor <ktraynor@redhat.com>
Subject: [PATCH] dpif-netdev: Load based PMD sleeping.

Sleep for an incremental amount of time if none of the Rx queues assigned to a PMD have at least half a batch of packets (i.e. 16 pkts) on an polling iteration of the PMD.

Upon detecting >= 16 pkts on an Rxq, reset the sleep time to zero.

Sleep time will be increased by 1 uS on each iteration where the low load conditions remain up to a total of 250 uS.

The feature is off by default and can be enabled by:
ovs-vsctl set Open_vSwitch . other_config:pmd-powersave=true

Also add new stats to pmd-perf-show to get visibility of operation:
Iterations:
...
  - No-sleep thresh:           576  ( 98.1 % of busy it)
  - Max sleeps:              19424  (167 us/it avg.)
...

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
---
 Documentation/topics/dpdk/pmd.rst | 20 +++++++++++
 lib/dpif-netdev-perf.c            | 25 ++++++++++++--
 lib/dpif-netdev-perf.h            |  7 +++-
 lib/dpif-netdev.c                 | 56 +++++++++++++++++++++++++++++--
 vswitchd/vswitch.xml              | 17 ++++++++++
 5 files changed, 118 insertions(+), 7 deletions(-)

diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
index b259cc8b3..a42e34956 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -312,4 +312,24 @@ reassignment due to PMD Auto Load Balance. For example, this could be set  (in min) such that a reassignment is triggered at most every few hours.
 
+PMD Power Saving
+----------------
+
+PMD threads constantly poll Rx queues which are assigned to them. In 
+order to reduce the CPU cycles they use, they can sleep for small 
+periods of time when there is no load or very-low load from the Rx queues it polls.
+
+This can be enabled by::
+
+    $ ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"
Should we think about having this per PMD thread instead of applying for all the PMD threads?
+
+With this enabled a PMD may sleep by an incrementing amount of time up 
+to a total of 250 uS after polling all Rx queues assigned to it, if 
+none of the Rx queues have at least half a batch of packets (i.e. 16).
+
+.. note::
+
+    Sleeping in a PMD thread may cause extra packet latency due to the sleep
+    and possible processor low-power states which may be invoked.
+
 .. _ovs-vswitchd(8):
     https://urldefense.proofpoint.com/v2/url?u=http-3A__openvswitch.org_support_dist-2Ddocs_ovs-2Dvswitchd.8.html&d=DwIDAg&c=s883GpUCOChKOHiocYtGcg&r=1sd3waKor_ps6hs2j0tfqmW6ts2tlVvmmMySlXCPN6w&m=j2qb7PEBmEEdOCoDKnOOlknwNtbrbMajkVMwgyP9wSKZO5G-jtHdhk6hYvL3wgc1&s=iiXtQr8FYSg6QP3btMH7HA5LBoyrNKToSpvCx6LE3Sw&e=
diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index a2a7d8f0b..1bdeda265 100644
--- a/lib/dpif-netdev-perf.c
+++ b/lib/dpif-netdev-perf.c
@@ -231,4 +231,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
     uint64_t idle_iter = s->pkts.bin[0];
     uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
+    uint64_t no_sleep_hit = stats[PMD_PWR_NO_SLEEP_HIT];
+    uint64_t max_sleep_hit = stats[PMD_PWR_MAX_SLEEP_HIT];
+    uint64_t tot_sleep_time = stats[PMD_PWR_SLEEP_TIME];
 
     ds_put_format(str,
@@ -236,5 +239,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
             "  - Used TSC cycles:  %12"PRIu64"  (%5.1f %% of total cycles)\n"
             "  - idle iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
-            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n",
+            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
+            "  - No-sleep thresh:  %12"PRIu64"  (%5.1f %% of busy it)\n"
+            "  - Max sleeps:       %12"PRIu64"  (%3"PRIu64" us/it avg.)\n",
             tot_iter, tot_cycles * us_per_cycle / tot_iter,
             tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz, @@ -242,5 +247,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
             100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
             busy_iter,
-            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
+            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles,
+            no_sleep_hit, busy_iter ? 100.0 * no_sleep_hit / busy_iter : 0,
+            max_sleep_hit, tot_iter ? (tot_sleep_time / 1000) / 
+ tot_iter : 0);
     if (rx_packets > 0) {
         ds_put_format(str,
@@ -519,5 +526,7 @@ OVS_REQUIRES(s->stats_mutex)  void  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
-                       int tx_packets, bool full_metrics)
+                       int tx_packets, bool max_sleep_hit,
+                       bool no_sleep_hit, uint64_t sleep_time,
+                       bool full_metrics)
 {
     uint64_t now_tsc = cycles_counter_update(s); @@ -540,4 +549,14 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
     histogram_add_sample(&s->pkts, rx_packets);
 
+    if (no_sleep_hit) {
+        pmd_perf_update_counter(s, PMD_PWR_NO_SLEEP_HIT, 1);
+    }
+    if (max_sleep_hit) {
+        pmd_perf_update_counter(s, PMD_PWR_MAX_SLEEP_HIT, 1);
+    }
+    if (sleep_time) {
+        pmd_perf_update_counter(s, PMD_PWR_SLEEP_TIME, sleep_time);
+    }
+
     if (!full_metrics) {
         return;
diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index 9673dddd8..c3d6cd435 100644
--- a/lib/dpif-netdev-perf.h
+++ b/lib/dpif-netdev-perf.h
@@ -81,4 +81,7 @@ enum pmd_stat_type {
     PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
     PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
+    PMD_PWR_NO_SLEEP_HIT,   /* Iterations with Rx above no-sleep thresh. */
+    PMD_PWR_MAX_SLEEP_HIT,  /* Number of times max sleep value hit. */
+    PMD_PWR_SLEEP_TIME,     /* Total time slept to save power. */
     PMD_N_STATS
 };
@@ -409,5 +412,7 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s);  void  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
-                       int tx_packets, bool full_metrics);
+                       int tx_packets, bool max_sleep_hit,
+                       bool no_sleep_hit, uint64_t sleep_time,
+                       bool full_metrics);
 
 /* Formatting the output of commands. */ diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 2c08a71c8..91a323c74 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -170,4 +170,11 @@ static struct odp_support dp_netdev_support = {  #define PMD_RCU_QUIESCE_INTERVAL 10000LL
 
+/* Max time in nanoseconds for a pmd thread to sleep based on load. */ 
+#define PMD_PWR_MAX_SLEEP 250000
+/* Number of pkts Rx on an interface that will stop pmd thread 
+sleeping. */ #define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST/2)
+/* Time in nanosecond to increment a pmd thread sleep time. */ #define 
+PMD_PWR_INC 1000
Should we make them Tunnable/configurable(MAX_SLEEP and PWR_INC) instead of them being constants?
+
 struct dpcls {
     struct cmap_node node;      /* Within dp_netdev_pmd_thread.classifiers */
@@ -278,4 +285,6 @@ struct dp_netdev {
     /* Enable collection of PMD performance metrics. */
     atomic_bool pmd_perf_metrics;
+    /* Enable PMD load based sleeping. */
+    atomic_bool pmd_powersave;
     /* Enable the SMC cache from ovsdb config */
     atomic_bool smc_enable_db;
@@ -4915,4 +4924,17 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
 
     set_pmd_auto_lb(dp, autolb_state, log_autolb);
+
+    bool powersave = smap_get_bool(other_config, "pmd-powersave", false);
+    bool cur_powersave;
+    atomic_read_relaxed(&dp->pmd_powersave, &cur_powersave);
+    if (powersave != cur_powersave) {
+        atomic_store_relaxed(&dp->pmd_powersave, powersave);
+        if (powersave) {
+            VLOG_INFO("PMD powersave is enabled.");
+        } else {
+            VLOG_INFO("PMD powersave is disabled.");
+        }
+    }
+
     return 0;
 }
@@ -6873,7 +6895,9 @@ pmd_thread_main(void *f_)
     bool exiting;
     bool reload;
+    bool powersave;
     int poll_cnt;
     int i;
     int process_packets = 0;
+    uint64_t sleep_time = 0;
 
     poll_list = NULL;
@@ -6935,7 +6959,14 @@ reload:
     for (;;) {
         uint64_t rx_packets = 0, tx_packets = 0;
+        bool nosleep_hit = false;
+        bool max_sleep_hit = false;
+        uint64_t time_slept = 0;
 
         pmd_perf_start_iteration(s);
-
+        atomic_read_relaxed(&pmd->dp->pmd_powersave, &powersave);
+        if (!powersave) {
+            /* Reset sleep_time as policy may have changed. */
+            sleep_time = 0;
+        }
         atomic_read_relaxed(&pmd->dp->smc_enable_db, &pmd->ctx.smc_enable_db);
 
@@ -6957,4 +6988,8 @@ reload:
                                            poll_list[i].port_no);
             rx_packets += process_packets;
+            if (process_packets >= PMD_PWR_NO_SLEEP_THRESH) {
+                nosleep_hit = true;
+                sleep_time = 0;
+            }
         }
 
@@ -6964,5 +6999,19 @@ reload:
              * There was no time updates on current iteration. */
             pmd_thread_ctx_time_update(pmd);
-            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
+            tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
+                                                   sleep_time ? true : false);
+        }
+
+        if (powersave) {
+            if (sleep_time) {
+                xnanosleep(sleep_time);
+                time_slept = sleep_time;
+            }
+            if (sleep_time < PMD_PWR_MAX_SLEEP) {
+                /* Increase potential sleep time for next iteration. */
+                sleep_time += PMD_PWR_INC;
+            } else {
+                max_sleep_hit = true;
+            }
M understanding here is that the above code will lead to a interface with less packets coming
Cause latency for the packets arriving in the other interface handled by the same PMD.
Please correct me if I am wrong.

Lets say we have 3 interfaces being serviced by this PMD.
first interface receives 8 pkts , second receives 5 pkts and third received 30 pkts.
After servicing the first interface, sleep time increases to 1 us and
Before servicing the second interface it will sleep 1 us and increment the sleep time to 1us
And before servicing third interface it will sleep 2 us(so even though the third interface receives > PMD_PWR_NO_SLEEP_THRESH packets on that interface will suffer a latency).
If the order of the interface was changed the latency suffered by the packets on these interfaces will change.

I am thinking , if we hold an history of some x iterations and sleep based on this history, the order of the interfaces will not affect the latency suffered by the packets.
and possibly have even latencies across.

         }
 
@@ -7004,5 +7053,6 @@ reload:
         }
 
-        pmd_perf_end_iteration(s, rx_packets, tx_packets,
+        pmd_perf_end_iteration(s, rx_packets, tx_packets, max_sleep_hit,
+                               nosleep_hit, time_slept,
                                pmd_perf_metrics_enabled(pmd));
     }
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 928821a82..d51a486d8 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -789,4 +789,21 @@
         </p>
       </column>
+      <column name="other_config" key="pmd-powersave"
+              type='{"type": "boolean"}'>
+        <p>
+         Enables PMD thread sleeping up to 250 uS per iteration based on
+         rx queue batch sizes.
+        </p>
+        <p>
+          The default value is <code>false</code>.
+        </p>
+        <p>
+          Set this value to <code>true</code> to enable this option.
+        </p>
+        <p>
+         Sleep time will be reset to 0 when any rx queue has a batch of
+         16 or more packets.
+        </p>
+      </column>
       <column name="other_config" key="userspace-tso-enable"
               type='{"type": "boolean"}'>
--
2.38.1
Kevin Traynor Dec. 8, 2022, 2:39 p.m. UTC | #2
On 07/12/2022 09:28, Thilak Raj Surendra Babu wrote:
> Hi Kevin,
> Thank you for the patch. It is pretty clean for power saving.

Hi Thilak, Thank you for testing.

> I tried the patch below is what I see
> 
> Ping test:
> Ping from physical machine to br0 ip-address.
> 
> Poll:
> [root@dargil-4 pktgen-dpdk]# ping 10.15.251.33 -i 0.001 -c 10
> PING 10.15.251.33 (10.15.251.33) 56(84) bytes of data.
> 64 bytes from 10.15.251.33: icmp_seq=1 ttl=64 time=0.081 ms
> 64 bytes from 10.15.251.33: icmp_seq=2 ttl=64 time=0.061 ms
> 64 bytes from 10.15.251.33: icmp_seq=3 ttl=64 time=0.050 ms
> 64 bytes from 10.15.251.33: icmp_seq=4 ttl=64 time=0.049 ms
> 64 bytes from 10.15.251.33: icmp_seq=5 ttl=64 time=0.048 ms
> 64 bytes from 10.15.251.33: icmp_seq=6 ttl=64 time=0.048 ms
> 64 bytes from 10.15.251.33: icmp_seq=7 ttl=64 time=0.047 ms
> 64 bytes from 10.15.251.33: icmp_seq=8 ttl=64 time=0.047 ms
> 64 bytes from 10.15.251.33: icmp_seq=9 ttl=64 time=0.047 ms
> 64 bytes from 10.15.251.33: icmp_seq=10 ttl=64 time=0.049 ms
> 
> --- 10.15.251.33 ping statistics ---
> 10 packets transmitted, 10 received, 0% packet loss, time 9ms
> rtt min/avg/max/mdev = 0.047/0.052/0.081/0.013 ms
> 
> With your changes:
> [root@dargil-4 pktgen-dpdk]# ping 10.15.251.33 -i 0.001 -c 10
> PING 10.15.251.33 (10.15.251.33) 56(84) bytes of data.
> 64 bytes from 10.15.251.33: icmp_seq=1 ttl=64 time=0.189 ms
> 64 bytes from 10.15.251.33: icmp_seq=2 ttl=64 time=0.367 ms
> 64 bytes from 10.15.251.33: icmp_seq=3 ttl=64 time=0.351 ms
> 64 bytes from 10.15.251.33: icmp_seq=4 ttl=64 time=0.315 ms
> 64 bytes from 10.15.251.33: icmp_seq=5 ttl=64 time=0.240 ms
> 64 bytes from 10.15.251.33: icmp_seq=6 ttl=64 time=0.204 ms
> 64 bytes from 10.15.251.33: icmp_seq=7 ttl=64 time=0.204 ms
> 64 bytes from 10.15.251.33: icmp_seq=8 ttl=64 time=0.138 ms
> 64 bytes from 10.15.251.33: icmp_seq=9 ttl=64 time=0.137 ms
> 64 bytes from 10.15.251.33: icmp_seq=10 ttl=64 time=0.417 ms
> 
> --- 10.15.251.33 ping statistics ---
> 10 packets transmitted, 10 received, 0% packet loss, time 9ms
> rtt min/avg/max/mdev = 0.137/0.256/0.417/0.094 ms
> 
> At a high level, I was thinking of the below.
> Please let me know what you think?
> 
> 1. Can we have the option of making a particular pmd thread sleep instead of it being at the global level.
> There are some interfaces for which the flows being serviced by them are very sensitive towards latency(heartbeats)

To just have a per pmd setting means that the user would need to know 
which PMDs were being used for that port (could be multiple) and react 
to any changes in assignments or changes to PMD cores.

It also means that anyone who wants it on all PMD cores now needs to be 
aware of which PMDs are being used and track changes and then re-set to 
the new PMDs.

So I think per-pmd config as a replacement for a global one is too 
difficult for a user.

*If* a per-pmd functionality was needed a better option would be to keep 
the global pmd-powersave setting AND add a per-pmd no-sleep override 
(default false).

The expectation then would be if a user is concerned about the latency 
on a particular port they would be prepared to tune and pin that port Rx 
queue(s) to a specific PMD and set that PMD to never sleep, regardless 
of the global pmd-powersave mode.

It preserves the easy to use general setting for users, but it gives a 
user who has special requirements about some interfaces and is willing 
to go deeper into tuning the extra functionality.

It still does introduce some extra complexity for user and a bit for 
code too so we'd need to know that it's needed.

Note, by default it would not have an impact on a global pmd-powersave 
config option. In this way a per-pmd override is an add-on functionality 
and could be done at a later time.

That is certainly something that could be added if users felt it is 
needed and we can plan to do it but I feel it would be better to bed 
down that the core operation of sleeping is useful/correct first, 
before adding on extra ways to specify on/off.

> 2. I was thinking about using umwait to reduce the latency of the first packet(but portability becomes a challenge I suppose) . But I guess this patch is very generic enough to be adopted on all platforms.

At a quick look, yes portability would be issue. I will look into it 
some more.

> 3. Instead of sleeping on the next immediate iteration if the previous iteration had less than half of the burst size,
> can we have a logic where we record a window of packets received(say last 60 iterations) and based on that decide to sleep?

We could delay sleeping until an arbitrary number of iterations with all 
Rx queue < BURST/2, but in some cases that might mean we don't sleep 
where we might have been able to. The other issue would be picking the 
arbitrary number - I wouldn't want a user setting something so low level.

> I am trying to avoid scenarios where a PMD thread has two interfaces and one interface has < BURST/2 packets coming in which induces latency for the packets received on the other interface.
> 

It would be right for the PMD to sleep if both interfaces have <BURST/2.
Waiting an arbitrary amount of iterations would likely just delay that.

I suppose one thing to mention is that this focuses on the transition 
and edge cases and it is hard to get them perfect for all traffic sencarios.

There will always be some trade-off between sleeping sooner or later. 
i.e. do we sleep sooner and try to save power or wait a bit longer and 
be more sure that the traffic is not bursty. I would hope the trade-off 
is only for a very small window and the majority of time it's clear that 
we should sleep or not sleep.

Thanks, really appreciate your thoughts and feedback.

Kevin.

> Thanks
> Thilak Raj S
> 
> -----Original Message-----
> From: Kevin Traynor <ktraynor@redhat.com>
> Sent: 29 November 2022 06:02
> To: dev@openvswitch.org
> Cc: david.marchand@redhat.com; cfontain@redhat.com; cian.ferriter@intel.com; Thilak Raj Surendra Babu <thilakraj.sb@nutanix.com>; Kevin Traynor <ktraynor@redhat.com>
> Subject: [PATCH] dpif-netdev: Load based PMD sleeping.
> 
> Sleep for an incremental amount of time if none of the Rx queues assigned to a PMD have at least half a batch of packets (i.e. 16 pkts) on an polling iteration of the PMD.
> 
> Upon detecting >= 16 pkts on an Rxq, reset the sleep time to zero.
> 
> Sleep time will be increased by 1 uS on each iteration where the low load conditions remain up to a total of 250 uS.
> 
> The feature is off by default and can be enabled by:
> ovs-vsctl set Open_vSwitch . other_config:pmd-powersave=true
> 
> Also add new stats to pmd-perf-show to get visibility of operation:
> Iterations:
> ...
>    - No-sleep thresh:           576  ( 98.1 % of busy it)
>    - Max sleeps:              19424  (167 us/it avg.)
> ...
> 
> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
> ---
>   Documentation/topics/dpdk/pmd.rst | 20 +++++++++++
>   lib/dpif-netdev-perf.c            | 25 ++++++++++++--
>   lib/dpif-netdev-perf.h            |  7 +++-
>   lib/dpif-netdev.c                 | 56 +++++++++++++++++++++++++++++--
>   vswitchd/vswitch.xml              | 17 ++++++++++
>   5 files changed, 118 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> index b259cc8b3..a42e34956 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -312,4 +312,24 @@ reassignment due to PMD Auto Load Balance. For example, this could be set  (in min) such that a reassignment is triggered at most every few hours.
>   
> +PMD Power Saving
> +----------------
> +
> +PMD threads constantly poll Rx queues which are assigned to them. In
> +order to reduce the CPU cycles they use, they can sleep for small
> +periods of time when there is no load or very-low load from the Rx queues it polls.
> +
> +This can be enabled by::
> +
> +    $ ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"
> Should we think about having this per PMD thread instead of applying for all the PMD threads?
> +
> +With this enabled a PMD may sleep by an incrementing amount of time up
> +to a total of 250 uS after polling all Rx queues assigned to it, if
> +none of the Rx queues have at least half a batch of packets (i.e. 16).
> +
> +.. note::
> +
> +    Sleeping in a PMD thread may cause extra packet latency due to the sleep
> +    and possible processor low-power states which may be invoked.
> +
>   .. _ovs-vswitchd(8):
>       https://urldefense.proofpoint.com/v2/url?u=http-3A__openvswitch.org_support_dist-2Ddocs_ovs-2Dvswitchd.8.html&d=DwIDAg&c=s883GpUCOChKOHiocYtGcg&r=1sd3waKor_ps6hs2j0tfqmW6ts2tlVvmmMySlXCPN6w&m=j2qb7PEBmEEdOCoDKnOOlknwNtbrbMajkVMwgyP9wSKZO5G-jtHdhk6hYvL3wgc1&s=iiXtQr8FYSg6QP3btMH7HA5LBoyrNKToSpvCx6LE3Sw&e=
> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index a2a7d8f0b..1bdeda265 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -231,4 +231,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
>       uint64_t idle_iter = s->pkts.bin[0];
>       uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
> +    uint64_t no_sleep_hit = stats[PMD_PWR_NO_SLEEP_HIT];
> +    uint64_t max_sleep_hit = stats[PMD_PWR_MAX_SLEEP_HIT];
> +    uint64_t tot_sleep_time = stats[PMD_PWR_SLEEP_TIME];
>   
>       ds_put_format(str,
> @@ -236,5 +239,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
>               "  - Used TSC cycles:  %12"PRIu64"  (%5.1f %% of total cycles)\n"
>               "  - idle iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
> -            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n",
> +            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
> +            "  - No-sleep thresh:  %12"PRIu64"  (%5.1f %% of busy it)\n"
> +            "  - Max sleeps:       %12"PRIu64"  (%3"PRIu64" us/it avg.)\n",
>               tot_iter, tot_cycles * us_per_cycle / tot_iter,
>               tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz, @@ -242,5 +247,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
>               100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
>               busy_iter,
> -            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
> +            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles,
> +            no_sleep_hit, busy_iter ? 100.0 * no_sleep_hit / busy_iter : 0,
> +            max_sleep_hit, tot_iter ? (tot_sleep_time / 1000) /
> + tot_iter : 0);
>       if (rx_packets > 0) {
>           ds_put_format(str,
> @@ -519,5 +526,7 @@ OVS_REQUIRES(s->stats_mutex)  void  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
> -                       int tx_packets, bool full_metrics)
> +                       int tx_packets, bool max_sleep_hit,
> +                       bool no_sleep_hit, uint64_t sleep_time,
> +                       bool full_metrics)
>   {
>       uint64_t now_tsc = cycles_counter_update(s); @@ -540,4 +549,14 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
>       histogram_add_sample(&s->pkts, rx_packets);
>   
> +    if (no_sleep_hit) {
> +        pmd_perf_update_counter(s, PMD_PWR_NO_SLEEP_HIT, 1);
> +    }
> +    if (max_sleep_hit) {
> +        pmd_perf_update_counter(s, PMD_PWR_MAX_SLEEP_HIT, 1);
> +    }
> +    if (sleep_time) {
> +        pmd_perf_update_counter(s, PMD_PWR_SLEEP_TIME, sleep_time);
> +    }
> +
>       if (!full_metrics) {
>           return;
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index 9673dddd8..c3d6cd435 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -81,4 +81,7 @@ enum pmd_stat_type {
>       PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
>       PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
> +    PMD_PWR_NO_SLEEP_HIT,   /* Iterations with Rx above no-sleep thresh. */
> +    PMD_PWR_MAX_SLEEP_HIT,  /* Number of times max sleep value hit. */
> +    PMD_PWR_SLEEP_TIME,     /* Total time slept to save power. */
>       PMD_N_STATS
>   };
> @@ -409,5 +412,7 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s);  void  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
> -                       int tx_packets, bool full_metrics);
> +                       int tx_packets, bool max_sleep_hit,
> +                       bool no_sleep_hit, uint64_t sleep_time,
> +                       bool full_metrics);
>   
>   /* Formatting the output of commands. */ diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 2c08a71c8..91a323c74 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -170,4 +170,11 @@ static struct odp_support dp_netdev_support = {  #define PMD_RCU_QUIESCE_INTERVAL 10000LL
>   
> +/* Max time in nanoseconds for a pmd thread to sleep based on load. */
> +#define PMD_PWR_MAX_SLEEP 250000
> +/* Number of pkts Rx on an interface that will stop pmd thread
> +sleeping. */ #define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST/2)
> +/* Time in nanosecond to increment a pmd thread sleep time. */ #define
> +PMD_PWR_INC 1000
> Should we make them Tunnable/configurable(MAX_SLEEP and PWR_INC) instead of them being constants?
> +
>   struct dpcls {
>       struct cmap_node node;      /* Within dp_netdev_pmd_thread.classifiers */
> @@ -278,4 +285,6 @@ struct dp_netdev {
>       /* Enable collection of PMD performance metrics. */
>       atomic_bool pmd_perf_metrics;
> +    /* Enable PMD load based sleeping. */
> +    atomic_bool pmd_powersave;
>       /* Enable the SMC cache from ovsdb config */
>       atomic_bool smc_enable_db;
> @@ -4915,4 +4924,17 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>   
>       set_pmd_auto_lb(dp, autolb_state, log_autolb);
> +
> +    bool powersave = smap_get_bool(other_config, "pmd-powersave", false);
> +    bool cur_powersave;
> +    atomic_read_relaxed(&dp->pmd_powersave, &cur_powersave);
> +    if (powersave != cur_powersave) {
> +        atomic_store_relaxed(&dp->pmd_powersave, powersave);
> +        if (powersave) {
> +            VLOG_INFO("PMD powersave is enabled.");
> +        } else {
> +            VLOG_INFO("PMD powersave is disabled.");
> +        }
> +    }
> +
>       return 0;
>   }
> @@ -6873,7 +6895,9 @@ pmd_thread_main(void *f_)
>       bool exiting;
>       bool reload;
> +    bool powersave;
>       int poll_cnt;
>       int i;
>       int process_packets = 0;
> +    uint64_t sleep_time = 0;
>   
>       poll_list = NULL;
> @@ -6935,7 +6959,14 @@ reload:
>       for (;;) {
>           uint64_t rx_packets = 0, tx_packets = 0;
> +        bool nosleep_hit = false;
> +        bool max_sleep_hit = false;
> +        uint64_t time_slept = 0;
>   
>           pmd_perf_start_iteration(s);
> -
> +        atomic_read_relaxed(&pmd->dp->pmd_powersave, &powersave);
> +        if (!powersave) {
> +            /* Reset sleep_time as policy may have changed. */
> +            sleep_time = 0;
> +        }
>           atomic_read_relaxed(&pmd->dp->smc_enable_db, &pmd->ctx.smc_enable_db);
>   
> @@ -6957,4 +6988,8 @@ reload:
>                                              poll_list[i].port_no);
>               rx_packets += process_packets;
> +            if (process_packets >= PMD_PWR_NO_SLEEP_THRESH) {
> +                nosleep_hit = true;
> +                sleep_time = 0;
> +            }
>           }
>   
> @@ -6964,5 +6999,19 @@ reload:
>                * There was no time updates on current iteration. */
>               pmd_thread_ctx_time_update(pmd);
> -            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
> +            tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
> +                                                   sleep_time ? true : false);
> +        }
> +
> +        if (powersave) {
> +            if (sleep_time) {
> +                xnanosleep(sleep_time);
> +                time_slept = sleep_time;
> +            }
> +            if (sleep_time < PMD_PWR_MAX_SLEEP) {
> +                /* Increase potential sleep time for next iteration. */
> +                sleep_time += PMD_PWR_INC;
> +            } else {
> +                max_sleep_hit = true;
> +            }
> M understanding here is that the above code will lead to a interface with less packets coming
> Cause latency for the packets arriving in the other interface handled by the same PMD.
> Please correct me if I am wrong.
> 
> Lets say we have 3 interfaces being serviced by this PMD.
> first interface receives 8 pkts , second receives 5 pkts and third received 30 pkts.
> After servicing the first interface, sleep time increases to 1 us and
> Before servicing the second interface it will sleep 1 us and increment the sleep time to 1us
> And before servicing third interface it will sleep 2 us(so even though the third interface receives > PMD_PWR_NO_SLEEP_THRESH packets on that interface will suffer a latency).
> If the order of the interface was changed the latency suffered by the packets on these interfaces will change.
> 
> I am thinking , if we hold an history of some x iterations and sleep based on this history, the order of the interfaces will not affect the latency suffered by the packets.
> and possibly have even latencies across.
> 
>           }
>   
> @@ -7004,5 +7053,6 @@ reload:
>           }
>   
> -        pmd_perf_end_iteration(s, rx_packets, tx_packets,
> +        pmd_perf_end_iteration(s, rx_packets, tx_packets, max_sleep_hit,
> +                               nosleep_hit, time_slept,
>                                  pmd_perf_metrics_enabled(pmd));
>       }
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 928821a82..d51a486d8 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -789,4 +789,21 @@
>           </p>
>         </column>
> +      <column name="other_config" key="pmd-powersave"
> +              type='{"type": "boolean"}'>
> +        <p>
> +         Enables PMD thread sleeping up to 250 uS per iteration based on
> +         rx queue batch sizes.
> +        </p>
> +        <p>
> +          The default value is <code>false</code>.
> +        </p>
> +        <p>
> +          Set this value to <code>true</code> to enable this option.
> +        </p>
> +        <p>
> +         Sleep time will be reset to 0 when any rx queue has a batch of
> +         16 or more packets.
> +        </p>
> +      </column>
>         <column name="other_config" key="userspace-tso-enable"
>                 type='{"type": "boolean"}'>
> --
> 2.38.1
>
David Marchand Dec. 9, 2022, 3:13 p.m. UTC | #3
Hi Kevin,

Here are some comments on the stats side:

On Tue, Nov 29, 2022 at 3:01 PM Kevin Traynor <ktraynor@redhat.com> wrote:
> @@ -6964,5 +6999,19 @@ reload:
>               * There was no time updates on current iteration. */
>              pmd_thread_ctx_time_update(pmd);
> -            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
> +            tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
> +                                                   sleep_time ? true : false);
> +        }
> +
> +        if (powersave) {
> +            if (sleep_time) {
> +                xnanosleep(sleep_time);
> +                time_slept = sleep_time;

- time_slept is the requested time, and not the time that ovs actually slept.
PMD_PWR_SLEEP_TIME stat won't be accurate if xnanosleep() takes longer
than expected.
I would add a "timer" around xnanosleep() and store cycles internally
(and then translate back to us in the stats output).


- Both pmd idling (PMD_CYCLES_ITER_IDLE) and pmd processing
(PMD_CYCLES_ITER_BUSY) stats now include cycles spent sleeping.
Those stats are used for animating pmd stats ("avg processing cycles
per packet") and rxq stats ("overhead").
I think those "sleeping" cycles should be counted as idle so that
"busy" cycles do represent time spent processing the packets.


WDYT?

> +            }
> +            if (sleep_time < PMD_PWR_MAX_SLEEP) {
> +                /* Increase potential sleep time for next iteration. */
> +                sleep_time += PMD_PWR_INC;
> +            } else {
> +                max_sleep_hit = true;
> +            }
>          }
>
Thilak Raj Surendra Babu Dec. 15, 2022, 1:12 a.m. UTC | #4
Hi Kevin,
Apologize for the late reply. 
Please find my replies inline.

Thanks
Thilak Raj S

-----Original Message-----
From: Kevin Traynor <ktraynor@redhat.com> 
Sent: 08 December 2022 06:39
To: Thilak Raj Surendra Babu <thilakraj.sb@nutanix.com>; dev@openvswitch.org
Cc: david.marchand@redhat.com; cfontain@redhat.com; cian.ferriter@intel.com
Subject: Re: [PATCH] dpif-netdev: Load based PMD sleeping.

On 07/12/2022 09:28, Thilak Raj Surendra Babu wrote:
> Hi Kevin,
> Thank you for the patch. It is pretty clean for power saving.

Hi Thilak, Thank you for testing.

> I tried the patch below is what I see
> 
> Ping test:
> Ping from physical machine to br0 ip-address.
> 
> Poll:
> [root@dargil-4 pktgen-dpdk]# ping 10.15.251.33 -i 0.001 -c 10 PING 
> 10.15.251.33 (10.15.251.33) 56(84) bytes of data.
> 64 bytes from 10.15.251.33: icmp_seq=1 ttl=64 time=0.081 ms
> 64 bytes from 10.15.251.33: icmp_seq=2 ttl=64 time=0.061 ms
> 64 bytes from 10.15.251.33: icmp_seq=3 ttl=64 time=0.050 ms
> 64 bytes from 10.15.251.33: icmp_seq=4 ttl=64 time=0.049 ms
> 64 bytes from 10.15.251.33: icmp_seq=5 ttl=64 time=0.048 ms
> 64 bytes from 10.15.251.33: icmp_seq=6 ttl=64 time=0.048 ms
> 64 bytes from 10.15.251.33: icmp_seq=7 ttl=64 time=0.047 ms
> 64 bytes from 10.15.251.33: icmp_seq=8 ttl=64 time=0.047 ms
> 64 bytes from 10.15.251.33: icmp_seq=9 ttl=64 time=0.047 ms
> 64 bytes from 10.15.251.33: icmp_seq=10 ttl=64 time=0.049 ms
> 
> --- 10.15.251.33 ping statistics ---
> 10 packets transmitted, 10 received, 0% packet loss, time 9ms rtt 
> min/avg/max/mdev = 0.047/0.052/0.081/0.013 ms
> 
> With your changes:
> [root@dargil-4 pktgen-dpdk]# ping 10.15.251.33 -i 0.001 -c 10 PING 
> 10.15.251.33 (10.15.251.33) 56(84) bytes of data.
> 64 bytes from 10.15.251.33: icmp_seq=1 ttl=64 time=0.189 ms
> 64 bytes from 10.15.251.33: icmp_seq=2 ttl=64 time=0.367 ms
> 64 bytes from 10.15.251.33: icmp_seq=3 ttl=64 time=0.351 ms
> 64 bytes from 10.15.251.33: icmp_seq=4 ttl=64 time=0.315 ms
> 64 bytes from 10.15.251.33: icmp_seq=5 ttl=64 time=0.240 ms
> 64 bytes from 10.15.251.33: icmp_seq=6 ttl=64 time=0.204 ms
> 64 bytes from 10.15.251.33: icmp_seq=7 ttl=64 time=0.204 ms
> 64 bytes from 10.15.251.33: icmp_seq=8 ttl=64 time=0.138 ms
> 64 bytes from 10.15.251.33: icmp_seq=9 ttl=64 time=0.137 ms
> 64 bytes from 10.15.251.33: icmp_seq=10 ttl=64 time=0.417 ms
> 
> --- 10.15.251.33 ping statistics ---
> 10 packets transmitted, 10 received, 0% packet loss, time 9ms rtt 
> min/avg/max/mdev = 0.137/0.256/0.417/0.094 ms
> 
> At a high level, I was thinking of the below.
> Please let me know what you think?
> 
> 1. Can we have the option of making a particular pmd thread sleep instead of it being at the global level.
> There are some interfaces for which the flows being serviced by them 
> are very sensitive towards latency(heartbeats)

To just have a per pmd setting means that the user would need to know which PMDs were being used for that port (could be multiple) and react to any changes in assignments or changes to PMD cores.

It also means that anyone who wants it on all PMD cores now needs to be aware of which PMDs are being used and track changes and then re-set to the new PMDs.

So I think per-pmd config as a replacement for a global one is too difficult for a user.

*If* a per-pmd functionality was needed a better option would be to keep the global pmd-powersave setting AND add a per-pmd no-sleep override (default false).

The expectation then would be if a user is concerned about the latency on a particular port they would be prepared to tune and pin that port Rx
queue(s) to a specific PMD and set that PMD to never sleep, regardless of the global pmd-powersave mode.

It preserves the easy to use general setting for users, but it gives a user who has special requirements about some interfaces and is willing to go deeper into tuning the extra functionality.

It still does introduce some extra complexity for user and a bit for code too so we'd need to know that it's needed.

Note, by default it would not have an impact on a global pmd-powersave config option. In this way a per-pmd override is an add-on functionality and could be done at a later time.

That is certainly something that could be added if users felt it is needed and we can plan to do it but I feel it would be better to bed down that the core operation of sleeping is useful/correct first, before adding on extra ways to specify on/off.

So you are suggesting an override at PMD level instead of having this option itself at PMD level.
ovs-vsctl set open_vswitch . other_config:pmd-powersave="true" and 
ovs-vsctl set open_vswitch . other_config:pmd-powersave-no-sleep:<list of cores for which override applies>

I was thinking, if we could have a config like this
ovs-vsctl set open_vswitch . other_config:pmd-powersave-cores=<"list of core which will sleep/all"> if all is given it implies all cores will be in power save mode
I was thinking whenever there is both sleeping and non-sleeping PMDs it would make sense to explicitly pin the RXQs to the PMD threads.

For my mind the above felt much natural, but I don't have a strong preference either way.
I also agree that we should get the core operation first before spending much time on the knobs.

> 2. I was thinking about using umwait to reduce the latency of the first packet(but portability becomes a challenge I suppose) . But I guess this patch is very generic enough to be adopted on all platforms.

At a quick look, yes portability would be issue. I will look into it some more.

> 3. Instead of sleeping on the next immediate iteration if the previous 
> iteration had less than half of the burst size, can we have a logic where we record a window of packets received(say last 60 iterations) and based on that decide to sleep?

We could delay sleeping until an arbitrary number of iterations with all Rx queue < BURST/2, but in some cases that might mean we don't sleep where we might have been able to. The other issue would be picking the arbitrary number - I wouldn't want a user setting something so low level.

I was thinking on these lines, as I wanted to put the latency requirements first before the power saving feature for certain flows.
But If we could a system where there are threads which sleep to save power and threads which Poll, We should be fine.

> I am trying to avoid scenarios where a PMD thread has two interfaces and one interface has < BURST/2 packets coming in which induces latency for the packets received on the other interface.
> 

It would be right for the PMD to sleep if both interfaces have <BURST/2.
Waiting an arbitrary amount of iterations would likely just delay that.
My Bad, for some reason I initially thought code which decides the sleep was within the for loop going over the rxq.

I suppose one thing to mention is that this focuses on the transition and edge cases and it is hard to get them perfect for all traffic sencarios.

There will always be some trade-off between sleeping sooner or later. 
i.e. do we sleep sooner and try to save power or wait a bit longer and be more sure that the traffic is not bursty. I would hope the trade-off is only for a very small window and the majority of time it's clear that we should sleep or not sleep.
Agree that there are quite some edge cases, I think if we have the option to put only few threads to sleep and other thread to be polling,
Then traffic that are sensitive towards edge cases can be placed on the polling thread.

Thanks, really appreciate your thoughts and feedback.

Kevin.

> Thanks
> Thilak Raj S
> 
> -----Original Message-----
> From: Kevin Traynor <ktraynor@redhat.com>
> Sent: 29 November 2022 06:02
> To: dev@openvswitch.org
> Cc: david.marchand@redhat.com; cfontain@redhat.com; 
> cian.ferriter@intel.com; Thilak Raj Surendra Babu 
> <thilakraj.sb@nutanix.com>; Kevin Traynor <ktraynor@redhat.com>
> Subject: [PATCH] dpif-netdev: Load based PMD sleeping.
> 
> Sleep for an incremental amount of time if none of the Rx queues assigned to a PMD have at least half a batch of packets (i.e. 16 pkts) on an polling iteration of the PMD.
> 
> Upon detecting >= 16 pkts on an Rxq, reset the sleep time to zero.
> 
> Sleep time will be increased by 1 uS on each iteration where the low load conditions remain up to a total of 250 uS.
> 
> The feature is off by default and can be enabled by:
> ovs-vsctl set Open_vSwitch . other_config:pmd-powersave=true
> 
> Also add new stats to pmd-perf-show to get visibility of operation:
> Iterations:
> ...
>    - No-sleep thresh:           576  ( 98.1 % of busy it)
>    - Max sleeps:              19424  (167 us/it avg.)
> ...
> 
> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
> ---
>   Documentation/topics/dpdk/pmd.rst | 20 +++++++++++
>   lib/dpif-netdev-perf.c            | 25 ++++++++++++--
>   lib/dpif-netdev-perf.h            |  7 +++-
>   lib/dpif-netdev.c                 | 56 +++++++++++++++++++++++++++++--
>   vswitchd/vswitch.xml              | 17 ++++++++++
>   5 files changed, 118 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst 
> b/Documentation/topics/dpdk/pmd.rst
> index b259cc8b3..a42e34956 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -312,4 +312,24 @@ reassignment due to PMD Auto Load Balance. For example, this could be set  (in min) such that a reassignment is triggered at most every few hours.
>   
> +PMD Power Saving
> +----------------
> +
> +PMD threads constantly poll Rx queues which are assigned to them. In 
> +order to reduce the CPU cycles they use, they can sleep for small 
> +periods of time when there is no load or very-low load from the Rx queues it polls.
> +
> +This can be enabled by::
> +
> +    $ ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"
> Should we think about having this per PMD thread instead of applying for all the PMD threads?
> +
> +With this enabled a PMD may sleep by an incrementing amount of time 
> +up to a total of 250 uS after polling all Rx queues assigned to it, 
> +if none of the Rx queues have at least half a batch of packets (i.e. 16).
> +
> +.. note::
> +
> +    Sleeping in a PMD thread may cause extra packet latency due to the sleep
> +    and possible processor low-power states which may be invoked.
> +
>   .. _ovs-vswitchd(8):
>       
> https://urldefense.proofpoint.com/v2/url?u=http-3A__openvswitch.org_su
> pport_dist-2Ddocs_ovs-2Dvswitchd.8.html&d=DwIDAg&c=s883GpUCOChKOHiocYt
> Gcg&r=1sd3waKor_ps6hs2j0tfqmW6ts2tlVvmmMySlXCPN6w&m=j2qb7PEBmEEdOCoDKn
> OOlknwNtbrbMajkVMwgyP9wSKZO5G-jtHdhk6hYvL3wgc1&s=iiXtQr8FYSg6QP3btMH7H
> A5LBoyrNKToSpvCx6LE3Sw&e= diff --git a/lib/dpif-netdev-perf.c 
> b/lib/dpif-netdev-perf.c index a2a7d8f0b..1bdeda265 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -231,4 +231,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
>       uint64_t idle_iter = s->pkts.bin[0];
>       uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - 
> idle_iter : 0;
> +    uint64_t no_sleep_hit = stats[PMD_PWR_NO_SLEEP_HIT];
> +    uint64_t max_sleep_hit = stats[PMD_PWR_MAX_SLEEP_HIT];
> +    uint64_t tot_sleep_time = stats[PMD_PWR_SLEEP_TIME];
>   
>       ds_put_format(str,
> @@ -236,5 +239,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
>               "  - Used TSC cycles:  %12"PRIu64"  (%5.1f %% of total cycles)\n"
>               "  - idle iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
> -            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n",
> +            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
> +            "  - No-sleep thresh:  %12"PRIu64"  (%5.1f %% of busy it)\n"
> +            "  - Max sleeps:       %12"PRIu64"  (%3"PRIu64" us/it avg.)\n",
>               tot_iter, tot_cycles * us_per_cycle / tot_iter,
>               tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz, @@ -242,5 +247,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
>               100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
>               busy_iter,
> -            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
> +            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles,
> +            no_sleep_hit, busy_iter ? 100.0 * no_sleep_hit / busy_iter : 0,
> +            max_sleep_hit, tot_iter ? (tot_sleep_time / 1000) / 
> + tot_iter : 0);
>       if (rx_packets > 0) {
>           ds_put_format(str,
> @@ -519,5 +526,7 @@ OVS_REQUIRES(s->stats_mutex)  void  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
> -                       int tx_packets, bool full_metrics)
> +                       int tx_packets, bool max_sleep_hit,
> +                       bool no_sleep_hit, uint64_t sleep_time,
> +                       bool full_metrics)
>   {
>       uint64_t now_tsc = cycles_counter_update(s); @@ -540,4 +549,14 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
>       histogram_add_sample(&s->pkts, rx_packets);
>   
> +    if (no_sleep_hit) {
> +        pmd_perf_update_counter(s, PMD_PWR_NO_SLEEP_HIT, 1);
> +    }
> +    if (max_sleep_hit) {
> +        pmd_perf_update_counter(s, PMD_PWR_MAX_SLEEP_HIT, 1);
> +    }
> +    if (sleep_time) {
> +        pmd_perf_update_counter(s, PMD_PWR_SLEEP_TIME, sleep_time);
> +    }
> +
>       if (!full_metrics) {
>           return;
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index 
> 9673dddd8..c3d6cd435 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -81,4 +81,7 @@ enum pmd_stat_type {
>       PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
>       PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
> +    PMD_PWR_NO_SLEEP_HIT,   /* Iterations with Rx above no-sleep thresh. */
> +    PMD_PWR_MAX_SLEEP_HIT,  /* Number of times max sleep value hit. */
> +    PMD_PWR_SLEEP_TIME,     /* Total time slept to save power. */
>       PMD_N_STATS
>   };
> @@ -409,5 +412,7 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s);  void  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
> -                       int tx_packets, bool full_metrics);
> +                       int tx_packets, bool max_sleep_hit,
> +                       bool no_sleep_hit, uint64_t sleep_time,
> +                       bool full_metrics);
>   
>   /* Formatting the output of commands. */ diff --git 
> a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 2c08a71c8..91a323c74 
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -170,4 +170,11 @@ static struct odp_support dp_netdev_support = {  
> #define PMD_RCU_QUIESCE_INTERVAL 10000LL
>   
> +/* Max time in nanoseconds for a pmd thread to sleep based on load. 
> +*/ #define PMD_PWR_MAX_SLEEP 250000
> +/* Number of pkts Rx on an interface that will stop pmd thread 
> +sleeping. */ #define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST/2)
> +/* Time in nanosecond to increment a pmd thread sleep time. */ 
> +#define PMD_PWR_INC 1000
> Should we make them Tunnable/configurable(MAX_SLEEP and PWR_INC) instead of them being constants?
> +
>   struct dpcls {
>       struct cmap_node node;      /* Within dp_netdev_pmd_thread.classifiers */
> @@ -278,4 +285,6 @@ struct dp_netdev {
>       /* Enable collection of PMD performance metrics. */
>       atomic_bool pmd_perf_metrics;
> +    /* Enable PMD load based sleeping. */
> +    atomic_bool pmd_powersave;
>       /* Enable the SMC cache from ovsdb config */
>       atomic_bool smc_enable_db;
> @@ -4915,4 +4924,17 @@ dpif_netdev_set_config(struct dpif *dpif, const 
> struct smap *other_config)
>   
>       set_pmd_auto_lb(dp, autolb_state, log_autolb);
> +
> +    bool powersave = smap_get_bool(other_config, "pmd-powersave", false);
> +    bool cur_powersave;
> +    atomic_read_relaxed(&dp->pmd_powersave, &cur_powersave);
> +    if (powersave != cur_powersave) {
> +        atomic_store_relaxed(&dp->pmd_powersave, powersave);
> +        if (powersave) {
> +            VLOG_INFO("PMD powersave is enabled.");
> +        } else {
> +            VLOG_INFO("PMD powersave is disabled.");
> +        }
> +    }
> +
>       return 0;
>   }
> @@ -6873,7 +6895,9 @@ pmd_thread_main(void *f_)
>       bool exiting;
>       bool reload;
> +    bool powersave;
>       int poll_cnt;
>       int i;
>       int process_packets = 0;
> +    uint64_t sleep_time = 0;
>   
>       poll_list = NULL;
> @@ -6935,7 +6959,14 @@ reload:
>       for (;;) {
>           uint64_t rx_packets = 0, tx_packets = 0;
> +        bool nosleep_hit = false;
> +        bool max_sleep_hit = false;
> +        uint64_t time_slept = 0;
>   
>           pmd_perf_start_iteration(s);
> -
> +        atomic_read_relaxed(&pmd->dp->pmd_powersave, &powersave);
> +        if (!powersave) {
> +            /* Reset sleep_time as policy may have changed. */
> +            sleep_time = 0;
> +        }
>           atomic_read_relaxed(&pmd->dp->smc_enable_db, 
> &pmd->ctx.smc_enable_db);
>   
> @@ -6957,4 +6988,8 @@ reload:
>                                              poll_list[i].port_no);
>               rx_packets += process_packets;
> +            if (process_packets >= PMD_PWR_NO_SLEEP_THRESH) {
> +                nosleep_hit = true;
> +                sleep_time = 0;
> +            }
>           }
>   
> @@ -6964,5 +6999,19 @@ reload:
>                * There was no time updates on current iteration. */
>               pmd_thread_ctx_time_update(pmd);
> -            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
> +            tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
> +                                                   sleep_time ? true : false);
> +        }
> +
> +        if (powersave) {
> +            if (sleep_time) {
> +                xnanosleep(sleep_time);
> +                time_slept = sleep_time;
> +            }
> +            if (sleep_time < PMD_PWR_MAX_SLEEP) {
> +                /* Increase potential sleep time for next iteration. */
> +                sleep_time += PMD_PWR_INC;
> +            } else {
> +                max_sleep_hit = true;
> +            }
> M understanding here is that the above code will lead to a interface 
> with less packets coming Cause latency for the packets arriving in the other interface handled by the same PMD.
> Please correct me if I am wrong.
> 
> Lets say we have 3 interfaces being serviced by this PMD.
> first interface receives 8 pkts , second receives 5 pkts and third received 30 pkts.
> After servicing the first interface, sleep time increases to 1 us and 
> Before servicing the second interface it will sleep 1 us and increment 
> the sleep time to 1us And before servicing third interface it will sleep 2 us(so even though the third interface receives > PMD_PWR_NO_SLEEP_THRESH packets on that interface will suffer a latency).
> If the order of the interface was changed the latency suffered by the packets on these interfaces will change.
> 
> I am thinking , if we hold an history of some x iterations and sleep based on this history, the order of the interfaces will not affect the latency suffered by the packets.
> and possibly have even latencies across.
> 
>           }
>   
> @@ -7004,5 +7053,6 @@ reload:
>           }
>   
> -        pmd_perf_end_iteration(s, rx_packets, tx_packets,
> +        pmd_perf_end_iteration(s, rx_packets, tx_packets, max_sleep_hit,
> +                               nosleep_hit, time_slept,
>                                  pmd_perf_metrics_enabled(pmd));
>       }
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 
> 928821a82..d51a486d8 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -789,4 +789,21 @@
>           </p>
>         </column>
> +      <column name="other_config" key="pmd-powersave"
> +              type='{"type": "boolean"}'>
> +        <p>
> +         Enables PMD thread sleeping up to 250 uS per iteration based on
> +         rx queue batch sizes.
> +        </p>
> +        <p>
> +          The default value is <code>false</code>.
> +        </p>
> +        <p>
> +          Set this value to <code>true</code> to enable this option.
> +        </p>
> +        <p>
> +         Sleep time will be reset to 0 when any rx queue has a batch of
> +         16 or more packets.
> +        </p>
> +      </column>
>         <column name="other_config" key="userspace-tso-enable"
>                 type='{"type": "boolean"}'>
> --
> 2.38.1
>
Kevin Traynor Dec. 15, 2022, 5:18 p.m. UTC | #5
On 15/12/2022 01:12, Thilak Raj Surendra Babu wrote:
> Hi Kevin,
> Apologize for the late reply.
> Please find my replies inline.
> 

Hi Thilak,

Thanks for your comments. Replies below.

Just a small thing on email format. If possible, could you send replies 
inline (i.e. not top-post and original mail line have ">") or if not, 
mark your comments with your name. This is just to make it easier for 
anyone else reading the thread to know who is saying what!

thanks,
Kevin.

> Thanks
> Thilak Raj S
> 
> -----Original Message-----
> From: Kevin Traynor <ktraynor@redhat.com>
> Sent: 08 December 2022 06:39
> To: Thilak Raj Surendra Babu <thilakraj.sb@nutanix.com>; dev@openvswitch.org
> Cc: david.marchand@redhat.com; cfontain@redhat.com; cian.ferriter@intel.com
> Subject: Re: [PATCH] dpif-netdev: Load based PMD sleeping.
> 
> On 07/12/2022 09:28, Thilak Raj Surendra Babu wrote:
>> Hi Kevin,
>> Thank you for the patch. It is pretty clean for power saving.
> 
> Hi Thilak, Thank you for testing.
> 
>> I tried the patch below is what I see
>>
>> Ping test:
>> Ping from physical machine to br0 ip-address.
>>
>> Poll:
>> [root@dargil-4 pktgen-dpdk]# ping 10.15.251.33 -i 0.001 -c 10 PING
>> 10.15.251.33 (10.15.251.33) 56(84) bytes of data.
>> 64 bytes from 10.15.251.33: icmp_seq=1 ttl=64 time=0.081 ms
>> 64 bytes from 10.15.251.33: icmp_seq=2 ttl=64 time=0.061 ms
>> 64 bytes from 10.15.251.33: icmp_seq=3 ttl=64 time=0.050 ms
>> 64 bytes from 10.15.251.33: icmp_seq=4 ttl=64 time=0.049 ms
>> 64 bytes from 10.15.251.33: icmp_seq=5 ttl=64 time=0.048 ms
>> 64 bytes from 10.15.251.33: icmp_seq=6 ttl=64 time=0.048 ms
>> 64 bytes from 10.15.251.33: icmp_seq=7 ttl=64 time=0.047 ms
>> 64 bytes from 10.15.251.33: icmp_seq=8 ttl=64 time=0.047 ms
>> 64 bytes from 10.15.251.33: icmp_seq=9 ttl=64 time=0.047 ms
>> 64 bytes from 10.15.251.33: icmp_seq=10 ttl=64 time=0.049 ms
>>
>> --- 10.15.251.33 ping statistics ---
>> 10 packets transmitted, 10 received, 0% packet loss, time 9ms rtt
>> min/avg/max/mdev = 0.047/0.052/0.081/0.013 ms
>>
>> With your changes:
>> [root@dargil-4 pktgen-dpdk]# ping 10.15.251.33 -i 0.001 -c 10 PING
>> 10.15.251.33 (10.15.251.33) 56(84) bytes of data.
>> 64 bytes from 10.15.251.33: icmp_seq=1 ttl=64 time=0.189 ms
>> 64 bytes from 10.15.251.33: icmp_seq=2 ttl=64 time=0.367 ms
>> 64 bytes from 10.15.251.33: icmp_seq=3 ttl=64 time=0.351 ms
>> 64 bytes from 10.15.251.33: icmp_seq=4 ttl=64 time=0.315 ms
>> 64 bytes from 10.15.251.33: icmp_seq=5 ttl=64 time=0.240 ms
>> 64 bytes from 10.15.251.33: icmp_seq=6 ttl=64 time=0.204 ms
>> 64 bytes from 10.15.251.33: icmp_seq=7 ttl=64 time=0.204 ms
>> 64 bytes from 10.15.251.33: icmp_seq=8 ttl=64 time=0.138 ms
>> 64 bytes from 10.15.251.33: icmp_seq=9 ttl=64 time=0.137 ms
>> 64 bytes from 10.15.251.33: icmp_seq=10 ttl=64 time=0.417 ms
>>
>> --- 10.15.251.33 ping statistics ---
>> 10 packets transmitted, 10 received, 0% packet loss, time 9ms rtt
>> min/avg/max/mdev = 0.137/0.256/0.417/0.094 ms
>>
>> At a high level, I was thinking of the below.
>> Please let me know what you think?
>>
>> 1. Can we have the option of making a particular pmd thread sleep instead of it being at the global level.
>> There are some interfaces for which the flows being serviced by them
>> are very sensitive towards latency(heartbeats)
> 
> To just have a per pmd setting means that the user would need to know which PMDs were being used for that port (could be multiple) and react to any changes in assignments or changes to PMD cores.
> 
> It also means that anyone who wants it on all PMD cores now needs to be aware of which PMDs are being used and track changes and then re-set to the new PMDs.
> 
> So I think per-pmd config as a replacement for a global one is too difficult for a user.
> 
> *If* a per-pmd functionality was needed a better option would be to keep the global pmd-powersave setting AND add a per-pmd no-sleep override (default false).
> 
> The expectation then would be if a user is concerned about the latency on a particular port they would be prepared to tune and pin that port Rx
> queue(s) to a specific PMD and set that PMD to never sleep, regardless of the global pmd-powersave mode.
> 
> It preserves the easy to use general setting for users, but it gives a user who has special requirements about some interfaces and is willing to go deeper into tuning the extra functionality.
> 
> It still does introduce some extra complexity for user and a bit for code too so we'd need to know that it's needed.
> 
> Note, by default it would not have an impact on a global pmd-powersave config option. In this way a per-pmd override is an add-on functionality and could be done at a later time.
> 
> That is certainly something that could be added if users felt it is needed and we can plan to do it but I feel it would be better to bed down that the core operation of sleeping is useful/correct first, before adding on extra ways to specify on/off.
> 
> So you are suggesting an override at PMD level instead of having this option itself at PMD level.
> ovs-vsctl set open_vswitch . other_config:pmd-powersave="true" and
> ovs-vsctl set open_vswitch . other_config:pmd-powersave-no-sleep:<list of cores for which override applies>
> 

Yes, that's what i was thinking.

> I was thinking, if we could have a config like this
> ovs-vsctl set open_vswitch . other_config:pmd-powersave-cores=<"list of core which will sleep/all"> if all is given it implies all cores will be in power save mode
> I was thinking whenever there is both sleeping and non-sleeping PMDs it would make sense to explicitly pin the RXQs to the PMD threads.
>
> For my mind the above felt much natural, but I don't have a strong preference either way.
> I also agree that we should get the core operation first before spending much time on the knobs.
> 

For mixed cases (want some pmd able to sleep and some not), I'd have a 
preference for opting some pmd cores out, rather than in.

Reasoning is that if they are used for pinning then the user must know 
about them and they are fixed. Other pmds could be added/removed/changed 
and the user would not need to know specifics to keep other pmds 
sleeping but otoh, it is nice to have a one-liner too. We can review 
again later when we're happy with operation.

>> 2. I was thinking about using umwait to reduce the latency of the first packet(but portability becomes a challenge I suppose) . But I guess this patch is very generic enough to be adopted on all platforms.
> 
> At a quick look, yes portability would be issue. I will look into it some more.
> 
>> 3. Instead of sleeping on the next immediate iteration if the previous
>> iteration had less than half of the burst size, can we have a logic where we record a window of packets received(say last 60 iterations) and based on that decide to sleep?
> 
> We could delay sleeping until an arbitrary number of iterations with all Rx queue < BURST/2, but in some cases that might mean we don't sleep where we might have been able to. The other issue would be picking the arbitrary number - I wouldn't want a user setting something so low level.
> 
> I was thinking on these lines, as I wanted to put the latency requirements first before the power saving feature for certain flows.
> But If we could a system where there are threads which sleep to save power and threads which Poll, We should be fine.
> 

ok, thanks for confirming.

>> I am trying to avoid scenarios where a PMD thread has two interfaces and one interface has < BURST/2 packets coming in which induces latency for the packets received on the other interface.
>>
> 
> It would be right for the PMD to sleep if both interfaces have <BURST/2.
> Waiting an arbitrary amount of iterations would likely just delay that.
> My Bad, for some reason I initially thought code which decides the sleep was within the for loop going over the rxq.
> 
> I suppose one thing to mention is that this focuses on the transition and edge cases and it is hard to get them perfect for all traffic sencarios.
> 
> There will always be some trade-off between sleeping sooner or later.
> i.e. do we sleep sooner and try to save power or wait a bit longer and be more sure that the traffic is not bursty. I would hope the trade-off is only for a very small window and the majority of time it's clear that we should sleep or not sleep.
> Agree that there are quite some edge cases, I think if we have the option to put only few threads to sleep and other thread to be polling,
> Then traffic that are sensitive towards edge cases can be placed on the polling thread.
> 
> Thanks, really appreciate your thoughts and feedback.
> 
> Kevin.
> 
>> Thanks
>> Thilak Raj S
>>
>> -----Original Message-----
>> From: Kevin Traynor <ktraynor@redhat.com>
>> Sent: 29 November 2022 06:02
>> To: dev@openvswitch.org
>> Cc: david.marchand@redhat.com; cfontain@redhat.com;
>> cian.ferriter@intel.com; Thilak Raj Surendra Babu
>> <thilakraj.sb@nutanix.com>; Kevin Traynor <ktraynor@redhat.com>
>> Subject: [PATCH] dpif-netdev: Load based PMD sleeping.
>>
>> Sleep for an incremental amount of time if none of the Rx queues assigned to a PMD have at least half a batch of packets (i.e. 16 pkts) on an polling iteration of the PMD.
>>
>> Upon detecting >= 16 pkts on an Rxq, reset the sleep time to zero.
>>
>> Sleep time will be increased by 1 uS on each iteration where the low load conditions remain up to a total of 250 uS.
>>
>> The feature is off by default and can be enabled by:
>> ovs-vsctl set Open_vSwitch . other_config:pmd-powersave=true
>>
>> Also add new stats to pmd-perf-show to get visibility of operation:
>> Iterations:
>> ...
>>     - No-sleep thresh:           576  ( 98.1 % of busy it)
>>     - Max sleeps:              19424  (167 us/it avg.)
>> ...
>>
>> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
>> ---
>>    Documentation/topics/dpdk/pmd.rst | 20 +++++++++++
>>    lib/dpif-netdev-perf.c            | 25 ++++++++++++--
>>    lib/dpif-netdev-perf.h            |  7 +++-
>>    lib/dpif-netdev.c                 | 56 +++++++++++++++++++++++++++++--
>>    vswitchd/vswitch.xml              | 17 ++++++++++
>>    5 files changed, 118 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/topics/dpdk/pmd.rst
>> b/Documentation/topics/dpdk/pmd.rst
>> index b259cc8b3..a42e34956 100644
>> --- a/Documentation/topics/dpdk/pmd.rst
>> +++ b/Documentation/topics/dpdk/pmd.rst
>> @@ -312,4 +312,24 @@ reassignment due to PMD Auto Load Balance. For example, this could be set  (in min) such that a reassignment is triggered at most every few hours.
>>    
>> +PMD Power Saving
>> +----------------
>> +
>> +PMD threads constantly poll Rx queues which are assigned to them. In
>> +order to reduce the CPU cycles they use, they can sleep for small
>> +periods of time when there is no load or very-low load from the Rx queues it polls.
>> +
>> +This can be enabled by::
>> +
>> +    $ ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"
>> Should we think about having this per PMD thread instead of applying for all the PMD threads?
>> +
>> +With this enabled a PMD may sleep by an incrementing amount of time
>> +up to a total of 250 uS after polling all Rx queues assigned to it,
>> +if none of the Rx queues have at least half a batch of packets (i.e. 16).
>> +
>> +.. note::
>> +
>> +    Sleeping in a PMD thread may cause extra packet latency due to the sleep
>> +    and possible processor low-power states which may be invoked.
>> +
>>    .. _ovs-vswitchd(8):
>>        
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__openvswitch.org_su
>> pport_dist-2Ddocs_ovs-2Dvswitchd.8.html&d=DwIDAg&c=s883GpUCOChKOHiocYt
>> Gcg&r=1sd3waKor_ps6hs2j0tfqmW6ts2tlVvmmMySlXCPN6w&m=j2qb7PEBmEEdOCoDKn
>> OOlknwNtbrbMajkVMwgyP9wSKZO5G-jtHdhk6hYvL3wgc1&s=iiXtQr8FYSg6QP3btMH7H
>> A5LBoyrNKToSpvCx6LE3Sw&e= diff --git a/lib/dpif-netdev-perf.c
>> b/lib/dpif-netdev-perf.c index a2a7d8f0b..1bdeda265 100644
>> --- a/lib/dpif-netdev-perf.c
>> +++ b/lib/dpif-netdev-perf.c
>> @@ -231,4 +231,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
>>        uint64_t idle_iter = s->pkts.bin[0];
>>        uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter -
>> idle_iter : 0;
>> +    uint64_t no_sleep_hit = stats[PMD_PWR_NO_SLEEP_HIT];
>> +    uint64_t max_sleep_hit = stats[PMD_PWR_MAX_SLEEP_HIT];
>> +    uint64_t tot_sleep_time = stats[PMD_PWR_SLEEP_TIME];
>>    
>>        ds_put_format(str,
>> @@ -236,5 +239,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
>>                "  - Used TSC cycles:  %12"PRIu64"  (%5.1f %% of total cycles)\n"
>>                "  - idle iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
>> -            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n",
>> +            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
>> +            "  - No-sleep thresh:  %12"PRIu64"  (%5.1f %% of busy it)\n"
>> +            "  - Max sleeps:       %12"PRIu64"  (%3"PRIu64" us/it avg.)\n",
>>                tot_iter, tot_cycles * us_per_cycle / tot_iter,
>>                tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz, @@ -242,5 +247,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
>>                100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
>>                busy_iter,
>> -            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
>> +            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles,
>> +            no_sleep_hit, busy_iter ? 100.0 * no_sleep_hit / busy_iter : 0,
>> +            max_sleep_hit, tot_iter ? (tot_sleep_time / 1000) /
>> + tot_iter : 0);
>>        if (rx_packets > 0) {
>>            ds_put_format(str,
>> @@ -519,5 +526,7 @@ OVS_REQUIRES(s->stats_mutex)  void  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
>> -                       int tx_packets, bool full_metrics)
>> +                       int tx_packets, bool max_sleep_hit,
>> +                       bool no_sleep_hit, uint64_t sleep_time,
>> +                       bool full_metrics)
>>    {
>>        uint64_t now_tsc = cycles_counter_update(s); @@ -540,4 +549,14 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
>>        histogram_add_sample(&s->pkts, rx_packets);
>>    
>> +    if (no_sleep_hit) {
>> +        pmd_perf_update_counter(s, PMD_PWR_NO_SLEEP_HIT, 1);
>> +    }
>> +    if (max_sleep_hit) {
>> +        pmd_perf_update_counter(s, PMD_PWR_MAX_SLEEP_HIT, 1);
>> +    }
>> +    if (sleep_time) {
>> +        pmd_perf_update_counter(s, PMD_PWR_SLEEP_TIME, sleep_time);
>> +    }
>> +
>>        if (!full_metrics) {
>>            return;
>> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index
>> 9673dddd8..c3d6cd435 100644
>> --- a/lib/dpif-netdev-perf.h
>> +++ b/lib/dpif-netdev-perf.h
>> @@ -81,4 +81,7 @@ enum pmd_stat_type {
>>        PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
>>        PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
>> +    PMD_PWR_NO_SLEEP_HIT,   /* Iterations with Rx above no-sleep thresh. */
>> +    PMD_PWR_MAX_SLEEP_HIT,  /* Number of times max sleep value hit. */
>> +    PMD_PWR_SLEEP_TIME,     /* Total time slept to save power. */
>>        PMD_N_STATS
>>    };
>> @@ -409,5 +412,7 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s);  void  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
>> -                       int tx_packets, bool full_metrics);
>> +                       int tx_packets, bool max_sleep_hit,
>> +                       bool no_sleep_hit, uint64_t sleep_time,
>> +                       bool full_metrics);
>>    
>>    /* Formatting the output of commands. */ diff --git
>> a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 2c08a71c8..91a323c74
>> 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -170,4 +170,11 @@ static struct odp_support dp_netdev_support = {
>> #define PMD_RCU_QUIESCE_INTERVAL 10000LL
>>    
>> +/* Max time in nanoseconds for a pmd thread to sleep based on load.
>> +*/ #define PMD_PWR_MAX_SLEEP 250000
>> +/* Number of pkts Rx on an interface that will stop pmd thread
>> +sleeping. */ #define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST/2)
>> +/* Time in nanosecond to increment a pmd thread sleep time. */
>> +#define PMD_PWR_INC 1000
>> Should we make them Tunnable/configurable(MAX_SLEEP and PWR_INC) instead of them being constants?
>> +
>>    struct dpcls {
>>        struct cmap_node node;      /* Within dp_netdev_pmd_thread.classifiers */
>> @@ -278,4 +285,6 @@ struct dp_netdev {
>>        /* Enable collection of PMD performance metrics. */
>>        atomic_bool pmd_perf_metrics;
>> +    /* Enable PMD load based sleeping. */
>> +    atomic_bool pmd_powersave;
>>        /* Enable the SMC cache from ovsdb config */
>>        atomic_bool smc_enable_db;
>> @@ -4915,4 +4924,17 @@ dpif_netdev_set_config(struct dpif *dpif, const
>> struct smap *other_config)
>>    
>>        set_pmd_auto_lb(dp, autolb_state, log_autolb);
>> +
>> +    bool powersave = smap_get_bool(other_config, "pmd-powersave", false);
>> +    bool cur_powersave;
>> +    atomic_read_relaxed(&dp->pmd_powersave, &cur_powersave);
>> +    if (powersave != cur_powersave) {
>> +        atomic_store_relaxed(&dp->pmd_powersave, powersave);
>> +        if (powersave) {
>> +            VLOG_INFO("PMD powersave is enabled.");
>> +        } else {
>> +            VLOG_INFO("PMD powersave is disabled.");
>> +        }
>> +    }
>> +
>>        return 0;
>>    }
>> @@ -6873,7 +6895,9 @@ pmd_thread_main(void *f_)
>>        bool exiting;
>>        bool reload;
>> +    bool powersave;
>>        int poll_cnt;
>>        int i;
>>        int process_packets = 0;
>> +    uint64_t sleep_time = 0;
>>    
>>        poll_list = NULL;
>> @@ -6935,7 +6959,14 @@ reload:
>>        for (;;) {
>>            uint64_t rx_packets = 0, tx_packets = 0;
>> +        bool nosleep_hit = false;
>> +        bool max_sleep_hit = false;
>> +        uint64_t time_slept = 0;
>>    
>>            pmd_perf_start_iteration(s);
>> -
>> +        atomic_read_relaxed(&pmd->dp->pmd_powersave, &powersave);
>> +        if (!powersave) {
>> +            /* Reset sleep_time as policy may have changed. */
>> +            sleep_time = 0;
>> +        }
>>            atomic_read_relaxed(&pmd->dp->smc_enable_db,
>> &pmd->ctx.smc_enable_db);
>>    
>> @@ -6957,4 +6988,8 @@ reload:
>>                                               poll_list[i].port_no);
>>                rx_packets += process_packets;
>> +            if (process_packets >= PMD_PWR_NO_SLEEP_THRESH) {
>> +                nosleep_hit = true;
>> +                sleep_time = 0;
>> +            }
>>            }
>>    
>> @@ -6964,5 +6999,19 @@ reload:
>>                 * There was no time updates on current iteration. */
>>                pmd_thread_ctx_time_update(pmd);
>> -            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
>> +            tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
>> +                                                   sleep_time ? true : false);
>> +        }
>> +
>> +        if (powersave) {
>> +            if (sleep_time) {
>> +                xnanosleep(sleep_time);
>> +                time_slept = sleep_time;
>> +            }
>> +            if (sleep_time < PMD_PWR_MAX_SLEEP) {
>> +                /* Increase potential sleep time for next iteration. */
>> +                sleep_time += PMD_PWR_INC;
>> +            } else {
>> +                max_sleep_hit = true;
>> +            }
>> M understanding here is that the above code will lead to a interface
>> with less packets coming Cause latency for the packets arriving in the other interface handled by the same PMD.
>> Please correct me if I am wrong.
>>
>> Lets say we have 3 interfaces being serviced by this PMD.
>> first interface receives 8 pkts , second receives 5 pkts and third received 30 pkts.
>> After servicing the first interface, sleep time increases to 1 us and
>> Before servicing the second interface it will sleep 1 us and increment
>> the sleep time to 1us And before servicing third interface it will sleep 2 us(so even though the third interface receives > PMD_PWR_NO_SLEEP_THRESH packets on that interface will suffer a latency).
>> If the order of the interface was changed the latency suffered by the packets on these interfaces will change.
>>
>> I am thinking , if we hold an history of some x iterations and sleep based on this history, the order of the interfaces will not affect the latency suffered by the packets.
>> and possibly have even latencies across.
>>
>>            }
>>    
>> @@ -7004,5 +7053,6 @@ reload:
>>            }
>>    
>> -        pmd_perf_end_iteration(s, rx_packets, tx_packets,
>> +        pmd_perf_end_iteration(s, rx_packets, tx_packets, max_sleep_hit,
>> +                               nosleep_hit, time_slept,
>>                                   pmd_perf_metrics_enabled(pmd));
>>        }
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
>> 928821a82..d51a486d8 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -789,4 +789,21 @@
>>            </p>
>>          </column>
>> +      <column name="other_config" key="pmd-powersave"
>> +              type='{"type": "boolean"}'>
>> +        <p>
>> +         Enables PMD thread sleeping up to 250 uS per iteration based on
>> +         rx queue batch sizes.
>> +        </p>
>> +        <p>
>> +          The default value is <code>false</code>.
>> +        </p>
>> +        <p>
>> +          Set this value to <code>true</code> to enable this option.
>> +        </p>
>> +        <p>
>> +         Sleep time will be reset to 0 when any rx queue has a batch of
>> +         16 or more packets.
>> +        </p>
>> +      </column>
>>          <column name="other_config" key="userspace-tso-enable"
>>                  type='{"type": "boolean"}'>
>> --
>> 2.38.1
>>
>
Kevin Traynor Dec. 16, 2022, 6:09 p.m. UTC | #6
On 09/12/2022 15:13, David Marchand wrote:
> Hi Kevin,
> 
> Here are some comments on the stats side:
> 
> On Tue, Nov 29, 2022 at 3:01 PM Kevin Traynor <ktraynor@redhat.com> wrote:
>> @@ -6964,5 +6999,19 @@ reload:
>>                * There was no time updates on current iteration. */
>>               pmd_thread_ctx_time_update(pmd);
>> -            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
>> +            tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
>> +                                                   sleep_time ? true : false);
>> +        }
>> +
>> +        if (powersave) {
>> +            if (sleep_time) {
>> +                xnanosleep(sleep_time);
>> +                time_slept = sleep_time;
> 
> - time_slept is the requested time, and not the time that ovs actually slept.
> PMD_PWR_SLEEP_TIME stat won't be accurate if xnanosleep() takes longer
> than expected.
> I would add a "timer" around xnanosleep() and store cycles internally
> (and then translate back to us in the stats output).
> 

Good idea, I added this in v2.

> 
> - Both pmd idling (PMD_CYCLES_ITER_IDLE) and pmd processing
> (PMD_CYCLES_ITER_BUSY) stats now include cycles spent sleeping.
> Those stats are used for animating pmd stats ("avg processing cycles
> per packet") and rxq stats ("overhead").
> I think those "sleeping" cycles should be counted as idle so that
> "busy" cycles do represent time spent processing the packets.
> 

As we chatted about, I removed it from both idle and busy stats. It 
didn't seem to fit well loading all the sleeps into idle as they may 
have taken place during a busy iteration.

So now it is keeping idle and busy stats meaning the same, whether 
powersave is true/false. For the sleeps there is a new stat for total 
time slept and average/iteration. So we can see idle iterations/cycles, 
busy iterations/cycles and sleep time.

I also removed max sleeps but kept "no-sleep thresh hit".

Let me know if you think it's enough visibility. Thanks for suggestion 
and review.

> 
> WDYT?
> 
>> +            }
>> +            if (sleep_time < PMD_PWR_MAX_SLEEP) {
>> +                /* Increase potential sleep time for next iteration. */
>> +                sleep_time += PMD_PWR_INC;
>> +            } else {
>> +                max_sleep_hit = true;
>> +            }
>>           }
>>
> 
>
diff mbox series

Patch

diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
index b259cc8b3..a42e34956 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -312,4 +312,24 @@  reassignment due to PMD Auto Load Balance. For example, this could be set
 (in min) such that a reassignment is triggered at most every few hours.
 
+PMD Power Saving
+----------------
+
+PMD threads constantly poll Rx queues which are assigned to them. In order to
+reduce the CPU cycles they use, they can sleep for small periods of time
+when there is no load or very-low load from the Rx queues it polls.
+
+This can be enabled by::
+
+    $ ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"
+
+With this enabled a PMD may sleep by an incrementing amount of time up to
+a total of 250 uS after polling all Rx queues assigned to it, if none of the
+Rx queues have at least half a batch of packets (i.e. 16).
+
+.. note::
+
+    Sleeping in a PMD thread may cause extra packet latency due to the sleep
+    and possible processor low-power states which may be invoked.
+
 .. _ovs-vswitchd(8):
     http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html
diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
index a2a7d8f0b..1bdeda265 100644
--- a/lib/dpif-netdev-perf.c
+++ b/lib/dpif-netdev-perf.c
@@ -231,4 +231,7 @@  pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
     uint64_t idle_iter = s->pkts.bin[0];
     uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
+    uint64_t no_sleep_hit = stats[PMD_PWR_NO_SLEEP_HIT];
+    uint64_t max_sleep_hit = stats[PMD_PWR_MAX_SLEEP_HIT];
+    uint64_t tot_sleep_time = stats[PMD_PWR_SLEEP_TIME];
 
     ds_put_format(str,
@@ -236,5 +239,7 @@  pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
             "  - Used TSC cycles:  %12"PRIu64"  (%5.1f %% of total cycles)\n"
             "  - idle iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
-            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n",
+            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
+            "  - No-sleep thresh:  %12"PRIu64"  (%5.1f %% of busy it)\n"
+            "  - Max sleeps:       %12"PRIu64"  (%3"PRIu64" us/it avg.)\n",
             tot_iter, tot_cycles * us_per_cycle / tot_iter,
             tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz,
@@ -242,5 +247,7 @@  pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
             100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
             busy_iter,
-            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
+            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles,
+            no_sleep_hit, busy_iter ? 100.0 * no_sleep_hit / busy_iter : 0,
+            max_sleep_hit, tot_iter ? (tot_sleep_time / 1000) / tot_iter : 0);
     if (rx_packets > 0) {
         ds_put_format(str,
@@ -519,5 +526,7 @@  OVS_REQUIRES(s->stats_mutex)
 void
 pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
-                       int tx_packets, bool full_metrics)
+                       int tx_packets, bool max_sleep_hit,
+                       bool no_sleep_hit, uint64_t sleep_time,
+                       bool full_metrics)
 {
     uint64_t now_tsc = cycles_counter_update(s);
@@ -540,4 +549,14 @@  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
     histogram_add_sample(&s->pkts, rx_packets);
 
+    if (no_sleep_hit) {
+        pmd_perf_update_counter(s, PMD_PWR_NO_SLEEP_HIT, 1);
+    }
+    if (max_sleep_hit) {
+        pmd_perf_update_counter(s, PMD_PWR_MAX_SLEEP_HIT, 1);
+    }
+    if (sleep_time) {
+        pmd_perf_update_counter(s, PMD_PWR_SLEEP_TIME, sleep_time);
+    }
+
     if (!full_metrics) {
         return;
diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
index 9673dddd8..c3d6cd435 100644
--- a/lib/dpif-netdev-perf.h
+++ b/lib/dpif-netdev-perf.h
@@ -81,4 +81,7 @@  enum pmd_stat_type {
     PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
     PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
+    PMD_PWR_NO_SLEEP_HIT,   /* Iterations with Rx above no-sleep thresh. */
+    PMD_PWR_MAX_SLEEP_HIT,  /* Number of times max sleep value hit. */
+    PMD_PWR_SLEEP_TIME,     /* Total time slept to save power. */
     PMD_N_STATS
 };
@@ -409,5 +412,7 @@  pmd_perf_start_iteration(struct pmd_perf_stats *s);
 void
 pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
-                       int tx_packets, bool full_metrics);
+                       int tx_packets, bool max_sleep_hit,
+                       bool no_sleep_hit, uint64_t sleep_time,
+                       bool full_metrics);
 
 /* Formatting the output of commands. */
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 2c08a71c8..91a323c74 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -170,4 +170,11 @@  static struct odp_support dp_netdev_support = {
 #define PMD_RCU_QUIESCE_INTERVAL 10000LL
 
+/* Max time in nanoseconds for a pmd thread to sleep based on load. */
+#define PMD_PWR_MAX_SLEEP 250000
+/* Number of pkts Rx on an interface that will stop pmd thread sleeping. */
+#define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST/2)
+/* Time in nanosecond to increment a pmd thread sleep time. */
+#define PMD_PWR_INC 1000
+
 struct dpcls {
     struct cmap_node node;      /* Within dp_netdev_pmd_thread.classifiers */
@@ -278,4 +285,6 @@  struct dp_netdev {
     /* Enable collection of PMD performance metrics. */
     atomic_bool pmd_perf_metrics;
+    /* Enable PMD load based sleeping. */
+    atomic_bool pmd_powersave;
     /* Enable the SMC cache from ovsdb config */
     atomic_bool smc_enable_db;
@@ -4915,4 +4924,17 @@  dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
 
     set_pmd_auto_lb(dp, autolb_state, log_autolb);
+
+    bool powersave = smap_get_bool(other_config, "pmd-powersave", false);
+    bool cur_powersave;
+    atomic_read_relaxed(&dp->pmd_powersave, &cur_powersave);
+    if (powersave != cur_powersave) {
+        atomic_store_relaxed(&dp->pmd_powersave, powersave);
+        if (powersave) {
+            VLOG_INFO("PMD powersave is enabled.");
+        } else {
+            VLOG_INFO("PMD powersave is disabled.");
+        }
+    }
+
     return 0;
 }
@@ -6873,7 +6895,9 @@  pmd_thread_main(void *f_)
     bool exiting;
     bool reload;
+    bool powersave;
     int poll_cnt;
     int i;
     int process_packets = 0;
+    uint64_t sleep_time = 0;
 
     poll_list = NULL;
@@ -6935,7 +6959,14 @@  reload:
     for (;;) {
         uint64_t rx_packets = 0, tx_packets = 0;
+        bool nosleep_hit = false;
+        bool max_sleep_hit = false;
+        uint64_t time_slept = 0;
 
         pmd_perf_start_iteration(s);
-
+        atomic_read_relaxed(&pmd->dp->pmd_powersave, &powersave);
+        if (!powersave) {
+            /* Reset sleep_time as policy may have changed. */
+            sleep_time = 0;
+        }
         atomic_read_relaxed(&pmd->dp->smc_enable_db, &pmd->ctx.smc_enable_db);
 
@@ -6957,4 +6988,8 @@  reload:
                                            poll_list[i].port_no);
             rx_packets += process_packets;
+            if (process_packets >= PMD_PWR_NO_SLEEP_THRESH) {
+                nosleep_hit = true;
+                sleep_time = 0;
+            }
         }
 
@@ -6964,5 +6999,19 @@  reload:
              * There was no time updates on current iteration. */
             pmd_thread_ctx_time_update(pmd);
-            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
+            tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
+                                                   sleep_time ? true : false);
+        }
+
+        if (powersave) {
+            if (sleep_time) {
+                xnanosleep(sleep_time);
+                time_slept = sleep_time;
+            }
+            if (sleep_time < PMD_PWR_MAX_SLEEP) {
+                /* Increase potential sleep time for next iteration. */
+                sleep_time += PMD_PWR_INC;
+            } else {
+                max_sleep_hit = true;
+            }
         }
 
@@ -7004,5 +7053,6 @@  reload:
         }
 
-        pmd_perf_end_iteration(s, rx_packets, tx_packets,
+        pmd_perf_end_iteration(s, rx_packets, tx_packets, max_sleep_hit,
+                               nosleep_hit, time_slept,
                                pmd_perf_metrics_enabled(pmd));
     }
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 928821a82..d51a486d8 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -789,4 +789,21 @@ 
         </p>
       </column>
+      <column name="other_config" key="pmd-powersave"
+              type='{"type": "boolean"}'>
+        <p>
+         Enables PMD thread sleeping up to 250 uS per iteration based on
+         rx queue batch sizes.
+        </p>
+        <p>
+          The default value is <code>false</code>.
+        </p>
+        <p>
+          Set this value to <code>true</code> to enable this option.
+        </p>
+        <p>
+         Sleep time will be reset to 0 when any rx queue has a batch of
+         16 or more packets.
+        </p>
+      </column>
       <column name="other_config" key="userspace-tso-enable"
               type='{"type": "boolean"}'>