diff mbox

[ovs-dev,v10] netdev-dpdk: Increase pmd thread priority.

Message ID 1498428406-29712-1-git-send-email-bhanuprakash.bodireddy@intel.com
State Rejected
Delegated to: Darrell Ball
Headers show

Commit Message

Bodireddy, Bhanuprakash June 25, 2017, 10:06 p.m. UTC
Increase the DPDK pmd thread scheduling priority by lowering the nice
value. This will advise the kernel scheduler to prioritize pmd thread
over other processes and will help PMD to provide deterministic
performance in out-of-the-box deployments.

This patch sets the nice value of PMD threads to '-20'.

  $ ps -eLo comm,policy,psr,nice | grep pmd

   COMMAND  POLICY  PROCESSOR    NICE
    pmd62     TS        3        -20
    pmd63     TS        0        -20
    pmd64     TS        1        -20
    pmd65     TS        2        -20

Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
Tested-by: Billy O'Mahony <billy.o.mahony@intel.com>
Acked-by: Billy O'Mahony <billy.o.mahony@intel.com>
---
v9->v10
* Return error code if setpriority fails.

v8->v9:
* Rebase

v7->v8:
* Rebase
* Update the documentation file @Documentation/intro/install/dpdk-advanced.rst

v6->v7:
* Remove realtime scheduling policy logic.
* Increase pmd thread scheduling priority by lowering nice value to -20.
* Update doc accordingly.

v5->v6:
* Prohibit spawning pmd thread on the lowest core in dpdk-lcore-mask if
  lcore-mask and pmd-mask affinity are identical.
* Updated Note section in INSTALL.DPDK-ADVANCED doc.
* Tested below cases to verify system stability with pmd priority patch

v4->v5:
* Reword Note section in DPDK-ADVANCED.md

v3->v4:
* Document update
* Use ovs_strerror for reporting errors in lib-numa.c

v2->v3:
* Move set_priority() function to lib/ovs-numa.c
* Apply realtime scheduling policy and priority to pmd thread only if
  pmd-cpu-mask is passed.
* Update INSTALL.DPDK-ADVANCED.

v1->v2:
* Removed #ifdef and introduced dummy function "pmd_thread_setpriority"
  in netdev-dpdk.h
* Rebase

 Documentation/intro/install/dpdk.rst |  8 +++++++-
 lib/dpif-netdev.c                    |  4 ++++
 lib/ovs-numa.c                       | 22 ++++++++++++++++++++++
 lib/ovs-numa.h                       |  1 +
 4 files changed, 34 insertions(+), 1 deletion(-)

Comments

Darrell Ball June 26, 2017, 3:02 a.m. UTC | #1
With this change and CFS in effect, it effectively means that the dpdk control threads need to be on
different cores than the PMD threads or the response latency may be too long for their control work ?
Have we tested having the control threads on the same cpu with -20 nice for the pmd thread ?

I see the comment is added below
+    It is recommended that the OVS control thread and pmd thread shouldn't be
+    pinned to the same core i.e 'dpdk-lcore-mask' and 'pmd-cpu-mask' cpu mask
+    settings should be non-overlapping.


I understand that other heavy threads would be a problem for PMD threads and we want to effectively
encourage these to be on different cores in the situation where we are using a pmd-cpu-mask.
However, here we are almost shutting down other threads by default on the same core as PMDs
threads using -20 nice, even those with little cpu load but just needing a reasonable latency.

Will this aggravate the argument from some quarters that using dpdk requires too much
cpu reservation ?




On 6/25/17, 3:06 PM, "ovs-dev-bounces@openvswitch.org on behalf of Bhanuprakash Bodireddy" <ovs-dev-bounces@openvswitch.org on behalf of bhanuprakash.bodireddy@intel.com> wrote:

    Increase the DPDK pmd thread scheduling priority by lowering the nice
    value. This will advise the kernel scheduler to prioritize pmd thread
    over other processes and will help PMD to provide deterministic
    performance in out-of-the-box deployments.
    
    This patch sets the nice value of PMD threads to '-20'.
    
      $ ps -eLo comm,policy,psr,nice | grep pmd
    
       COMMAND  POLICY  PROCESSOR    NICE
        pmd62     TS        3        -20
        pmd63     TS        0        -20
        pmd64     TS        1        -20
        pmd65     TS        2        -20
    
    Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
    Tested-by: Billy O'Mahony <billy.o.mahony@intel.com>
    Acked-by: Billy O'Mahony <billy.o.mahony@intel.com>
    ---
    v9->v10
    * Return error code if setpriority fails.
    
    v8->v9:
    * Rebase
    
    v7->v8:
    * Rebase
    * Update the documentation file @Documentation/intro/install/dpdk-advanced.rst
    
    v6->v7:
    * Remove realtime scheduling policy logic.
    * Increase pmd thread scheduling priority by lowering nice value to -20.
    * Update doc accordingly.
    
    v5->v6:
    * Prohibit spawning pmd thread on the lowest core in dpdk-lcore-mask if
      lcore-mask and pmd-mask affinity are identical.
    * Updated Note section in INSTALL.DPDK-ADVANCED doc.
    * Tested below cases to verify system stability with pmd priority patch
    
    v4->v5:
    * Reword Note section in DPDK-ADVANCED.md
    
    v3->v4:
    * Document update
    * Use ovs_strerror for reporting errors in lib-numa.c
    
    v2->v3:
    * Move set_priority() function to lib/ovs-numa.c
    * Apply realtime scheduling policy and priority to pmd thread only if
      pmd-cpu-mask is passed.
    * Update INSTALL.DPDK-ADVANCED.
    
    v1->v2:
    * Removed #ifdef and introduced dummy function "pmd_thread_setpriority"
      in netdev-dpdk.h
    * Rebase
    
     Documentation/intro/install/dpdk.rst |  8 +++++++-
     lib/dpif-netdev.c                    |  4 ++++
     lib/ovs-numa.c                       | 22 ++++++++++++++++++++++
     lib/ovs-numa.h                       |  1 +
     4 files changed, 34 insertions(+), 1 deletion(-)
    
    diff --git a/Documentation/intro/install/dpdk.rst b/Documentation/intro/install/dpdk.rst
    index e83f852..b5c26ba 100644
    --- a/Documentation/intro/install/dpdk.rst
    +++ b/Documentation/intro/install/dpdk.rst
    @@ -453,7 +453,8 @@ affinitized accordingly.
       to be affinitized to isolated cores for optimum performance.
     
       By setting a bit in the mask, a pmd thread is created and pinned to the
    -  corresponding CPU core. e.g. to run a pmd thread on core 2::
    +  corresponding CPU core with nice value set to -20.
    +  e.g. to run a pmd thread on core 2::
     
           $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
     
    @@ -493,6 +494,11 @@ improvements as there will be more total CPU occupancy available::
     
         NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1
     
    +  .. note::
    +    It is recommended that the OVS control thread and pmd thread shouldn't be
    +    pinned to the same core i.e 'dpdk-lcore-mask' and 'pmd-cpu-mask' cpu mask
    +    settings should be non-overlapping.
    +
     DPDK Physical Port Rx Queues
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     
    diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
    index 4e29085..e952cf9 100644
    --- a/lib/dpif-netdev.c
    +++ b/lib/dpif-netdev.c
    @@ -3712,6 +3712,10 @@ pmd_thread_main(void *f_)
         ovs_numa_thread_setaffinity_core(pmd->core_id);
         dpdk_set_lcore_id(pmd->core_id);
         poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
    +
    +    /* Set pmd thread's nice value to -20 */
    +#define MIN_NICE -20
    +    ovs_numa_thread_setpriority(MIN_NICE);
     reload:
         emc_cache_init(&pmd->flow_cache);
     
    diff --git a/lib/ovs-numa.c b/lib/ovs-numa.c
    index 98e97cb..9cf6bd4 100644
    --- a/lib/ovs-numa.c
    +++ b/lib/ovs-numa.c
    @@ -23,6 +23,7 @@
     #include <dirent.h>
     #include <stddef.h>
     #include <string.h>
    +#include <sys/resource.h>
     #include <sys/types.h>
     #include <unistd.h>
     #endif /* __linux__ */
    @@ -570,3 +571,24 @@ int ovs_numa_thread_setaffinity_core(unsigned core_id OVS_UNUSED)
         return EOPNOTSUPP;
     #endif /* __linux__ */
     }
    +
    +int
    +ovs_numa_thread_setpriority(int nice OVS_UNUSED)
    +{
    +    if (dummy_numa) {
    +        return 0;
    +    }
    +
    +#ifndef _WIN32
    +    int err;
    +    err = setpriority(PRIO_PROCESS, 0, nice);
    +    if (err) {
    +        VLOG_ERR("Thread priority error %s", ovs_strerror(err));
    +        return err;
    +    }
    +
    +    return 0;
    +#else
    +    return EOPNOTSUPP;
    +#endif
    +}
    diff --git a/lib/ovs-numa.h b/lib/ovs-numa.h
    index 6946cdc..e132483 100644
    --- a/lib/ovs-numa.h
    +++ b/lib/ovs-numa.h
    @@ -62,6 +62,7 @@ bool ovs_numa_dump_contains_core(const struct ovs_numa_dump *,
     size_t ovs_numa_dump_count(const struct ovs_numa_dump *);
     void ovs_numa_dump_destroy(struct ovs_numa_dump *);
     int ovs_numa_thread_setaffinity_core(unsigned core_id);
    +int ovs_numa_thread_setpriority(int nice);
     
     #define FOR_EACH_CORE_ON_DUMP(ITER, DUMP)                    \
         HMAP_FOR_EACH((ITER), hmap_node, &(DUMP)->cores)
    -- 
    2.4.11
    
    _______________________________________________
    dev mailing list
    dev@openvswitch.org
    https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=zXbCg4WV0RIP3JSv_63fHNJnZWnx5t936cWOkpWxXwM&s=jnpfeaeqmiFt37-rg8gCHNZvYMYwuPvl-hSrZL5ziCw&e=
Bodireddy, Bhanuprakash June 26, 2017, 12:56 p.m. UTC | #2
>With this change and CFS in effect, it effectively means that the dpdk control

>threads need to be on different cores than the PMD threads or the response

>latency may be too long for their control work ?

>Have we tested having the control threads on the same cpu with -20 nice for

>the pmd thread ?


Yes, I did some testing and had a reason to add the comment that recommends dpdk-lcore-mask and pmd-cpu-mask should be non-overlapping.
The testing was done with a simple script that adds and deletes 750 vHost User ports(script copied below). The time statistics are captured in this case.

   dpdk-lcore-mask |  PMD thread   | PMD NICE |  Time statistics
      unspecified                Core 3                  -20                real    1m5.610s   / user    0m0.706s/ sys     0m0.023s          [With patch]
      Core 3                            Core 3                  -20               real    2m14.089s / user    0m0.717s/ sys     0m0.017s         [with patch]
      unspecified                Core 3                     0                 real    1m5.209s   /user    0m0.711s/sys     0m0.020s            [Master]
      Core 3                            Core 3                    0                real     1m7.209s   /user    0m0.711s/sys     0m0.020s            [Master]

In all cases, if the dpdk-lcore-mask is 'unspecified' the main thread floats between the available cores(0-27 in my case).

With this patch(PMD nice value is at -20), and with main & pmd thread pinned to core 3, the port addition and deletion took twice the time. However most important thing to notice is  with active traffic and with port addition/deletion in progress, throughput drops instantly *without* the patch. In this case the vswitchd  thread consumes 7% of the CPU time at one stage there by impacting the forwarding performance.

With the patch the throughput is still affected but happens gradually. In this case the vswitchd thread was consuming not more than 2% of the CPU time and so port addition/deletion took longer time.       

>

>I see the comment is added below

>+    It is recommended that the OVS control thread and pmd thread shouldn't

>be

>+    pinned to the same core i.e 'dpdk-lcore-mask' and 'pmd-cpu-mask' cpu

>mask

>+    settings should be non-overlapping.

>

>

>I understand that other heavy threads would be a problem for PMD threads

>and we want to effectively encourage these to be on different cores in the

>situation where we are using a pmd-cpu-mask.

>However, here we are almost shutting down other threads by default on the

>same core as PMDs threads using -20 nice, even those with little cpu load but

>just needing a reasonable latency.


I had the logic of completely shutting down other threads in the early versions of this patch by assigning real time priority to the PMD thread. But that seemed too dangerous and changing nice value is safer bet. I agree that latency can go up for non-pmd threads with this patch but it’s the same problem as there are other kernel threads that runs at -20 nice value and some with 'rt' priority. 

>

>Will this aggravate the argument from some quarters that using dpdk requires

>too much cpu reservation ?

Atleast for PMD threads that are heart of packet processing in OvS-DPDK. 


More information on commands:

script to test the port addition and deletion.

$cat port_test.sh
   cmds=; for i in {1..750}; do cmds+=" -- add-port br0 dpdkvhostuser$i -- set Interface dpdkvhostuser$i type=dpdkvhostuser"; done
   ovs-vsctl $cmds

   sleep 1;

   cmds=; for i in {1..750}; do cmds+=" -- del-port br0 dpdkvhostuser$i"; done
   ovs-vsctl $cmds

$ time ./port_test.sh

dpdk-lcore-mask and pmd-cpu-mask explicitly set to CORE 3.
-------------------------------------------------------------------------------------------
$ ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=8
$ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=8
$ ps -eLo tid,psr,comm | grep -e revalidator -e handler -e ovs -e pmd -e urc -e eal
   110881  20 ovsdb-server
   110892   3 ovs-vswitchd
   110976   3 pmd61
   110898   3 eal-intr-thread
   110903   3 urcu3
   110947   3 handler60

Dpdk-lcore-mask unspecified, pmd-cpu-mask explicitly set to CORE 3.
---------------------------------------------------------------------------------------------
$ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=8
$  ps -eLo tid,psr,comm | grep -e revalidator -e handler -e ovs -e pmd -e urc -e eal
    111474  14 ovsdb-server
    111483   6 ovs-vswitchd
    111566   3 pmd61
    111564  10 revalidator60
    111489   0 eal-intr-thread
    111493   8 urcu3

Regards,
Bhanuprakash.
Darrell Ball July 27, 2017, 4 a.m. UTC | #3
-----Original Message-----
From: "Bodireddy, Bhanuprakash" <bhanuprakash.bodireddy@intel.com>

Date: Monday, June 26, 2017 at 5:56 AM
To: Darrell Ball <dball@vmware.com>, "dev@openvswitch.org" <dev@openvswitch.org>
Subject: RE: [ovs-dev] [PATCH v10] netdev-dpdk: Increase pmd thread priority.

    >With this change and CFS in effect, it effectively means that the dpdk control

    >threads need to be on different cores than the PMD threads or the response

    >latency may be too long for their control work ?

    >Have we tested having the control threads on the same cpu with -20 nice for

    >the pmd thread ?

    
    Yes, I did some testing and had a reason to add the comment that recommends dpdk-lcore-mask and pmd-cpu-mask should be non-overlapping.
    The testing was done with a simple script that adds and deletes 750 vHost User ports(script copied below). The time statistics are captured in this case.
    
       dpdk-lcore-mask |  PMD thread   | PMD NICE |  Time statistics
          unspecified                Core 3                  -20                real    1m5.610s   / user    0m0.706s/ sys     0m0.023s          [With patch]
          Core 3                            Core 3                  -20               real    2m14.089s / user    0m0.717s/ sys     0m0.017s         [with patch]
          unspecified                Core 3                     0                 real    1m5.209s   /user    0m0.711s/sys     0m0.020s            [Master]
          Core 3                            Core 3                    0                real     1m7.209s   /user    0m0.711s/sys     0m0.020s            [Master]
    

[Darrell]
So if either the lcore mask is unspecified or specified to be non-conflicting, then the advantage is basically nil.
We should usually be able to do this and when we cannot I am not sure favoring throughput over management tasks such as port add is good,
as the potential relative impact of the management task is high while the % of total cpu time usage is lower.
///////////


    In all cases, if the dpdk-lcore-mask is 'unspecified' the main thread floats between the available cores(0-27 in my case).
    
    With this patch(PMD nice value is at -20), and with main & pmd thread pinned to core 3, the port addition and deletion took twice the time. However most important thing to notice is  with active traffic and with port addition/deletion in progress, throughput drops instantly *without* the patch. In this case the vswitchd  thread consumes 7% of the CPU time at one stage there by impacting the forwarding performance.
    
    With the patch the throughput is still affected but happens gradually. In this case the vswitchd thread was consuming not more than 2% of the CPU time and so port addition/deletion took longer time.       
    
    >

    >I see the comment is added below

    >+    It is recommended that the OVS control thread and pmd thread shouldn't

    >be

    >+    pinned to the same core i.e 'dpdk-lcore-mask' and 'pmd-cpu-mask' cpu

    >mask

    >+    settings should be non-overlapping.

    >

    >

    >I understand that other heavy threads would be a problem for PMD threads

    >and we want to effectively encourage these to be on different cores in the

    >situation where we are using a pmd-cpu-mask.

    >However, here we are almost shutting down other threads by default on the

    >same core as PMDs threads using -20 nice, even those with little cpu load but

    >just needing a reasonable latency.

    
    I had the logic of completely shutting down other threads in the early versions of this patch by assigning real time priority to the PMD thread. But that seemed too dangerous and changing nice value is safer bet. I agree that latency can go up for non-pmd threads with this patch but it’s the same problem as there are other kernel threads that runs at -20 nice value and some with 'rt' priority. 
    
    >

    >Will this aggravate the argument from some quarters that using dpdk requires

    >too much cpu reservation ?

    Atleast for PMD threads that are heart of packet processing in OvS-DPDK. 
    
    
    More information on commands:
    
    script to test the port addition and deletion.
    
    $cat port_test.sh
       cmds=; for i in {1..750}; do cmds+=" -- add-port br0 dpdkvhostuser$i -- set Interface dpdkvhostuser$i type=dpdkvhostuser"; done
       ovs-vsctl $cmds
    
       sleep 1;
    
       cmds=; for i in {1..750}; do cmds+=" -- del-port br0 dpdkvhostuser$i"; done
       ovs-vsctl $cmds
    
    $ time ./port_test.sh
    
    dpdk-lcore-mask and pmd-cpu-mask explicitly set to CORE 3.
    -------------------------------------------------------------------------------------------
    $ ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=8
    $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=8
    $ ps -eLo tid,psr,comm | grep -e revalidator -e handler -e ovs -e pmd -e urc -e eal
       110881  20 ovsdb-server
       110892   3 ovs-vswitchd
       110976   3 pmd61
       110898   3 eal-intr-thread
       110903   3 urcu3
       110947   3 handler60
    
    Dpdk-lcore-mask unspecified, pmd-cpu-mask explicitly set to CORE 3.
    ---------------------------------------------------------------------------------------------
    $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=8
    $  ps -eLo tid,psr,comm | grep -e revalidator -e handler -e ovs -e pmd -e urc -e eal
        111474  14 ovsdb-server
        111483   6 ovs-vswitchd
        111566   3 pmd61
        111564  10 revalidator60
        111489   0 eal-intr-thread
        111493   8 urcu3
    
    Regards,
    Bhanuprakash.
diff mbox

Patch

diff --git a/Documentation/intro/install/dpdk.rst b/Documentation/intro/install/dpdk.rst
index e83f852..b5c26ba 100644
--- a/Documentation/intro/install/dpdk.rst
+++ b/Documentation/intro/install/dpdk.rst
@@ -453,7 +453,8 @@  affinitized accordingly.
   to be affinitized to isolated cores for optimum performance.
 
   By setting a bit in the mask, a pmd thread is created and pinned to the
-  corresponding CPU core. e.g. to run a pmd thread on core 2::
+  corresponding CPU core with nice value set to -20.
+  e.g. to run a pmd thread on core 2::
 
       $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
 
@@ -493,6 +494,11 @@  improvements as there will be more total CPU occupancy available::
 
     NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1
 
+  .. note::
+    It is recommended that the OVS control thread and pmd thread shouldn't be
+    pinned to the same core i.e 'dpdk-lcore-mask' and 'pmd-cpu-mask' cpu mask
+    settings should be non-overlapping.
+
 DPDK Physical Port Rx Queues
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 4e29085..e952cf9 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -3712,6 +3712,10 @@  pmd_thread_main(void *f_)
     ovs_numa_thread_setaffinity_core(pmd->core_id);
     dpdk_set_lcore_id(pmd->core_id);
     poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
+
+    /* Set pmd thread's nice value to -20 */
+#define MIN_NICE -20
+    ovs_numa_thread_setpriority(MIN_NICE);
 reload:
     emc_cache_init(&pmd->flow_cache);
 
diff --git a/lib/ovs-numa.c b/lib/ovs-numa.c
index 98e97cb..9cf6bd4 100644
--- a/lib/ovs-numa.c
+++ b/lib/ovs-numa.c
@@ -23,6 +23,7 @@ 
 #include <dirent.h>
 #include <stddef.h>
 #include <string.h>
+#include <sys/resource.h>
 #include <sys/types.h>
 #include <unistd.h>
 #endif /* __linux__ */
@@ -570,3 +571,24 @@  int ovs_numa_thread_setaffinity_core(unsigned core_id OVS_UNUSED)
     return EOPNOTSUPP;
 #endif /* __linux__ */
 }
+
+int
+ovs_numa_thread_setpriority(int nice OVS_UNUSED)
+{
+    if (dummy_numa) {
+        return 0;
+    }
+
+#ifndef _WIN32
+    int err;
+    err = setpriority(PRIO_PROCESS, 0, nice);
+    if (err) {
+        VLOG_ERR("Thread priority error %s", ovs_strerror(err));
+        return err;
+    }
+
+    return 0;
+#else
+    return EOPNOTSUPP;
+#endif
+}
diff --git a/lib/ovs-numa.h b/lib/ovs-numa.h
index 6946cdc..e132483 100644
--- a/lib/ovs-numa.h
+++ b/lib/ovs-numa.h
@@ -62,6 +62,7 @@  bool ovs_numa_dump_contains_core(const struct ovs_numa_dump *,
 size_t ovs_numa_dump_count(const struct ovs_numa_dump *);
 void ovs_numa_dump_destroy(struct ovs_numa_dump *);
 int ovs_numa_thread_setaffinity_core(unsigned core_id);
+int ovs_numa_thread_setpriority(int nice);
 
 #define FOR_EACH_CORE_ON_DUMP(ITER, DUMP)                    \
     HMAP_FOR_EACH((ITER), hmap_node, &(DUMP)->cores)