Message ID | cover.1618236421.git.grive@u256.net |
---|---|
Headers | show |
Series | dpif-netdev: Parallel offload processing | expand |
Hi Gaetan, On 4/12/21 5:19 PM, Gaetan Rivet wrote: > This patch series aims to improve the performance of the management > of hw-offloads in dpif-netdev. In the current version, some setup > will experience high memory usage and poor latency between a flow > decision and its execution regarding hardware offloading. > > This series starts by measuring key metrics regarding both issues > Those patches are introduced first to compare the current status > with each improvements introduced. > Offloads enqueued and inserted, as well as the latency > from queue insertion to hardware insertion is measured. A new > command 'ovs-appctl dpctl/offload-stats-show' is introduced > to show the current measure. > > In my current performance test setup I am measuring an > average latency hovering between 1~2 seconds. > After the optimizations, it is reduced to 500~900 ms. > Finally when using multiple threads and with proper driver > support[1], it is measured in the order of 1 ms. > > A few modules are introduced: > > * An ID pool with reduced capabilities, simplifying its > operations and allowing better performances in both > single and multi-thread setup. > > * A lockless queue between PMDs / revalidators and > offload thread(s). As the number of PMDs increases, > contention can be high on the shared queue. > This queue is designed to serve as message queue > between threads. > > * A bounded lockless MPMC ring and some helpers for > calculating moving averages. > > * A moving average module for Cumulative and Exponential > moving averages. > > The netdev-offload-dpdk module is made thread-safe. > Internal maps are made per-netdev instead, and locks are > taken for shorter critical sections within the module. > > CI result: https://github.com/grivet/ovs/actions/runs/554918929 > > [1]: The rte_flow API was made thread-safe in the 20.11 DPDK > release. Drivers that do not implement those operations > concurrently are protected by a lock. Others will > allow better concurrency, that improve the result > of this series. > > v2: > > * Improved the MPSC queue API to simplify usage. > > * Moved flush operation from initiator thread to offload > thread(s). This ensures offload metadata are shared only > among the offload thread pool. > > * Flush operation needs additional thread synchronization. > The ovs_barrier currently triggers a UAF. Add a unit-test to > validate its operations and a fix for the UAF. > > CI result: https://github.com/grivet/ovs/actions/runs/741430135 > The error comes from a failure to download 'automake' on > osx, unrelated to any change in this series. > > Gaetan Rivet (28): > netdev: Add flow API uninit function > netdev-offload-dpdk: Use per-netdev offload metadata > netdev-offload-dpdk: Implement hw-offload statistics read > dpctl: Add function to read hardware offload statistics > dpif-netdev: Rename flow offload thread > dpif-netdev: Rename offload thread structure > mov-avg: Add a moving average helper structure > dpif-netdev: Implement hardware offloads stats query > ovs-atomic: Expose atomic exchange operation > mpsc-queue: Module for lock-free message passing > llring: Add lockless MPMC bounded queue structure > seq-pool: Module for faster ID generation > netdev-offload: Add multi-thread API > dpif-netdev: Quiesce offload thread periodically > dpif-netdev: Postpone flow offload item freeing > dpif-netdev: Use seq-pool for mark allocation > dpif-netdev: Introduce tagged union of offload requests > tests: Add ovs-barrier unit test > ovs-thread: Fix barrier use-after-free > dpif-netdev: Execute flush from offload thread > netdev-offload-dpdk: Use per-thread HW offload stats > netdev-offload-dpdk: Lock rte_flow map access > netdev-offload-dpdk: Protect concurrent offload destroy/query > dpif-netdev: Use lockless queue to manage offloads > dpif-netdev: Make megaflow and mark mappings thread objects > dpif-netdev: Replace port mutex by rwlock > dpif-netdev: Use one or more offload threads > netdev-dpdk: Remove rte-flow API access locks > > lib/automake.mk | 7 + > lib/dpctl.c | 36 ++ > lib/dpif-netdev.c | 742 +++++++++++++++++++++++++--------- > lib/dpif-netlink.c | 1 + > lib/dpif-provider.h | 7 + > lib/dpif.c | 8 + > lib/dpif.h | 9 + > lib/llring.c | 153 +++++++ > lib/llring.h | 76 ++++ > lib/mov-avg.h | 171 ++++++++ > lib/mpsc-queue.c | 251 ++++++++++++ > lib/mpsc-queue.h | 189 +++++++++ > lib/netdev-dpdk.c | 6 - > lib/netdev-offload-dpdk.c | 277 +++++++++++-- > lib/netdev-offload-provider.h | 4 + > lib/netdev-offload.c | 92 ++++- > lib/netdev-offload.h | 21 + > lib/ovs-atomic-c++.h | 3 + > lib/ovs-atomic-clang.h | 5 + > lib/ovs-atomic-gcc4+.h | 5 + > lib/ovs-atomic-gcc4.7+.h | 5 + > lib/ovs-atomic-i586.h | 5 + > lib/ovs-atomic-locked.h | 9 + > lib/ovs-atomic-msvc.h | 22 + > lib/ovs-atomic-pthreads.h | 5 + > lib/ovs-atomic-x86_64.h | 5 + > lib/ovs-atomic.h | 8 +- > lib/ovs-thread.c | 61 ++- > lib/ovs-thread.h | 6 +- > lib/seq-pool.c | 198 +++++++++ > lib/seq-pool.h | 66 +++ > tests/automake.mk | 3 + > tests/library.at | 15 + > tests/test-barrier.c | 264 ++++++++++++ > tests/test-mpsc-queue.c | 727 +++++++++++++++++++++++++++++++++ > tests/test-seq-pool.c | 543 +++++++++++++++++++++++++ > vswitchd/vswitch.xml | 16 + > 37 files changed, 3788 insertions(+), 233 deletions(-) > create mode 100644 lib/llring.c > create mode 100644 lib/llring.h > create mode 100644 lib/mov-avg.h > create mode 100644 lib/mpsc-queue.c > create mode 100644 lib/mpsc-queue.h > create mode 100644 lib/seq-pool.c > create mode 100644 lib/seq-pool.h > create mode 100644 tests/test-barrier.c > create mode 100644 tests/test-mpsc-queue.c > create mode 100644 tests/test-seq-pool.c > > -- > 2.31.1 > The series looks almost good to me. Please note that I haven't had a setup to test it this week, I'll try to run some tests with it next week. Thanks, Maxime
On Fri, Apr 23, 2021, at 15:21, Maxime Coquelin wrote: > Hi Gaetan, > > On 4/12/21 5:19 PM, Gaetan Rivet wrote: > > This patch series aims to improve the performance of the management > > of hw-offloads in dpif-netdev. In the current version, some setup > > will experience high memory usage and poor latency between a flow > > decision and its execution regarding hardware offloading. > > > > This series starts by measuring key metrics regarding both issues > > Those patches are introduced first to compare the current status > > with each improvements introduced. > > Offloads enqueued and inserted, as well as the latency > > from queue insertion to hardware insertion is measured. A new > > command 'ovs-appctl dpctl/offload-stats-show' is introduced > > to show the current measure. > > > > In my current performance test setup I am measuring an > > average latency hovering between 1~2 seconds. > > After the optimizations, it is reduced to 500~900 ms. > > Finally when using multiple threads and with proper driver > > support[1], it is measured in the order of 1 ms. > > > > A few modules are introduced: > > > > * An ID pool with reduced capabilities, simplifying its > > operations and allowing better performances in both > > single and multi-thread setup. > > > > * A lockless queue between PMDs / revalidators and > > offload thread(s). As the number of PMDs increases, > > contention can be high on the shared queue. > > This queue is designed to serve as message queue > > between threads. > > > > * A bounded lockless MPMC ring and some helpers for > > calculating moving averages. > > > > * A moving average module for Cumulative and Exponential > > moving averages. > > > > The netdev-offload-dpdk module is made thread-safe. > > Internal maps are made per-netdev instead, and locks are > > taken for shorter critical sections within the module. > > > > CI result: https://github.com/grivet/ovs/actions/runs/554918929 > > > > [1]: The rte_flow API was made thread-safe in the 20.11 DPDK > > release. Drivers that do not implement those operations > > concurrently are protected by a lock. Others will > > allow better concurrency, that improve the result > > of this series. > > > > v2: > > > > * Improved the MPSC queue API to simplify usage. > > > > * Moved flush operation from initiator thread to offload > > thread(s). This ensures offload metadata are shared only > > among the offload thread pool. > > > > * Flush operation needs additional thread synchronization. > > The ovs_barrier currently triggers a UAF. Add a unit-test to > > validate its operations and a fix for the UAF. > > > > CI result: https://github.com/grivet/ovs/actions/runs/741430135 > > The error comes from a failure to download 'automake' on > > osx, unrelated to any change in this series. > > > > Gaetan Rivet (28): > > netdev: Add flow API uninit function > > netdev-offload-dpdk: Use per-netdev offload metadata > > netdev-offload-dpdk: Implement hw-offload statistics read > > dpctl: Add function to read hardware offload statistics > > dpif-netdev: Rename flow offload thread > > dpif-netdev: Rename offload thread structure > > mov-avg: Add a moving average helper structure > > dpif-netdev: Implement hardware offloads stats query > > ovs-atomic: Expose atomic exchange operation > > mpsc-queue: Module for lock-free message passing > > llring: Add lockless MPMC bounded queue structure > > seq-pool: Module for faster ID generation > > netdev-offload: Add multi-thread API > > dpif-netdev: Quiesce offload thread periodically > > dpif-netdev: Postpone flow offload item freeing > > dpif-netdev: Use seq-pool for mark allocation > > dpif-netdev: Introduce tagged union of offload requests > > tests: Add ovs-barrier unit test > > ovs-thread: Fix barrier use-after-free > > dpif-netdev: Execute flush from offload thread > > netdev-offload-dpdk: Use per-thread HW offload stats > > netdev-offload-dpdk: Lock rte_flow map access > > netdev-offload-dpdk: Protect concurrent offload destroy/query > > dpif-netdev: Use lockless queue to manage offloads > > dpif-netdev: Make megaflow and mark mappings thread objects > > dpif-netdev: Replace port mutex by rwlock > > dpif-netdev: Use one or more offload threads > > netdev-dpdk: Remove rte-flow API access locks > > > > lib/automake.mk | 7 + > > lib/dpctl.c | 36 ++ > > lib/dpif-netdev.c | 742 +++++++++++++++++++++++++--------- > > lib/dpif-netlink.c | 1 + > > lib/dpif-provider.h | 7 + > > lib/dpif.c | 8 + > > lib/dpif.h | 9 + > > lib/llring.c | 153 +++++++ > > lib/llring.h | 76 ++++ > > lib/mov-avg.h | 171 ++++++++ > > lib/mpsc-queue.c | 251 ++++++++++++ > > lib/mpsc-queue.h | 189 +++++++++ > > lib/netdev-dpdk.c | 6 - > > lib/netdev-offload-dpdk.c | 277 +++++++++++-- > > lib/netdev-offload-provider.h | 4 + > > lib/netdev-offload.c | 92 ++++- > > lib/netdev-offload.h | 21 + > > lib/ovs-atomic-c++.h | 3 + > > lib/ovs-atomic-clang.h | 5 + > > lib/ovs-atomic-gcc4+.h | 5 + > > lib/ovs-atomic-gcc4.7+.h | 5 + > > lib/ovs-atomic-i586.h | 5 + > > lib/ovs-atomic-locked.h | 9 + > > lib/ovs-atomic-msvc.h | 22 + > > lib/ovs-atomic-pthreads.h | 5 + > > lib/ovs-atomic-x86_64.h | 5 + > > lib/ovs-atomic.h | 8 +- > > lib/ovs-thread.c | 61 ++- > > lib/ovs-thread.h | 6 +- > > lib/seq-pool.c | 198 +++++++++ > > lib/seq-pool.h | 66 +++ > > tests/automake.mk | 3 + > > tests/library.at | 15 + > > tests/test-barrier.c | 264 ++++++++++++ > > tests/test-mpsc-queue.c | 727 +++++++++++++++++++++++++++++++++ > > tests/test-seq-pool.c | 543 +++++++++++++++++++++++++ > > vswitchd/vswitch.xml | 16 + > > 37 files changed, 3788 insertions(+), 233 deletions(-) > > create mode 100644 lib/llring.c > > create mode 100644 lib/llring.h > > create mode 100644 lib/mov-avg.h > > create mode 100644 lib/mpsc-queue.c > > create mode 100644 lib/mpsc-queue.h > > create mode 100644 lib/seq-pool.c > > create mode 100644 lib/seq-pool.h > > create mode 100644 tests/test-barrier.c > > create mode 100644 tests/test-mpsc-queue.c > > create mode 100644 tests/test-seq-pool.c > > > > -- > > 2.31.1 > > > > The series looks almost good to me. > Please note that I haven't had a setup to test it this week, I'll try to > run some tests with it next week. > > Thanks, > Maxime > > Thank you Maxime, I've made the modifications to the RCU quiescing, as well as moved the two fixes from this series (flow offload thread renaming, ovs-barrier UAF) first to ensure backporting would be easy if needed. In tests, it seems using ovsrcu_quiesce() actually improved latency, so I decided to send the v3 so you would see the correct numbers in your tests. v3 is here: https://patchwork.ozlabs.org/project/openvswitch/list/?series=240678 Cheers,