mbox series

[ovs-dev,v5,00/27] dpif-netdev: Parallel offload processing

Message ID cover.1631094144.git.grive@u256.net
Headers show
Series dpif-netdev: Parallel offload processing | expand

Message

Gaetan Rivet Sept. 8, 2021, 9:47 a.m. UTC
This patch series aims to improve the performance of the management
of hw-offloads in dpif-netdev. In the current version, some setup
will experience high memory usage and poor latency between a flow
decision and its execution regarding hardware offloading.

This series starts by measuring key metrics regarding both issues
Those patches are introduced first to compare the current status
with each improvements introduced.
Offloads enqueued and inserted, as well as the latency
from queue insertion to hardware insertion is measured. A new
command 'ovs-appctl dpctl/offload-stats-show' is introduced
to show the current measure.

In my current performance test setup I am measuring an
average latency hovering between 1~2 seconds.
After the optimizations, it is reduced to 500~900 ms.
Finally when using multiple threads and with proper driver
support[1], it is measured in the order of 1 ms.

A few modules are introduced:

  * An ID pool with reduced capabilities, simplifying its
    operations and allowing better performances in both
    single and multi-thread setup.

  * A lockless queue between PMDs / revalidators and
    offload thread(s). As the number of PMDs increases,
    contention can be high on the shared queue.
    This queue is designed to serve as message queue
    between threads.

  * A bounded lockless MPMC ring and some helpers for
    calculating moving averages.

  * A moving average module for Cumulative and Exponential
    moving averages.

The netdev-offload-dpdk module is made thread-safe.
Internal maps are made per-netdev instead, and locks are
taken for shorter critical sections within the module.

CI result: https://github.com/grivet/ovs/actions/runs/554918929

[1]: The rte_flow API was made thread-safe in the 20.11 DPDK
     release. Drivers that do not implement those operations
     concurrently are protected by a lock. Others will
     allow better concurrency, that improve the result
     of this series.

v2:

  * Improved the MPSC queue API to simplify usage.

  * Moved flush operation from initiator thread to offload
    thread(s). This ensures offload metadata are shared only
    among the offload thread pool.

  * Flush operation needs additional thread synchronization.
    The ovs_barrier currently triggers a UAF. Add a unit-test to
    validate its operations and a fix for the UAF.

CI result: https://github.com/grivet/ovs/actions/runs/741430135
           The error comes from a failure to download 'automake' on
           osx, unrelated to any change in this series.

v3:

  * Re-ordered commits so fixes are first. No conflict seen currently,
    but it might prevent them if some requested changes to the series
    were to move code in the same parts.

  * Modified the reduced quiescing the thread to use ovsrcu_quiesce(),
    and base next_rcu on the current time value (after quiescing happened,
    however long it takes).

  * Added Reviewed-by tags to the relevant commits.

CI result: https://github.com/grivet/ovs/actions/runs/782655601

v4:

  * Modified the seq-pool to use batches of IDs with a spinlock
    instead of lockless rings.

  * The llring structure is removed.

  * Due to the length of the changes to the structure, some
    acked-by or reviewed-by were not ported to the id-fpool patch.

CI result: https://github.com/grivet/ovs/actions/runs/921095015

v5:

  * Rebase on master.
    Conflicts were seen related to the vxlan-decap and pmd rebalance
    series.

  * Fix typo in xchg patch spotted by Maxime Coquelin.

  * Added Reviewed-by Maxime Coquelin on 4 patches.

CI result: https://github.com/grivet/ovs/actions/runs/1212804378

Gaetan Rivet (27):
  ovs-thread: Fix barrier use-after-free
  dpif-netdev: Rename flow offload thread
  tests: Add ovs-barrier unit test
  netdev: Add flow API uninit function
  netdev-offload-dpdk: Use per-netdev offload metadata
  netdev-offload-dpdk: Implement hw-offload statistics read
  dpctl: Add function to read hardware offload statistics
  dpif-netdev: Rename offload thread structure
  mov-avg: Add a moving average helper structure
  dpif-netdev: Implement hardware offloads stats query
  ovs-atomic: Expose atomic exchange operation
  mpsc-queue: Module for lock-free message passing
  id-fpool: Module for fast ID generation
  netdev-offload: Add multi-thread API
  dpif-netdev: Quiesce offload thread periodically
  dpif-netdev: Postpone flow offload item freeing
  dpif-netdev: Use id-fpool for mark allocation
  dpif-netdev: Introduce tagged union of offload requests
  dpif-netdev: Execute flush from offload thread
  netdev-offload-dpdk: Use per-thread HW offload stats
  netdev-offload-dpdk: Lock rte_flow map access
  netdev-offload-dpdk: Protect concurrent offload destroy/query
  dpif-netdev: Use lockless queue to manage offloads
  dpif-netdev: Make megaflow and mark mappings thread objects
  dpif-netdev: Replace port mutex by rwlock
  dpif-netdev: Use one or more offload threads
  netdev-dpdk: Remove rte-flow API access locks

 lib/automake.mk               |   5 +
 lib/dpctl.c                   |  36 ++
 lib/dpif-netdev.c             | 734 ++++++++++++++++++++++++--------
 lib/dpif-netlink.c            |   1 +
 lib/dpif-provider.h           |   7 +
 lib/dpif.c                    |   8 +
 lib/dpif.h                    |   9 +
 lib/id-fpool.c                | 279 ++++++++++++
 lib/id-fpool.h                |  66 +++
 lib/mov-avg.h                 | 171 ++++++++
 lib/mpsc-queue.c              | 251 +++++++++++
 lib/mpsc-queue.h              | 190 +++++++++
 lib/netdev-dpdk.c             |   6 -
 lib/netdev-offload-dpdk.c     | 274 ++++++++++--
 lib/netdev-offload-provider.h |   4 +
 lib/netdev-offload.c          |  92 +++-
 lib/netdev-offload.h          |  21 +
 lib/ovs-atomic-c++.h          |   3 +
 lib/ovs-atomic-clang.h        |   5 +
 lib/ovs-atomic-gcc4+.h        |   5 +
 lib/ovs-atomic-gcc4.7+.h      |   5 +
 lib/ovs-atomic-i586.h         |   5 +
 lib/ovs-atomic-locked.h       |   9 +
 lib/ovs-atomic-msvc.h         |  22 +
 lib/ovs-atomic-pthreads.h     |   5 +
 lib/ovs-atomic-x86_64.h       |   5 +
 lib/ovs-atomic.h              |   8 +-
 lib/ovs-thread.c              |  61 ++-
 lib/ovs-thread.h              |   6 +-
 tests/automake.mk             |   3 +
 tests/library.at              |  14 +
 tests/test-barrier.c          | 264 ++++++++++++
 tests/test-id-fpool.c         | 615 +++++++++++++++++++++++++++
 tests/test-mpsc-queue.c       | 772 ++++++++++++++++++++++++++++++++++
 vswitchd/vswitch.xml          |  16 +
 35 files changed, 3749 insertions(+), 228 deletions(-)
 create mode 100644 lib/id-fpool.c
 create mode 100644 lib/id-fpool.h
 create mode 100644 lib/mov-avg.h
 create mode 100644 lib/mpsc-queue.c
 create mode 100644 lib/mpsc-queue.h
 create mode 100644 tests/test-barrier.c
 create mode 100644 tests/test-id-fpool.c
 create mode 100644 tests/test-mpsc-queue.c

--
2.31.1

Comments

Maxime Coquelin Sept. 15, 2021, 1:45 p.m. UTC | #1
Hi,

On 9/8/21 11:47 AM, Gaetan Rivet wrote:
> This patch series aims to improve the performance of the management
> of hw-offloads in dpif-netdev. In the current version, some setup
> will experience high memory usage and poor latency between a flow
> decision and its execution regarding hardware offloading.
> 
> This series starts by measuring key metrics regarding both issues
> Those patches are introduced first to compare the current status
> with each improvements introduced.
> Offloads enqueued and inserted, as well as the latency
> from queue insertion to hardware insertion is measured. A new
> command 'ovs-appctl dpctl/offload-stats-show' is introduced
> to show the current measure.
> 
> In my current performance test setup I am measuring an
> average latency hovering between 1~2 seconds.
> After the optimizations, it is reduced to 500~900 ms.
> Finally when using multiple threads and with proper driver
> support[1], it is measured in the order of 1 ms.
> 
> A few modules are introduced:
> 
>   * An ID pool with reduced capabilities, simplifying its
>     operations and allowing better performances in both
>     single and multi-thread setup.
> 
>   * A lockless queue between PMDs / revalidators and
>     offload thread(s). As the number of PMDs increases,
>     contention can be high on the shared queue.
>     This queue is designed to serve as message queue
>     between threads.
> 
>   * A bounded lockless MPMC ring and some helpers for
>     calculating moving averages.
> 
>   * A moving average module for Cumulative and Exponential
>     moving averages.
> 
> The netdev-offload-dpdk module is made thread-safe.
> Internal maps are made per-netdev instead, and locks are
> taken for shorter critical sections within the module.
> 
> CI result: https://github.com/grivet/ovs/actions/runs/554918929
> 
> [1]: The rte_flow API was made thread-safe in the 20.11 DPDK
>      release. Drivers that do not implement those operations
>      concurrently are protected by a lock. Others will
>      allow better concurrency, that improve the result
>      of this series.
> 
> v2:
> 
>   * Improved the MPSC queue API to simplify usage.
> 
>   * Moved flush operation from initiator thread to offload
>     thread(s). This ensures offload metadata are shared only
>     among the offload thread pool.
> 
>   * Flush operation needs additional thread synchronization.
>     The ovs_barrier currently triggers a UAF. Add a unit-test to
>     validate its operations and a fix for the UAF.
> 
> CI result: https://github.com/grivet/ovs/actions/runs/741430135
>            The error comes from a failure to download 'automake' on
>            osx, unrelated to any change in this series.
> 
> v3:
> 
>   * Re-ordered commits so fixes are first. No conflict seen currently,
>     but it might prevent them if some requested changes to the series
>     were to move code in the same parts.
> 
>   * Modified the reduced quiescing the thread to use ovsrcu_quiesce(),
>     and base next_rcu on the current time value (after quiescing happened,
>     however long it takes).
> 
>   * Added Reviewed-by tags to the relevant commits.
> 
> CI result: https://github.com/grivet/ovs/actions/runs/782655601
> 
> v4:
> 
>   * Modified the seq-pool to use batches of IDs with a spinlock
>     instead of lockless rings.
> 
>   * The llring structure is removed.
> 
>   * Due to the length of the changes to the structure, some
>     acked-by or reviewed-by were not ported to the id-fpool patch.
> 
> CI result: https://github.com/grivet/ovs/actions/runs/921095015
> 
> v5:
> 
>   * Rebase on master.
>     Conflicts were seen related to the vxlan-decap and pmd rebalance
>     series.
> 
>   * Fix typo in xchg patch spotted by Maxime Coquelin.
> 
>   * Added Reviewed-by Maxime Coquelin on 4 patches.


I went through the changes between v4 and v5, and would like to confirm
these are are minor and OK to me.

Regards,
Maxime

> CI result: https://github.com/grivet/ovs/actions/runs/1212804378
> 
> Gaetan Rivet (27):
>   ovs-thread: Fix barrier use-after-free
>   dpif-netdev: Rename flow offload thread
>   tests: Add ovs-barrier unit test
>   netdev: Add flow API uninit function
>   netdev-offload-dpdk: Use per-netdev offload metadata
>   netdev-offload-dpdk: Implement hw-offload statistics read
>   dpctl: Add function to read hardware offload statistics
>   dpif-netdev: Rename offload thread structure
>   mov-avg: Add a moving average helper structure
>   dpif-netdev: Implement hardware offloads stats query
>   ovs-atomic: Expose atomic exchange operation
>   mpsc-queue: Module for lock-free message passing
>   id-fpool: Module for fast ID generation
>   netdev-offload: Add multi-thread API
>   dpif-netdev: Quiesce offload thread periodically
>   dpif-netdev: Postpone flow offload item freeing
>   dpif-netdev: Use id-fpool for mark allocation
>   dpif-netdev: Introduce tagged union of offload requests
>   dpif-netdev: Execute flush from offload thread
>   netdev-offload-dpdk: Use per-thread HW offload stats
>   netdev-offload-dpdk: Lock rte_flow map access
>   netdev-offload-dpdk: Protect concurrent offload destroy/query
>   dpif-netdev: Use lockless queue to manage offloads
>   dpif-netdev: Make megaflow and mark mappings thread objects
>   dpif-netdev: Replace port mutex by rwlock
>   dpif-netdev: Use one or more offload threads
>   netdev-dpdk: Remove rte-flow API access locks
> 
>  lib/automake.mk               |   5 +
>  lib/dpctl.c                   |  36 ++
>  lib/dpif-netdev.c             | 734 ++++++++++++++++++++++++--------
>  lib/dpif-netlink.c            |   1 +
>  lib/dpif-provider.h           |   7 +
>  lib/dpif.c                    |   8 +
>  lib/dpif.h                    |   9 +
>  lib/id-fpool.c                | 279 ++++++++++++
>  lib/id-fpool.h                |  66 +++
>  lib/mov-avg.h                 | 171 ++++++++
>  lib/mpsc-queue.c              | 251 +++++++++++
>  lib/mpsc-queue.h              | 190 +++++++++
>  lib/netdev-dpdk.c             |   6 -
>  lib/netdev-offload-dpdk.c     | 274 ++++++++++--
>  lib/netdev-offload-provider.h |   4 +
>  lib/netdev-offload.c          |  92 +++-
>  lib/netdev-offload.h          |  21 +
>  lib/ovs-atomic-c++.h          |   3 +
>  lib/ovs-atomic-clang.h        |   5 +
>  lib/ovs-atomic-gcc4+.h        |   5 +
>  lib/ovs-atomic-gcc4.7+.h      |   5 +
>  lib/ovs-atomic-i586.h         |   5 +
>  lib/ovs-atomic-locked.h       |   9 +
>  lib/ovs-atomic-msvc.h         |  22 +
>  lib/ovs-atomic-pthreads.h     |   5 +
>  lib/ovs-atomic-x86_64.h       |   5 +
>  lib/ovs-atomic.h              |   8 +-
>  lib/ovs-thread.c              |  61 ++-
>  lib/ovs-thread.h              |   6 +-
>  tests/automake.mk             |   3 +
>  tests/library.at              |  14 +
>  tests/test-barrier.c          | 264 ++++++++++++
>  tests/test-id-fpool.c         | 615 +++++++++++++++++++++++++++
>  tests/test-mpsc-queue.c       | 772 ++++++++++++++++++++++++++++++++++
>  vswitchd/vswitch.xml          |  16 +
>  35 files changed, 3749 insertions(+), 228 deletions(-)
>  create mode 100644 lib/id-fpool.c
>  create mode 100644 lib/id-fpool.h
>  create mode 100644 lib/mov-avg.h
>  create mode 100644 lib/mpsc-queue.c
>  create mode 100644 lib/mpsc-queue.h
>  create mode 100644 tests/test-barrier.c
>  create mode 100644 tests/test-id-fpool.c
>  create mode 100644 tests/test-mpsc-queue.c
> 
> --
> 2.31.1
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Ilya Maximets Jan. 19, 2022, 1:54 a.m. UTC | #2
On 9/15/21 15:45, Maxime Coquelin wrote:
> Hi,
> 
> On 9/8/21 11:47 AM, Gaetan Rivet wrote:
>> This patch series aims to improve the performance of the management
>> of hw-offloads in dpif-netdev. In the current version, some setup
>> will experience high memory usage and poor latency between a flow
>> decision and its execution regarding hardware offloading.
>>
>> This series starts by measuring key metrics regarding both issues
>> Those patches are introduced first to compare the current status
>> with each improvements introduced.
>> Offloads enqueued and inserted, as well as the latency
>> from queue insertion to hardware insertion is measured. A new
>> command 'ovs-appctl dpctl/offload-stats-show' is introduced
>> to show the current measure.
>>
>> In my current performance test setup I am measuring an
>> average latency hovering between 1~2 seconds.
>> After the optimizations, it is reduced to 500~900 ms.
>> Finally when using multiple threads and with proper driver
>> support[1], it is measured in the order of 1 ms.
>>
>> A few modules are introduced:
>>
>>   * An ID pool with reduced capabilities, simplifying its
>>     operations and allowing better performances in both
>>     single and multi-thread setup.
>>
>>   * A lockless queue between PMDs / revalidators and
>>     offload thread(s). As the number of PMDs increases,
>>     contention can be high on the shared queue.
>>     This queue is designed to serve as message queue
>>     between threads.
>>
>>   * A bounded lockless MPMC ring and some helpers for
>>     calculating moving averages.
>>
>>   * A moving average module for Cumulative and Exponential
>>     moving averages.
>>
>> The netdev-offload-dpdk module is made thread-safe.
>> Internal maps are made per-netdev instead, and locks are
>> taken for shorter critical sections within the module.
>>
>> CI result: https://github.com/grivet/ovs/actions/runs/554918929
>>
>> [1]: The rte_flow API was made thread-safe in the 20.11 DPDK
>>      release. Drivers that do not implement those operations
>>      concurrently are protected by a lock. Others will
>>      allow better concurrency, that improve the result
>>      of this series.
>>
>> v2:
>>
>>   * Improved the MPSC queue API to simplify usage.
>>
>>   * Moved flush operation from initiator thread to offload
>>     thread(s). This ensures offload metadata are shared only
>>     among the offload thread pool.
>>
>>   * Flush operation needs additional thread synchronization.
>>     The ovs_barrier currently triggers a UAF. Add a unit-test to
>>     validate its operations and a fix for the UAF.
>>
>> CI result: https://github.com/grivet/ovs/actions/runs/741430135
>>            The error comes from a failure to download 'automake' on
>>            osx, unrelated to any change in this series.
>>
>> v3:
>>
>>   * Re-ordered commits so fixes are first. No conflict seen currently,
>>     but it might prevent them if some requested changes to the series
>>     were to move code in the same parts.
>>
>>   * Modified the reduced quiescing the thread to use ovsrcu_quiesce(),
>>     and base next_rcu on the current time value (after quiescing happened,
>>     however long it takes).
>>
>>   * Added Reviewed-by tags to the relevant commits.
>>
>> CI result: https://github.com/grivet/ovs/actions/runs/782655601
>>
>> v4:
>>
>>   * Modified the seq-pool to use batches of IDs with a spinlock
>>     instead of lockless rings.
>>
>>   * The llring structure is removed.
>>
>>   * Due to the length of the changes to the structure, some
>>     acked-by or reviewed-by were not ported to the id-fpool patch.
>>
>> CI result: https://github.com/grivet/ovs/actions/runs/921095015
>>
>> v5:
>>
>>   * Rebase on master.
>>     Conflicts were seen related to the vxlan-decap and pmd rebalance
>>     series.
>>
>>   * Fix typo in xchg patch spotted by Maxime Coquelin.
>>
>>   * Added Reviewed-by Maxime Coquelin on 4 patches.
> 
> 
> I went through the changes between v4 and v5, and would like to confirm
> these are are minor and OK to me.
> 
> Regards,
> Maxime
> 
>> CI result: https://github.com/grivet/ovs/actions/runs/1212804378
>>
>> Gaetan Rivet (27):
>>   ovs-thread: Fix barrier use-after-free
>>   dpif-netdev: Rename flow offload thread
>>   tests: Add ovs-barrier unit test
>>   netdev: Add flow API uninit function
>>   netdev-offload-dpdk: Use per-netdev offload metadata
>>   netdev-offload-dpdk: Implement hw-offload statistics read
>>   dpctl: Add function to read hardware offload statistics
>>   dpif-netdev: Rename offload thread structure
>>   mov-avg: Add a moving average helper structure
>>   dpif-netdev: Implement hardware offloads stats query
>>   ovs-atomic: Expose atomic exchange operation
>>   mpsc-queue: Module for lock-free message passing
>>   id-fpool: Module for fast ID generation
>>   netdev-offload: Add multi-thread API
>>   dpif-netdev: Quiesce offload thread periodically
>>   dpif-netdev: Postpone flow offload item freeing
>>   dpif-netdev: Use id-fpool for mark allocation
>>   dpif-netdev: Introduce tagged union of offload requests
>>   dpif-netdev: Execute flush from offload thread
>>   netdev-offload-dpdk: Use per-thread HW offload stats
>>   netdev-offload-dpdk: Lock rte_flow map access
>>   netdev-offload-dpdk: Protect concurrent offload destroy/query
>>   dpif-netdev: Use lockless queue to manage offloads
>>   dpif-netdev: Make megaflow and mark mappings thread objects
>>   dpif-netdev: Replace port mutex by rwlock
>>   dpif-netdev: Use one or more offload threads
>>   netdev-dpdk: Remove rte-flow API access locks
>>
>>  lib/automake.mk               |   5 +
>>  lib/dpctl.c                   |  36 ++
>>  lib/dpif-netdev.c             | 734 ++++++++++++++++++++++++--------
>>  lib/dpif-netlink.c            |   1 +
>>  lib/dpif-provider.h           |   7 +
>>  lib/dpif.c                    |   8 +
>>  lib/dpif.h                    |   9 +
>>  lib/id-fpool.c                | 279 ++++++++++++
>>  lib/id-fpool.h                |  66 +++
>>  lib/mov-avg.h                 | 171 ++++++++
>>  lib/mpsc-queue.c              | 251 +++++++++++
>>  lib/mpsc-queue.h              | 190 +++++++++
>>  lib/netdev-dpdk.c             |   6 -
>>  lib/netdev-offload-dpdk.c     | 274 ++++++++++--
>>  lib/netdev-offload-provider.h |   4 +
>>  lib/netdev-offload.c          |  92 +++-
>>  lib/netdev-offload.h          |  21 +
>>  lib/ovs-atomic-c++.h          |   3 +
>>  lib/ovs-atomic-clang.h        |   5 +
>>  lib/ovs-atomic-gcc4+.h        |   5 +
>>  lib/ovs-atomic-gcc4.7+.h      |   5 +
>>  lib/ovs-atomic-i586.h         |   5 +
>>  lib/ovs-atomic-locked.h       |   9 +
>>  lib/ovs-atomic-msvc.h         |  22 +
>>  lib/ovs-atomic-pthreads.h     |   5 +
>>  lib/ovs-atomic-x86_64.h       |   5 +
>>  lib/ovs-atomic.h              |   8 +-
>>  lib/ovs-thread.c              |  61 ++-
>>  lib/ovs-thread.h              |   6 +-
>>  tests/automake.mk             |   3 +
>>  tests/library.at              |  14 +
>>  tests/test-barrier.c          | 264 ++++++++++++
>>  tests/test-id-fpool.c         | 615 +++++++++++++++++++++++++++
>>  tests/test-mpsc-queue.c       | 772 ++++++++++++++++++++++++++++++++++
>>  vswitchd/vswitch.xml          |  16 +
>>  35 files changed, 3749 insertions(+), 228 deletions(-)
>>  create mode 100644 lib/id-fpool.c
>>  create mode 100644 lib/id-fpool.h
>>  create mode 100644 lib/mov-avg.h
>>  create mode 100644 lib/mpsc-queue.c
>>  create mode 100644 lib/mpsc-queue.h
>>  create mode 100644 tests/test-barrier.c
>>  create mode 100644 tests/test-id-fpool.c
>>  create mode 100644 tests/test-mpsc-queue.c
>>
>> --
>> 2.31.1

Thanks, Gaetan and Maxime!

That's a great work!  Though I still think that some parts of it
are a bit over-engineered, this version is sane enough. :)

I had to fix the windows build as the implementation of atomic_exchange
didn't compile.  I also added a NEWS entry and updated the man page
to specify that n-offload-threads makes sense only for userspace datapath
right now.  And I disabled printing the number of threads, if not
explicitly configured.  That is to avoid confusion among kernel datapath
users if they will see the number of threads printed (we have enough
misunderstands of how offloading works).

With that, applied.

I also spotted one bug where the flow stays offloaded if it gets added
and removed shortly after that.  But it's an old existing race condition,
not introduced by this patch set.  We basically have to enqueue the
flow deletion regardless of the flow->mark being set.
I'll send a patch for that issue, probably, in the next couple of days.
Or if you want to work on that, that's fine for me too.

Regarding rte_flow API being thread-safe, that is somehow a very shady
concept, as only a few API calls are actually protected inside DPDK.
At the same time DPDK seems to claim that all rte_flow APIs are thread-safe.
Very confusing.

Best regards, Ilya Maximets.
Sriharsha Basavapatna Jan. 23, 2022, 4:28 p.m. UTC | #3
Hi Ilya,

On Wed, Jan 19, 2022 at 7:24 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 9/15/21 15:45, Maxime Coquelin wrote:
> > Hi,
> >
> > On 9/8/21 11:47 AM, Gaetan Rivet wrote:
> >> This patch series aims to improve the performance of the management
> >> of hw-offloads in dpif-netdev. In the current version, some setup
> >> will experience high memory usage and poor latency between a flow
> >> decision and its execution regarding hardware offloading.
> >>
> >> This series starts by measuring key metrics regarding both issues
> >> Those patches are introduced first to compare the current status
> >> with each improvements introduced.
> >> Offloads enqueued and inserted, as well as the latency
> >> from queue insertion to hardware insertion is measured. A new
> >> command 'ovs-appctl dpctl/offload-stats-show' is introduced
> >> to show the current measure.
> >>
> >> In my current performance test setup I am measuring an
> >> average latency hovering between 1~2 seconds.
> >> After the optimizations, it is reduced to 500~900 ms.
> >> Finally when using multiple threads and with proper driver
> >> support[1], it is measured in the order of 1 ms.
> >>
> >> A few modules are introduced:
> >>
> >>   * An ID pool with reduced capabilities, simplifying its
> >>     operations and allowing better performances in both
> >>     single and multi-thread setup.
> >>
> >>   * A lockless queue between PMDs / revalidators and
> >>     offload thread(s). As the number of PMDs increases,
> >>     contention can be high on the shared queue.
> >>     This queue is designed to serve as message queue
> >>     between threads.
> >>
> >>   * A bounded lockless MPMC ring and some helpers for
> >>     calculating moving averages.
> >>
> >>   * A moving average module for Cumulative and Exponential
> >>     moving averages.
> >>
> >> The netdev-offload-dpdk module is made thread-safe.
> >> Internal maps are made per-netdev instead, and locks are
> >> taken for shorter critical sections within the module.
> >>
> >> CI result: https://github.com/grivet/ovs/actions/runs/554918929
> >>
> >> [1]: The rte_flow API was made thread-safe in the 20.11 DPDK
> >>      release. Drivers that do not implement those operations
> >>      concurrently are protected by a lock. Others will
> >>      allow better concurrency, that improve the result
> >>      of this series.
> >>
> >> v2:
> >>
> >>   * Improved the MPSC queue API to simplify usage.
> >>
> >>   * Moved flush operation from initiator thread to offload
> >>     thread(s). This ensures offload metadata are shared only
> >>     among the offload thread pool.
> >>
> >>   * Flush operation needs additional thread synchronization.
> >>     The ovs_barrier currently triggers a UAF. Add a unit-test to
> >>     validate its operations and a fix for the UAF.
> >>
> >> CI result: https://github.com/grivet/ovs/actions/runs/741430135
> >>            The error comes from a failure to download 'automake' on
> >>            osx, unrelated to any change in this series.
> >>
> >> v3:
> >>
> >>   * Re-ordered commits so fixes are first. No conflict seen currently,
> >>     but it might prevent them if some requested changes to the series
> >>     were to move code in the same parts.
> >>
> >>   * Modified the reduced quiescing the thread to use ovsrcu_quiesce(),
> >>     and base next_rcu on the current time value (after quiescing happened,
> >>     however long it takes).
> >>
> >>   * Added Reviewed-by tags to the relevant commits.
> >>
> >> CI result: https://github.com/grivet/ovs/actions/runs/782655601
> >>
> >> v4:
> >>
> >>   * Modified the seq-pool to use batches of IDs with a spinlock
> >>     instead of lockless rings.
> >>
> >>   * The llring structure is removed.
> >>
> >>   * Due to the length of the changes to the structure, some
> >>     acked-by or reviewed-by were not ported to the id-fpool patch.
> >>
> >> CI result: https://github.com/grivet/ovs/actions/runs/921095015
> >>
> >> v5:
> >>
> >>   * Rebase on master.
> >>     Conflicts were seen related to the vxlan-decap and pmd rebalance
> >>     series.
> >>
> >>   * Fix typo in xchg patch spotted by Maxime Coquelin.
> >>
> >>   * Added Reviewed-by Maxime Coquelin on 4 patches.
> >
> >
> > I went through the changes between v4 and v5, and would like to confirm
> > these are are minor and OK to me.
> >
> > Regards,
> > Maxime
> >
> >> CI result: https://github.com/grivet/ovs/actions/runs/1212804378
> >>
> >> Gaetan Rivet (27):
> >>   ovs-thread: Fix barrier use-after-free
> >>   dpif-netdev: Rename flow offload thread
> >>   tests: Add ovs-barrier unit test
> >>   netdev: Add flow API uninit function
> >>   netdev-offload-dpdk: Use per-netdev offload metadata
> >>   netdev-offload-dpdk: Implement hw-offload statistics read
> >>   dpctl: Add function to read hardware offload statistics
> >>   dpif-netdev: Rename offload thread structure
> >>   mov-avg: Add a moving average helper structure
> >>   dpif-netdev: Implement hardware offloads stats query
> >>   ovs-atomic: Expose atomic exchange operation
> >>   mpsc-queue: Module for lock-free message passing
> >>   id-fpool: Module for fast ID generation
> >>   netdev-offload: Add multi-thread API
> >>   dpif-netdev: Quiesce offload thread periodically
> >>   dpif-netdev: Postpone flow offload item freeing
> >>   dpif-netdev: Use id-fpool for mark allocation
> >>   dpif-netdev: Introduce tagged union of offload requests
> >>   dpif-netdev: Execute flush from offload thread
> >>   netdev-offload-dpdk: Use per-thread HW offload stats
> >>   netdev-offload-dpdk: Lock rte_flow map access
> >>   netdev-offload-dpdk: Protect concurrent offload destroy/query
> >>   dpif-netdev: Use lockless queue to manage offloads
> >>   dpif-netdev: Make megaflow and mark mappings thread objects
> >>   dpif-netdev: Replace port mutex by rwlock
> >>   dpif-netdev: Use one or more offload threads
> >>   netdev-dpdk: Remove rte-flow API access locks
> >>
> >>  lib/automake.mk               |   5 +
> >>  lib/dpctl.c                   |  36 ++
> >>  lib/dpif-netdev.c             | 734 ++++++++++++++++++++++++--------
> >>  lib/dpif-netlink.c            |   1 +
> >>  lib/dpif-provider.h           |   7 +
> >>  lib/dpif.c                    |   8 +
> >>  lib/dpif.h                    |   9 +
> >>  lib/id-fpool.c                | 279 ++++++++++++
> >>  lib/id-fpool.h                |  66 +++
> >>  lib/mov-avg.h                 | 171 ++++++++
> >>  lib/mpsc-queue.c              | 251 +++++++++++
> >>  lib/mpsc-queue.h              | 190 +++++++++
> >>  lib/netdev-dpdk.c             |   6 -
> >>  lib/netdev-offload-dpdk.c     | 274 ++++++++++--
> >>  lib/netdev-offload-provider.h |   4 +
> >>  lib/netdev-offload.c          |  92 +++-
> >>  lib/netdev-offload.h          |  21 +
> >>  lib/ovs-atomic-c++.h          |   3 +
> >>  lib/ovs-atomic-clang.h        |   5 +
> >>  lib/ovs-atomic-gcc4+.h        |   5 +
> >>  lib/ovs-atomic-gcc4.7+.h      |   5 +
> >>  lib/ovs-atomic-i586.h         |   5 +
> >>  lib/ovs-atomic-locked.h       |   9 +
> >>  lib/ovs-atomic-msvc.h         |  22 +
> >>  lib/ovs-atomic-pthreads.h     |   5 +
> >>  lib/ovs-atomic-x86_64.h       |   5 +
> >>  lib/ovs-atomic.h              |   8 +-
> >>  lib/ovs-thread.c              |  61 ++-
> >>  lib/ovs-thread.h              |   6 +-
> >>  tests/automake.mk             |   3 +
> >>  tests/library.at              |  14 +
> >>  tests/test-barrier.c          | 264 ++++++++++++
> >>  tests/test-id-fpool.c         | 615 +++++++++++++++++++++++++++
> >>  tests/test-mpsc-queue.c       | 772 ++++++++++++++++++++++++++++++++++
> >>  vswitchd/vswitch.xml          |  16 +
> >>  35 files changed, 3749 insertions(+), 228 deletions(-)
> >>  create mode 100644 lib/id-fpool.c
> >>  create mode 100644 lib/id-fpool.h
> >>  create mode 100644 lib/mov-avg.h
> >>  create mode 100644 lib/mpsc-queue.c
> >>  create mode 100644 lib/mpsc-queue.h
> >>  create mode 100644 tests/test-barrier.c
> >>  create mode 100644 tests/test-id-fpool.c
> >>  create mode 100644 tests/test-mpsc-queue.c
> >>
> >> --
> >> 2.31.1
>
> Thanks, Gaetan and Maxime!
>
> That's a great work!  Though I still think that some parts of it
> are a bit over-engineered, this version is sane enough. :)
>
> I had to fix the windows build as the implementation of atomic_exchange
> didn't compile.  I also added a NEWS entry and updated the man page
> to specify that n-offload-threads makes sense only for userspace datapath
> right now.  And I disabled printing the number of threads, if not
> explicitly configured.  That is to avoid confusion among kernel datapath
> users if they will see the number of threads printed (we have enough
> misunderstands of how offloading works).
>
> With that, applied.
>
> I also spotted one bug where the flow stays offloaded if it gets added
> and removed shortly after that.  But it's an old existing race condition,
> not introduced by this patch set.  We basically have to enqueue the
> flow deletion regardless of the flow->mark being set.
> I'll send a patch for that issue, probably, in the next couple of days.
> Or if you want to work on that, that's fine for me too.

We are seeing a similar issue with OVS-2.16. While adding and deleting
flows in a loop, after a few iterations, rte_flow_destroy() is not
seen by the PMD for some of the offloaded flows. And those flows stay
offloaded in HW.  I tried the following change in 2.16 like you
suggested above and it resolved the issue.

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index d6bee2a5a..8cca57f1f 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2758,9 +2758,7 @@ dp_netdev_pmd_remove_flow(struct
dp_netdev_pmd_thread *pmd,
     ovs_assert(cls != NULL);
     dpcls_remove(cls, &flow->cr);
     cmap_remove(&pmd->flow_table, node, dp_netdev_flow_hash(&flow->ufid));
-    if (flow->mark != INVALID_FLOW_MARK) {
-        queue_netdev_flow_del(pmd, flow);
-    }
+    queue_netdev_flow_del(pmd, flow);
     flow->dead = true;

     dp_netdev_flow_unref(flow);

Thanks,
-Harsha

>
> Regarding rte_flow API being thread-safe, that is somehow a very shady
> concept, as only a few API calls are actually protected inside DPDK.
> At the same time DPDK seems to claim that all rte_flow APIs are thread-safe.
> Very confusing.
>
> Best regards, Ilya Maximets.
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Ilya Maximets Jan. 25, 2022, 9:56 p.m. UTC | #4
On 1/23/22 17:28, Sriharsha Basavapatna wrote:
> Hi Ilya,
> 
> On Wed, Jan 19, 2022 at 7:24 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> <snip>
>>
>> I also spotted one bug where the flow stays offloaded if it gets added
>> and removed shortly after that.  But it's an old existing race condition,
>> not introduced by this patch set.  We basically have to enqueue the
>> flow deletion regardless of the flow->mark being set.
>> I'll send a patch for that issue, probably, in the next couple of days.
>> Or if you want to work on that, that's fine for me too.
> 
> We are seeing a similar issue with OVS-2.16. While adding and deleting
> flows in a loop, after a few iterations, rte_flow_destroy() is not
> seen by the PMD for some of the offloaded flows. And those flows stay
> offloaded in HW.  I tried the following change in 2.16 like you
> suggested above and it resolved the issue.
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index d6bee2a5a..8cca57f1f 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -2758,9 +2758,7 @@ dp_netdev_pmd_remove_flow(struct
> dp_netdev_pmd_thread *pmd,
>      ovs_assert(cls != NULL);
>      dpcls_remove(cls, &flow->cr);
>      cmap_remove(&pmd->flow_table, node, dp_netdev_flow_hash(&flow->ufid));
> -    if (flow->mark != INVALID_FLOW_MARK) {
> -        queue_netdev_flow_del(pmd, flow);
> -    }
> +    queue_netdev_flow_del(pmd, flow);
>      flow->dead = true;
> 
>      dp_netdev_flow_unref(flow);

Yeah, I was thinking about something similar.  Would you mind sending an
official patch?

Best regards, Ilya Maximets.
Sriharsha Basavapatna Jan. 26, 2022, 10:38 a.m. UTC | #5
On Wed, Jan 26, 2022 at 3:26 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 1/23/22 17:28, Sriharsha Basavapatna wrote:
> > Hi Ilya,
> >
> > On Wed, Jan 19, 2022 at 7:24 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > <snip>
> >>
> >> I also spotted one bug where the flow stays offloaded if it gets added
> >> and removed shortly after that.  But it's an old existing race condition,
> >> not introduced by this patch set.  We basically have to enqueue the
> >> flow deletion regardless of the flow->mark being set.
> >> I'll send a patch for that issue, probably, in the next couple of days.
> >> Or if you want to work on that, that's fine for me too.
> >
> > We are seeing a similar issue with OVS-2.16. While adding and deleting
> > flows in a loop, after a few iterations, rte_flow_destroy() is not
> > seen by the PMD for some of the offloaded flows. And those flows stay
> > offloaded in HW.  I tried the following change in 2.16 like you
> > suggested above and it resolved the issue.
> >
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > index d6bee2a5a..8cca57f1f 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -2758,9 +2758,7 @@ dp_netdev_pmd_remove_flow(struct
> > dp_netdev_pmd_thread *pmd,
> >      ovs_assert(cls != NULL);
> >      dpcls_remove(cls, &flow->cr);
> >      cmap_remove(&pmd->flow_table, node, dp_netdev_flow_hash(&flow->ufid));
> > -    if (flow->mark != INVALID_FLOW_MARK) {
> > -        queue_netdev_flow_del(pmd, flow);
> > -    }
> > +    queue_netdev_flow_del(pmd, flow);
> >      flow->dead = true;
> >
> >      dp_netdev_flow_unref(flow);
>
> Yeah, I was thinking about something similar.  Would you mind sending an
> official patch?
>
> Best regards, Ilya Maximets.

Sure, I will send it.
Thanks,
-Harsha