mbox series

[ovs-dev,v9,00/16] DPIF Framework + Optimizations

Message ID 20210212171718.2189798-1-harry.van.haaren@intel.com
Headers show
Series DPIF Framework + Optimizations | expand


Van Haaren, Harry Feb. 12, 2021, 5:17 p.m. UTC
v9 Summary:
- Added AVX512 POC work for DPIF and MFEX in single patch at end
-- Note that the AVX512 MFEX is for Ether()/IP()/UDP() traffic.
-- A significant performance boost is possible with these optimizations.

v8 Summary:
- Added NEWS entries for significant changes
- Added scalar optimizations for datapath TX
- Patchset is now ready for merge in my opinion.

v7 summary:
- OVS Conference included DPIF overview, youtube link:
--- https://youtu.be/5dWyPxiXEhg
- Rebased and tested on the DPDK 20.11 v4 patch
--- Link: https://patchwork.ozlabs.org/project/openvswitch/list/?series=220645
--- Tested this series for shared/static builds
--- Tested this series with/without -march=<native,skylake,nehalem>
- Minor code improvements in DPIF component (see commits for details)
- Improved CPU ISA checks, caching results
- Commit message improvements (.'s etc)
- Added performance data of patchset
--- Note that the benchmark below does not utilize the AVX512-vpopcntdq
--- optimizations, and performance is expected to improve when used.
--- Further optimizations are planned that continue.

Benchmark Details & Results

Intel® Xeon® Gold 6230 CPU @2.10GHz
OVS*-DPDK* Phy-Phy Performance 4x 25G Ports - Total 1 million flows
1C1T-4P, 64-byte frame size, performance in mpps:

Results Table:
DPIF  | Scalar | Scalar | AVX512 | AVX512 |
DPCLS | Scalar | AVX512 | Scalar | AVX512 |
mpps  |  6.955 |  7.530 |  7.530 |  7.962 |

By enabling both AVX512 DPIF and DPCLS, packet forwarding
is  7.962 / 6.955 = 1.1447x faster, aka 14% speedup.

v6 summary:
- Rebase to DPDK 20.11 enabling patch
--- This creates a dependency, expect CI build failures on the last
    patch in this series if it is not applied!
- Small improvements to DPIF layer
--- EMC/SMC enabling in AVX512 DPIF cleanups
- CPU ISA flags are cached, lowering overhead
- Wilcard Classifier DPCLS
--- Refactor and cleanups for function names
--- Enable more subtable specializations
--- Enable AVX512 vpopcount instruction

v5 summary:
- Dropped MFEX optimizations, re-targetting to a later release
--- This allows focus of community reviews & development on DPIF
--- Note OVS Conference talk still introduces both DPIF and MFEX topics
- DPIF improvements
--- Better EMC/SMC handling
--- HWOL is enabled in the avx512 DPIF
--- Documentation & NEWS items added
--- Various smaller improvements

v4 summary:
- Updated and improve DPIF component
--- SMC now implemented
--- EMC handling improved
--- Novel batching method using AVX512 implemented
--- see commits for details
- Updated Miniflow Extract component
--- Improved AVX512 code path performance
--- Implemented multiple TODO item's in v3
--- Add "disable" implementation to return to scalar miniflow only
--- More fixes planned for v5/future revisions:
---- Rename command to better reflect usage
---- Improve dynamicness of patterns
---- Add more demo protocols to show usage
- Future work
--- Documentation/NEWS items
--- Statistics for optimized MFEX
- Note that this patchset will be discussed/presented at OvsConf soon :)

v3 update summary:
(Cian Ferriter helping with rebases, review and code cleanups)
- Split out partially related changes (these will be sent separately)
--- netdev output action optimization
--- avx512 dpcls 16-block support optimization
- Squash commit which moves netdev struct flow into the refactor commit:
--- Squash dpif-netdev: move netdev flow struct to header
--- Into dpif-netdev: Refactor to multiple header files
- Implement Miniflow extract for AVX-512 DPIF
--- A generic method of matching patterns and packets is implemented,
    providing traffic-pattern specific miniflow-extract acceleration.
--- The patterns today are hard-coded, however in a future patchset it
    is intended to make these runtime configurable, allowing users to
    optimize the SIMD miniflow extract for active traffic types.
- Notes:
--- 32 bit builds will be fixed in next release by adding flexible
    miniflow extract optimization selection.
--- AVX-512 VBMI ISA is not yet supported in OVS due to requiring the
    DPDK 20.11 update for RTE_CPUFLAG_*. Once on a newer DPDK this will
    be added.

v2 updates:
- Includes DPIF command switching at runtime
- Includes AVX512 DPIF implementation
- Includes some partially related changes (can be split out of set?)
--- netdev output action optimization
--- avx512 dpcls 16-block support optimization

This patchset is a v7 for making the DPIF components of the
userspace datapath more flexible. It has been refactored to be
more modular to encourage code-reuse, and scalable in that ISA
optimized implementations can be added and selected at runtime.

The same approach as has been previously used for DPCLS is used
here, where a function pointer allows selection of an implementation
at runtime.

Datapath features such as EMC, SMC and HWOL are shared between
implementations, hence they are refactored into seperate header files.
The file splitting also improves maintainability, as dpif_netdev.c
has ~9000 LOC, and very hard to modify due to many structs defined
locally in the .c file, ruling out re-usability in other .c files.

Questions welcomed! Regards, -Harry

Cian Ferriter (1):
  docs/dpdk/bridge: Add dpif performance section.

Harry van Haaren (15):
  dpif-netdev: Refactor to multiple header files.
  dpif-netdev: Split HWOL out to own header file.
  dpif-netdev: Add function pointer for netdev input.
  dpif-avx512: Add ISA implementation of dpif.
  dpif-avx512: Add HWOL support to avx512 dpif.
  dpif-netdev: Add command to switch dpif implementation.
  dpif-netdev: Add command to get dpif implementations.
  dpif-netdev/dpcls: Refactor function names to dpcls.
  dpif-netdev/dpcls-avx512: enable 16 block processing.
  dpif-netdev/dpcls: specialize more subtable signatures.
  dpdk: Cache result of CPU ISA checks.
  dpcls-avx512: enabling avx512 vector popcount instruction.
  dpif-netdev: Optimize dp output action
  netdev: Optimize netdev_send_prepare_batch
  dpif-netdev: POC of future DPIF and MFEX AVX512 optimizations

 Documentation/topics/dpdk/bridge.rst   |  37 ++
 NEWS                                   |  16 +-
 acinclude.m4                           |  15 +
 configure.ac                           |   1 +
 lib/automake.mk                        |  12 +-
 lib/dpdk.c                             |  30 +-
 lib/dpif-netdev-avx512.c               | 362 ++++++++++++
 lib/dpif-netdev-lookup-autovalidator.c |   1 -
 lib/dpif-netdev-lookup-avx512-gather.c | 278 ++++++---
 lib/dpif-netdev-lookup-generic.c       |   7 +-
 lib/dpif-netdev-lookup.h               |   2 +-
 lib/dpif-netdev-private-dfc.h          | 252 ++++++++
 lib/dpif-netdev-private-dpcls.h        | 127 ++++
 lib/dpif-netdev-private-dpif.c         |  99 ++++
 lib/dpif-netdev-private-dpif.h         |  85 +++
 lib/dpif-netdev-private-flow.h         | 162 +++++
 lib/dpif-netdev-private-hwol.h         |  63 ++
 lib/dpif-netdev-private-thread.h       | 225 +++++++
 lib/dpif-netdev-private.h              | 123 ++--
 lib/dpif-netdev.c                      | 779 +++++++------------------
 lib/flow_avx512.h                      | 117 ++++
 lib/netdev.c                           |  31 +-
 22 files changed, 2069 insertions(+), 755 deletions(-)
 create mode 100644 lib/dpif-netdev-avx512.c
 create mode 100644 lib/dpif-netdev-private-dfc.h
 create mode 100644 lib/dpif-netdev-private-dpcls.h
 create mode 100644 lib/dpif-netdev-private-dpif.c
 create mode 100644 lib/dpif-netdev-private-dpif.h
 create mode 100644 lib/dpif-netdev-private-flow.h
 create mode 100644 lib/dpif-netdev-private-hwol.h
 create mode 100644 lib/dpif-netdev-private-thread.h
 create mode 100644 lib/flow_avx512.h