diff mbox series

[ovs-dev] dpif-netdev-lookup: Allow AVX512 dpcls without building core OVS with SSE4.2.

Message ID 20210322105848.496988-1-i.maximets@ovn.org
State Changes Requested
Headers show
Series [ovs-dev] dpif-netdev-lookup: Allow AVX512 dpcls without building core OVS with SSE4.2. | expand

Commit Message

Ilya Maximets March 22, 2021, 10:58 a.m. UTC
It's only required to build lib/dpif-netdev-lookup-avx512-gather.c
with SSE4.2.  All hash functions in lib/hash.h are defined as
'static inline', so they will use fast crc instructions while building
this module.  The minimal requirement for core OVS could be left
intact.

With this change support for SSE4.2 is checked in ./configure and
-msse4.2  passed to build libopenvswitchavx512.  The rest of OVS is
built with default flags provided by user.  So, AVX512 dpcls could
be enabled in a build with default CFLAGS and the rest of the OVS
could run on older CPUs or in virtual environments without support
for sse4.2 instructions.

Same for popcnt.

Since the whole lib/dpif-netdev-lookup-avx512-gather.c is built with
special instruction sets enabled it's not safe to call any function
from where unless we're sure that these ISA is supported by current
CPU.  For this reason runtime ISA checks moved to common initialization
code and performed only once.  avx512-gather implementation will
not be registered and will not be available for a user if current
CPU doesn't support it.  Log in dpdk-stub lowered to DBG level because
now it will be printed always on startup and it's not really useful
because user will not have ability to enable avx512 implementation.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---

I only tested the build, didn't check in runtime.

 Documentation/intro/install/dpdk.rst   | 10 ----
 Makefile.am                            | 10 ++++
 acinclude.m4                           |  2 +-
 configure.ac                           |  3 +-
 lib/automake.mk                        |  6 ++
 lib/dpdk-stub.c                        |  2 +-
 lib/dpdk.c                             |  2 +
 lib/dpif-netdev-lookup-autovalidator.c | 18 ++++--
 lib/dpif-netdev-lookup-avx512-gather.c | 20 ++++---
 lib/dpif-netdev-lookup-generic.c       | 12 ++++
 lib/dpif-netdev-lookup.c               | 80 +++++++++++++++++---------
 lib/dpif-netdev-lookup.h               | 19 +++---
 lib/dpif-netdev.c                      |  3 +
 m4/openvswitch.m4                      |  1 -
 14 files changed, 126 insertions(+), 62 deletions(-)

Comments

Van Haaren, Harry March 22, 2021, 11:45 a.m. UTC | #1
Hi Ilya,

> -----Original Message-----
> From: Ilya Maximets <i.maximets@ovn.org>
> Sent: Monday, March 22, 2021 10:59 AM
> To: ovs-dev@openvswitch.org; Van Haaren, Harry
> <harry.van.haaren@intel.com>
> Cc: Stokes, Ian <ian.stokes@intel.com>; Timothy Redaelli
> <tredaelli@redhat.com>; Flavio Leitner <fbl@sysclose.org>; Ilya Maximets
> <i.maximets@ovn.org>
> Subject: [PATCH] dpif-netdev-lookup: Allow AVX512 dpcls without building core
> OVS with SSE4.2.
> 
> It's only required to build lib/dpif-netdev-lookup-avx512-gather.c
> with SSE4.2.  All hash functions in lib/hash.h are defined as
> 'static inline', so they will use fast crc instructions while building
> this module.  The minimal requirement for core OVS could be left
> intact.
> 
> With this change support for SSE4.2 is checked in ./configure and
> -msse4.2  passed to build libopenvswitchavx512.  The rest of OVS is
> built with default flags provided by user.  So, AVX512 dpcls could
> be enabled in a build with default CFLAGS and the rest of the OVS
> could run on older CPUs or in virtual environments without support
> for sse4.2 instructions.
> 
> Same for popcnt.
> 
> Since the whole lib/dpif-netdev-lookup-avx512-gather.c is built with
> special instruction sets enabled it's not safe to call any function
> from where unless we're sure that these ISA is supported by current
> CPU.  For this reason runtime ISA checks moved to common initialization
> code and performed only once.  avx512-gather implementation will
> not be registered and will not be available for a user if current
> CPU doesn't support it.  Log in dpdk-stub lowered to DBG level because
> now it will be printed always on startup and it's not really useful
> because user will not have ability to enable avx512 implementation.
> 
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>


Thanks for taking a look at the DPCLS AVX512 optimizations & functionality.
From the commit message I understand "what" you're changing, but I failed
to understand a clear "why"?

From initial review, it seems to do a few things:
1) Rework CPU flags (sse4.2,popcnt) from compile-time checks to runtime
2) Rework DPCLS impl from static registration to runtime/dynamic

Is the big-picture goal to avoid compiling core OVS with SSE4.2/popcnt enabled?
Perhaps a better question - what is the big picture goal?


Regarding 1) the requirement for SSE4.2 for CRC32Q instruction based hashing,
I think you have mistook the reason for *requiring* core OVS to be built with
SSE4.2 enabled. The hashing done in OVS core and DPCLS must be identical.
Allowing OVS core to use murmurhash (no SSE4.2), and DPCLS using CRC32 (yes SSE4.2)
means that upcall hash values, and datapath hash values don't match. For this
reason, it is invalid to mix-and-match SSE4.2 support in *compliation units* across
OVS.

To be clear, this is a side-effect of compile-time ISA-based hash function selection,
and has actually nothing to do with DPCLS per-se: it just shows up here first as it
is the first location ISA specialization is being done in OVS.

I don't see issues with 2) reworking registration, it can be a separate patch to
the 1) CPU-flag rework topic. Although related, it does not depend on it I think?

All in all, I'm OK with the registration rework if split into a separate patch.
Regarding topic 1) SSE4.2 ISA-changes, I think it is not valid.

> ---
> 
> I only tested the build, didn't check in runtime.
>

You state this wasn't runtime checked - likely issues would have shown up (DPCLS
misses on installed keys, resulting in every packet getting an upcall) if it was tested.
What needs to be done to ensure you have access to a machine with AVX512?


Regards, -Harry

<snip patch contents, as high-level review didn't investigate details>
Ilya Maximets March 22, 2021, 12:54 p.m. UTC | #2
On 3/22/21 12:45 PM, Van Haaren, Harry wrote:
> Hi Ilya,
> 
>> -----Original Message-----
>> From: Ilya Maximets <i.maximets@ovn.org>
>> Sent: Monday, March 22, 2021 10:59 AM
>> To: ovs-dev@openvswitch.org; Van Haaren, Harry
>> <harry.van.haaren@intel.com>
>> Cc: Stokes, Ian <ian.stokes@intel.com>; Timothy Redaelli
>> <tredaelli@redhat.com>; Flavio Leitner <fbl@sysclose.org>; Ilya Maximets
>> <i.maximets@ovn.org>
>> Subject: [PATCH] dpif-netdev-lookup: Allow AVX512 dpcls without building core
>> OVS with SSE4.2.
>>
>> It's only required to build lib/dpif-netdev-lookup-avx512-gather.c
>> with SSE4.2.  All hash functions in lib/hash.h are defined as
>> 'static inline', so they will use fast crc instructions while building
>> this module.  The minimal requirement for core OVS could be left
>> intact.
>>
>> With this change support for SSE4.2 is checked in ./configure and
>> -msse4.2  passed to build libopenvswitchavx512.  The rest of OVS is
>> built with default flags provided by user.  So, AVX512 dpcls could
>> be enabled in a build with default CFLAGS and the rest of the OVS
>> could run on older CPUs or in virtual environments without support
>> for sse4.2 instructions.
>>
>> Same for popcnt.
>>
>> Since the whole lib/dpif-netdev-lookup-avx512-gather.c is built with
>> special instruction sets enabled it's not safe to call any function
>> from where unless we're sure that these ISA is supported by current
>> CPU.  For this reason runtime ISA checks moved to common initialization
>> code and performed only once.  avx512-gather implementation will
>> not be registered and will not be available for a user if current
>> CPU doesn't support it.  Log in dpdk-stub lowered to DBG level because
>> now it will be printed always on startup and it's not really useful
>> because user will not have ability to enable avx512 implementation.
>>
>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> 
> 
> Thanks for taking a look at the DPCLS AVX512 optimizations & functionality.
> From the commit message I understand "what" you're changing, but I failed
> to understand a clear "why"?
> 
> From initial review, it seems to do a few things:
> 1) Rework CPU flags (sse4.2,popcnt) from compile-time checks to runtime

Some compile time checks are still there, but moved around a little.

> 2) Rework DPCLS impl from static registration to runtime/dynamic
> 
> Is the big-picture goal to avoid compiling core OVS with SSE4.2/popcnt enabled?
> Perhaps a better question - what is the big picture goal?

Yes, the big-picture goal was to avoid compiling core OVS with SSE4.2,
i.e. being able to run the same binary on old or virtual hardware and
being able to enable avx512 optimizations while running on high-end hardware.

> 
> 
> Regarding 1) the requirement for SSE4.2 for CRC32Q instruction based hashing,
> I think you have mistook the reason for *requiring* core OVS to be built with
> SSE4.2 enabled. The hashing done in OVS core and DPCLS must be identical.
> Allowing OVS core to use murmurhash (no SSE4.2), and DPCLS using CRC32 (yes SSE4.2)
> means that upcall hash values, and datapath hash values don't match. For this
> reason, it is invalid to mix-and-match SSE4.2 support in *compliation units* across
> OVS.

Thanks for pointing that out.  I missed that completely.  Basically, I forgot
about this.  We should probably add this information to code comments as it's
easy to miss that fact.  Or do we have it already and I just messed it?
Would be nice to have a build assertion.  I'm not sure how to implement it, though.

The thing that we need is that hash computation in netdev_flow_mask/key_init()
needs to match with hash in dpif-netdev-lookup implementation.
And this is likely achievable if we will create a helper static inline function
implemented somewhere inside dpif-netdev-lookup.h to calculate flow hash.
And since miniflow structure already exists at the point of hash calculation
we can choose the implementation the same way as we're choosing lookup
implementation.
But, yes, this will be a bigger change.

> 
> To be clear, this is a side-effect of compile-time ISA-based hash function selection,
> and has actually nothing to do with DPCLS per-se: it just shows up here first as it
> is the first location ISA specialization is being done in OVS.
> 
> I don't see issues with 2) reworking registration, it can be a separate patch to
> the 1) CPU-flag rework topic. Although related, it does not depend on it I think?
> 
> All in all, I'm OK with the registration rework if split into a separate patch.

It could be that rework is required as a bug fix, because current runtime
checks are done inside the code that already built with higher ISA enabled,
so it might trigger illegal instruction exception during the check, in theory.

> Regarding topic 1) SSE4.2 ISA-changes, I think it is not valid.
> 
>> ---
>>
>> I only tested the build, didn't check in runtime.
>>
> 
> You state this wasn't runtime checked - likely issues would have shown up (DPCLS
> misses on installed keys, resulting in every packet getting an upcall) if it was tested.
> What needs to be done to ensure you have access to a machine with AVX512?

In practice, I just have no enough time for proper runtime checking.
Writing this patch was pretty fast, OTOH. :)

> 
> 
> Regards, -Harry
> 
> <snip patch contents, as high-level review didn't investigate details>
>
Van Haaren, Harry March 23, 2021, 2:11 p.m. UTC | #3
> -----Original Message-----
> From: Ilya Maximets <i.maximets@ovn.org>
> Sent: Monday, March 22, 2021 12:55 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Ilya Maximets
> <i.maximets@ovn.org>; ovs-dev@openvswitch.org
> Cc: Stokes, Ian <ian.stokes@intel.com>; Timothy Redaelli
> <tredaelli@redhat.com>; Flavio Leitner <fbl@sysclose.org>
> Subject: Re: [PATCH] dpif-netdev-lookup: Allow AVX512 dpcls without building
> core OVS with SSE4.2.
> 
> On 3/22/21 12:45 PM, Van Haaren, Harry wrote:
> > Hi Ilya,
> >
> >> -----Original Message-----
> >> From: Ilya Maximets <i.maximets@ovn.org>
> >> Sent: Monday, March 22, 2021 10:59 AM
> >> To: ovs-dev@openvswitch.org; Van Haaren, Harry
> >> <harry.van.haaren@intel.com>
> >> Cc: Stokes, Ian <ian.stokes@intel.com>; Timothy Redaelli
> >> <tredaelli@redhat.com>; Flavio Leitner <fbl@sysclose.org>; Ilya Maximets
> >> <i.maximets@ovn.org>
> >> Subject: [PATCH] dpif-netdev-lookup: Allow AVX512 dpcls without building
> core
> >> OVS with SSE4.2.
> >>
> >> It's only required to build lib/dpif-netdev-lookup-avx512-gather.c
> >> with SSE4.2.  All hash functions in lib/hash.h are defined as
> >> 'static inline', so they will use fast crc instructions while building
> >> this module.  The minimal requirement for core OVS could be left
> >> intact.
> >>
> >> With this change support for SSE4.2 is checked in ./configure and
> >> -msse4.2  passed to build libopenvswitchavx512.  The rest of OVS is
> >> built with default flags provided by user.  So, AVX512 dpcls could
> >> be enabled in a build with default CFLAGS and the rest of the OVS
> >> could run on older CPUs or in virtual environments without support
> >> for sse4.2 instructions.
> >>
> >> Same for popcnt.
> >>
> >> Since the whole lib/dpif-netdev-lookup-avx512-gather.c is built with
> >> special instruction sets enabled it's not safe to call any function
> >> from where unless we're sure that these ISA is supported by current
> >> CPU.  For this reason runtime ISA checks moved to common initialization
> >> code and performed only once.  avx512-gather implementation will
> >> not be registered and will not be available for a user if current
> >> CPU doesn't support it.  Log in dpdk-stub lowered to DBG level because
> >> now it will be printed always on startup and it's not really useful
> >> because user will not have ability to enable avx512 implementation.
> >>
> >> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> >
> >
> > Thanks for taking a look at the DPCLS AVX512 optimizations & functionality.
> > From the commit message I understand "what" you're changing, but I failed
> > to understand a clear "why"?
> >
> > From initial review, it seems to do a few things:
> > 1) Rework CPU flags (sse4.2,popcnt) from compile-time checks to runtime
> 
> Some compile time checks are still there, but moved around a little.

Yep agreed.


> > 2) Rework DPCLS impl from static registration to runtime/dynamic
> >
> > Is the big-picture goal to avoid compiling core OVS with SSE4.2/popcnt enabled?
> > Perhaps a better question - what is the big picture goal?
> 
> Yes, the big-picture goal was to avoid compiling core OVS with SSE4.2,
> i.e. being able to run the same binary on old or virtual hardware and
> being able to enable avx512 optimizations while running on high-end hardware.

I'm failing to see the requirement for compiling ore OVS without SSE4.2 support.
Is this a "nice to have", or a real requirement?

Assuming these CPUs are actually going to *run* OVS, having SSE4.2 support enabled
means that performance actually increases for "free" by using the CRC32Q for hashing...
which I expect is desired in that case. Even when running a virtualized OVS, surely we
don't want to limit performance artificially by reducing baseline ISA without a benefit?


> > Regarding 1) the requirement for SSE4.2 for CRC32Q instruction based hashing,
> > I think you have mistook the reason for *requiring* core OVS to be built with
> > SSE4.2 enabled. The hashing done in OVS core and DPCLS must be identical.
> > Allowing OVS core to use murmurhash (no SSE4.2), and DPCLS using CRC32 (yes
> SSE4.2)
> > means that upcall hash values, and datapath hash values don't match. For this
> > reason, it is invalid to mix-and-match SSE4.2 support in *compliation units*
> across
> > OVS.
> 
> Thanks for pointing that out.  I missed that completely.  Basically, I forgot
> about this.  We should probably add this information to code comments as it's
> easy to miss that fact.  Or do we have it already and I just messed it?

Its noted in the docs that AVX512 disables if sse4.2 and popcnt are missing, and
it refers to the below source-code snippet;
https://docs.openvswitch.org/en/latest/intro/install/dpdk/#install-ovs

There's a comment in  lib/dpif-netdev-lookup.c + 52  about compile time requirements,
but it doesn't explicitly call out the reason (hashing code changes based on compile time isa).

We can document this better - if ovs community think that would be useful?


> Would be nice to have a build assertion.  I'm not sure how to implement it,
> though.
> 
> The thing that we need is that hash computation in netdev_flow_mask/key_init()
> needs to match with hash in dpif-netdev-lookup implementation.
> And this is likely achievable if we will create a helper static inline function
> implemented somewhere inside dpif-netdev-lookup.h to calculate flow hash.
> And since miniflow structure already exists at the point of hash calculation
> we can choose the implementation the same way as we're choosing lookup
> implementation.

I dislike this idea of "fixing up" one of the places where hashing is different.
There are numerous places where hashing is/could-be used, and the compiler
would not give warnings/errors on where hash-mis-matches due to ISA occur.

I noticed & fixed this issue when building AVX512 DPCLS in the way I did because
it was difficult to debug, and one would not suspect the hash_add() function in
OVS from providing "invalid" results due to the ISA enabled in the compilation unit.

If we did take this "fixing up" approach, all future hashing code in upcalls would have to
be manually reviewed in the context of "is a similar hash recomputed in datapath".
Some locations where I can imagine hashing is used in upcalls is for features like
conn-track, load-balancing actions, LAG... these would all *silently* break if we
didn't go and do runtime-testing with ISA on/off, and play "whack-a-mole" with
the resulting bugs... sounds nasty to me.


> But, yes, this will be a bigger change.

All the above complexity/effort, to gain what benefit? As per above question,
I don't see specific value in reducing core OVS compile-time ISA requirements
to below SSE4.2, as per first topic.


> > To be clear, this is a side-effect of compile-time ISA-based hash function
> selection,
> > and has actually nothing to do with DPCLS per-se: it just shows up here first as it
> > is the first location ISA specialization is being done in OVS.
> >
> > I don't see issues with 2) reworking registration, it can be a separate patch to
> > the 1) CPU-flag rework topic. Although related, it does not depend on it I think?
> >
> > All in all, I'm OK with the registration rework if split into a separate patch.
> 
> It could be that rework is required as a bug fix, because current runtime
> checks are done inside the code that already built with higher ISA enabled,
> so it might trigger illegal instruction exception during the check, in theory.

Ok, lets handle that as a separate topic; how about we annotate the probe()
function to use only "generic CPU with 64-bit extensions" ISA? Much smaller
diff, achieves the same thing, probe() function prototype becomes this:

dpcls_subtable_lookup_func  __attribute__((__target__(arch="x86-64"))
dpcls_subtable_<name>_probe(uint32_t u0, uint32_t u1)
{
     // ISA safe here, due to __target__ attribute
}

See target() attribute docs, and -march= options on 2nd link, and nice ISA table on wikipedia:
https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html
https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
https://en.wikipedia.org/wiki/X86-64#Microarchitecture_Levels


> 
> > Regarding topic 1) SSE4.2 ISA-changes, I think it is not valid.
> >
> >> ---
> >>
> >> I only tested the build, didn't check in runtime.
> >>
> >
> > You state this wasn't runtime checked - likely issues would have shown up
> (DPCLS
> > misses on installed keys, resulting in every packet getting an upcall) if it was
> tested.
> > What needs to be done to ensure you have access to a machine with AVX512?
> 
> In practice, I just have no enough time for proper runtime checking.
> Writing this patch was pretty fast, OTOH. :)

Hah, okay... Most open-source communities don't like reviewing patches
that haven't been tested. As the saying goes "untested code is broken code" :)


> > Regards, -Harry
> >
> > <snip patch contents, as high-level review didn't investigate details>
> >
diff mbox series

Patch

diff --git a/Documentation/intro/install/dpdk.rst b/Documentation/intro/install/dpdk.rst
index 3a24e54f9..91c2adbd5 100644
--- a/Documentation/intro/install/dpdk.rst
+++ b/Documentation/intro/install/dpdk.rst
@@ -155,16 +155,6 @@  has to be configured to build against the DPDK library (``--with-dpdk``).
      While ``--with-dpdk`` is required, you can pass any other configuration
      option described in :ref:`general-configuring`.
 
-   It is strongly recommended to build OVS with at least ``-msse4.2`` and
-   ``-mpopcnt`` optimization flags. If these flags are not enabled, the AVX512
-   optimized DPCLS implementation is not available in the resulting binary.
-   For technical details see the subtable registration code in the
-   ``lib/dpif-netdev-lookup.c`` file.
-
-   An example that enables the AVX512 optimizations is::
-
-       $ ./configure --with-dpdk=static CFLAGS="-Ofast -msse4.2 -mpopcnt"
-
 #. Build and install OVS, as described in :ref:`general-building`
 
 Additional information can be found in :doc:`general`.
diff --git a/Makefile.am b/Makefile.am
index 691a005ad..52b4f1e38 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -41,6 +41,16 @@  else
 DPDKSTRIP_FLAGS = --nodpdk
 endif
 
+if HAVE_SSE42
+if HAVE_POPCNT
+if HAVE_AVX512F
+if HAVE_LD_AVX512_GOOD
+AM_CFLAGS += -DHAVE_AVX512_DPCLS
+endif
+endif
+endif
+endif
+
 if NDEBUG
 AM_CPPFLAGS += -DNDEBUG
 AM_CFLAGS += -fomit-frame-pointer
diff --git a/acinclude.m4 b/acinclude.m4
index 15a54d636..941ff527d 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -1220,7 +1220,7 @@  dnl gives unlimited permission to copy and/or distribute it,
 dnl with or without modifications, as long as this notice is preserved.
 
 AC_DEFUN([_OVS_CHECK_CC_OPTION], [dnl
-  m4_define([ovs_cv_name], [ovs_cv_[]m4_translit([$1], [-= ], [__])])dnl
+  m4_define([ovs_cv_name], [ovs_cv_[]m4_translit([$1], [-= .], [__])])dnl
   AC_CACHE_CHECK([whether $CC accepts $1], [ovs_cv_name], 
     [ovs_save_CFLAGS="$CFLAGS"
      dnl Include -Werror in the compiler options, because without -Werror
diff --git a/configure.ac b/configure.ac
index c077034d4..53255ef0b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -179,8 +179,9 @@  OVS_ENABLE_OPTION([-Wno-null-pointer-arithmetic])
 OVS_ENABLE_OPTION([-Warray-bounds-pointer-arithmetic])
 OVS_CONDITIONAL_CC_OPTION([-Wno-unused], [HAVE_WNO_UNUSED])
 OVS_CONDITIONAL_CC_OPTION([-Wno-unused-parameter], [HAVE_WNO_UNUSED_PARAMETER])
+OVS_CONDITIONAL_CC_OPTION([-msse4.2], [HAVE_SSE42])
+OVS_CONDITIONAL_CC_OPTION([-mpopcnt], [HAVE_POPCNT])
 OVS_CONDITIONAL_CC_OPTION([-mavx512f], [HAVE_AVX512F])
-OVS_CHECK_CC_OPTION([-mavx512f], [CFLAGS="$CFLAGS -DHAVE_AVX512F"])
 OVS_ENABLE_WERROR
 OVS_ENABLE_SPARSE
 OVS_CTAGS_IDENTIFIERS
diff --git a/lib/automake.mk b/lib/automake.mk
index 39afbff9d..6c6527e11 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -21,6 +21,8 @@  lib_libopenvswitch_la_LDFLAGS = \
         -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \
         $(AM_LDFLAGS)
 
+if HAVE_SSE42
+if HAVE_POPCNT
 if HAVE_AVX512F
 if HAVE_LD_AVX512_GOOD
 # Build library of avx512 code with CPU ISA CFLAGS enabled. This allows the
@@ -30,6 +32,8 @@  if HAVE_LD_AVX512_GOOD
 lib_LTLIBRARIES += lib/libopenvswitchavx512.la
 lib_libopenvswitch_la_LIBADD += lib/libopenvswitchavx512.la
 lib_libopenvswitchavx512_la_CFLAGS = \
+	-msse4.2 \
+	-mpopcnt \
 	-mavx512f \
 	-mavx512bw \
 	-mavx512dq \
@@ -42,6 +46,8 @@  lib_libopenvswitchavx512_la_LDFLAGS = \
 	-static
 endif
 endif
+endif
+endif
 
 # Build core vswitch libraries as before
 lib_libopenvswitch_la_SOURCES = \
diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
index b7d577870..fe24f9abd 100644
--- a/lib/dpdk-stub.c
+++ b/lib/dpdk-stub.c
@@ -83,7 +83,7 @@  bool
 dpdk_get_cpu_has_isa(const char *arch OVS_UNUSED,
                      const char *feature OVS_UNUSED)
 {
-    VLOG_ERR_ONCE("DPDK not supported in this version of Open vSwitch, "
+    VLOG_DBG_ONCE("DPDK not supported in this version of Open vSwitch, "
                   "cannot use CPU flag based optimizations");
     return false;
 }
diff --git a/lib/dpdk.c b/lib/dpdk.c
index 319540394..206af73f8 100644
--- a/lib/dpdk.c
+++ b/lib/dpdk.c
@@ -636,6 +636,8 @@  dpdk_get_cpu_has_isa(const char *arch, const char *feature)
     /* CPU flags only defined for the architecture that support it. */
     CHECK_CPU_FEATURE(feature, "avx512f", RTE_CPUFLAG_AVX512F);
     CHECK_CPU_FEATURE(feature, "bmi2", RTE_CPUFLAG_BMI2);
+    CHECK_CPU_FEATURE(feature, "sse4.2", RTE_CPUFLAG_SSE4_2);
+    CHECK_CPU_FEATURE(feature, "popcnt", RTE_CPUFLAG_POPCNT);
 #endif
 
     VLOG_WARN("Unknown CPU arch,feature: %s,%s. Returning not supported.\n",
diff --git a/lib/dpif-netdev-lookup-autovalidator.c b/lib/dpif-netdev-lookup-autovalidator.c
index 97b59fdd0..4e2af6feb 100644
--- a/lib/dpif-netdev-lookup-autovalidator.c
+++ b/lib/dpif-netdev-lookup-autovalidator.c
@@ -38,10 +38,6 @@  VLOG_DEFINE_THIS_MODULE(dpif_lookup_autovalidator);
  * implementation, however the results (rules[] output) must be the same.
  */
 
-dpcls_subtable_lookup_func
-dpcls_subtable_autovalidator_probe(uint32_t u0 OVS_UNUSED,
-                                   uint32_t u1 OVS_UNUSED);
-
 static uint32_t
 dpcls_subtable_autovalidator(struct dpcls_subtable *subtable,
                              uint32_t keys_map,
@@ -101,10 +97,22 @@  dpcls_subtable_autovalidator(struct dpcls_subtable *subtable,
     return matches_good;
 }
 
-dpcls_subtable_lookup_func
+static dpcls_subtable_lookup_func
 dpcls_subtable_autovalidator_probe(uint32_t u0 OVS_UNUSED,
                                    uint32_t u1 OVS_UNUSED)
 {
     /* Always return the same validator tester, it works for all subtables. */
     return dpcls_subtable_autovalidator;
 }
+
+void
+dpcls_subtable_autovalidator_register(uint8_t priority)
+{
+    struct dpcls_subtable_lookup_info_t lookup = {
+        .prio = priority,
+        .probe = dpcls_subtable_autovalidator_probe,
+        .name = "autovalidator",
+    };
+
+    dpcls_subtable_lookup_register(&lookup);
+}
diff --git a/lib/dpif-netdev-lookup-avx512-gather.c b/lib/dpif-netdev-lookup-avx512-gather.c
index 5e3634249..4f3ac3cd5 100644
--- a/lib/dpif-netdev-lookup-avx512-gather.c
+++ b/lib/dpif-netdev-lookup-avx512-gather.c
@@ -236,17 +236,11 @@  dpcls_avx512_gather_mf_any(struct dpcls_subtable *subtable, uint32_t keys_map,
                               subtable->mf_bits_set_unit1);
 }
 
-dpcls_subtable_lookup_func
+static dpcls_subtable_lookup_func
 dpcls_subtable_avx512_gather_probe(uint32_t u0_bits, uint32_t u1_bits)
 {
     dpcls_subtable_lookup_func f = NULL;
 
-    int avx512f_available = dpdk_get_cpu_has_isa("x86_64", "avx512f");
-    int bmi2_available = dpdk_get_cpu_has_isa("x86_64", "bmi2");
-    if (!avx512f_available || !bmi2_available) {
-        return NULL;
-    }
-
     CHECK_LOOKUP_FUNCTION(5, 1);
     CHECK_LOOKUP_FUNCTION(4, 1);
     CHECK_LOOKUP_FUNCTION(4, 0);
@@ -260,5 +254,17 @@  dpcls_subtable_avx512_gather_probe(uint32_t u0_bits, uint32_t u1_bits)
     return f;
 }
 
+void
+dpcls_subtable_avx512_gather_register(uint8_t priority)
+{
+    struct dpcls_subtable_lookup_info_t lookup = {
+        .prio = priority,
+        .probe = dpcls_subtable_avx512_gather_probe,
+        .name = "avx512_gather",
+    };
+
+    dpcls_subtable_lookup_register(&lookup);
+}
+
 #endif /* CHECKER */
 #endif /* __x86_64__ */
diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
index b1a0cfc36..3835ca06e 100644
--- a/lib/dpif-netdev-lookup-generic.c
+++ b/lib/dpif-netdev-lookup-generic.c
@@ -318,3 +318,15 @@  dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits)
 
     return f;
 }
+
+void
+dpcls_subtable_generic_register(uint8_t priority)
+{
+    struct dpcls_subtable_lookup_info_t lookup = {
+        .prio = priority,
+        .probe = dpcls_subtable_generic_probe,
+        .name = "generic",
+    };
+
+    dpcls_subtable_lookup_register(&lookup);
+}
diff --git a/lib/dpif-netdev-lookup.c b/lib/dpif-netdev-lookup.c
index bd0a99abe..b678f6d5f 100644
--- a/lib/dpif-netdev-lookup.c
+++ b/lib/dpif-netdev-lookup.c
@@ -16,14 +16,35 @@ 
 
 #include <config.h>
 #include <errno.h>
+#include "dpdk.h"
 #include "dpif-netdev-lookup.h"
 
 #include "openvswitch/vlog.h"
 
 VLOG_DEFINE_THIS_MODULE(dpif_netdev_lookup);
 
-/* Actual list of implementations goes here */
-static struct dpcls_subtable_lookup_info_t subtable_lookups[] = {
+#define LOOKUPS_MAX 3
+static int subtable_lookups_size = 0;
+static struct dpcls_subtable_lookup_info_t subtable_lookups[LOOKUPS_MAX];
+
+void
+dpcls_subtable_lookup_register(struct dpcls_subtable_lookup_info_t *lookup)
+{
+    VLOG_DBG("Registering dpcls subtable lookup implementation: %s"
+             ", priority: %d.", lookup->name, lookup->prio);
+    ovs_assert(subtable_lookups_size < LOOKUPS_MAX);
+    subtable_lookups[subtable_lookups_size++] = *lookup;
+}
+
+void
+dpcls_subtable_lookup_init(void)
+{
+    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+
+    if (!ovsthread_once_start(&once)) {
+        return;
+    }
+
     /* The autovalidator implementation will not be used by default, it must
      * be enabled at compile time to be the default lookup implementation. The
      * user may enable it at runtime using the normal "prio-set" command if
@@ -31,30 +52,36 @@  static struct dpcls_subtable_lookup_info_t subtable_lookups[] = {
      * tests to transparently run with the autovalidator.
      */
 #ifdef DPCLS_AUTOVALIDATOR_DEFAULT
-    { .prio = 255,
+    dpcls_subtable_autovalidator_register(255);
 #else
-    { .prio = 0,
+    dpcls_subtable_autovalidator_register(0);
 #endif
-      .probe = dpcls_subtable_autovalidator_probe,
-      .name = "autovalidator", },
-
-    /* The default scalar C code implementation. */
-    { .prio = 1,
-      .probe = dpcls_subtable_generic_probe,
-      .name = "generic", },
-
-#if (__x86_64__ && HAVE_AVX512F && HAVE_LD_AVX512_GOOD && __SSE4_2__)
-    /* Only available on x86_64 bit builds with SSE 4.2 used for OVS core. */
-    { .prio = 0,
-      .probe = dpcls_subtable_avx512_gather_probe,
-      .name = "avx512_gather", },
+
+    dpcls_subtable_generic_register(1);
+
+#if (__x86_64__ && HAVE_AVX512_DPCLS)
+    /* Checks below performed here and not inside the _avx512_gather_register
+     * function, because implementation of this function is already built with
+     * support of these instruction sets.  Need to check here to avoid possible
+     * illegal instruction execution. */
+    if (dpdk_get_cpu_has_isa("x86_64", "avx512f") &&
+        dpdk_get_cpu_has_isa("x86_64", "bmi2")    &&
+        dpdk_get_cpu_has_isa("x86_64", "sse4.2")  &&
+        dpdk_get_cpu_has_isa("x86_64", "popcnt")) {
+        /* Runtime checks succeeded.  Current CPU supports all required
+         * instruction sets for avx512 dpcls implementation. */
+        dpcls_subtable_avx512_gather_register(0);
+    }
+
 #else
-    /* Disabling AVX512 at compile time, as compile time requirements not met.
+    /* Not registering AVX512 support as compile time requirements not met.
      * This could be due to a number of reasons:
-     *  1) core OVS is not compiled with SSE4.2 instruction set.
-     *     The SSE42 instructions are required to use CRC32 ISA for high-
-     *     performance hashing. Consider ./configure of OVS with -msse42 (or
-     *     newer) to enable CRC32 hashing and higher performance.
+     *  1) AVX512 or SSE4.2 instruction set not supported by compiler or
+     *     explicitly disabled in compile time.
+     *     The SSE4.2 instructions are required to use CRC32 ISA for high-
+     *     performance hashing. Check that compiler supports -msse4.2 and
+     *     -mavx512f.  Or check that OVS is not configured with -mno-sse4.2,
+     *     -mno-avx512f or similar.
      *  2) The assembler in binutils versions 2.30 and 2.31 has bugs in AVX512
      *     assembly. Compile time probes check for this assembler issue, and
      *     disable the HAVE_LD_AVX512_GOOD check if an issue is detected.
@@ -62,7 +89,8 @@  static struct dpcls_subtable_lookup_info_t subtable_lookups[] = {
      *     2069ccaf8dc28ea699bd901fdd35d90613e4402a
      */
 #endif
-};
+     ovsthread_once_done(&once);
+}
 
 int32_t
 dpcls_subtable_lookup_info_get(struct dpcls_subtable_lookup_info_t **out_ptr)
@@ -72,14 +100,14 @@  dpcls_subtable_lookup_info_get(struct dpcls_subtable_lookup_info_t **out_ptr)
     }
 
     *out_ptr = subtable_lookups;
-    return ARRAY_SIZE(subtable_lookups);
+    return subtable_lookups_size;
 }
 
 /* sets the priority of the lookup function with "name". */
 int32_t
 dpcls_subtable_set_prio(const char *name, uint8_t priority)
 {
-    for (int i = 0; i < ARRAY_SIZE(subtable_lookups); i++) {
+    for (int i = 0; i < subtable_lookups_size; i++) {
         if (strcmp(name, subtable_lookups[i].name) == 0) {
                 subtable_lookups[i].prio = priority;
                 VLOG_INFO("Subtable function '%s' set priority to %d\n",
@@ -100,7 +128,7 @@  dpcls_subtable_get_best_impl(uint32_t u0_bit_count, uint32_t u1_bit_count)
     const char *name = NULL;
     dpcls_subtable_lookup_func best_func = NULL;
 
-    for (int i = 0; i < ARRAY_SIZE(subtable_lookups); i++) {
+    for (int i = 0; i < subtable_lookups_size; i++) {
         int32_t probed_prio = subtable_lookups[i].prio;
         if (probed_prio > prio) {
             dpcls_subtable_lookup_func probed_func;
diff --git a/lib/dpif-netdev-lookup.h b/lib/dpif-netdev-lookup.h
index bd72aa29b..e6b23fea7 100644
--- a/lib/dpif-netdev-lookup.h
+++ b/lib/dpif-netdev-lookup.h
@@ -29,11 +29,6 @@  typedef
 dpcls_subtable_lookup_func (*dpcls_subtable_probe_func)(uint32_t u0_bit_count,
                                                         uint32_t u1_bit_count);
 
-/* Prototypes for subtable implementations */
-dpcls_subtable_lookup_func
-dpcls_subtable_autovalidator_probe(uint32_t u0_bit_count,
-                                   uint32_t u1_bit_count);
-
 /* Probe function to select a specialized version of the generic lookup
  * implementation. This provides performance benefit due to compile-time
  * optimizations such as loop-unrolling. These are enabled by the compile-time
@@ -42,11 +37,6 @@  dpcls_subtable_autovalidator_probe(uint32_t u0_bit_count,
 dpcls_subtable_lookup_func
 dpcls_subtable_generic_probe(uint32_t u0_bit_count, uint32_t u1_bit_count);
 
-/* Probe function for AVX-512 gather implementation */
-dpcls_subtable_lookup_func
-dpcls_subtable_avx512_gather_probe(uint32_t u0_bit_cnt, uint32_t u1_bit_cnt);
-
-
 /* Subtable registration and iteration helpers */
 struct dpcls_subtable_lookup_info_t {
     /* higher priority gets used over lower values. This allows deployments
@@ -64,6 +54,15 @@  struct dpcls_subtable_lookup_info_t {
     const char *name;
 };
 
+void dpcls_subtable_lookup_init(void);
+
+void dpcls_subtable_lookup_register(struct dpcls_subtable_lookup_info_t *);
+
+/* Register functions for different subtable lookup implementations. */
+void dpcls_subtable_autovalidator_register(uint8_t priority);
+void dpcls_subtable_generic_register(uint8_t priority);
+void dpcls_subtable_avx512_gather_register(uint8_t priority);
+
 int32_t dpcls_subtable_set_prio(const char *name, uint8_t priority);
 
 dpcls_subtable_lookup_func
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 816945375..36b6d1e41 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -1650,6 +1650,9 @@  dpif_netdev_init(void)
     unixctl_command_register("dpif-netdev/subtable-lookup-prio-get", "",
                              0, 0, dpif_netdev_subtable_lookup_get,
                              NULL);
+
+    dpcls_subtable_lookup_init();
+
     return 0;
 }
 
diff --git a/m4/openvswitch.m4 b/m4/openvswitch.m4
index 244ea0fba..fa6b967ac 100644
--- a/m4/openvswitch.m4
+++ b/m4/openvswitch.m4
@@ -412,7 +412,6 @@  AC_DEFUN([OVS_CHECK_BINUTILS_AVX512],
      if ($CC -dumpmachine | grep x86_64) >/dev/null 2>&1; then
        if (objdump -d  --no-show-raw-insn $OBJFILE | grep -q $GATHER_PARAMS) >/dev/null 2>&1; then
          ovs_cv_binutils_avx512_good=yes
-         CFLAGS="$CFLAGS -DHAVE_LD_AVX512_GOOD"
        else
          ovs_cv_binutils_avx512_good=no
          dnl Explicitly disallow avx512f to stop compiler auto-vectorizing