diff mbox series

[ovs-dev,v13,12/12] dpcls-avx512: Enable avx512 vector popcount instruction.

Message ID 20210617161825.94741-13-cian.ferriter@intel.com
State Changes Requested
Headers show
Series DPIF Framework + Optimizations | expand

Commit Message

Ferriter, Cian June 17, 2021, 4:18 p.m. UTC
From: Harry van Haaren <harry.van.haaren@intel.com>

This commit enables the AVX512-VPOPCNTDQ Vector Popcount
instruction. This instruction is not available on every CPU
that supports the AVX512-F Foundation ISA, hence it is enabled
only when the additional VPOPCNTDQ ISA check is passed.

The vector popcount instruction is used instead of the AVX512
popcount emulation code present in the avx512 optimized DPCLS today.
It provides higher performance in the SIMD miniflow processing
as that requires the popcount to calculate the miniflow block indexes.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>

---

v13:
- Rebased and Improved comment on use_vpop variable (Ian)
---
 NEWS                                   |  3 +
 lib/dpdk.c                             |  1 +
 lib/dpif-netdev-lookup-avx512-gather.c | 85 ++++++++++++++++++++------
 3 files changed, 71 insertions(+), 18 deletions(-)

Comments

Flavio Leitner June 24, 2021, 3:57 a.m. UTC | #1
On Thu, Jun 17, 2021 at 05:18:25PM +0100, Cian Ferriter wrote:
> From: Harry van Haaren <harry.van.haaren@intel.com>
> 
> This commit enables the AVX512-VPOPCNTDQ Vector Popcount
> instruction. This instruction is not available on every CPU
> that supports the AVX512-F Foundation ISA, hence it is enabled
> only when the additional VPOPCNTDQ ISA check is passed.
> 
> The vector popcount instruction is used instead of the AVX512
> popcount emulation code present in the avx512 optimized DPCLS today.
> It provides higher performance in the SIMD miniflow processing
> as that requires the popcount to calculate the miniflow block indexes.
> 
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> 
> ---

Acked-by: Flavio Leitner <fbl@sysclose.org>

This patch series implements low level optimizations by manually
coding instructions. I wonder if gcc couldn't get some relevant
level of vectorized optimizations refactoring and enabling
compiling flags. I assume the answer is no, but I would appreciate
some enlightenment on the matter.

Thanks,
fbl
Harry van Haaren June 24, 2021, 11:07 a.m. UTC | #2
> -----Original Message-----
> From: dev <ovs-dev-bounces@openvswitch.org> On Behalf Of Flavio Leitner
> Sent: Thursday, June 24, 2021 4:57 AM
> To: Ferriter, Cian <cian.ferriter@intel.com>
> Cc: ovs-dev@openvswitch.org; i.maximets@ovn.org
> Subject: Re: [ovs-dev] [v13 12/12] dpcls-avx512: Enable avx512 vector popcount
> instruction.
> 
> On Thu, Jun 17, 2021 at 05:18:25PM +0100, Cian Ferriter wrote:
> > From: Harry van Haaren <harry.van.haaren@intel.com>
> >
> > This commit enables the AVX512-VPOPCNTDQ Vector Popcount
> > instruction. This instruction is not available on every CPU
> > that supports the AVX512-F Foundation ISA, hence it is enabled
> > only when the additional VPOPCNTDQ ISA check is passed.
> >
> > The vector popcount instruction is used instead of the AVX512
> > popcount emulation code present in the avx512 optimized DPCLS today.
> > It provides higher performance in the SIMD miniflow processing
> > as that requires the popcount to calculate the miniflow block indexes.
> >
> > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> 
> Acked-by: Flavio Leitner <fbl@sysclose.org>

Thanks for reviewing!

> This patch series implements low level optimizations by manually
> coding instructions. I wonder if gcc couldn't get some relevant
> level of vectorized optimizations refactoring and enabling
> compiling flags. I assume the answer is no, but I would appreciate
> some enlightenment on the matter.

Unfortunately no... there is no magic solution here to have the toolchain
provide fallbacks if the latest ISA is not available. You're 100% right, these
are manually implemented versions of new ISA, implemented in "older"
ISA, to allow usage of the functionality. In this case, Skylake grade "AVX512-F"
is used to implement the Icelake grade "AVX512-VPOPCNTDQ" instruction:
(https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_popcnt_epi64%2520&expand=4368,4368)

I do like the idea of toolchain supporting ISA options a bit more, there is
so much compute performance available that is not widely used today.
Such an effort industry wide would be very beneficial to all for improving
performance, but would be a pretty large undertaking too... outside the
scope of this patchset! :)

I'll admit to being a bit of an ISA fan, but there's some magical instructions
that can do stuff in 1x instruction that otherwise take large amounts of
shifts & loops. Did I hear somebody ask for examples..??

Miniflow Bits processing with "BMI" (Bit Manipulation Instructions)
Introduced in Haswell era, https://software.intel.com/sites/landingpage/IntrinsicsGuide/#othertechs=BMI1,BMI2
- Favorite instructions are pdep and pext (parallel bit deposit, and parallel bit extract)
- Very useful for dense bitfield unpacking, instead of "load - shift - AND" per field, can
   unpack up to 8 bitfields in a u64 and align them to byte-boundaries
- Its "opposite" "pext" also exists, extracting sparse bits from an integer into a packed layout
(pext is used in DPCLS, to pull sparse bits from the packet's miniflow into linear packed layout,
allowing it to be processed in a single packed AVX512 register)

Note that we're all benefitting from novel usage of the scalar "popcount" instruction too, since merging
commit: a0b36b392 (introduced in SSE4.2, with CPUID flag POPCNT) It uses a bitmask & popcount approach
to index into the miniflow, improving on the previous "count and shifts bits" to iterate miniflows approach.

There are likely multiple other places in OVS where we spend significant cycles
on processing data in ways that can be accelerated significantly by using all available ISA.
There is ongoing work in miniflow extract (MFEX) with AVX512 SIMD ISA, allowing parsing
of multiple packet protocols at the same time (see here https://patchwork.ozlabs.org/project/openvswitch/list/?series=249470)

I'll stop promoting ISA here, but am happy to continue detailed discussions, or break out
conversations about specific areas of compute in OVS if there's appetite for that! Feel free
to email to OVS Mailing list (with me on CC please :) or email directly OK too.

Regards, -Harry
Ilya Maximets June 24, 2021, 11:41 a.m. UTC | #3
On 6/24/21 1:07 PM, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: dev <ovs-dev-bounces@openvswitch.org> On Behalf Of Flavio Leitner
>> Sent: Thursday, June 24, 2021 4:57 AM
>> To: Ferriter, Cian <cian.ferriter@intel.com>
>> Cc: ovs-dev@openvswitch.org; i.maximets@ovn.org
>> Subject: Re: [ovs-dev] [v13 12/12] dpcls-avx512: Enable avx512 vector popcount
>> instruction.
>>
>> On Thu, Jun 17, 2021 at 05:18:25PM +0100, Cian Ferriter wrote:
>>> From: Harry van Haaren <harry.van.haaren@intel.com>
>>>
>>> This commit enables the AVX512-VPOPCNTDQ Vector Popcount
>>> instruction. This instruction is not available on every CPU
>>> that supports the AVX512-F Foundation ISA, hence it is enabled
>>> only when the additional VPOPCNTDQ ISA check is passed.
>>>
>>> The vector popcount instruction is used instead of the AVX512
>>> popcount emulation code present in the avx512 optimized DPCLS today.
>>> It provides higher performance in the SIMD miniflow processing
>>> as that requires the popcount to calculate the miniflow block indexes.
>>>
>>> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
>>
>> Acked-by: Flavio Leitner <fbl@sysclose.org>
> 
> Thanks for reviewing!
> 
>> This patch series implements low level optimizations by manually
>> coding instructions. I wonder if gcc couldn't get some relevant
>> level of vectorized optimizations refactoring and enabling
>> compiling flags. I assume the answer is no, but I would appreciate
>> some enlightenment on the matter.
> 
> Unfortunately no... there is no magic solution here to have the toolchain
> provide fallbacks if the latest ISA is not available. You're 100% right, these
> are manually implemented versions of new ISA, implemented in "older"
> ISA, to allow usage of the functionality. In this case, Skylake grade "AVX512-F"
> is used to implement the Icelake grade "AVX512-VPOPCNTDQ" instruction:
> (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_popcnt_epi64%2520&expand=4368,4368)
> 
> I do like the idea of toolchain supporting ISA options a bit more, there is
> so much compute performance available that is not widely used today.
> Such an effort industry wide would be very beneficial to all for improving
> performance, but would be a pretty large undertaking too... outside the
> scope of this patchset! :)
> 
> I'll admit to being a bit of an ISA fan, but there's some magical instructions
> that can do stuff in 1x instruction that otherwise take large amounts of
> shifts & loops. Did I hear somebody ask for examples..??
> 
> Miniflow Bits processing with "BMI" (Bit Manipulation Instructions)
> Introduced in Haswell era, https://software.intel.com/sites/landingpage/IntrinsicsGuide/#othertechs=BMI1,BMI2
> - Favorite instructions are pdep and pext (parallel bit deposit, and parallel bit extract)
> - Very useful for dense bitfield unpacking, instead of "load - shift - AND" per field, can
>    unpack up to 8 bitfields in a u64 and align them to byte-boundaries
> - Its "opposite" "pext" also exists, extracting sparse bits from an integer into a packed layout
> (pext is used in DPCLS, to pull sparse bits from the packet's miniflow into linear packed layout,
> allowing it to be processed in a single packed AVX512 register)
> 
> Note that we're all benefitting from novel usage of the scalar "popcount" instruction too, since merging
> commit: a0b36b392 (introduced in SSE4.2, with CPUID flag POPCNT) It uses a bitmask & popcount approach
> to index into the miniflow, improving on the previous "count and shifts bits" to iterate miniflows approach.
> 
> There are likely multiple other places in OVS where we spend significant cycles
> on processing data in ways that can be accelerated significantly by using all available ISA.
> There is ongoing work in miniflow extract (MFEX) with AVX512 SIMD ISA, allowing parsing
> of multiple packet protocols at the same time (see here https://patchwork.ozlabs.org/project/openvswitch/list/?series=249470)
> 
> I'll stop promoting ISA here, but am happy to continue detailed discussions, or break out
> conversations about specific areas of compute in OVS if there's appetite for that! Feel free
> to email to OVS Mailing list (with me on CC please :) or email directly OK too.
> 
> Regards, -Harry
> 

Speaking of "magic" compiler optimizations, I'm wondering what
kind of performance improvement we can have by just compiling
"generic" implementations of DPCLS and other stuff with the same
flags with which we're compiling hand-crafted avx512 code.
I mean, if we'll have a separate .c file that would include
lib/dpif-netdev-lookup-generic.c (With some MACRO tricks to
generate a different name for the classifier callback) and will
compile it as part of libopenvswitchavx512 and have a separate
implementation switch for it in runtime.  Did you consider this
kind of solution?

It would be interesting to compare manual optimizations with
automatic.  I'm pretty sure that manual will be faster, but
it would be great to know the difference.
Maybe you have numbers for comparison where the whole OVS
just built with the same instruction set available?

Best regards, Ilya Maximets.
Harry van Haaren June 24, 2021, 12:07 p.m. UTC | #4
> -----Original Message-----
> From: Ilya Maximets <i.maximets@ovn.org>
> Sent: Thursday, June 24, 2021 12:42 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Flavio Leitner
> <fbl@sysclose.org>; Ferriter, Cian <cian.ferriter@intel.com>
> Cc: ovs-dev@openvswitch.org; i.maximets@ovn.org; Amber, Kumar
> <kumar.amber@intel.com>
> Subject: Re: [ovs-dev] [v13 12/12] dpcls-avx512: Enable avx512 vector popcount
> instruction.
> 
> On 6/24/21 1:07 PM, Van Haaren, Harry wrote:

<snip lots of ISA discussion & commit message>

> > I'll stop promoting ISA here, but am happy to continue detailed discussions, or
> break out
> > conversations about specific areas of compute in OVS if there's appetite for that!
> Feel free
> > to email to OVS Mailing list (with me on CC please :) or email directly OK too.
> >
> > Regards, -Harry
> >
> 
> Speaking of "magic" compiler optimizations, I'm wondering what
> kind of performance improvement we can have by just compiling
> "generic" implementations of DPCLS and other stuff with the same
> flags with which we're compiling hand-crafted avx512 code.

That's pretty easy to do? CFLAGS="-march=skylake-avx512 " on a Skylake
or newer CPU will achieve that. Or "-march=native" for whatever CPU
it is you're compiling on will enable all available ISA on that machine.

Note that subtable search specialization is actually a *huge* help to the
compiler in this case, as it can (at compile time) know how many times
a specific loop can be unrolled... and loop unrolling into SIMD code is
often the easiest of transforms to do & validate as correct for the compiler.

Look up CPU SIMD vector optimization, and 99.999% of the time the
example given is a float matrix multiply. Why? It has a nice property of loop-
unrolling into a SIMD register, and this optimization is inside a hot loop.
It’s the "home run" of compiler auto-vectorization. Packet processing
is much more complex in nature, and I've never seen a complex scalar
function be neatly vectorized by a compiler yet...


> I mean, if we'll have a separate .c file that would include
> lib/dpif-netdev-lookup-generic.c (With some MACRO tricks to
> generate a different name for the classifier callback) and will
> compile it as part of libopenvswitchavx512 and have a separate
> implementation switch for it in runtime.  Did you consider this
> kind of solution?

Not really, because there's no actual benefit. Try compiling all
of OVS as above with CFLAGS="-march=skylake-avx512" and see
how much the compiler manages to actually vectorize into SIMD code...

Unfortunately the complexity in the compiler to do transformations
such as scalar -> vector code are complex, and many "hazards" exist
that dis-allow vectorization.

Things like two pointers of the same type being loaded/stored to is already going
to stop the compiler, as those two pointers may overlap. This can be solved with
"restrict" C keyword, to inform the compiler that it is impossible to access the memory
region pointed to by that pointer through any other way... this would require large changes
to the OVS codebase to indicate that one struct flow* or struct miniflow* cannot overlap
with another.

Once that is done, we must rely on the compiler to actually understand the data
movement taking place, and be able to *guarantee correctness* for any input data.
As humans, we can logic about specific things, and rule them out. The compiler is not
allowed to do this, hence often CPU-auto vectorization just doesn't work.

Lastly, any if() conditions that have stores in them, these must be made "branch free".
As x86-64 has Total Store Ordering, this makes it difficult for the compiler to take liberties
in terms of re-ordering stores from program order. The result is the if the order of stores in
your program would change due to the compiler having auto-vectorized the code, it is not
valid, so the compiler will not emit it.


> It would be interesting to compare manual optimizations with
> automatic.  I'm pretty sure that manual will be faster, but
> it would be great to know the difference.
> Maybe you have numbers for comparison where the whole OVS
> just built with the same instruction set available?

Note that glibc functions such as memcmp() etc are already being optimized for
the ISA that is available. The VDSO that is at runtime linked into the Userspace app
is capable of using SIMD registers/CPU ISA as it likes, as the linker can define which
version of the VDSO to link. As a result, you can get functions like memcmp and
memcpy() to use SIMD registers "under the hood". This may help e.g. EMC for compares,
if its not memory bound on loading the data to be compared. 

In my experience, the compiler is not able to automatically vectorize to use
AVX512 SIMD instructions in any meaningful way to accelerate datapath.
As a result, the performance is pretty similar to the scalar code, within a few %.

Feel free to test this, the CFLAGS="" string is above. I would be a bit surprised
if there were > 5% differences in Phy-Phy end-to-end performance. 

> Best regards, Ilya Maximets.

Regards, -Harry
Flavio Leitner June 24, 2021, 12:17 p.m. UTC | #5
Hi Harry,

On Thu, Jun 24, 2021 at 11:07:59AM +0000, Van Haaren, Harry wrote:
> > -----Original Message-----
> > From: dev <ovs-dev-bounces@openvswitch.org> On Behalf Of Flavio Leitner
> > Sent: Thursday, June 24, 2021 4:57 AM
> > To: Ferriter, Cian <cian.ferriter@intel.com>
> > Cc: ovs-dev@openvswitch.org; i.maximets@ovn.org
> > Subject: Re: [ovs-dev] [v13 12/12] dpcls-avx512: Enable avx512 vector popcount
> > instruction.
> > 
> > On Thu, Jun 17, 2021 at 05:18:25PM +0100, Cian Ferriter wrote:
> > > From: Harry van Haaren <harry.van.haaren@intel.com>
> > >
> > > This commit enables the AVX512-VPOPCNTDQ Vector Popcount
> > > instruction. This instruction is not available on every CPU
> > > that supports the AVX512-F Foundation ISA, hence it is enabled
> > > only when the additional VPOPCNTDQ ISA check is passed.
> > >
> > > The vector popcount instruction is used instead of the AVX512
> > > popcount emulation code present in the avx512 optimized DPCLS today.
> > > It provides higher performance in the SIMD miniflow processing
> > > as that requires the popcount to calculate the miniflow block indexes.
> > >
> > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > 
> > Acked-by: Flavio Leitner <fbl@sysclose.org>
> 
> Thanks for reviewing!
> 
> > This patch series implements low level optimizations by manually
> > coding instructions. I wonder if gcc couldn't get some relevant
> > level of vectorized optimizations refactoring and enabling
> > compiling flags. I assume the answer is no, but I would appreciate
> > some enlightenment on the matter.
> 
> Unfortunately no... there is no magic solution here to have the toolchain
> provide fallbacks if the latest ISA is not available. You're 100% right, these
> are manually implemented versions of new ISA, implemented in "older"
> ISA, to allow usage of the functionality. In this case, Skylake grade "AVX512-F"
> is used to implement the Icelake grade "AVX512-VPOPCNTDQ" instruction:
> (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_popcnt_epi64%2520&expand=4368,4368)
> 
> I do like the idea of toolchain supporting ISA options a bit more, there is
> so much compute performance available that is not widely used today.
> Such an effort industry wide would be very beneficial to all for improving
> performance, but would be a pretty large undertaking too... outside the
> scope of this patchset! :)

Yeah, it is. I mean, if the toolchain is not ready yet and we think
worth the benefits considering that most probably fewer people will
be able to contribute or maintain, then I see no other way to solve
the issue.

Do you think improving the toolchain is a larger commitment than
manually improving applications? A quick look on gcc gave me the
impression that it does support at least some basic vector
optimization capabilities.


> I'll admit to being a bit of an ISA fan, but there's some magical instructions
> that can do stuff in 1x instruction that otherwise take large amounts of
> shifts & loops. Did I hear somebody ask for examples..??

Out of curiosity, which tool are you using (if you are) to measure
the improvements at cycles level? vtune?


> Miniflow Bits processing with "BMI" (Bit Manipulation Instructions)
> Introduced in Haswell era, https://software.intel.com/sites/landingpage/IntrinsicsGuide/#othertechs=BMI1,BMI2
> - Favorite instructions are pdep and pext (parallel bit deposit, and parallel bit extract)
> - Very useful for dense bitfield unpacking, instead of "load - shift - AND" per field, can
>    unpack up to 8 bitfields in a u64 and align them to byte-boundaries
> - Its "opposite" "pext" also exists, extracting sparse bits from an integer into a packed layout
> (pext is used in DPCLS, to pull sparse bits from the packet's miniflow into linear packed layout,
> allowing it to be processed in a single packed AVX512 register)
> 
> Note that we're all benefitting from novel usage of the scalar "popcount" instruction too, since merging
> commit: a0b36b392 (introduced in SSE4.2, with CPUID flag POPCNT) It uses a bitmask & popcount approach
> to index into the miniflow, improving on the previous "count and shifts bits" to iterate miniflows approach.
> 
> There are likely multiple other places in OVS where we spend significant cycles
> on processing data in ways that can be accelerated significantly by using all available ISA.
> There is ongoing work in miniflow extract (MFEX) with AVX512 SIMD ISA, allowing parsing
> of multiple packet protocols at the same time (see here https://patchwork.ozlabs.org/project/openvswitch/list/?series=249470)
> 
> I'll stop promoting ISA here, but am happy to continue detailed discussions, or break out
> conversations about specific areas of compute in OVS if there's appetite for that! Feel free
> to email to OVS Mailing list (with me on CC please :) or email directly OK too.

I am definitely learning more about it and I appreciated your
longer reply.

Thanks,
Harry van Haaren June 24, 2021, 12:52 p.m. UTC | #6
> -----Original Message-----
> From: Flavio Leitner <fbl@sysclose.org>
> Sent: Thursday, June 24, 2021 1:18 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: Ferriter, Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org;
> i.maximets@ovn.org; Amber, Kumar <kumar.amber@intel.com>
> Subject: Re: [ovs-dev] [v13 12/12] dpcls-avx512: Enable avx512 vector popcount
> instruction.
> 
> 
> Hi Harry,
> 
> On Thu, Jun 24, 2021 at 11:07:59AM +0000, Van Haaren, Harry wrote:
> > > -----Original Message-----
> > > From: dev <ovs-dev-bounces@openvswitch.org> On Behalf Of Flavio Leitner
> > > Sent: Thursday, June 24, 2021 4:57 AM
> > > To: Ferriter, Cian <cian.ferriter@intel.com>
> > > Cc: ovs-dev@openvswitch.org; i.maximets@ovn.org
> > > Subject: Re: [ovs-dev] [v13 12/12] dpcls-avx512: Enable avx512 vector popcount
> > > instruction.
> > >
> > > On Thu, Jun 17, 2021 at 05:18:25PM +0100, Cian Ferriter wrote:
> > > > From: Harry van Haaren <harry.van.haaren@intel.com>

<snip some previous discussion detail away>

> > I do like the idea of toolchain supporting ISA options a bit more, there is
> > so much compute performance available that is not widely used today.
> > Such an effort industry wide would be very beneficial to all for improving
> > performance, but would be a pretty large undertaking too... outside the
> > scope of this patchset! :)
> 
> Yeah, it is. I mean, if the toolchain is not ready yet and we think
> worth the benefits considering that most probably fewer people will
> be able to contribute or maintain, then I see no other way to solve
> the issue.

So the toolchain is "ready" in that we have a path to enable CPU ISA, and
see the benefits. We can dream about future toolchains, and how those might
improve our workflow in future, but pragmatically the approach here is the
best-known-method based on available tools today. DPDK uses the same
techniques (Function pointer, CPUID based ISA check, and plug in ISA if available).

Improving the toolchain would only solve the problem to allow the compiler to use the
CPU ISA. This does not solve the problem of the compiler not being able to understand
the data-movement & processing to be able to reason about it and auto-vectorize.

> Do you think improving the toolchain is a larger commitment than
> manually improving applications? A quick look on gcc gave me the
> impression that it does support at least some basic vector
> optimization capabilities.

Yes - you raise a good point, "basic vector optimization capabilities" are present
in various compilers (gcc and clang/llvm is what I test with). For the matrix-multiply
problem that is often used to showcase compiler auto-vectorization, it is an extremely
well bounded, and simple task from understanding the work to be done.

Our emails crossed paths, there's more detail here about matrix multiply & basic vectorization.
https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384377.html


> > I'll admit to being a bit of an ISA fan, but there's some magical instructions
> > that can do stuff in 1x instruction that otherwise take large amounts of
> > shifts & loops. Did I hear somebody ask for examples..??
> 
> Out of curiosity, which tool are you using (if you are) to measure
> the improvements at cycles level? vtune?

I use the Linux Perf tooling for performance measurements, along with OVS's
own per-packet cycle count reporting. Hardware performance measuring (as Linux
Perf and VTune use) provide all the info that's required.

For those not measuring performance at the function/ASM level, run the following
commands and view the performance in your terminal:    perf top -C <pmd_core> -b

Based on that, focus on the area's where lots of cycles are spent, and investigate
alternative SIMD based implementations for that same functionality, making use
of the CPU ISA. That's the general workflow :)

For those particularly interested, I done a "Measure Software Performance of Data Plane Applications"
talk at DPDK Userspace in 2019 talking about workflow/method: https://www.youtube.com/watch?v=ZmwOKR5JyPk


<snip lots of ISA details>

> > I'll stop promoting ISA here, but am happy to continue detailed discussions, or
> break out
> > conversations about specific areas of compute in OVS if there's appetite for that!
> Feel free
> > to email to OVS Mailing list (with me on CC please :) or email directly OK too.
> 
> I am definitely learning more about it and I appreciated your
> longer reply.

As you may notice, this is an area I'm passionate about. If there's specific interest,
I can volunteer to try cover "measuring OVS's SW datapath performance" talk at a
future OVS conference..

Regards, -Harry
Flavio Leitner June 24, 2021, 6:04 p.m. UTC | #7
On Thu, Jun 24, 2021 at 12:52:49PM +0000, Van Haaren, Harry wrote:
> > On Thu, Jun 24, 2021 at 11:07:59AM +0000, Van Haaren, Harry wrote:
> > > > On Thu, Jun 17, 2021 at 05:18:25PM +0100, Cian Ferriter wrote:
> > > > > From: Harry van Haaren <harry.van.haaren@intel.com>
> 
> > > I do like the idea of toolchain supporting ISA options a bit more, there is
> > > so much compute performance available that is not widely used today.
> > > Such an effort industry wide would be very beneficial to all for improving
> > > performance, but would be a pretty large undertaking too... outside the
> > > scope of this patchset! :)
> > 
> > Yeah, it is. I mean, if the toolchain is not ready yet and we think
> > worth the benefits considering that most probably fewer people will
> > be able to contribute or maintain, then I see no other way to solve
> > the issue.
> 
> So the toolchain is "ready" in that we have a path to enable CPU ISA, and
> see the benefits. We can dream about future toolchains, and how those might
> improve our workflow in future, but pragmatically the approach here is the
> best-known-method based on available tools today. DPDK uses the same
> techniques (Function pointer, CPUID based ISA check, and plug in ISA if available).
> 
> Improving the toolchain would only solve the problem to allow the compiler to use the
> CPU ISA. This does not solve the problem of the compiler not being able to understand
> the data-movement & processing to be able to reason about it and auto-vectorize.

Yeah, the examples I found are straight forward use of ISA as you said,
then I wasn't sure about how much a compiler is able to help nowadays.


> > Do you think improving the toolchain is a larger commitment than
> > manually improving applications? A quick look on gcc gave me the
> > impression that it does support at least some basic vector
> > optimization capabilities.
> 
> Yes - you raise a good point, "basic vector optimization capabilities" are present
> in various compilers (gcc and clang/llvm is what I test with). For the matrix-multiply
> problem that is often used to showcase compiler auto-vectorization, it is an extremely
> well bounded, and simple task from understanding the work to be done.
> 
> Our emails crossed paths, there's more detail here about matrix multiply & basic vectorization.
> https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384377.html

Exactly :) For sure we want OVS to run faster, but there needs to be
line on how low level we can go because it's always a trade off with
complexity. In this case the line was blur, at least to me, because
I wasn't aware of how far the toolchain can help us.

Do you think these optimizations will be a problem with Windows or
BSDs? I haven't found an alternative to Cirrus which I used before
to build on BSD.


> > > I'll admit to being a bit of an ISA fan, but there's some magical instructions
> > > that can do stuff in 1x instruction that otherwise take large amounts of
> > > shifts & loops. Did I hear somebody ask for examples..??
> > 
> > Out of curiosity, which tool are you using (if you are) to measure
> > the improvements at cycles level? vtune?
> 
> I use the Linux Perf tooling for performance measurements, along with OVS's
> own per-packet cycle count reporting. Hardware performance measuring (as Linux
> Perf and VTune use) provide all the info that's required.
> 
> For those not measuring performance at the function/ASM level, run the following
> commands and view the performance in your terminal:    perf top -C <pmd_core> -b
> 
> Based on that, focus on the area's where lots of cycles are spent, and investigate
> alternative SIMD based implementations for that same functionality, making use
> of the CPU ISA. That's the general workflow :)

Yup, I am familiar with most of those except with VTune, so I wondered
if that provided more insights to see AVX512 optimizations impact.

> For those particularly interested, I done a "Measure Software Performance of Data Plane Applications"
> talk at DPDK Userspace in 2019 talking about workflow/method: https://www.youtube.com/watch?v=ZmwOKR5JyPk

Great, thanks for sharing it.

> <snip lots of ISA details>
> 
> > > I'll stop promoting ISA here, but am happy to continue detailed discussions, or
> > break out
> > > conversations about specific areas of compute in OVS if there's appetite for that!
> > Feel free
> > > to email to OVS Mailing list (with me on CC please :) or email directly OK too.
> > 
> > I am definitely learning more about it and I appreciated your
> > longer reply.
> 
> As you may notice, this is an area I'm passionate about. If there's specific interest,
> I can volunteer to try cover "measuring OVS's SW datapath performance" talk at a
> future OVS conference..

I'd say that interesting talks are always welcome! :)

One thing that maybe you have interest is to increase datapath
visibility with regards to performance. Today there are some
statistics, but maybe there could be more to potentially help
to monitor or pinpoint permanent (or transient) bottlenecks,
CPU cache misses, and so on. Giving that the datapath deals
with traffic and flow tables, and that both can be unpredictable,
the more visibility we have on how efficient it is running,
the better.

One idea that comes to mind after reviewing these patches as
an example, is that it seems cheap now to build a histogram
of how many different flows were used in a single batch. Say
that OVS received 32 packets in a batch, 30 of them matched
a single flow while the remaining 2 matched another flow. It
could build a histogram per port on how many flows were used
per batch.

Again, that is just an example of stats at batching and
flow processing level that would be helpful to understand
workloads and apparently could leverage AVX512.

Thanks,
diff mbox series

Patch

diff --git a/NEWS b/NEWS
index f03bfeb5d..bc1db7948 100644
--- a/NEWS
+++ b/NEWS
@@ -18,6 +18,9 @@  Post-v2.15.0
      * Enable AVX512 optimized DPCLS to search subtables with larger miniflows.
      * Add more specialized DPCLS subtables to cover common rules, enhancing
        the lookup performance.
+     * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction if the
+       CPU supports it. This enhances performance by using the native vpopcount
+       instructions, instead of the emulated version of vpopcount.
    - ovs-ctl:
      * New option '--no-record-hostname' to disable hostname configuration
        in ovsdb on startup.
diff --git a/lib/dpdk.c b/lib/dpdk.c
index c883a4b8b..a9494a40f 100644
--- a/lib/dpdk.c
+++ b/lib/dpdk.c
@@ -655,6 +655,7 @@  dpdk_get_cpu_has_isa(const char *arch, const char *feature)
 #if __x86_64__
     /* CPU flags only defined for the architecture that support it. */
     CHECK_CPU_FEATURE(feature, "avx512f", RTE_CPUFLAG_AVX512F);
+    CHECK_CPU_FEATURE(feature, "avx512vpopcntdq", RTE_CPUFLAG_AVX512VPOPCNTDQ);
     CHECK_CPU_FEATURE(feature, "bmi2", RTE_CPUFLAG_BMI2);
 #endif
 
diff --git a/lib/dpif-netdev-lookup-avx512-gather.c b/lib/dpif-netdev-lookup-avx512-gather.c
index 0b51ef9dc..bc359dc4a 100644
--- a/lib/dpif-netdev-lookup-avx512-gather.c
+++ b/lib/dpif-netdev-lookup-avx512-gather.c
@@ -53,6 +53,15 @@ 
 
 VLOG_DEFINE_THIS_MODULE(dpif_lookup_avx512_gather);
 
+
+/* Wrapper function required to enable ISA. */
+static inline __m512i
+__attribute__((__target__("avx512vpopcntdq")))
+_mm512_popcnt_epi64_wrapper(__m512i v_in)
+{
+    return _mm512_popcnt_epi64(v_in);
+}
+
 static inline __m512i
 _mm512_popcnt_epi64_manual(__m512i v_in)
 {
@@ -131,6 +140,7 @@  netdev_rule_matches_key(const struct dpcls_rule *rule,
  *   pkt_mf_u0_pop: population count of bits in u0 of the packet.
  *   zero_mask: bitmask of lanes to zero as packet doesn't have mf bits set.
  *   u64_lanes_mask: bitmask of lanes to process.
+ *   use_vpop: compile-time constant indicating if VPOPCNT instruction allowed.
  */
 static inline ALWAYS_INLINE __m512i
 avx512_blocks_gather(__m512i v_u0,
@@ -141,7 +151,8 @@  avx512_blocks_gather(__m512i v_u0,
                      __mmask64 u1_bcast_msk,
                      const uint64_t pkt_mf_u0_pop,
                      __mmask64 zero_mask,
-                     __mmask64 u64_lanes_mask)
+                     __mmask64 u64_lanes_mask,
+                     const uint32_t use_vpop)
 {
         /* Suggest to compiler to load tbl blocks ahead of gather(). */
         __m512i v_tbl_blocks = _mm512_maskz_loadu_epi64(u64_lanes_mask,
@@ -155,8 +166,15 @@  avx512_blocks_gather(__m512i v_u0,
                                                       tbl_mf_masks);
         __m512i v_masks = _mm512_and_si512(v_pkt_bits, v_tbl_masks);
 
-        /* Manual AVX512 popcount for u64 lanes. */
-        __m512i v_popcnts = _mm512_popcnt_epi64_manual(v_masks);
+        /* Calculate AVX512 popcount for u64 lanes using the native instruction
+         * if available, or using emulation if not available.
+         */
+        __m512i v_popcnts;
+        if (use_vpop) {
+            v_popcnts = _mm512_popcnt_epi64_wrapper(v_masks);
+        } else {
+            v_popcnts = _mm512_popcnt_epi64_manual(v_masks);
+        }
 
         /* Add popcounts and offset for u1 bits. */
         __m512i v_idx_u0_offset = _mm512_maskz_set1_epi64(u1_bcast_msk,
@@ -181,7 +199,8 @@  avx512_lookup_impl(struct dpcls_subtable *subtable,
                    const struct netdev_flow_key *keys[],
                    struct dpcls_rule **rules,
                    const uint32_t bit_count_u0,
-                   const uint32_t bit_count_u1)
+                   const uint32_t bit_count_u1,
+                   const uint32_t use_vpop)
 {
     OVS_ALIGNED_VAR(CACHE_LINE_SIZE)uint64_t block_cache[BLOCKS_CACHE_SIZE];
     uint32_t hashes[NETDEV_MAX_BURST];
@@ -233,7 +252,8 @@  avx512_lookup_impl(struct dpcls_subtable *subtable,
                                                 u1_bcast_mask,
                                                 pkt_mf_u0_pop,
                                                 zero_mask,
-                                                bit_count_total_mask);
+                                                bit_count_total_mask,
+                                                use_vpop);
         _mm512_storeu_si512(&block_cache[i * MF_BLOCKS_PER_PACKET], v_blocks);
 
         if (bit_count_total > 8) {
@@ -254,7 +274,8 @@  avx512_lookup_impl(struct dpcls_subtable *subtable,
                                                     u1_bcast_mask_gt8,
                                                     pkt_mf_u0_pop,
                                                     zero_mask_gt8,
-                                                    bit_count_gt8_mask);
+                                                    bit_count_gt8_mask,
+                                                    use_vpop);
             _mm512_storeu_si512(&block_cache[(i * MF_BLOCKS_PER_PACKET) + 8],
                                 v_blocks_gt8);
         }
@@ -303,7 +324,11 @@  avx512_lookup_impl(struct dpcls_subtable *subtable,
     return found_map;
 }
 
-/* Expand out specialized functions with U0 and U1 bit attributes. */
+/* Expand out specialized functions with U0 and U1 bit attributes. As the
+ * AVX512 vpopcnt instruction is not supported on all AVX512 capable CPUs,
+ * create two functions for each miniflow signature. This allows the runtime
+ * CPU detection in probe() to select the ideal implementation.
+ */
 #define DECLARE_OPTIMIZED_LOOKUP_FUNCTION(U0, U1)                             \
     static uint32_t                                                           \
     dpcls_avx512_gather_mf_##U0##_##U1(struct dpcls_subtable *subtable,       \
@@ -311,7 +336,20 @@  avx512_lookup_impl(struct dpcls_subtable *subtable,
                                        const struct netdev_flow_key *keys[],  \
                                        struct dpcls_rule **rules)             \
     {                                                                         \
-        return avx512_lookup_impl(subtable, keys_map, keys, rules, U0, U1);   \
+        const uint32_t use_vpop = 0;                                          \
+        return avx512_lookup_impl(subtable, keys_map, keys, rules,            \
+                                  U0, U1, use_vpop);                          \
+    }                                                                         \
+                                                                              \
+    static uint32_t __attribute__((__target__("avx512vpopcntdq")))            \
+    dpcls_avx512_gather_mf_##U0##_##U1##_vpop(struct dpcls_subtable *subtable,\
+                                       uint32_t keys_map,                     \
+                                       const struct netdev_flow_key *keys[],  \
+                                       struct dpcls_rule **rules)             \
+    {                                                                         \
+        const uint32_t use_vpop = 1;                                          \
+        return avx512_lookup_impl(subtable, keys_map, keys, rules,            \
+                                  U0, U1, use_vpop);                          \
     }                                                                         \
 
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(9, 4)
@@ -321,11 +359,18 @@  DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 1)
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1)
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0)
 
-/* Check if a specialized function is valid for the required subtable. */
-#define CHECK_LOOKUP_FUNCTION(U0, U1)                                         \
+/* Check if a specialized function is valid for the required subtable.
+ * The use_vpop variable is used to decide if the VPOPCNT instruction can be
+ * used or not.
+ */
+#define CHECK_LOOKUP_FUNCTION(U0, U1, use_vpop)                               \
     ovs_assert((U0 + U1) <= (NUM_U64_IN_ZMM_REG * 2));                        \
     if (!f && u0_bits == U0 && u1_bits == U1) {                               \
-        f = dpcls_avx512_gather_mf_##U0##_##U1;                               \
+        if (use_vpop) {                                                       \
+            f = dpcls_avx512_gather_mf_##U0##_##U1##_vpop;                    \
+        } else {                                                              \
+            f = dpcls_avx512_gather_mf_##U0##_##U1;                           \
+        }                                                                     \
     }
 
 static uint32_t
@@ -333,9 +378,11 @@  dpcls_avx512_gather_mf_any(struct dpcls_subtable *subtable, uint32_t keys_map,
                            const struct netdev_flow_key *keys[],
                            struct dpcls_rule **rules)
 {
+    const uint32_t use_vpop = 0;
     return avx512_lookup_impl(subtable, keys_map, keys, rules,
                               subtable->mf_bits_set_unit0,
-                              subtable->mf_bits_set_unit1);
+                              subtable->mf_bits_set_unit1,
+                              use_vpop);
 }
 
 dpcls_subtable_lookup_func
@@ -349,12 +396,14 @@  dpcls_subtable_avx512_gather_probe(uint32_t u0_bits, uint32_t u1_bits)
         return NULL;
     }
 
-    CHECK_LOOKUP_FUNCTION(9, 4);
-    CHECK_LOOKUP_FUNCTION(9, 1);
-    CHECK_LOOKUP_FUNCTION(5, 3);
-    CHECK_LOOKUP_FUNCTION(5, 1);
-    CHECK_LOOKUP_FUNCTION(4, 1);
-    CHECK_LOOKUP_FUNCTION(4, 0);
+    int use_vpop = dpdk_get_cpu_has_isa("x86_64", "avx512vpopcntdq");
+
+    CHECK_LOOKUP_FUNCTION(9, 4, use_vpop);
+    CHECK_LOOKUP_FUNCTION(9, 1, use_vpop);
+    CHECK_LOOKUP_FUNCTION(5, 3, use_vpop);
+    CHECK_LOOKUP_FUNCTION(5, 1, use_vpop);
+    CHECK_LOOKUP_FUNCTION(4, 1, use_vpop);
+    CHECK_LOOKUP_FUNCTION(4, 0, use_vpop);
 
     /* Check if the _any looping version of the code can perform this miniflow
      * lookup. Performance gain may be less pronounced due to non-specialized