Message ID | 1560422287-30298-1-git-send-email-Yanqin.Wei@arm.com |
---|---|
State | Accepted |
Commit | a0f7bf222030f4cdbce0bda66bd9f2bc6983d9db |
Headers | show |
Series | [ovs-dev,v1] util: implement count_1bits with Neon intrinsics or gcc built-in for aarch64. | expand |
On Thu, Jun 13, 2019 at 06:38:07PM +0800, Yanqin Wei wrote: > Userspace datapath needs to traverse through miniflow values many times. In > this process, 'count_1bits' operation for 'Flowmap' significantly impact > performance. On arm, this function was defined by portable implementation > because gcc for arm does not support popcnt feature. > But in the aarch64, VCNT neon instruction can accelerate "count_1bits". > From Gcc-7, the built-in function is implemented with neon intruction. > In this patch, count_1bits function will be impelmented with gcc built-in > from gcc-7 on, and with neon intrinsics in gcc-6. > Performance test was run in two aarch64 machines. In the NIC2NIC test, one > tuple dpcls lookup case achieves around 4% throughput improvement and > 10(average) tuples case achieves around 5% improvement. > > Tested-by: Malvika Gupta <malvika.gupta@arm.com> > Signed-off-by: Yanqin Wei <Yanqin.Wei@arm.com> Thanks! I applied this to master.
diff --git a/lib/util.h b/lib/util.h index 53354f1..2fd01f4 100644 --- a/lib/util.h +++ b/lib/util.h @@ -29,6 +29,9 @@ #include "compiler.h" #include "util.h" #include "openvswitch/util.h" +#if defined(__aarch64__) && __GNUC__ >= 6 +#include <arm_neon.h> +#endif extern char *program_name; @@ -353,8 +356,10 @@ log_2_ceil(uint64_t n) static inline unsigned int count_1bits(uint64_t x) { -#if __GNUC__ >= 4 && __POPCNT__ +#if (__GNUC__ >= 4 && __POPCNT__) || (defined(__aarch64__) && __GNUC__ >= 7) return __builtin_popcountll(x); +#elif defined(__aarch64__) && __GNUC__ >= 6 + return vaddv_u8(vcnt_u8(vcreate_u8(x))); #else /* This portable implementation is the fastest one we know of for 64 * bits, and about 3x faster than GCC 4.7 __builtin_popcountll(). */