diff mbox series

[ovs-dev,v1] util: implement count_1bits with Neon intrinsics or gcc built-in for aarch64.

Message ID 1560422287-30298-1-git-send-email-Yanqin.Wei@arm.com
State Accepted
Commit a0f7bf222030f4cdbce0bda66bd9f2bc6983d9db
Headers show
Series [ovs-dev,v1] util: implement count_1bits with Neon intrinsics or gcc built-in for aarch64. | expand

Commit Message

Yanqin Wei June 13, 2019, 10:38 a.m. UTC
Userspace datapath needs to traverse through miniflow values many times. In
this process, 'count_1bits' operation for 'Flowmap' significantly impact
performance. On arm, this function was defined by portable implementation
because gcc for arm does not support popcnt feature.
But in the aarch64, VCNT neon instruction can accelerate "count_1bits".
From Gcc-7, the built-in function is implemented with neon intruction.
In this patch, count_1bits function will be impelmented with gcc built-in
from gcc-7 on, and with neon intrinsics in gcc-6.
Performance test was run in two aarch64 machines. In the NIC2NIC test, one
tuple dpcls lookup case achieves around 4% throughput improvement and
10(average) tuples case achieves around 5% improvement.

Tested-by: Malvika Gupta <malvika.gupta@arm.com>
Signed-off-by: Yanqin Wei <Yanqin.Wei@arm.com>
---
 lib/util.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Ben Pfaff June 13, 2019, 5:51 p.m. UTC | #1
On Thu, Jun 13, 2019 at 06:38:07PM +0800, Yanqin Wei wrote:
> Userspace datapath needs to traverse through miniflow values many times. In
> this process, 'count_1bits' operation for 'Flowmap' significantly impact
> performance. On arm, this function was defined by portable implementation
> because gcc for arm does not support popcnt feature.
> But in the aarch64, VCNT neon instruction can accelerate "count_1bits".
> From Gcc-7, the built-in function is implemented with neon intruction.
> In this patch, count_1bits function will be impelmented with gcc built-in
> from gcc-7 on, and with neon intrinsics in gcc-6.
> Performance test was run in two aarch64 machines. In the NIC2NIC test, one
> tuple dpcls lookup case achieves around 4% throughput improvement and
> 10(average) tuples case achieves around 5% improvement.
> 
> Tested-by: Malvika Gupta <malvika.gupta@arm.com>
> Signed-off-by: Yanqin Wei <Yanqin.Wei@arm.com>

Thanks!  I applied this to master.
diff mbox series

Patch

diff --git a/lib/util.h b/lib/util.h
index 53354f1..2fd01f4 100644
--- a/lib/util.h
+++ b/lib/util.h
@@ -29,6 +29,9 @@ 
 #include "compiler.h"
 #include "util.h"
 #include "openvswitch/util.h"
+#if defined(__aarch64__) && __GNUC__ >= 6
+#include <arm_neon.h>
+#endif
 
 extern char *program_name;
 
@@ -353,8 +356,10 @@  log_2_ceil(uint64_t n)
 static inline unsigned int
 count_1bits(uint64_t x)
 {
-#if __GNUC__ >= 4 && __POPCNT__
+#if (__GNUC__ >= 4 && __POPCNT__) || (defined(__aarch64__) && __GNUC__ >= 7)
     return __builtin_popcountll(x);
+#elif defined(__aarch64__) && __GNUC__ >= 6
+    return vaddv_u8(vcnt_u8(vcreate_u8(x)));
 #else
     /* This portable implementation is the fastest one we know of for 64
      * bits, and about 3x faster than GCC 4.7 __builtin_popcountll(). */