Message ID | 1467190029-694-2-git-send-email-vijayak@cavium.com |
---|---|
State | New |
Headers | show |
On 29 June 2016 at 09:47, <vijayak@cavium.com> wrote: > From: Vijay <vijayak@cavium.com> > > Use Neon instructions to perform zero checking of > buffer. This is helps in reducing total migration time. > diff --git a/util/cutils.c b/util/cutils.c > index 5830a68..4779403 100644 > --- a/util/cutils.c > +++ b/util/cutils.c > @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd) > #define SPLAT(p) _mm_set1_epi8(*(p)) > #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF) > #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2)) > +#elif __aarch64__ > +#include "arm_neon.h" > +#define VECTYPE uint64x2_t > +#define ALL_EQ(v1, v2) \ > + ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \ > + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1))) > +#define VEC_OR(v1, v2) ((v1) | (v2)) Should be '#elif defined(__aarch64__)'. I have made this tweak and put this patch in target-arm.next. thanks -- PMM
On 06/30/2016 06:45 AM, Peter Maydell wrote: > On 29 June 2016 at 09:47, <vijayak@cavium.com> wrote: >> From: Vijay <vijayak@cavium.com> >> >> Use Neon instructions to perform zero checking of >> buffer. This is helps in reducing total migration time. > >> diff --git a/util/cutils.c b/util/cutils.c >> index 5830a68..4779403 100644 >> --- a/util/cutils.c >> +++ b/util/cutils.c >> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd) >> #define SPLAT(p) _mm_set1_epi8(*(p)) >> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF) >> #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2)) >> +#elif __aarch64__ >> +#include "arm_neon.h" >> +#define VECTYPE uint64x2_t >> +#define ALL_EQ(v1, v2) \ >> + ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \ >> + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1))) >> +#define VEC_OR(v1, v2) ((v1) | (v2)) > > Should be '#elif defined(__aarch64__)'. I have made this > tweak and put this patch in target-arm.next. Consider #define VECTYPE uint32x4_t #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0) which compiles down to 1c: 6e211c00 eor v0.16b, v0.16b, v1.16b 20: 6eb0a800 umaxv s0, v0.4s 24: 1e260000 fmov w0, s0 28: 6b1f001f cmp w0, wzr 2c: 1a9f17e0 cset w0, eq 30: d65f03c0 ret vs 34: 4e083c20 mov x0, v1.d[0] 38: 4e083c01 mov x1, v0.d[0] 3c: eb00003f cmp x1, x0 40: 52800000 mov w0, #0 44: 54000040 b.eq 4c <f0+0x18> 48: d65f03c0 ret 4c: 4e183c20 mov x0, v1.d[1] 50: 4e183c01 mov x1, v0.d[1] 54: eb00003f cmp x1, x0 58: 1a9f17e0 cset w0, eq 5c: d65f03c0 ret r~
On 1 July 2016 at 23:07, Richard Henderson <rth@twiddle.net> wrote: > On 06/30/2016 06:45 AM, Peter Maydell wrote: >> >> On 29 June 2016 at 09:47, <vijayak@cavium.com> wrote: >>> >>> From: Vijay <vijayak@cavium.com> >>> >>> Use Neon instructions to perform zero checking of >>> buffer. This is helps in reducing total migration time. >> >> >>> diff --git a/util/cutils.c b/util/cutils.c >>> index 5830a68..4779403 100644 >>> --- a/util/cutils.c >>> +++ b/util/cutils.c >>> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd) >>> #define SPLAT(p) _mm_set1_epi8(*(p)) >>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == >>> 0xFFFF) >>> #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2)) >>> +#elif __aarch64__ >>> +#include "arm_neon.h" >>> +#define VECTYPE uint64x2_t >>> +#define ALL_EQ(v1, v2) \ >>> + ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \ >>> + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1))) >>> +#define VEC_OR(v1, v2) ((v1) | (v2)) >> >> >> Should be '#elif defined(__aarch64__)'. I have made this >> tweak and put this patch in target-arm.next. > > > Consider > > #define VECTYPE uint32x4_t > #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0) Sounds good. Vijay, could you benchmark that variant, please? thanks -- PMM
On Sat, Jul 2, 2016 at 3:37 AM, Richard Henderson <rth@twiddle.net> wrote: > On 06/30/2016 06:45 AM, Peter Maydell wrote: >> >> On 29 June 2016 at 09:47, <vijayak@cavium.com> wrote: >>> >>> From: Vijay <vijayak@cavium.com> >>> >>> Use Neon instructions to perform zero checking of >>> buffer. This is helps in reducing total migration time. >> >> >>> diff --git a/util/cutils.c b/util/cutils.c >>> index 5830a68..4779403 100644 >>> --- a/util/cutils.c >>> +++ b/util/cutils.c >>> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd) >>> #define SPLAT(p) _mm_set1_epi8(*(p)) >>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == >>> 0xFFFF) >>> #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2)) >>> +#elif __aarch64__ >>> +#include "arm_neon.h" >>> +#define VECTYPE uint64x2_t >>> +#define ALL_EQ(v1, v2) \ >>> + ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \ >>> + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1))) >>> +#define VEC_OR(v1, v2) ((v1) | (v2)) >> >> >> Should be '#elif defined(__aarch64__)'. I have made this >> tweak and put this patch in target-arm.next. > > > Consider > > #define VECTYPE uint32x4_t > #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0) > > > which compiles down to > > 1c: 6e211c00 eor v0.16b, v0.16b, v1.16b > 20: 6eb0a800 umaxv s0, v0.4s > 24: 1e260000 fmov w0, s0 > 28: 6b1f001f cmp w0, wzr > 2c: 1a9f17e0 cset w0, eq > 30: d65f03c0 ret For me this code compiles as below and migration time is ~100ms more. See below 3 trails of migration time 7039cc: 6eb0a800 umaxv s0, v0.4s 7039d0: 0e043c02 mov w2, v0.s[0] 7039d4: 350000c2 cbnz w2, 7039ec <f0+0xf4> 7039d8: 91002084 add x4, x4, #0x8 7039dc: 91020063 add x3, x3, #0x80 7039e0: eb01009f cmp x4, x1 (qemu) info migrate capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off Migration status: completed total time: 3070 milliseconds downtime: 55 milliseconds setup: 4 milliseconds transferred ram: 300637 kbytes throughput: 802.49 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2062834 pages skipped: 0 pages normal: 70489 pages normal bytes: 281956 kbytes dirty sync count: 3 (qemu) info migrate capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off Migration status: completed total time: 3067 milliseconds downtime: 47 milliseconds setup: 5 milliseconds transferred ram: 290277 kbytes throughput: 775.61 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2064185 pages skipped: 0 pages normal: 67901 pages normal bytes: 271604 kbytes dirty sync count: 3 (qemu) (qemu) info migrate capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off Migration status: completed total time: 3067 milliseconds downtime: 34 milliseconds setup: 5 milliseconds transferred ram: 294614 kbytes throughput: 787.19 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2063365 pages skipped: 0 pages normal: 68985 pages normal bytes: 275940 kbytes dirty sync count: 3 > > vs > > 34: 4e083c20 mov x0, v1.d[0] > 38: 4e083c01 mov x1, v0.d[0] > 3c: eb00003f cmp x1, x0 > 40: 52800000 mov w0, #0 > 44: 54000040 b.eq 4c <f0+0x18> > 48: d65f03c0 ret > 4c: 4e183c20 mov x0, v1.d[1] > 50: 4e183c01 mov x1, v0.d[1] > 54: eb00003f cmp x1, x0 > 58: 1a9f17e0 cset w0, eq > 5c: d65f03c0 ret > My patch compiles to below code and takes ~100ms less time #define VECTYPE uint64x2_t #define ALL_EQ(v1, v2) \ ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \ (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1))) 7039d0: 4e083c02 mov x2, v0.d[0] 7039d4: b5000102 cbnz x2, 7039f4 <f0+0xfc> 7039d8: 4e183c02 mov x2, v0.d[1] 7039dc: b50000c2 cbnz x2, 7039f4 <f0+0xfc> 7039e0: 91002084 add x4, x4, #0x8 7039e4: 91020063 add x3, x3, #0x80 7039e8: eb04003f cmp x1, x4 capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off Migration status: completed total time: 2973 milliseconds downtime: 67 milliseconds setup: 5 milliseconds transferred ram: 293659 kbytes throughput: 809.45 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2062791 pages skipped: 0 pages normal: 68748 pages normal bytes: 274992 kbytes dirty sync count: 3 (qemu) capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off Migration status: completed total time: 2972 milliseconds downtime: 47 milliseconds setup: 5 milliseconds transferred ram: 295972 kbytes throughput: 816.10 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2062861 pages skipped: 0 pages normal: 69325 pages normal bytes: 277300 kbytes dirty sync count: 3 (qemu) capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off Migration status: completed total time: 2982 milliseconds downtime: 40 milliseconds setup: 5 milliseconds transferred ram: 293386 kbytes throughput: 806.26 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2063199 pages skipped: 0 pages normal: 68679 pages normal bytes: 274716 kbytes dirty sync count: 4 (qemu) Regards Vijay
On 5 July 2016 at 13:24, Vijay Kilari <vijay.kilari@gmail.com> wrote: > On Sat, Jul 2, 2016 at 3:37 AM, Richard Henderson <rth@twiddle.net> wrote: >> Consider >> >> #define VECTYPE uint32x4_t >> #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0) >> >> >> which compiles down to >> >> 1c: 6e211c00 eor v0.16b, v0.16b, v1.16b >> 20: 6eb0a800 umaxv s0, v0.4s >> 24: 1e260000 fmov w0, s0 >> 28: 6b1f001f cmp w0, wzr >> 2c: 1a9f17e0 cset w0, eq >> 30: d65f03c0 ret > > For me this code compiles as below and migration time is ~100ms more. Thanks for benchmarking this. I'll take your original patch into target-arm.next. -- PMM
diff --git a/util/cutils.c b/util/cutils.c index 5830a68..4779403 100644 --- a/util/cutils.c +++ b/util/cutils.c @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd) #define SPLAT(p) _mm_set1_epi8(*(p)) #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF) #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2)) +#elif __aarch64__ +#include "arm_neon.h" +#define VECTYPE uint64x2_t +#define ALL_EQ(v1, v2) \ + ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \ + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1))) +#define VEC_OR(v1, v2) ((v1) | (v2)) #else #define VECTYPE unsigned long #define SPLAT(p) (*(p) * (~0UL / 255))