[v3,1/1] target-arm: Use Neon for zero checking
diff mbox

Message ID 1467190029-694-2-git-send-email-vijayak@cavium.com
State New
Headers show

Commit Message

vijayak@cavium.com June 29, 2016, 8:47 a.m. UTC
From: Vijay <vijayak@cavium.com>

Use Neon instructions to perform zero checking of
buffer. This is helps in reducing total migration time.

Use case: Idle VM live migration with 4 VCPUS and 8GB ram
running CentOS 7.

Without Neon, the Total migration time is 3.5 Sec

Migration status: completed
total time: 3560 milliseconds
downtime: 33 milliseconds
setup: 5 milliseconds
transferred ram: 297907 kbytes
throughput: 685.76 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062760 pages
skipped: 0 pages
normal: 69808 pages
normal bytes: 279232 kbytes
dirty sync count: 3

With Neon, the total migration time is 2.9 Sec

Migration status: completed
total time: 2960 milliseconds
downtime: 65 milliseconds
setup: 4 milliseconds
transferred ram: 299869 kbytes
throughput: 830.19 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2064313 pages
skipped: 0 pages
normal: 70294 pages
normal bytes: 281176 kbytes
dirty sync count: 3

Signed-off-by: Vijaya Kumar K <vijayak@cavium.com>
Signed-off-by: Suresh <ksuresh@cavium.com>
---
 util/cutils.c |    7 +++++++
 1 file changed, 7 insertions(+)

Comments

Peter Maydell June 30, 2016, 1:45 p.m. UTC | #1
On 29 June 2016 at 09:47,  <vijayak@cavium.com> wrote:
> From: Vijay <vijayak@cavium.com>
>
> Use Neon instructions to perform zero checking of
> buffer. This is helps in reducing total migration time.

> diff --git a/util/cutils.c b/util/cutils.c
> index 5830a68..4779403 100644
> --- a/util/cutils.c
> +++ b/util/cutils.c
> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
>  #define SPLAT(p)       _mm_set1_epi8(*(p))
>  #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
>  #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
> +#elif __aarch64__
> +#include "arm_neon.h"
> +#define VECTYPE        uint64x2_t
> +#define ALL_EQ(v1, v2) \
> +        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
> +         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
> +#define VEC_OR(v1, v2) ((v1) | (v2))

Should be '#elif defined(__aarch64__)'. I have made this
tweak and put this patch in target-arm.next.

thanks
-- PMM
Richard Henderson July 1, 2016, 10:07 p.m. UTC | #2
On 06/30/2016 06:45 AM, Peter Maydell wrote:
> On 29 June 2016 at 09:47,  <vijayak@cavium.com> wrote:
>> From: Vijay <vijayak@cavium.com>
>>
>> Use Neon instructions to perform zero checking of
>> buffer. This is helps in reducing total migration time.
>
>> diff --git a/util/cutils.c b/util/cutils.c
>> index 5830a68..4779403 100644
>> --- a/util/cutils.c
>> +++ b/util/cutils.c
>> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
>>  #define SPLAT(p)       _mm_set1_epi8(*(p))
>>  #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
>>  #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
>> +#elif __aarch64__
>> +#include "arm_neon.h"
>> +#define VECTYPE        uint64x2_t
>> +#define ALL_EQ(v1, v2) \
>> +        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
>> +         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
>> +#define VEC_OR(v1, v2) ((v1) | (v2))
>
> Should be '#elif defined(__aarch64__)'. I have made this
> tweak and put this patch in target-arm.next.

Consider

#define VECTYPE        uint32x4_t
#define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0)


which compiles down to

   1c:	6e211c00 	eor	v0.16b, v0.16b, v1.16b
   20:	6eb0a800 	umaxv	s0, v0.4s
   24:	1e260000 	fmov	w0, s0
   28:	6b1f001f 	cmp	w0, wzr
   2c:	1a9f17e0 	cset	w0, eq
   30:	d65f03c0 	ret

vs

   34:	4e083c20 	mov	x0, v1.d[0]
   38:	4e083c01 	mov	x1, v0.d[0]
   3c:	eb00003f 	cmp	x1, x0
   40:	52800000 	mov	w0, #0
   44:	54000040 	b.eq	4c <f0+0x18>
   48:	d65f03c0 	ret
   4c:	4e183c20 	mov	x0, v1.d[1]
   50:	4e183c01 	mov	x1, v0.d[1]
   54:	eb00003f 	cmp	x1, x0
   58:	1a9f17e0 	cset	w0, eq
   5c:	d65f03c0 	ret


r~
Peter Maydell July 2, 2016, 9:42 a.m. UTC | #3
On 1 July 2016 at 23:07, Richard Henderson <rth@twiddle.net> wrote:
> On 06/30/2016 06:45 AM, Peter Maydell wrote:
>>
>> On 29 June 2016 at 09:47,  <vijayak@cavium.com> wrote:
>>>
>>> From: Vijay <vijayak@cavium.com>
>>>
>>> Use Neon instructions to perform zero checking of
>>> buffer. This is helps in reducing total migration time.
>>
>>
>>> diff --git a/util/cutils.c b/util/cutils.c
>>> index 5830a68..4779403 100644
>>> --- a/util/cutils.c
>>> +++ b/util/cutils.c
>>> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
>>>  #define SPLAT(p)       _mm_set1_epi8(*(p))
>>>  #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>>> 0xFFFF)
>>>  #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
>>> +#elif __aarch64__
>>> +#include "arm_neon.h"
>>> +#define VECTYPE        uint64x2_t
>>> +#define ALL_EQ(v1, v2) \
>>> +        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
>>> +         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
>>> +#define VEC_OR(v1, v2) ((v1) | (v2))
>>
>>
>> Should be '#elif defined(__aarch64__)'. I have made this
>> tweak and put this patch in target-arm.next.
>
>
> Consider
>
> #define VECTYPE        uint32x4_t
> #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0)

Sounds good. Vijay, could you benchmark that variant, please?

thanks
-- PMM
Vijay Kilari July 5, 2016, 12:24 p.m. UTC | #4
On Sat, Jul 2, 2016 at 3:37 AM, Richard Henderson <rth@twiddle.net> wrote:
> On 06/30/2016 06:45 AM, Peter Maydell wrote:
>>
>> On 29 June 2016 at 09:47,  <vijayak@cavium.com> wrote:
>>>
>>> From: Vijay <vijayak@cavium.com>
>>>
>>> Use Neon instructions to perform zero checking of
>>> buffer. This is helps in reducing total migration time.
>>
>>
>>> diff --git a/util/cutils.c b/util/cutils.c
>>> index 5830a68..4779403 100644
>>> --- a/util/cutils.c
>>> +++ b/util/cutils.c
>>> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
>>>  #define SPLAT(p)       _mm_set1_epi8(*(p))
>>>  #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>>> 0xFFFF)
>>>  #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
>>> +#elif __aarch64__
>>> +#include "arm_neon.h"
>>> +#define VECTYPE        uint64x2_t
>>> +#define ALL_EQ(v1, v2) \
>>> +        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
>>> +         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
>>> +#define VEC_OR(v1, v2) ((v1) | (v2))
>>
>>
>> Should be '#elif defined(__aarch64__)'. I have made this
>> tweak and put this patch in target-arm.next.
>
>
> Consider
>
> #define VECTYPE        uint32x4_t
> #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0)
>
>
> which compiles down to
>
>   1c:   6e211c00        eor     v0.16b, v0.16b, v1.16b
>   20:   6eb0a800        umaxv   s0, v0.4s
>   24:   1e260000        fmov    w0, s0
>   28:   6b1f001f        cmp     w0, wzr
>   2c:   1a9f17e0        cset    w0, eq
>   30:   d65f03c0        ret

For me this code compiles as below and migration time is ~100ms more.

See below 3 trails of migration time

  7039cc:       6eb0a800        umaxv   s0, v0.4s
  7039d0:       0e043c02        mov     w2, v0.s[0]
  7039d4:       350000c2        cbnz    w2, 7039ec <f0+0xf4>
  7039d8:       91002084        add     x4, x4, #0x8
  7039dc:       91020063        add     x3, x3, #0x80
  7039e0:       eb01009f        cmp     x4, x1

(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 3070 milliseconds
downtime: 55 milliseconds
setup: 4 milliseconds
transferred ram: 300637 kbytes
throughput: 802.49 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062834 pages
skipped: 0 pages
normal: 70489 pages
normal bytes: 281956 kbytes
dirty sync count: 3

(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 3067 milliseconds
downtime: 47 milliseconds
setup: 5 milliseconds
transferred ram: 290277 kbytes
throughput: 775.61 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2064185 pages
skipped: 0 pages
normal: 67901 pages
normal bytes: 271604 kbytes
dirty sync count: 3
(qemu)

(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 3067 milliseconds
downtime: 34 milliseconds
setup: 5 milliseconds
transferred ram: 294614 kbytes
throughput: 787.19 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2063365 pages
skipped: 0 pages
normal: 68985 pages
normal bytes: 275940 kbytes
dirty sync count: 3

>
> vs
>
>   34:   4e083c20        mov     x0, v1.d[0]
>   38:   4e083c01        mov     x1, v0.d[0]
>   3c:   eb00003f        cmp     x1, x0
>   40:   52800000        mov     w0, #0
>   44:   54000040        b.eq    4c <f0+0x18>
>   48:   d65f03c0        ret
>   4c:   4e183c20        mov     x0, v1.d[1]
>   50:   4e183c01        mov     x1, v0.d[1]
>   54:   eb00003f        cmp     x1, x0
>   58:   1a9f17e0        cset    w0, eq
>   5c:   d65f03c0        ret
>

My patch compiles to below code and takes ~100ms less time

#define VECTYPE        uint64x2_t
#define ALL_EQ(v1, v2) \
        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))

  7039d0:       4e083c02        mov     x2, v0.d[0]
  7039d4:       b5000102        cbnz    x2, 7039f4 <f0+0xfc>
  7039d8:       4e183c02        mov     x2, v0.d[1]
  7039dc:       b50000c2        cbnz    x2, 7039f4 <f0+0xfc>
  7039e0:       91002084        add     x4, x4, #0x8
  7039e4:       91020063        add     x3, x3, #0x80
  7039e8:       eb04003f        cmp     x1, x4

capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 2973 milliseconds
downtime: 67 milliseconds
setup: 5 milliseconds
transferred ram: 293659 kbytes
throughput: 809.45 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062791 pages
skipped: 0 pages
normal: 68748 pages
normal bytes: 274992 kbytes
dirty sync count: 3
(qemu)

capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 2972 milliseconds
downtime: 47 milliseconds
setup: 5 milliseconds
transferred ram: 295972 kbytes
throughput: 816.10 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062861 pages
skipped: 0 pages
normal: 69325 pages
normal bytes: 277300 kbytes
dirty sync count: 3
(qemu)

capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 2982 milliseconds
downtime: 40 milliseconds
setup: 5 milliseconds
transferred ram: 293386 kbytes
throughput: 806.26 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2063199 pages
skipped: 0 pages
normal: 68679 pages
normal bytes: 274716 kbytes
dirty sync count: 4
(qemu)

Regards
Vijay
Peter Maydell July 11, 2016, 5:55 p.m. UTC | #5
On 5 July 2016 at 13:24, Vijay Kilari <vijay.kilari@gmail.com> wrote:
> On Sat, Jul 2, 2016 at 3:37 AM, Richard Henderson <rth@twiddle.net> wrote:
>> Consider
>>
>> #define VECTYPE        uint32x4_t
>> #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0)
>>
>>
>> which compiles down to
>>
>>   1c:   6e211c00        eor     v0.16b, v0.16b, v1.16b
>>   20:   6eb0a800        umaxv   s0, v0.4s
>>   24:   1e260000        fmov    w0, s0
>>   28:   6b1f001f        cmp     w0, wzr
>>   2c:   1a9f17e0        cset    w0, eq
>>   30:   d65f03c0        ret
>
> For me this code compiles as below and migration time is ~100ms more.

Thanks for benchmarking this. I'll take your original patch into
target-arm.next.

-- PMM

Patch
diff mbox

diff --git a/util/cutils.c b/util/cutils.c
index 5830a68..4779403 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -184,6 +184,13 @@  int qemu_fdatasync(int fd)
 #define SPLAT(p)       _mm_set1_epi8(*(p))
 #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
 #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
+#elif __aarch64__
+#include "arm_neon.h"
+#define VECTYPE        uint64x2_t
+#define ALL_EQ(v1, v2) \
+        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
+         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
+#define VEC_OR(v1, v2) ((v1) | (v2))
 #else
 #define VECTYPE        unsigned long
 #define SPLAT(p)       (*(p) * (~0UL / 255))