[ARM] Improve 64 bit division performance
diff mbox

Message ID CADnVucDOc85gQyAR+G45suuCRNHjznQAzcxrg9YB1yjF+QP-Ww@mail.gmail.com
State New
Headers show

Commit Message

Charles Baylis Feb. 27, 2014, 4:38 p.m. UTC
[resending as text/plain]

Hi

These patches optimise 64 bit division by removing the use of the
__gnu_[u]ldivmod_helper functions and hence avoiding the redundant
calculation of the remainder in those functions.

Bootstrapped, tested and checked for arm-unknown-linux-gnueabihf.

Benchmarked on Chromebook and Raspberry Pi using attached divbench3.c.
Loop1 varies the divisor and loop2 varies the dividend.

Chromebook:

before:
loop1 unsigned:         3.474419
loop2 unsigned:         6.564871
loop1 signed:           4.127967
loop2 signed:           6.071490

after:
loop1 unsigned:         2.781364
loop2 unsigned:         6.166478
loop1 signed:           2.800974
loop2 signed:           6.129588

Raspberry pi:
before
loop1 unsigned:        28.881753
loop2 unsigned:        19.876385
loop1 signed:          32.074941
loop2 signed:          20.594860

after:
loop1 unsigned:        24.893846
loop2 unsigned:        19.537562
loop1 signed:          25.334509
loop2 signed:          19.615088

Any comments? OK for stage 1?


Patch 1:

2014-02-27  Charles Baylis  <charles.baylis@linaro.org>

        * config/arm/bpabi.S (__aeabi_uldivmod): Perform division using call
        to __udivmoddi4.


Patch 2:

2014-02-27  Charles Baylis  <charles.baylis@linaro.org>

        * config/arm/bpabi.S (__aeabi_ldivmod): Perform signed division via
        call to __udivmoddi4 and fixing up for negative operands.

Comments

Charles Baylis April 23, 2014, 11:18 a.m. UTC | #1
Ping?

Ramana mentioned at Linaro Connect that this should be tested on more platforms.

I've now checked this on qemu with no regressions on trunk for:
arm-unknown-linux-gnueabihf v7-A: ARM and Thumb-2
arm-unknown-linux-gnueabi v4t, v5t, v6: ARM

OK for trunk?

Archive link: http://gcc.gnu.org/ml/gcc-patches/2014-02/msg01611.html

On 27 February 2014 16:38, Charles Baylis <charles.baylis@linaro.org> wrote:
> [resending as text/plain]
>
> Hi
>
> These patches optimise 64 bit division by removing the use of the
> __gnu_[u]ldivmod_helper functions and hence avoiding the redundant
> calculation of the remainder in those functions.
>
> Bootstrapped, tested and checked for arm-unknown-linux-gnueabihf.
>
> Benchmarked on Chromebook and Raspberry Pi using attached divbench3.c.
> Loop1 varies the divisor and loop2 varies the dividend.
>
> Chromebook:
>
> before:
> loop1 unsigned:         3.474419
> loop2 unsigned:         6.564871
> loop1 signed:           4.127967
> loop2 signed:           6.071490
>
> after:
> loop1 unsigned:         2.781364
> loop2 unsigned:         6.166478
> loop1 signed:           2.800974
> loop2 signed:           6.129588
>
> Raspberry pi:
> before
> loop1 unsigned:        28.881753
> loop2 unsigned:        19.876385
> loop1 signed:          32.074941
> loop2 signed:          20.594860
>
> after:
> loop1 unsigned:        24.893846
> loop2 unsigned:        19.537562
> loop1 signed:          25.334509
> loop2 signed:          19.615088
>
> Any comments? OK for stage 1?
>
>
> Patch 1:
>
> 2014-02-27  Charles Baylis  <charles.baylis@linaro.org>
>
>         * config/arm/bpabi.S (__aeabi_uldivmod): Perform division using call
>         to __udivmoddi4.
>
>
> Patch 2:
>
> 2014-02-27  Charles Baylis  <charles.baylis@linaro.org>
>
>         * config/arm/bpabi.S (__aeabi_ldivmod): Perform signed division via
>         call to __udivmoddi4 and fixing up for negative operands.
Richard Earnshaw May 1, 2014, 3:41 p.m. UTC | #2
On 27/02/14 16:38, Charles Baylis wrote:
> [resending as text/plain]
> 
> Hi
> 
> These patches optimise 64 bit division by removing the use of the
> __gnu_[u]ldivmod_helper functions and hence avoiding the redundant
> calculation of the remainder in those functions.
> 
> Bootstrapped, tested and checked for arm-unknown-linux-gnueabihf.
> 
> Benchmarked on Chromebook and Raspberry Pi using attached divbench3.c.
> Loop1 varies the divisor and loop2 varies the dividend.
> 
> Chromebook:
> 
> before:
> loop1 unsigned:         3.474419
> loop2 unsigned:         6.564871
> loop1 signed:           4.127967
> loop2 signed:           6.071490
> 
> after:
> loop1 unsigned:         2.781364
> loop2 unsigned:         6.166478
> loop1 signed:           2.800974
> loop2 signed:           6.129588
> 
> Raspberry pi:
> before
> loop1 unsigned:        28.881753
> loop2 unsigned:        19.876385
> loop1 signed:          32.074941
> loop2 signed:          20.594860
> 
> after:
> loop1 unsigned:        24.893846
> loop2 unsigned:        19.537562
> loop1 signed:          25.334509
> loop2 signed:          19.615088
> 
> Any comments? OK for stage 1?
> 
> 
> Patch 1:
> 
> 2014-02-27  Charles Baylis  <charles.baylis@linaro.org>
> 
>         * config/arm/bpabi.S (__aeabi_uldivmod): Perform division using call
>         to __udivmoddi4.
> 
> 
> Patch 2:
> 
> 2014-02-27  Charles Baylis  <charles.baylis@linaro.org>
> 
>         * config/arm/bpabi.S (__aeabi_ldivmod): Perform signed division via
>         call to __udivmoddi4 and fixing up for negative operands.
> 
> 

Sorry for the delay reviewing this...

I think really, you've got three independent changes here:

1) Optimize the prologue/epilogue sequences when ldrd is available.
2) Replace the call to __gnu_ldivmod_helper with __udivmoddi4
3) Optimize the code to __aeabi_ldivmod.

Ideally, therefore, this is a three patch series, but it's then missing
a few bits.

4) Step 2 can also be trivially applied to bpabi-v6m.S as well, since
it's a direct swap of one function for another (unless I've misread the
changes, I think the ABI of the two helper functions are the same).
5) Step 4 then makes __gnu_ldivmod_helper in bpabi.c a dead function
which can be deleted.  This is good because currently pulling in either
64-bit division function causes both these helper functions to be pulled
in and thus the whole of the 64-bit div-mod code for both signed and
unsigned values.   That's particularly unfortunate for ARMv6m class
devices as that's potentially a lot of redundant code.

Finally, I know this was the original code, but the complete lack of
comments in this code made reviewing even the trivial parts a complete
nightmare -- it took me half an hour before I remembered that
__udivmoddi4 took three parameters, the third of which was on the stack:
thus the messing around with sp/ip in the prologue wasn't just trivial
padding but a necessary part of the function.  Please could you add, at
least some short comments clarifying the register disposition on input
and what that prologue code is up to...

Finally, how was this code tested?

Anyway, some additional comments below:

> 0001-Optimise-__aeabi_uldivmod.patch
> 
> 
> From 35254b813303e7fb40eb8aa0bb749216fd8f96fc Mon Sep 17 00:00:00 2001
> From: Charles Baylis <charles.baylis@linaro.org>
> Date: Tue, 25 Feb 2014 18:34:38 +0000
> Subject: [PATCH 1/2] Optimise __aeabi_uldivmod
> 
> 2014-02-25  Charles Baylis  <charles.baylis@linaro.org>
> 
> 	* config/arm/bpabi.S (__aeabi_uldivmod): Perform division using call
> 	to __udivmoddi4.
> 	* config/arm/bpabi.S (__aeabi_uldivmod): Optimise stack pointer
> 	manipulation.

Don't repeat the function name for multiple tweaks to the same function;
as mentioned above, if these are really separate changes they should be
in separate submissions.  Mixing unrelated changes just makes the
reviewing step that much harder.

> ---
>  libgcc/config/arm/bpabi.S | 25 ++++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/libgcc/config/arm/bpabi.S b/libgcc/config/arm/bpabi.S
> index 7772301..e020af5 100644
> --- a/libgcc/config/arm/bpabi.S
> +++ b/libgcc/config/arm/bpabi.S
> @@ -120,6 +120,16 @@ ARM_FUNC_START aeabi_ulcmp
>  #endif
>  .endm
>  
> +/* we can use STRD/LDRD on v5TE and later, and any Thumb-2 architecture. */
> +#if (defined(__ARM_EABI__)                                            \
> +     && (defined(__thumb2__)                                          \
> +         || (__ARM_ARCH >= 5 && defined(__TARGET_FEATURE_DSP))))
> +#define CAN_USE_LDRD 1
> +#else
> +#define CAN_USE_LDRD 0
> +#endif
> +
> +
>  #ifdef L_aeabi_ldivmod
>  
>  ARM_FUNC_START aeabi_ldivmod
> @@ -149,18 +159,23 @@ ARM_FUNC_START aeabi_uldivmod
>  	cfi_start	__aeabi_uldivmod, LSYM(Lend_aeabi_uldivmod)
>  	test_div_by_zero unsigned
>  
> -	sub sp, sp, #8
> -#if defined(__thumb2__)
> -	mov ip, sp
> -	push {ip, lr}
> +#if defined(__thumb2__) && CAN_USE_LDRD
> +	sub ip, sp, #8
> +	strd ip,lr, [sp, #-16]!

Space after comma.

Also, since you've essentially rewritten the entire function, can you
please also reformat them to follow the coding style of the rest of the
file: namely "<tab>OP<tab>operands".

>  #else
> +	sub sp, sp, #8
>  	do_push {sp, lr}
>  #endif

Please add a comment that the value at *sp is the address of the the
slot for the remainder.

>  98:	cfi_push 98b - __aeabi_uldivmod, 0xe, -0xc, 0x10
> -	bl SYM(__gnu_uldivmod_helper) __PLT__
> +	bl SYM(__udivmoddi4) __PLT__
>  	ldr lr, [sp, #4]
> +#if CAN_USE_LDRD
> +	ldrd r2, r3, [sp, #8]
> +	add sp, sp, #16
> +#else
>  	add sp, sp, #8
>  	do_pop {r2, r3}
> +#endif
>  	RET
>  	cfi_end	LSYM(Lend_aeabi_uldivmod)
>  
> 
> 
> 0002-Optimise-__aeabi_ldivmod.patch
> 
> 
> From 975d9c624e77ee00476e6866250b0e2e31461fca Mon Sep 17 00:00:00 2001
> From: Charles Baylis <charles.baylis@linaro.org>
> Date: Tue, 25 Feb 2014 16:27:59 +0000
> Subject: [PATCH 2/2] Optimise __aeabi_ldivmod
> 
> 2014-02-25  Charles Baylis  <charles.baylis@linaro.org>
> 
>         * config/arm/bpabi.S (__aeabi_ldivmod): Perform signed division using
> 	unsigned division via call to __udivmoddi4 and additional logic.
> ---
>  libgcc/config/arm/bpabi.S | 74 +++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 69 insertions(+), 5 deletions(-)
> 
> diff --git a/libgcc/config/arm/bpabi.S b/libgcc/config/arm/bpabi.S
> index e020af5..8b75a28 100644
> --- a/libgcc/config/arm/bpabi.S
> +++ b/libgcc/config/arm/bpabi.S
> @@ -136,20 +136,84 @@ ARM_FUNC_START aeabi_ldivmod
>  	cfi_start	__aeabi_ldivmod, LSYM(Lend_aeabi_ldivmod)
>  	test_div_by_zero signed
>  
> -	sub sp, sp, #8
> -#if defined(__thumb2__)
> -	mov ip, sp
> -	push {ip, lr}
> +#if defined(__thumb2__) && CAN_USE_LDRD
> +	sub ip, sp, #8
> +	strd ip,lr, [sp, #-16]!

Space after comma.

>  #else
> +	sub sp, sp, #8
>  	do_push {sp, lr}
>  #endif
> +	cmp xxh, #0
> +	blt 1f
> +	cmp yyh, #0
> +	blt 2f
> +
> +98:	cfi_push 98b - __aeabi_ldivmod, 0xe, -0xc, 0x10

The CFI push should really precede your conditional tests, it relates to
the do_push expression.

> +	bl SYM(__udivmoddi4) __PLT__
> +	ldr lr, [sp, #4]
> +#if CAN_USE_LDRD
> +	ldrd r2, r3, [sp, #8]
> +	add sp, sp, #16
> +#else
> +	add sp, sp, #8
> +	do_pop {r2, r3}
> +#endif

You're missing a CFI pop, which is needed when the values on the stack
go out of scope.

> +	RET
> +1: /* xxh:xxl is negative */
> +	rsbs xxl, xxl, #0

We're using unified syntax, so NEGS is preferable.

> +	sbc xxh, xxh, xxh, lsl #1

Worthy of a comment, Thumb2 has no RSC instruction, so use X - 2X.

> +	cmp yyh, #0
> +	blt 3f
> +98:	cfi_push 98b - __aeabi_ldivmod, 0xe, -0xc, 0x10

This CFI push looks wrong.  You've already pushed things earlier.  On
the other hand, you should save the state before the CFI pop above, so
that you can restore the state again for the next (ie this) block of code.

> +	bl SYM(__udivmoddi4) __PLT__
> +	ldr lr, [sp, #4]
> +#if CAN_USE_LDRD
> +	ldrd r2, r3, [sp, #8]
> +	add sp, sp, #16
> +#else
> +	add sp, sp, #8
> +	do_pop {r2, r3}
> +#endif
> +	rsbs xxl, xxl, #0
> +	sbc xxh, xxh, xxh, lsl #1
> +	rsbs yyl, yyl, #0
> +	sbc yyh, yyh, yyh, lsl #1
> +	RET
> +
> +2: /* only yyh:yyl is negative */
> +	rsbs yyl, yyl, #0
> +	sbc yyh, yyh, yyh, lsl #1
>  98:	cfi_push 98b - __aeabi_ldivmod, 0xe, -0xc, 0x10

See above... and more below.

> -	bl SYM(__gnu_ldivmod_helper) __PLT__
> +	bl SYM(__udivmoddi4) __PLT__
>  	ldr lr, [sp, #4]
> +#if CAN_USE_LDRD
> +	ldrd r2, r3, [sp, #8]
> +	add sp, sp, #16
> +#else
>  	add sp, sp, #8
>  	do_pop {r2, r3}
> +#endif
> +	rsbs xxl, xxl, #0
> +	sbc xxh, xxh, xxh, lsl #1
>  	RET
> +
> +3: /* both xxh:xxl and yyh:yyl are negative */
> +	rsbs yyl, yyl, #0
> +	sbc yyh, yyh, yyh, lsl #1
>  	cfi_end	LSYM(Lend_aeabi_ldivmod)
> +98:	cfi_push 98b - __aeabi_ldivmod, 0xe, -0xc, 0x10
> +	bl SYM(__udivmoddi4) __PLT__
> +	ldr lr, [sp, #4]
> +#if CAN_USE_LDRD
> +	ldrd r2, r3, [sp, #8]
> +	add sp, sp, #16
> +#else
> +	add sp, sp, #8
> +	do_pop {r2, r3}
> +#endif
> +	rsbs yyl, yyl, #0
> +	sbc yyh, yyh, yyh, lsl #1
> +	RET
>  	
>  #endif /* L_aeabi_ldivmod */
>  

You use the LDRD vs do_pop sequence identically several times.  To avoid
a lot of ifdefs, it might be worth considering a macro for this idiom to
reduce the overall amount of conditionalized code.

> 
> 
> divbench3.c
> 
> 
> 
> #include <stdint.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <sys/time.h>
> 
> double tv_to_s(struct timeval tv)
> {
>   return tv.tv_sec + ((double)tv.tv_usec)/1.0e6;
> }
> 
> #define STEP (0x7fffffffffff0000/100000000)
> #define END  (0x7fffffffffff0001-STEP)
> #define START1 (37ll)
> #define START2 (3ll)
> 
> uint64_t __aeabi_uldivmod(uint64_t,uint64_t);
> int64_t __aeabi_ldivmod(int64_t,int64_t);
> 
> int main(int argc, char **argv)
> {
>   double time1, time2, time3, time4;
>   struct timeval start, end;
> 
> 
>   volatile uint64_t dummy;
>   uint64_t i;
> 
>   volatile int64_t sdummy;
>   int64_t si;
> 
>   gettimeofday (&start, NULL);
>   for (i = START2; i < END; i += STEP)
>     {
>       dummy = __aeabi_uldivmod(END, i);
>     }
>   gettimeofday (&end, NULL);
>   time1 = tv_to_s (end) - tv_to_s (start);
> 
>   gettimeofday (&start, NULL);
>   for (i = START1; i < END; i += STEP * 5)
>     {
>       dummy = __aeabi_uldivmod(i, 373459);
>     }
>   gettimeofday (&end, NULL);
>   time2 = tv_to_s (end) - tv_to_s (start);
> 
>   gettimeofday (&start, NULL);
>   for (si = START2; si < END; si += STEP)
>     {
>       sdummy = __aeabi_ldivmod(END, si);
>     }
>   gettimeofday (&end, NULL);
>   time3 = tv_to_s (end) - tv_to_s (start);
> 
>   gettimeofday (&start, NULL);
>   for (si = START1; si < END; si += STEP * 5)
>     {
>       sdummy = __aeabi_ldivmod(si, 373459);
>     }
>   gettimeofday (&end, NULL);
>   time4 = tv_to_s (end) - tv_to_s (start);
> 
>   printf ("loop1 unsigned:     %12f\n"
>           "loop2 unsigned:     %12f\n"
>           "loop1 signed:       %12f\n"
>           "loop2 signed:       %12f\n",
>           time1, time2, time3, time4);
> 
>   return 0;
> }
>

Patch
diff mbox

From 975d9c624e77ee00476e6866250b0e2e31461fca Mon Sep 17 00:00:00 2001
From: Charles Baylis <charles.baylis@linaro.org>
Date: Tue, 25 Feb 2014 16:27:59 +0000
Subject: [PATCH 2/2] Optimise __aeabi_ldivmod

2014-02-25  Charles Baylis  <charles.baylis@linaro.org>

        * config/arm/bpabi.S (__aeabi_ldivmod): Perform signed division using
	unsigned division via call to __udivmoddi4 and additional logic.
---
 libgcc/config/arm/bpabi.S | 74 +++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 69 insertions(+), 5 deletions(-)

diff --git a/libgcc/config/arm/bpabi.S b/libgcc/config/arm/bpabi.S
index e020af5..8b75a28 100644
--- a/libgcc/config/arm/bpabi.S
+++ b/libgcc/config/arm/bpabi.S
@@ -136,20 +136,84 @@  ARM_FUNC_START aeabi_ldivmod
 	cfi_start	__aeabi_ldivmod, LSYM(Lend_aeabi_ldivmod)
 	test_div_by_zero signed
 
-	sub sp, sp, #8
-#if defined(__thumb2__)
-	mov ip, sp
-	push {ip, lr}
+#if defined(__thumb2__) && CAN_USE_LDRD
+	sub ip, sp, #8
+	strd ip,lr, [sp, #-16]!
 #else
+	sub sp, sp, #8
 	do_push {sp, lr}
 #endif
+	cmp xxh, #0
+	blt 1f
+	cmp yyh, #0
+	blt 2f
+
+98:	cfi_push 98b - __aeabi_ldivmod, 0xe, -0xc, 0x10
+	bl SYM(__udivmoddi4) __PLT__
+	ldr lr, [sp, #4]
+#if CAN_USE_LDRD
+	ldrd r2, r3, [sp, #8]
+	add sp, sp, #16
+#else
+	add sp, sp, #8
+	do_pop {r2, r3}
+#endif
+	RET
+1: /* xxh:xxl is negative */
+	rsbs xxl, xxl, #0
+	sbc xxh, xxh, xxh, lsl #1
+	cmp yyh, #0
+	blt 3f
+98:	cfi_push 98b - __aeabi_ldivmod, 0xe, -0xc, 0x10
+	bl SYM(__udivmoddi4) __PLT__
+	ldr lr, [sp, #4]
+#if CAN_USE_LDRD
+	ldrd r2, r3, [sp, #8]
+	add sp, sp, #16
+#else
+	add sp, sp, #8
+	do_pop {r2, r3}
+#endif
+	rsbs xxl, xxl, #0
+	sbc xxh, xxh, xxh, lsl #1
+	rsbs yyl, yyl, #0
+	sbc yyh, yyh, yyh, lsl #1
+	RET
+
+2: /* only yyh:yyl is negative */
+	rsbs yyl, yyl, #0
+	sbc yyh, yyh, yyh, lsl #1
 98:	cfi_push 98b - __aeabi_ldivmod, 0xe, -0xc, 0x10
-	bl SYM(__gnu_ldivmod_helper) __PLT__
+	bl SYM(__udivmoddi4) __PLT__
 	ldr lr, [sp, #4]
+#if CAN_USE_LDRD
+	ldrd r2, r3, [sp, #8]
+	add sp, sp, #16
+#else
 	add sp, sp, #8
 	do_pop {r2, r3}
+#endif
+	rsbs xxl, xxl, #0
+	sbc xxh, xxh, xxh, lsl #1
 	RET
+
+3: /* both xxh:xxl and yyh:yyl are negative */
+	rsbs yyl, yyl, #0
+	sbc yyh, yyh, yyh, lsl #1
 	cfi_end	LSYM(Lend_aeabi_ldivmod)
+98:	cfi_push 98b - __aeabi_ldivmod, 0xe, -0xc, 0x10
+	bl SYM(__udivmoddi4) __PLT__
+	ldr lr, [sp, #4]
+#if CAN_USE_LDRD
+	ldrd r2, r3, [sp, #8]
+	add sp, sp, #16
+#else
+	add sp, sp, #8
+	do_pop {r2, r3}
+#endif
+	rsbs yyl, yyl, #0
+	sbc yyh, yyh, yyh, lsl #1
+	RET
 	
 #endif /* L_aeabi_ldivmod */
 
-- 
1.8.3.2