diff mbox

[arm] Minor optimization on thumb2 tail call

Message ID 000001d003a2$8b5b8300$a2128900$@arm.com
State New
Headers show

Commit Message

Joey Ye Nov. 19, 2014, 2:43 a.m. UTC
Current thumb2 -Os generates suboptimal code for following tail call case:

int f4(int b, int a, int c, int d);
int g(int a, int b, int c, int d)
{ return f4(b, a, c, d); }

arm-none-eabi-gcc -Os -mthumb -mcpu=cortex-m3 test.c

push
{r4, lr}
mov r4, r1
mov r1, r0
mov r0, r4
pop {r4, lr}

b f4

There are two issues: The first one is that saving/restoring lr is not
necessary, as there is no return via pop pc. The second one is that even if
we managed to avoid lr push/pop, ldmia.w sp!, {r4} is still emitted as there
is a missing pattern for pop single and code size is not optimal.

This patch fixes these two issues and introduces a shared test case. CSiBE
thumb2 -Os shows cross board code size reduction, except for one case with 4
bytes regression. The case is like:

void f ()
{
   if ()
     ...
   else if ()
     ...
   else g();
}

There are N=2 non-sibcall returns and S=1 sibcall return. Originally the
non-sibcall returns are just pop {r4, r5, pc}, now they become
  b.n  .Lreturn

.Lreturn:
  pop {r4, r5}
  bx lr

The one byte save from sibcall return does not win the non-sibcall return
regressions back. In general scenario, number of N non-sibcall returns use
b.n branching to merged tail, number of S sibcalls save 2 bytes by avoid
poping lr. It results in 4-2*S bytes regression. In the worst scenario, each
non-sibcall return has to use b.w branching to merged tail, resulting in
(N-S)*2 bytes regression. The worst scenario is rare, according to CSiBE.
The general regression scenario can only regress 2 bytes at most. So I would
not introduce additional complexity to handle the regression case.

Make check cortex-m3: pass
thumb2 bootstrap (O2/Os): pass

    * config/arm/arm.c (arm_compute_save_reg_mask):
    Do not save lr in case of tail call.
    * config/arm/thumb2.md (*thumb2_pop_single): New pattern.

    * gcc.target/arm/thumb2-pop-single.c: New test.

 ;; to reflect the fact that the permissible constant pool ranges differ
 ;; between ldr instructions taking low regs and ldr instructions taking
high

Comments

Joey Ye Jan. 13, 2015, 5:31 a.m. UTC | #1
Ping

On Wed, Nov 19, 2014 at 10:43 AM, Joey Ye <joey.ye@arm.com> wrote:
> Current thumb2 -Os generates suboptimal code for following tail call case:
>
> int f4(int b, int a, int c, int d);
> int g(int a, int b, int c, int d)
> { return f4(b, a, c, d); }
>
> arm-none-eabi-gcc -Os -mthumb -mcpu=cortex-m3 test.c
>
> push
> {r4, lr}
> mov r4, r1
> mov r1, r0
> mov r0, r4
> pop {r4, lr}
>
> b f4
>
> There are two issues: The first one is that saving/restoring lr is not
> necessary, as there is no return via pop pc. The second one is that even if
> we managed to avoid lr push/pop, ldmia.w sp!, {r4} is still emitted as there
> is a missing pattern for pop single and code size is not optimal.
>
> This patch fixes these two issues and introduces a shared test case. CSiBE
> thumb2 -Os shows cross board code size reduction, except for one case with 4
> bytes regression. The case is like:
>
> void f ()
> {
>    if ()
>      ...
>    else if ()
>      ...
>    else g();
> }
>
> There are N=2 non-sibcall returns and S=1 sibcall return. Originally the
> non-sibcall returns are just pop {r4, r5, pc}, now they become
>   b.n  .Lreturn
>
> .Lreturn:
>   pop {r4, r5}
>   bx lr
>
> The one byte save from sibcall return does not win the non-sibcall return
> regressions back. In general scenario, number of N non-sibcall returns use
> b.n branching to merged tail, number of S sibcalls save 2 bytes by avoid
> poping lr. It results in 4-2*S bytes regression. In the worst scenario, each
> non-sibcall return has to use b.w branching to merged tail, resulting in
> (N-S)*2 bytes regression. The worst scenario is rare, according to CSiBE.
> The general regression scenario can only regress 2 bytes at most. So I would
> not introduce additional complexity to handle the regression case.
>
> Make check cortex-m3: pass
> thumb2 bootstrap (O2/Os): pass
>
>     * config/arm/arm.c (arm_compute_save_reg_mask):
>     Do not save lr in case of tail call.
>     * config/arm/thumb2.md (*thumb2_pop_single): New pattern.
>
>     * gcc.target/arm/thumb2-pop-single.c: New test.
>
> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> index 4f04707..20d0b9e 100644
> --- a/gcc/config/arm/arm.c
> +++ b/gcc/config/arm/arm.c
> @@ -19190,6 +19190,7 @@ arm_compute_save_reg_mask (void)
>        || (save_reg_mask
>           && optimize_size
>           && ARM_FUNC_TYPE (func_type) == ARM_FT_NORMAL
> +         && !crtl->tail_call_emit
>           && !crtl->calls_eh_return))
>      save_reg_mask |= 1 << LR_REGNUM;
>
> diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
> index 64acfea..29cfb17 100644
> --- a/gcc/config/arm/thumb2.md
> +++ b/gcc/config/arm/thumb2.md
> @@ -267,6 +267,17 @@
>     (set_attr "type" "multiple")]
>  )
>
> +;; Pop a single register as its size is preferred over a post-incremental
> load
> +(define_insn "*thumb2_pop_single"
> +  [(set (match_operand:SI 0 "low_register_operand" "=r")
> +        (mem:SI (post_inc:SI (reg:SI SP_REGNUM))))]
> +  "TARGET_THUMB2 && (reload_in_progress || reload_completed)"
> +  "pop\t{%0}"
> +  [(set_attr "type" "load1")
> +   (set_attr "length" "2")
> +   (set_attr "predicable" "yes")]
> +)
> +
>  ;; We have two alternatives here for memory loads (and similarly for
> stores)
>  ;; to reflect the fact that the permissible constant pool ranges differ
>  ;; between ldr instructions taking low regs and ldr instructions taking
> high
>
>
>
Ramana Radhakrishnan Jan. 13, 2015, 9:50 a.m. UTC | #2
On 19/11/14 02:43, Joey Ye wrote:
> Current thumb2 -Os generates suboptimal code for following tail call case:
>
> int f4(int b, int a, int c, int d);
> int g(int a, int b, int c, int d)
> { return f4(b, a, c, d); }
>
> arm-none-eabi-gcc -Os -mthumb -mcpu=cortex-m3 test.c
>
> push
> {r4, lr}
> mov r4, r1
> mov r1, r0
> mov r0, r4
> pop {r4, lr}
>
> b f4
>
> There are two issues: The first one is that saving/restoring lr is not
> necessary, as there is no return via pop pc. The second one is that even if
> we managed to avoid lr push/pop, ldmia.w sp!, {r4} is still emitted as there
> is a missing pattern for pop single and code size is not optimal.
>
> This patch fixes these two issues and introduces a shared test case. CSiBE
> thumb2 -Os shows cross board code size reduction, except for one case with 4
> bytes regression. The case is like:
>
> void f ()
> {
>     if ()
>       ...
>     else if ()
>       ...
>     else g();
> }
>
> There are N=2 non-sibcall returns and S=1 sibcall return. Originally the
> non-sibcall returns are just pop {r4, r5, pc}, now they become
>    b.n  .Lreturn
>
> .Lreturn:
>    pop {r4, r5}
>    bx lr
>
> The one byte save from sibcall return does not win the non-sibcall return
> regressions back. In general scenario, number of N non-sibcall returns use
> b.n branching to merged tail, number of S sibcalls save 2 bytes by avoid
> poping lr. It results in 4-2*S bytes regression. In the worst scenario, each
> non-sibcall return has to use b.w branching to merged tail, resulting in
> (N-S)*2 bytes regression. The worst scenario is rare, according to CSiBE.
> The general regression scenario can only regress 2 bytes at most. So I would
> not introduce additional complexity to handle the regression case.
>
> Make check cortex-m3: pass
> thumb2 bootstrap (O2/Os): pass
>
>      * config/arm/arm.c (arm_compute_save_reg_mask):
>      Do not save lr in case of tail call.
>      * config/arm/thumb2.md (*thumb2_pop_single): New pattern.
>
>      * gcc.target/arm/thumb2-pop-single.c: New test.
>
> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> index 4f04707..20d0b9e 100644
> --- a/gcc/config/arm/arm.c
> +++ b/gcc/config/arm/arm.c
> @@ -19190,6 +19190,7 @@ arm_compute_save_reg_mask (void)
>         || (save_reg_mask
>   	  && optimize_size
>   	  && ARM_FUNC_TYPE (func_type) == ARM_FT_NORMAL
> +	  && !crtl->tail_call_emit
>   	  && !crtl->calls_eh_return))
>       save_reg_mask |= 1 << LR_REGNUM;
>
> diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
> index 64acfea..29cfb17 100644
> --- a/gcc/config/arm/thumb2.md
> +++ b/gcc/config/arm/thumb2.md
> @@ -267,6 +267,17 @@
>      (set_attr "type" "multiple")]
>   )
>
> +;; Pop a single register as its size is preferred over a post-incremental
> load
> +(define_insn "*thumb2_pop_single"
> +  [(set (match_operand:SI 0 "low_register_operand" "=r")
> +        (mem:SI (post_inc:SI (reg:SI SP_REGNUM))))]
> +  "TARGET_THUMB2 && (reload_in_progress || reload_completed)"
> +  "pop\t{%0}"
> +  [(set_attr "type" "load1")
> +   (set_attr "length" "2")
> +   (set_attr "predicable" "yes")]
> +)
> +
>   ;; We have two alternatives here for memory loads (and similarly for
> stores)
>   ;; to reflect the fact that the permissible constant pool ranges differ
>   ;; between ldr instructions taking low regs and ldr instructions taking
> high
>

This is OK thanks. Please CC me on ARM specific patches, this one 
somehow seems to have missed my filters.


Ramana

>
>
>
diff mbox

Patch

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 4f04707..20d0b9e 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -19190,6 +19190,7 @@  arm_compute_save_reg_mask (void)
       || (save_reg_mask
 	  && optimize_size
 	  && ARM_FUNC_TYPE (func_type) == ARM_FT_NORMAL
+	  && !crtl->tail_call_emit
 	  && !crtl->calls_eh_return))
     save_reg_mask |= 1 << LR_REGNUM;
 
diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index 64acfea..29cfb17 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -267,6 +267,17 @@ 
    (set_attr "type" "multiple")]
 )
 
+;; Pop a single register as its size is preferred over a post-incremental
load
+(define_insn "*thumb2_pop_single"
+  [(set (match_operand:SI 0 "low_register_operand" "=r")
+        (mem:SI (post_inc:SI (reg:SI SP_REGNUM))))]
+  "TARGET_THUMB2 && (reload_in_progress || reload_completed)"
+  "pop\t{%0}"
+  [(set_attr "type" "load1")
+   (set_attr "length" "2")
+   (set_attr "predicable" "yes")]
+)
+
 ;; We have two alternatives here for memory loads (and similarly for
stores)