Patchwork [RFA/ARM,04/05] : STRD generation instead of PUSH in A15 ARM prologue.

login
register
mail settings
Submitter Sameera Deshpande
Date Nov. 8, 2011, 10:57 a.m.
Message ID <1320749852.28506.86.camel@e102549-lin.cambridge.arm.com>
Download mbox | patch
Permalink /patch/124321/
State New
Headers show

Comments

Sameera Deshpande - Nov. 8, 2011, 10:57 a.m.
On Fri, 2011-10-21 at 13:45 +0100, Ramana Radhakrishnan wrote: 
> >+arm_emit_strd_push (unsigned long saved_regs_mask)
> 
> How different is this from the thumb2 version you sent out in Patch 03/05 ?
> 
Thumb-2 STRD can handle non-consecutive registers, ARM STRD cannot.
Because of which we accumulate non-consecutive STRDs in ARM mode and
emit STM instruction. For consecutive registers, STRD is generated.

> >@@ -15958,7 +16081,8 @@ arm_get_frame_offsets (void)
> > 	     use 32-bit push/pop instructions.  */
> >  	  if (! any_sibcall_uses_r3 ()
> > 	      && arm_size_return_regs () <= 12
> >-	      && (offsets->saved_regs_mask & (1 << 3)) == 0)
> >+	      && (offsets->saved_regs_mask & (1 << 3)) == 0
> >+              && (TARGET_THUMB2 || !current_tune->prefer_ldrd_strd))
> 
> Not sure I completely follow this change yet.
> 
If the stack is not aligned, we need to adjust the stack in prologue.
Here, instead of adjusting the stack, we PUSH register R3 on stack, so
that no additional ADD instruction is needed for stack adjustment.
This works fine when we generate multi-reg load/store instructions.

However, when we generate STRD in ARM mode, non-consecutive registers
are stored using STR/STM instruction. As pair register of R3 (reg R2) is
never pushed on stack, we always end up generating STR instruction to
PUSH R3 on stack. This is more expensive than doing ADD SP, SP, #4 for
stack adjustment.

e.g. if we are PUSHing {R4, R5, R6} registers, the stack is not aligned,
hence, we PUSH {R3, R4, R5, R6}
So, Instructions generated are:
STR R6, [sp, #4]
STRD R4, R5, [sp, #12]
STR R3, [sp, #16]

However, if instead of R3, other caller-saved register is PUSHed,
we push {R4, R5, R6, R7}, to generate
STRD R6, R7, [sp, #8]
STRD R4, R5, [sp, #16]

If no caller saved register is available, we generate ADD instruction,
which is still better than generating STR. 
> 
> Hmmm the question remains if we want to put these into ldmstm.md since
> it was theoretically
> auto-generated from ldmstm.ml. If this has to be marked to be separate
> then I'd like
> to regenerate ldmstm.md from ldmstm.ml and differentiate between the
> bits that can be auto-generated
> and the bits that have been added since.
> 
The current patterns are quite different from patterns generated using
arm-ldmstm.ml. I will submit updated arm-ldmstm.ml file generating
ldrd/strd patterns as a new patch. Is that fine?

The patch is tested with check-gcc, check-gdb and bootstrap.

I see a regression in gcc:
FAIL: gcc.c-torture/execute/vector-compare-1.c compilation,  -O3
-fomit-frame-pointer -funroll-loops with error message 
/tmp/ccC13odV.s: Assembler messages:
/tmp/ccC13odV.s:544: Error: co-processor offset out of range

This seems to be uncovered latent bug, and I am looking into it.

- Thanks and regards,
  Sameera D.

Patch

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index e71ead5..ccf05c7 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -163,6 +163,7 @@  extern const char *arm_output_memory_barrier (rtx *);
 extern const char *arm_output_sync_insn (rtx, rtx *);
 extern unsigned int arm_sync_loop_insns (rtx , rtx *);
 extern int arm_attr_length_push_multi(rtx, rtx);
+extern bool bad_reg_pair_for_arm_ldrd_strd (rtx, rtx);
 
 #if defined TREE_CODE
 extern void arm_init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree);
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 334a25f..deee78b 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -93,6 +93,7 @@  static bool arm_assemble_integer (rtx, unsigned int, int);
 static void arm_print_operand (FILE *, rtx, int);
 static void arm_print_operand_address (FILE *, rtx);
 static bool arm_print_operand_punct_valid_p (unsigned char code);
+static rtx emit_multi_reg_push (unsigned long);
 static const char *fp_const_from_val (REAL_VALUE_TYPE *);
 static arm_cc get_arm_condition_code (rtx);
 static HOST_WIDE_INT int_log2 (HOST_WIDE_INT);
@@ -15438,6 +15439,117 @@  arm_output_function_epilogue (FILE *file ATTRIBUTE_UNUSED,
     }
 }
 
+/* STRD in ARM mode needs consecutive registers to be stored.  This function
+   keeps accumulating non-consecutive registers until first consecutive register
+   pair is found.  It then generates multi register PUSH for all accumulated
+   registers, and then generates STRD with write-back for consecutive register
+   pair.  This process is repeated until all the registers are stored on stack.
+   multi register PUSH takes care of lone registers as well.  */
+static void
+arm_emit_strd_push (unsigned long saved_regs_mask)
+{
+  int num_regs = 0;
+  int i, j;
+  rtx par = NULL_RTX;
+  rtx dwarf = NULL_RTX;
+  rtx insn = NULL_RTX;
+  rtx tmp, tmp1;
+  unsigned long regs_to_be_pushed_mask;
+
+  for (i = 0; i <= LAST_ARM_REGNUM; i++)
+    if (saved_regs_mask & (1 << i))
+      num_regs++;
+
+  gcc_assert (num_regs && num_regs <= 16);
+
+  /* Var j iterates over all registers to gather all registers in
+     saved_regs_mask.  Var i is used to count number of registers stored on
+     stack.  regs_to_be_pushed_mask accumulates non-consecutive registers
+     that can be pushed using multi register PUSH before STRD is
+     generated.  */
+  for (i=0, j = LAST_ARM_REGNUM, regs_to_be_pushed_mask = 0; i < num_regs; j--)
+    if (saved_regs_mask & (1 << j))
+      {
+        gcc_assert (j != SP_REGNUM);
+        gcc_assert (j != PC_REGNUM);
+        i++;
+
+        if ((j % 2 == 1)
+            && (saved_regs_mask & (1 << (j - 1)))
+            && regs_to_be_pushed_mask)
+          {
+            /* Current register and previous register form register pair for
+               which STRD can be generated.  Hence, emit PUSH for accumulated
+               registers and reset regs_to_be_pushed_mask.  */
+            insn = emit_multi_reg_push (regs_to_be_pushed_mask);
+            regs_to_be_pushed_mask = 0;
+            RTX_FRAME_RELATED_P (insn) = 1;
+            continue;
+          }
+
+        regs_to_be_pushed_mask |= (1 << j);
+
+        if ((j % 2) == 0 && (saved_regs_mask & (1 << (j + 1))))
+          {
+            /* We have found 2 consecutive registers, for which STRD can be
+               generated.  Generate pattern to emit STRD as accumulated
+               registers have already been pushed.  */
+            par = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (3));
+            dwarf = gen_rtx_SEQUENCE (VOIDmode, rtvec_alloc (3));
+
+            tmp = gen_rtx_SET (VOIDmode,
+                               stack_pointer_rtx,
+                               plus_constant (stack_pointer_rtx, -8));
+            tmp1 = gen_rtx_SET (VOIDmode,
+                                stack_pointer_rtx,
+                                plus_constant (stack_pointer_rtx, -8));
+            RTX_FRAME_RELATED_P (tmp) = 1;
+            RTX_FRAME_RELATED_P (tmp1) = 1;
+            XVECEXP (par, 0, 0) = tmp;
+            XVECEXP (dwarf, 0, 0) = tmp1;
+
+            tmp = gen_rtx_SET (SImode,
+                               gen_frame_mem (SImode, stack_pointer_rtx),
+                               gen_rtx_REG (SImode, j));
+            tmp1 = gen_rtx_SET (SImode,
+                                gen_frame_mem (SImode, stack_pointer_rtx),
+                                gen_rtx_REG (SImode, j));
+            RTX_FRAME_RELATED_P (tmp) = 1;
+            RTX_FRAME_RELATED_P (tmp1) = 1;
+            XVECEXP (par, 0, 1) = tmp;
+            XVECEXP (dwarf, 0, 1) = tmp1;
+
+            tmp = gen_rtx_SET (SImode,
+                          gen_frame_mem (SImode,
+                                    plus_constant (stack_pointer_rtx, 4)),
+                          gen_rtx_REG (SImode, j + 1));
+            tmp1 = gen_rtx_SET (SImode,
+                           gen_frame_mem (SImode,
+                                     plus_constant (stack_pointer_rtx, 4)),
+                           gen_rtx_REG (SImode, j + 1));
+            RTX_FRAME_RELATED_P (tmp) = 1;
+            RTX_FRAME_RELATED_P (tmp1) = 1;
+            XVECEXP (par, 0, 2) = tmp;
+            XVECEXP (dwarf, 0, 2) = tmp1;
+
+            insn = emit_insn (par);
+            add_reg_note (insn, REG_FRAME_RELATED_EXPR, dwarf);
+            RTX_FRAME_RELATED_P (insn) = 1;
+            regs_to_be_pushed_mask = 0;
+          }
+      }
+
+  /* Check if any accumulated registers are yet to be pushed, and generate
+     multi register PUSH for them.  */
+  if (regs_to_be_pushed_mask)
+    {
+      insn = emit_multi_reg_push (regs_to_be_pushed_mask);
+      RTX_FRAME_RELATED_P (insn) = 1;
+    }
+
+  return;
+}
+
 /* Generate and emit a pattern that will be recognized as STRD pattern.  If even
    number of registers are being pushed, multiple STRD patterns are created for
    all register pairs.  If odd number of registers are pushed, emit a
@@ -15826,6 +15938,17 @@  arm_emit_vfp_multi_reg_pop (int first_reg, int num_regs, rtx base_reg)
 }
 
 bool
+bad_reg_pair_for_arm_ldrd_strd (rtx src1, rtx src2)
+{
+  return (GET_CODE (src1) != REG
+          || GET_CODE (src2) != REG
+          || ((REGNO (src1) + 1) != REGNO (src2))
+          || ((REGNO (src1) % 2) != 0)
+          || (REGNO (src2) == PC_REGNUM)
+          || (REGNO (src2) == SP_REGNUM));
+}
+
+bool
 bad_reg_pair_for_thumb_ldrd_strd (rtx src1, rtx src2)
 {
   return (GET_CODE (src1) != REG
@@ -16249,7 +16372,8 @@  arm_get_frame_offsets (void)
 	     use 32-bit push/pop instructions.  */
  	  if (! any_sibcall_uses_r3 ()
 	      && arm_size_return_regs () <= 12
-	      && (offsets->saved_regs_mask & (1 << 3)) == 0)
+	      && (offsets->saved_regs_mask & (1 << 3)) == 0
+              && (TARGET_THUMB2 || !current_tune->prefer_ldrd_strd))
 	    {
 	      reg = 3;
 	    }
@@ -16718,11 +16842,13 @@  arm_expand_prologue (void)
 	    }
 	}
 
-      if (TARGET_THUMB2
-          && current_tune->prefer_ldrd_strd
+      if (current_tune->prefer_ldrd_strd
           && !optimize_function_for_size_p (cfun))
         {
-          thumb2_emit_strd_push (live_regs_mask);
+          if (TARGET_THUMB2)
+            thumb2_emit_strd_push (live_regs_mask);
+          else
+            arm_emit_strd_push (live_regs_mask);
         }
       else
         {
diff --git a/gcc/config/arm/ldmstm.md b/gcc/config/arm/ldmstm.md
index e3dcd4f..ffa675d 100644
--- a/gcc/config/arm/ldmstm.md
+++ b/gcc/config/arm/ldmstm.md
@@ -73,6 +73,42 @@ 
   [(set_attr "type" "store2")
    (set_attr "predicable" "yes")])
 
+(define_insn "*arm_strd_base_update"
+  [(set (match_operand:SI 0 "arm_hard_register_operand" "+&rk")
+        (plus:SI (match_dup 0)
+                 (const_int -8)))
+   (set (mem:SI (match_dup 0))
+        (match_operand:SI 1 "arm_hard_register_operand" "r"))
+   (set (mem:SI (plus:SI (match_dup 0)
+                         (const_int 4)))
+        (match_operand:SI 2 "arm_hard_register_operand" "r"))]
+  "(TARGET_ARM && current_tune->prefer_ldrd_strd
+     && (!bad_reg_pair_for_arm_ldrd_strd (operands[1], operands[2]))
+     && (REGNO (operands[1]) != REGNO (operands[0]))
+     && (REGNO (operands[2]) != REGNO (operands[0])))"
+  "str%(d%)\t%1, %2, [%0, #-8]!"
+  [(set_attr "type" "store2")
+   (set_attr "predicable" "yes")])
+
+(define_peephole2
+  [(parallel
+    [(set (match_operand:SI 0 "arm_hard_register_operand" "")
+        (plus:SI (match_dup 0)
+                 (const_int -8)))
+     (set (mem:SI (match_dup 0))
+          (match_operand:SI 1 "arm_hard_register_operand" ""))
+     (set (mem:SI (plus:SI (match_dup 0)
+                           (const_int 4)))
+          (match_operand:SI 2 "arm_hard_register_operand" ""))])]
+  "(TARGET_ARM && current_tune->prefer_ldrd_strd
+     && (!bad_reg_pair_for_arm_ldrd_strd (operands[1], operands[2]))
+     && (REGNO (operands[1]) != REGNO (operands[0]))
+     && (REGNO (operands[2]) != REGNO (operands[0])))"
+  [(set (mem:DI (pre_dec:SI (match_dup 0)))
+        (match_dup 1))]
+  "operands[1] = gen_rtx_REG (DImode, REGNO (operands[1]));"
+)
+
 (define_insn "*ldm4_ia"
   [(match_parallel 0 "load_multiple_operation"
     [(set (match_operand:SI 1 "arm_hard_register_operand" "")