diff mbox

[AArch64] Add support for 64-bit vector-mode ldp/stp

Message ID 5620F47B.9010107@arm.com
State New
Headers show

Commit Message

Kyrylo Tkachov Oct. 16, 2015, 12:58 p.m. UTC
Hi all,

We already support load/store-pair operations on the D-registers when they contain an FP value, but the peepholes/sched-fusion machinery that
do all the hard work currently ignore 64-bit vector modes.

This patch adds support for fusing loads/stores of 64-bit vector operands into ldp and stp instructions.
I've seen this trigger a few times in SPEC2006. Not too many times, but the times it did trigger the code seemed objectively better
i.e. long sequences of ldr and str instructions essentially halved in size.

Bootstrapped and tested on aarch64-none-linux-gnu.

Ok for trunk?

Thanks,
Kyrill

2015-10-16  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/aarch64/aarch64.c (aarch64_mode_valid_for_sched_fusion_p):
     New function.
     (fusion_load_store): Use it.
     * config/aarch64/aarch64-ldpstp.md: Add new peephole2s for
     ldp and stp in VD modes.
     * config/aarch64/aarch64-simd.md (load_pair<mode>, VD): New pattern.
     (store_pair<mode>, VD): Likewise.

2015-10-16  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.target/aarch64/stp_vec_64_1.c: New test.
     * gcc.target/aarch64/ldp_vec_64_1.c: New test.

Comments

Marcus Shawcroft Oct. 20, 2015, 4:05 p.m. UTC | #1
On 16 October 2015 at 13:58, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote:
> Hi all,
>
> We already support load/store-pair operations on the D-registers when they
> contain an FP value, but the peepholes/sched-fusion machinery that
> do all the hard work currently ignore 64-bit vector modes.
>
> This patch adds support for fusing loads/stores of 64-bit vector operands
> into ldp and stp instructions.
> I've seen this trigger a few times in SPEC2006. Not too many times, but the
> times it did trigger the code seemed objectively better
> i.e. long sequences of ldr and str instructions essentially halved in size.
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
>
> Ok for trunk?
>
> Thanks,
> Kyrill
>
> 2015-10-16  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * config/aarch64/aarch64.c (aarch64_mode_valid_for_sched_fusion_p):

We have several different flavours of fusion in the backend, this one
is specifically load/stores, perhaps making that clear in the name of
this predicate will avoid confusion further down the line?

>     New function.
>     (fusion_load_store): Use it.
>     * config/aarch64/aarch64-ldpstp.md: Add new peephole2s for
>     ldp and stp in VD modes.
>     * config/aarch64/aarch64-simd.md (load_pair<mode>, VD): New pattern.
>     (store_pair<mode>, VD): Likewise.
>
> 2015-10-16  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * gcc.target/aarch64/stp_vec_64_1.c: New test.
>     * gcc.target/aarch64/ldp_vec_64_1.c: New test.

Otherwise OK /Marcus
Kyrylo Tkachov Oct. 20, 2015, 4:26 p.m. UTC | #2
Hi Marcus,

On 20/10/15 17:05, Marcus Shawcroft wrote:
> On 16 October 2015 at 13:58, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote:
>> Hi all,
>>
>> We already support load/store-pair operations on the D-registers when they
>> contain an FP value, but the peepholes/sched-fusion machinery that
>> do all the hard work currently ignore 64-bit vector modes.
>>
>> This patch adds support for fusing loads/stores of 64-bit vector operands
>> into ldp and stp instructions.
>> I've seen this trigger a few times in SPEC2006. Not too many times, but the
>> times it did trigger the code seemed objectively better
>> i.e. long sequences of ldr and str instructions essentially halved in size.
>>
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>
>> Ok for trunk?
>>
>> Thanks,
>> Kyrill
>>
>> 2015-10-16  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>      * config/aarch64/aarch64.c (aarch64_mode_valid_for_sched_fusion_p):
> We have several different flavours of fusion in the backend, this one
> is specifically load/stores, perhaps making that clear in the name of
> this predicate will avoid confusion further down the line?
Thanks for the review,

This particular type of fusion is called sched_fusion in various
places in the compiler and its implementation in aarch64 is only for
load/store merging (indeed, the only usage of sched_fusion currently
is to merge loads/stores in arm and aarch64).

So, I think that sched_fusion in the name already conveys the information
that it's the ldp/stp one rather than macro fusion. In fact, there is a
macro fusion of ADRP and an LDR instruction,
so having sched_fusion in the name is actually a better differentiator than
mentioning loads/stores as both types of fusion deal with loads in some way.

Is it ok to keep the name as is?

Thanks,
Kyrill

>
>>      New function.
>>      (fusion_load_store): Use it.
>>      * config/aarch64/aarch64-ldpstp.md: Add new peephole2s for
>>      ldp and stp in VD modes.
>>      * config/aarch64/aarch64-simd.md (load_pair<mode>, VD): New pattern.
>>      (store_pair<mode>, VD): Likewise.
>>
>> 2015-10-16  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>      * gcc.target/aarch64/stp_vec_64_1.c: New test.
>>      * gcc.target/aarch64/ldp_vec_64_1.c: New test.
> Otherwise OK /Marcus
>
Marcus Shawcroft Oct. 20, 2015, 5:14 p.m. UTC | #3
On 20 October 2015 at 17:26, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote:
> Hi Marcus,
>
> On 20/10/15 17:05, Marcus Shawcroft wrote:
>>
>> On 16 October 2015 at 13:58, Kyrill Tkachov <kyrylo.tkachov@arm.com>
>> wrote:
>>>
>>> Hi all,
>>>
>>> We already support load/store-pair operations on the D-registers when
>>> they
>>> contain an FP value, but the peepholes/sched-fusion machinery that
>>> do all the hard work currently ignore 64-bit vector modes.
>>>
>>> This patch adds support for fusing loads/stores of 64-bit vector operands
>>> into ldp and stp instructions.
>>> I've seen this trigger a few times in SPEC2006. Not too many times, but
>>> the
>>> times it did trigger the code seemed objectively better
>>> i.e. long sequences of ldr and str instructions essentially halved in
>>> size.
>>>
>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>
>>> Ok for trunk?
>>>
>>> Thanks,
>>> Kyrill
>>>
>>> 2015-10-16  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>
>>>      * config/aarch64/aarch64.c (aarch64_mode_valid_for_sched_fusion_p):
>>
>> We have several different flavours of fusion in the backend, this one
>> is specifically load/stores, perhaps making that clear in the name of
>> this predicate will avoid confusion further down the line?
>
> Thanks for the review,
>
> This particular type of fusion is called sched_fusion in various
> places in the compiler and its implementation in aarch64 is only for
> load/store merging (indeed, the only usage of sched_fusion currently
> is to merge loads/stores in arm and aarch64).
>
> So, I think that sched_fusion in the name already conveys the information
> that it's the ldp/stp one rather than macro fusion. In fact, there is a
> macro fusion of ADRP and an LDR instruction,
> so having sched_fusion in the name is actually a better differentiator than
> mentioning loads/stores as both types of fusion deal with loads in some way.
>
> Is it ok to keep the name as is?

Thanks for the justification, patch is OK to commit /Marcus
diff mbox

Patch

commit b5f4a5b87a7315fb8a4d88da3e4c4afc52d16052
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Tue Oct 6 12:08:24 2015 +0100

    [AArch64] Add support for 64-bit vector-mode ldp/stp

diff --git a/gcc/config/aarch64/aarch64-ldpstp.md b/gcc/config/aarch64/aarch64-ldpstp.md
index 8d6d882..458829c 100644
--- a/gcc/config/aarch64/aarch64-ldpstp.md
+++ b/gcc/config/aarch64/aarch64-ldpstp.md
@@ -98,6 +98,47 @@  (define_peephole2
     }
 })
 
+(define_peephole2
+  [(set (match_operand:VD 0 "register_operand" "")
+	(match_operand:VD 1 "aarch64_mem_pair_operand" ""))
+   (set (match_operand:VD 2 "register_operand" "")
+	(match_operand:VD 3 "memory_operand" ""))]
+  "aarch64_operands_ok_for_ldpstp (operands, true, <MODE>mode)"
+  [(parallel [(set (match_dup 0) (match_dup 1))
+	      (set (match_dup 2) (match_dup 3))])]
+{
+  rtx base, offset_1, offset_2;
+
+  extract_base_offset_in_addr (operands[1], &base, &offset_1);
+  extract_base_offset_in_addr (operands[3], &base, &offset_2);
+  if (INTVAL (offset_1) > INTVAL (offset_2))
+    {
+      std::swap (operands[0], operands[2]);
+      std::swap (operands[1], operands[3]);
+    }
+})
+
+(define_peephole2
+  [(set (match_operand:VD 0 "aarch64_mem_pair_operand" "")
+	(match_operand:VD 1 "register_operand" ""))
+   (set (match_operand:VD 2 "memory_operand" "")
+	(match_operand:VD 3 "register_operand" ""))]
+  "TARGET_SIMD && aarch64_operands_ok_for_ldpstp (operands, false, <MODE>mode)"
+  [(parallel [(set (match_dup 0) (match_dup 1))
+	      (set (match_dup 2) (match_dup 3))])]
+{
+  rtx base, offset_1, offset_2;
+
+  extract_base_offset_in_addr (operands[0], &base, &offset_1);
+  extract_base_offset_in_addr (operands[2], &base, &offset_2);
+  if (INTVAL (offset_1) > INTVAL (offset_2))
+    {
+      std::swap (operands[0], operands[2]);
+      std::swap (operands[1], operands[3]);
+    }
+})
+
+
 ;; Handle sign/zero extended consecutive load/store.
 
 (define_peephole2
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 6a2ab61..bf051c3 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -153,6 +153,34 @@  (define_insn "*aarch64_simd_mov<mode>"
    (set_attr "length" "4,4,4,8,8,8,4")]
 )
 
+(define_insn "load_pair<mode>"
+  [(set (match_operand:VD 0 "register_operand" "=w")
+	(match_operand:VD 1 "aarch64_mem_pair_operand" "Ump"))
+   (set (match_operand:VD 2 "register_operand" "=w")
+	(match_operand:VD 3 "memory_operand" "m"))]
+  "TARGET_SIMD
+   && rtx_equal_p (XEXP (operands[3], 0),
+		   plus_constant (Pmode,
+				  XEXP (operands[1], 0),
+				  GET_MODE_SIZE (<MODE>mode)))"
+  "ldp\\t%d0, %d2, %1"
+  [(set_attr "type" "neon_ldp")]
+)
+
+(define_insn "store_pair<mode>"
+  [(set (match_operand:VD 0 "aarch64_mem_pair_operand" "=Ump")
+	(match_operand:VD 1 "register_operand" "w"))
+   (set (match_operand:VD 2 "memory_operand" "=m")
+	(match_operand:VD 3 "register_operand" "w"))]
+  "TARGET_SIMD
+   && rtx_equal_p (XEXP (operands[2], 0),
+		   plus_constant (Pmode,
+				  XEXP (operands[0], 0),
+				  GET_MODE_SIZE (<MODE>mode)))"
+  "stp\\t%d1, %d3, %0"
+  [(set_attr "type" "neon_stp")]
+)
+
 (define_split
   [(set (match_operand:VQ 0 "register_operand" "")
       (match_operand:VQ 1 "register_operand" ""))]
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index d7d05b8..7682417 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -3491,6 +3491,18 @@  offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
 	  && offset % GET_MODE_SIZE (mode) == 0);
 }
 
+/* Return true if MODE is one of the modes for which we
+   support LDP/STP operations.  */
+
+static bool
+aarch64_mode_valid_for_sched_fusion_p (machine_mode mode)
+{
+  return mode == SImode || mode == DImode
+	 || mode == SFmode || mode == DFmode
+	 || (aarch64_vector_mode_supported_p (mode)
+	     && GET_MODE_SIZE (mode) == 8);
+}
+
 /* Return true if X is a valid address for machine mode MODE.  If it is,
    fill in INFO appropriately.  STRICT_P is true if REG_OK_STRICT is in
    effect.  OUTER_CODE is PARALLEL for a load/store pair.  */
@@ -12863,8 +12875,9 @@  fusion_load_store (rtx_insn *insn, rtx *base, rtx *offset)
   src = SET_SRC (x);
   dest = SET_DEST (x);
 
-  if (GET_MODE (dest) != SImode && GET_MODE (dest) != DImode
-      && GET_MODE (dest) != SFmode && GET_MODE (dest) != DFmode)
+  machine_mode dest_mode = GET_MODE (dest);
+
+  if (!aarch64_mode_valid_for_sched_fusion_p (dest_mode))
     return SCHED_FUSION_NONE;
 
   if (GET_CODE (src) == SIGN_EXTEND)
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_vec_64_1.c b/gcc/testsuite/gcc.target/aarch64/ldp_vec_64_1.c
new file mode 100644
index 0000000..62213f3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_vec_64_1.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-Ofast" } */
+
+typedef int int32x2_t __attribute__ ((__vector_size__ ((8))));
+
+void
+foo (int32x2_t *foo, int32x2_t *bar)
+{
+  int i = 0;
+  int32x2_t val = { 3, 2 };
+
+  for (i = 0; i < 1024; i+=2)
+    foo[i] = bar[i] + bar[i + 1];
+}
+
+/* { dg-final { scan-assembler "ldp\td\[0-9\]+, d\[0-9\]" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stp_vec_64_1.c b/gcc/testsuite/gcc.target/aarch64/stp_vec_64_1.c
new file mode 100644
index 0000000..11e757a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stp_vec_64_1.c
@@ -0,0 +1,20 @@ 
+/* { dg-do compile } */
+/* { dg-options "-Ofast" } */
+
+
+typedef int int32x2_t __attribute__ ((__vector_size__ ((8))));
+
+void
+bar (int32x2_t *foo)
+{
+  int i = 0;
+  int32x2_t val = { 3, 2 };
+
+  for (i = 0; i < 256; i+=2)
+    {
+      foo[i] = val;
+      foo[i+1] = val;
+    }
+}
+
+/* { dg-final { scan-assembler "stp\td\[0-9\]+, d\[0-9\]" } } */