diff mbox series

[AArch64] Improve LDP/STP generation that requires a base register

Message ID 5B0D719B.4070500@foss.arm.com
State New
Headers show
Series [AArch64] Improve LDP/STP generation that requires a base register | expand

Commit Message

Kyrill Tkachov May 29, 2018, 3:28 p.m. UTC
[sending on behalf of Jackson Woodruff]

Hi all,

This patch generalizes the formation of LDP/STP that require a base register.

In AArch64, LDP/STP instructions have different sized immediate offsets than
normal LDR/STR instructions. This part of the backend attempts to spot groups
of four LDR/STR instructions that can be turned into LDP/STP instructions by
using a base register.

Previously, we would only accept address pairs that were ordered in ascending
or descending order, and only strictly sequential loads/stores. In fact, the
instructions that we generate from this should be able to consider any order
of loads or stores (provided that they can be re-ordered). They should also be
able to accept non-sequential loads and stores provided that the two pairs of
addresses are amenable to pairing. The current code is also overly restrictive
on the range of addresses that are accepted, as LDP/STP instructions may take
negative offsets as well as positive ones.

This patch improves that by allowing us to accept all orders of loads/stores
that are valid, and extending the range that the LDP/STP addresses can reach.

OK for trunk?

Jackson

ChangeLog:

gcc/

2017-05-29  Jackson Woodruff  <jackson.woodruff@arm.com>

     * config/aarch64/aarch64.c (aarch64_host_wide_int_compare): New.
     (aarch64_ldrstr_offset_compare): New.
     (aarch64_operands_adjust_ok_for_ldpstp): Update to consider all
     load/store orderings.
     (aarch64_gen_adjusted_ldpstp): Likewise.

gcc/testsuite

2017-05-29  Jackson Woodruff  <jackson.woodruff@arm.com>

     * gcc.target/aarch64/simd/ldp_stp_9: New.
     * gcc.target/aarch64/simd/ldp_stp_10: New.
     * gcc.target/aarch64/simd/ldp_stp_11: New.
     * gcc.target/aarch64/simd/ldp_stp_12: New.

Comments

James Greenhalgh May 29, 2018, 4:02 p.m. UTC | #1
On Tue, May 29, 2018 at 10:28:27AM -0500, Kyrill Tkachov wrote:
> [sending on behalf of Jackson Woodruff]
> 
> Hi all,
> 
> This patch generalizes the formation of LDP/STP that require a base register.
> 
> In AArch64, LDP/STP instructions have different sized immediate offsets than
> normal LDR/STR instructions. This part of the backend attempts to spot groups
> of four LDR/STR instructions that can be turned into LDP/STP instructions by
> using a base register.
> 
> Previously, we would only accept address pairs that were ordered in ascending
> or descending order, and only strictly sequential loads/stores. In fact, the
> instructions that we generate from this should be able to consider any order
> of loads or stores (provided that they can be re-ordered). They should also be
> able to accept non-sequential loads and stores provided that the two pairs of
> addresses are amenable to pairing. The current code is also overly restrictive
> on the range of addresses that are accepted, as LDP/STP instructions may take
> negative offsets as well as positive ones.
> 
> This patch improves that by allowing us to accept all orders of loads/stores
> that are valid, and extending the range that the LDP/STP addresses can reach.

OK.

Thanks,
James
Christophe Lyon May 31, 2018, 8:38 a.m. UTC | #2
Hi,

On 29 May 2018 at 18:02, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> On Tue, May 29, 2018 at 10:28:27AM -0500, Kyrill Tkachov wrote:
>> [sending on behalf of Jackson Woodruff]
>>
>> Hi all,
>>
>> This patch generalizes the formation of LDP/STP that require a base register.
>>
>> In AArch64, LDP/STP instructions have different sized immediate offsets than
>> normal LDR/STR instructions. This part of the backend attempts to spot groups
>> of four LDR/STR instructions that can be turned into LDP/STP instructions by
>> using a base register.
>>
>> Previously, we would only accept address pairs that were ordered in ascending
>> or descending order, and only strictly sequential loads/stores. In fact, the
>> instructions that we generate from this should be able to consider any order
>> of loads or stores (provided that they can be re-ordered). They should also be
>> able to accept non-sequential loads and stores provided that the two pairs of
>> addresses are amenable to pairing. The current code is also overly restrictive
>> on the range of addresses that are accepted, as LDP/STP instructions may take
>> negative offsets as well as positive ones.
>>
>> This patch improves that by allowing us to accept all orders of loads/stores
>> that are valid, and extending the range that the LDP/STP addresses can reach.
>
> OK.
>

The new test ldp_stp_10.c fails in ILP32 mode:
FAIL:    gcc.target/aarch64/ldp_stp_10.c scan-assembler-times
ldp\tw[0-9]+, w[0-9]+,  2
FAIL:    gcc.target/aarch64/ldp_stp_10.c scan-assembler-times
ldp\tx[0-9]+, x[0-9]+,  2

Christophe

> Thanks,
> James
>
>
Kyrill Tkachov May 31, 2018, 9:53 a.m. UTC | #3
Hi Christophe,

On 31/05/18 09:38, Christophe Lyon wrote:
> Hi,
>
> On 29 May 2018 at 18:02, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> > On Tue, May 29, 2018 at 10:28:27AM -0500, Kyrill Tkachov wrote:
> >> [sending on behalf of Jackson Woodruff]
> >>
> >> Hi all,
> >>
> >> This patch generalizes the formation of LDP/STP that require a base register.
> >>
> >> In AArch64, LDP/STP instructions have different sized immediate offsets than
> >> normal LDR/STR instructions. This part of the backend attempts to spot groups
> >> of four LDR/STR instructions that can be turned into LDP/STP instructions by
> >> using a base register.
> >>
> >> Previously, we would only accept address pairs that were ordered in ascending
> >> or descending order, and only strictly sequential loads/stores. In fact, the
> >> instructions that we generate from this should be able to consider any order
> >> of loads or stores (provided that they can be re-ordered). They should also be
> >> able to accept non-sequential loads and stores provided that the two pairs of
> >> addresses are amenable to pairing. The current code is also overly restrictive
> >> on the range of addresses that are accepted, as LDP/STP instructions may take
> >> negative offsets as well as positive ones.
> >>
> >> This patch improves that by allowing us to accept all orders of loads/stores
> >> that are valid, and extending the range that the LDP/STP addresses can reach.
> >
> > OK.
> >
>
> The new test ldp_stp_10.c fails in ILP32 mode:
> FAIL:    gcc.target/aarch64/ldp_stp_10.c scan-assembler-times
> ldp\tw[0-9]+, w[0-9]+,  2
> FAIL:    gcc.target/aarch64/ldp_stp_10.c scan-assembler-times
> ldp\tx[0-9]+, x[0-9]+,  2
>

This is because the register allocation is such that the last load in the sequence clobbers the address register like so:
...
         ldr     w0, [x2, 1600]
         ldr     w1, [x2, 2108]
         ldr     w3, [x2, 1604]
         ldr     w2, [x2, 2112] //<<--- x2 is an address and a destination
...

The checks in aarch64_operands_adjust_ok_for_ldpstp bail out for this case.
I believe as long as w2 is loaded in the second/last LDP pair that this optimisation generates
and the address is not a writeback address (as we are guaranteed in this context) then it should
be safe to form the LDP pairs.
So this is a missed-optimization to me.
Can you please file a bug report?

Thanks,
Kyrill


> Christophe
>
> > Thanks,
> > James
> >
> >
diff mbox series

Patch

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index f60e0ad37565b044b3ffe446eebcc0f40f9d99db..4c352854c9b8c31e6e6229e375c1e61117c2b7e4 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -16935,6 +16935,50 @@  aarch64_swap_ldrstr_operands (rtx* operands, bool load)
     }
 }
 
+/* Taking X and Y to be HOST_WIDE_INT pointers, return the result of a
+   comparison between the two.  */
+int
+aarch64_host_wide_int_compare (const void *x, const void *y)
+{
+  return wi::cmps (* ((const HOST_WIDE_INT *) x),
+		   * ((const HOST_WIDE_INT *) y));
+}
+
+/* Taking X and Y to be pairs of RTX, one pointing to a MEM rtx and the
+   other pointing to a REG rtx containing an offset, compare the offsets
+   of the two pairs.
+
+   Return:
+
+	1 iff offset (X) > offset (Y)
+	0 iff offset (X) == offset (Y)
+	-1 iff offset (X) < offset (Y)  */
+int
+aarch64_ldrstr_offset_compare (const void *x, const void *y)
+{
+  const rtx * operands_1 = (const rtx *) x;
+  const rtx * operands_2 = (const rtx *) y;
+  rtx mem_1, mem_2, base, offset_1, offset_2;
+
+  if (MEM_P (operands_1[0]))
+    mem_1 = operands_1[0];
+  else
+    mem_1 = operands_1[1];
+
+  if (MEM_P (operands_2[0]))
+    mem_2 = operands_2[0];
+  else
+    mem_2 = operands_2[1];
+
+  /* Extract the offsets.  */
+  extract_base_offset_in_addr (mem_1, &base, &offset_1);
+  extract_base_offset_in_addr (mem_2, &base, &offset_2);
+
+  gcc_assert (offset_1 != NULL_RTX && offset_2 != NULL_RTX);
+
+  return wi::cmps (INTVAL (offset_1), INTVAL (offset_2));
+}
+
 /* Given OPERANDS of consecutive load/store, check if we can merge
    them into ldp/stp by adjusting the offset.  LOAD is true if they
    are load instructions.  MODE is the mode of memory operands.
@@ -16961,7 +17005,7 @@  aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
 				       scalar_mode mode)
 {
   enum reg_class rclass_1, rclass_2, rclass_3, rclass_4;
-  HOST_WIDE_INT offval_1, offval_2, offval_3, offval_4, msize;
+  HOST_WIDE_INT offvals[4], msize;
   rtx mem_1, mem_2, mem_3, mem_4, reg_1, reg_2, reg_3, reg_4;
   rtx base_1, base_2, base_3, base_4, offset_1, offset_2, offset_3, offset_4;
 
@@ -16977,8 +17021,12 @@  aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
       mem_4 = operands[7];
       gcc_assert (REG_P (reg_1) && REG_P (reg_2)
 		  && REG_P (reg_3) && REG_P (reg_4));
-      if (REGNO (reg_1) == REGNO (reg_2) || REGNO (reg_3) == REGNO (reg_4))
-	return false;
+
+      /* Do not attempt to merge the loads if the loads clobber each other.  */
+      for (int i = 0; i < 8; i += 2)
+	for (int j = i + 2; j < 8; j += 2)
+	  if (reg_overlap_mentioned_p (operands[i], operands[j]))
+	    return false;
     }
   else
     {
@@ -17020,32 +17068,34 @@  aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
       || !rtx_equal_p (base_3, base_4))
     return false;
 
-  offval_1 = INTVAL (offset_1);
-  offval_2 = INTVAL (offset_2);
-  offval_3 = INTVAL (offset_3);
-  offval_4 = INTVAL (offset_4);
+  offvals[0] = INTVAL (offset_1);
+  offvals[1] = INTVAL (offset_2);
+  offvals[2] = INTVAL (offset_3);
+  offvals[3] = INTVAL (offset_4);
   msize = GET_MODE_SIZE (mode);
-  /* Check if the offsets are consecutive.  */
-  if ((offval_1 != (offval_2 + msize)
-       || offval_1 != (offval_3 + msize * 2)
-       || offval_1 != (offval_4 + msize * 3))
-      && (offval_4 != (offval_3 + msize)
-	  || offval_4 != (offval_2 + msize * 2)
-	  || offval_4 != (offval_1 + msize * 3)))
+
+  /* Check if the offsets can be put in the right order to do a ldp/stp.  */
+  qsort (offvals, 4, sizeof (HOST_WIDE_INT), aarch64_host_wide_int_compare);
+
+  if (!(offvals[1] == offvals[0] + msize
+	&& offvals[3] == offvals[2] + msize))
     return false;
 
-  /* Check if the addresses are clobbered by load.  */
-  if (load)
-    {
-      if (reg_mentioned_p (reg_1, mem_1)
-	  || reg_mentioned_p (reg_2, mem_2)
-	  || reg_mentioned_p (reg_3, mem_3))
-	return false;
+  /* Check that offsets are within range of each other.  The ldp/stp
+     instructions have 7 bit immediate offsets, so use 0x80.  */
+  if (offvals[2] - offvals[0] >= msize * 0x80)
+    return false;
 
-      /* In increasing order, the last load can clobber the address.  */
-      if (offval_1 > offval_2 && reg_mentioned_p (reg_4, mem_4))
-	return false;
-    }
+  /* The offsets must be aligned with respect to each other.  */
+  if (offvals[0] % msize != offvals[2] % msize)
+    return false;
+
+  /* Check if the addresses are clobbered by load.  */
+  if (load && (reg_mentioned_p (reg_1, mem_1)
+	       || reg_mentioned_p (reg_2, mem_2)
+	       || reg_mentioned_p (reg_3, mem_3)
+	       || reg_mentioned_p (reg_4, mem_4)))
+    return false;
 
   /* If we have SImode and slow unaligned ldp,
      check the alignment to be at least 8 byte. */
@@ -17084,8 +17134,8 @@  aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
 }
 
 /* Given OPERANDS of consecutive load/store, this function pairs them
-   into ldp/stp after adjusting the offset.  It depends on the fact
-   that addresses of load/store instructions are in increasing order.
+   into LDP/STP after adjusting the offset.  It depends on the fact
+   that the operands can be sorted so the offsets are correct for STP.
    MODE is the mode of memory operands.  CODE is the rtl operator
    which should be applied to all memory operands, it's SIGN_EXTEND,
    ZERO_EXTEND or UNKNOWN.  */
@@ -17094,100 +17144,109 @@  bool
 aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
 			     scalar_mode mode, RTX_CODE code)
 {
-  rtx base, offset_1, offset_2, t1, t2;
+  rtx base, offset_1, offset_3, t1, t2;
   rtx mem_1, mem_2, mem_3, mem_4;
-  HOST_WIDE_INT off_val, abs_off, adj_off, new_off, stp_off_limit, msize;
-
-  if (load)
-    {
-      mem_1 = operands[1];
-      mem_2 = operands[3];
-    }
-  else
-    {
-      mem_1 = operands[0];
-      mem_2 = operands[2];
-    }
+  rtx temp_operands[8];
+  HOST_WIDE_INT off_val_1, off_val_3, base_off, new_off_1, new_off_3,
+		stp_off_upper_limit, stp_off_lower_limit, msize;
 
-  extract_base_offset_in_addr (mem_1, &base, &offset_1);
-  extract_base_offset_in_addr (mem_2, &base, &offset_2);
-  gcc_assert (base != NULL_RTX && offset_1 != NULL_RTX
-	      && offset_2 != NULL_RTX);
+  /* We make changes on a copy as we may still bail out.  */
+  for (int i = 0; i < 8; i ++)
+    temp_operands[i] = operands[i];
 
-  if (INTVAL (offset_1) > INTVAL (offset_2))
-    {
-      std::swap (operands[0], operands[6]);
-      std::swap (operands[1], operands[7]);
-      std::swap (operands[2], operands[4]);
-      std::swap (operands[3], operands[5]);
-    }
+  /* Sort the operands.  */
+  qsort (temp_operands, 4, 2 * sizeof (rtx *), aarch64_ldrstr_offset_compare);
 
   if (load)
     {
-      mem_1 = operands[1];
-      mem_2 = operands[3];
-      mem_3 = operands[5];
-      mem_4 = operands[7];
+      mem_1 = temp_operands[1];
+      mem_2 = temp_operands[3];
+      mem_3 = temp_operands[5];
+      mem_4 = temp_operands[7];
     }
   else
     {
-      mem_1 = operands[0];
-      mem_2 = operands[2];
-      mem_3 = operands[4];
-      mem_4 = operands[6];
+      mem_1 = temp_operands[0];
+      mem_2 = temp_operands[2];
+      mem_3 = temp_operands[4];
+      mem_4 = temp_operands[6];
       gcc_assert (code == UNKNOWN);
     }
 
-  /* Extract the offset of the new first address.  */
   extract_base_offset_in_addr (mem_1, &base, &offset_1);
-  extract_base_offset_in_addr (mem_2, &base, &offset_2);
+  extract_base_offset_in_addr (mem_3, &base, &offset_3);
+  gcc_assert (base != NULL_RTX && offset_1 != NULL_RTX
+	      && offset_3 != NULL_RTX);
 
-  /* Adjust offset thus it can fit in ldp/stp instruction.  */
+  /* Adjust offset so it can fit in LDP/STP instruction.  */
   msize = GET_MODE_SIZE (mode);
-  stp_off_limit = msize * 0x40;
-  off_val = INTVAL (offset_1);
-  abs_off = (off_val < 0) ? -off_val : off_val;
-  new_off = abs_off % stp_off_limit;
-  adj_off = abs_off - new_off;
+  stp_off_upper_limit = msize * (0x40 - 1);
+  stp_off_lower_limit = - msize * 0x40;
 
-  /* Further adjust to make sure all offsets are OK.  */
-  if ((new_off + msize * 2) >= stp_off_limit)
+  off_val_1 = INTVAL (offset_1);
+  off_val_3 = INTVAL (offset_3);
+
+  /* The base offset is optimally half way between the two STP/LDP offsets.  */
+  if (msize <= 4)
+    base_off = (off_val_1 + off_val_3) / 2;
+  else
+    /* However, due to issues with negative LDP/STP offset generation for
+       larger modes, for DF, DI and vector modes. we must not use negative
+       addresses smaller than 9 signed unadjusted bits can store.  This
+       provides the most range in this case.  */
+    base_off = off_val_1;
+
+  /* Adjust the base so that it is aligned with the addresses but still
+     optimal.  */
+  if (base_off % msize != off_val_1 % msize)
+    /* Fix the offset, bearing in mind we want to make it bigger not
+       smaller.  */
+    base_off += (((base_off % msize) - (off_val_1 % msize)) + msize) % msize;
+  else if (msize <= 4)
+    /* The negative range of LDP/STP is one larger than the positive range.  */
+    base_off += msize;
+
+  /* Check if base offset is too big or too small.  We can attempt to resolve
+     this issue by setting it to the maximum value and seeing if the offsets
+     still fit.  */
+  if (base_off >= 0x1000)
     {
-      adj_off += stp_off_limit;
-      new_off -= stp_off_limit;
+      base_off = 0x1000 - 1;
+      /* We must still make sure that the base offset is aligned with respect
+	 to the address.  But it may may not be made any bigger.  */
+      base_off -= (((base_off % msize) - (off_val_1 % msize)) + msize) % msize;
     }
 
-  /* Make sure the adjustment can be done with ADD/SUB instructions.  */
-  if (adj_off >= 0x1000)
-    return false;
-
-  if (off_val < 0)
+  /* Likewise for the case where the base is too small.  */
+  if (base_off <= -0x1000)
     {
-      adj_off = -adj_off;
-      new_off = -new_off;
+      base_off = -0x1000 + 1;
+      base_off += (((base_off % msize) - (off_val_1 % msize)) + msize) % msize;
     }
 
-  /* Create new memory references.  */
-  mem_1 = change_address (mem_1, VOIDmode,
-			  plus_constant (DImode, operands[8], new_off));
+  /* Offset of the first STP/LDP.  */
+  new_off_1 = off_val_1 - base_off;
+
+  /* Offset of the second STP/LDP.  */
+  new_off_3 = off_val_3 - base_off;
 
-  /* Check if the adjusted address is OK for ldp/stp.  */
-  if (!aarch64_mem_pair_operand (mem_1, mode))
+  /* The offsets must be within the range of the LDP/STP instructions.  */
+  if (new_off_1 > stp_off_upper_limit || new_off_1 < stp_off_lower_limit
+      || new_off_3 > stp_off_upper_limit || new_off_3 < stp_off_lower_limit)
     return false;
 
-  msize = GET_MODE_SIZE (mode);
-  mem_2 = change_address (mem_2, VOIDmode,
-			  plus_constant (DImode,
-					 operands[8],
-					 new_off + msize));
-  mem_3 = change_address (mem_3, VOIDmode,
-			  plus_constant (DImode,
-					 operands[8],
-					 new_off + msize * 2));
-  mem_4 = change_address (mem_4, VOIDmode,
-			  plus_constant (DImode,
-					 operands[8],
-					 new_off + msize * 3));
+  replace_equiv_address_nv (mem_1, plus_constant (Pmode, operands[8],
+						  new_off_1), true);
+  replace_equiv_address_nv (mem_2, plus_constant (Pmode, operands[8],
+						  new_off_1 + msize), true);
+  replace_equiv_address_nv (mem_3, plus_constant (Pmode, operands[8],
+						  new_off_3), true);
+  replace_equiv_address_nv (mem_4, plus_constant (Pmode, operands[8],
+						  new_off_3 + msize), true);
+
+  if (!aarch64_mem_pair_operand (mem_1, mode)
+      || !aarch64_mem_pair_operand (mem_3, mode))
+    return false;
 
   if (code == ZERO_EXTEND)
     {
@@ -17206,21 +17265,29 @@  aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
 
   if (load)
     {
+      operands[0] = temp_operands[0];
       operands[1] = mem_1;
+      operands[2] = temp_operands[2];
       operands[3] = mem_2;
+      operands[4] = temp_operands[4];
       operands[5] = mem_3;
+      operands[6] = temp_operands[6];
       operands[7] = mem_4;
     }
   else
     {
       operands[0] = mem_1;
+      operands[1] = temp_operands[1];
       operands[2] = mem_2;
+      operands[3] = temp_operands[3];
       operands[4] = mem_3;
+      operands[5] = temp_operands[5];
       operands[6] = mem_4;
+      operands[7] = temp_operands[7];
     }
 
   /* Emit adjusting instruction.  */
-  emit_insn (gen_rtx_SET (operands[8], plus_constant (DImode, base, adj_off)));
+  emit_insn (gen_rtx_SET (operands[8], plus_constant (DImode, base, base_off)));
   /* Emit ldp/stp instructions.  */
   t1 = gen_rtx_SET (operands[0], operands[1]);
   t2 = gen_rtx_SET (operands[2], operands[3]);
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_10.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_10.c
new file mode 100644
index 0000000000000000000000000000000000000000..31f392901d2ca9e9e31cb20735fdf86eb040ee88
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_10.c
@@ -0,0 +1,33 @@ 
+/* { dg-options "-O2" } */
+
+int
+load (int *arr)
+{
+  return arr[527] << 1 + arr[400] << 1 + arr[401] << 1 + arr[528] << 1;
+}
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]+, " 2 } } */
+
+float
+load_float (float *arr)
+{
+  return arr[404] + arr[403] + arr[400] + arr[401];
+}
+
+/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]+, " 2 } } */
+
+long long
+load_long (long long int *arr)
+{
+  return arr[400] << 1 + arr[401] << 1 + arr[403] << 1 + arr[404] << 1;
+}
+
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]+, " 2 } } */
+
+double
+load_double (double *arr)
+{
+  return arr[200] + arr[201] + arr[263] + arr[264];
+}
+
+/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]+, " 2 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_11.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_11.c
new file mode 100644
index 0000000000000000000000000000000000000000..73e9fd7161ff9c313d6b4eaf1a3e88f75dc8bb01
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_11.c
@@ -0,0 +1,16 @@ 
+/* { dg-options "-O2" } */
+
+double
+load_one (double *in)
+{
+  return in[400] + in[401] + in[527] + in[528];
+}
+
+double
+load_two (double *in)
+{
+  return in[400] + in[401] + in[464] + in[465];
+}
+
+/* This is expected to fail due to PR 82214.  */
+/* { dg-final { scan-assembler-times "stp\td\[0-9\]+, d\[0-9\]+," 4 { xfail *-*-* } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_12.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_12.c
new file mode 100644
index 0000000000000000000000000000000000000000..718e82b53f0ccfd09a19afa26ebdb88654359e33
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_12.c
@@ -0,0 +1,13 @@ 
+/* { dg-options "-O2" } */
+
+void
+store_offset (int *array, int x, int y)
+{
+  array[1085] = x;
+  array[1084] = y;
+
+  array[1086] = y;
+  array[1087] = 5;
+}
+
+/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]+, " 2 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_9.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_9.c
new file mode 100644
index 0000000000000000000000000000000000000000..8f9564595b26b39c5bdc160d396886c57e48f841
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_9.c
@@ -0,0 +1,49 @@ 
+/* { dg-options "-O2" } */
+
+void
+store (int *arr, int x, int y, int z)
+{
+  arr[400] = x;
+  arr[401] = y;
+
+  arr[500] = z;
+  arr[501] = x;
+}
+
+/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]+, " 2 } } */
+
+void
+store_float (float *arr, float x, float y)
+{
+  arr[404] = x;
+  arr[403] = y;
+
+  arr[400] = x;
+  arr[401] = y;
+}
+
+/* { dg-final { scan-assembler-times "stp\ts\[0-9\]+, s\[0-9\]+, " 2 } } */
+
+void
+store_long (long long int *arr, long long int x, long long int y)
+{
+  arr[400] = x;
+  arr[401] = y;
+
+  arr[403] = y;
+  arr[404] = x;
+}
+
+/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]+, " 2 } } */
+
+void
+store_double (double *arr, double x, double y)
+{
+  arr[200] = x;
+  arr[201] = y;
+
+  arr[263] = y;
+  arr[264] = x;
+}
+
+/* { dg-final { scan-assembler-times "stp\td\[0-9\]+, d\[0-9\]+, " 2 } } */