diff mbox series

aarch64: Avoid out-of-range shrink-wrapped saves [PR111677]

Message ID ZbkWAKenx6eXMx13@arm.com
State New
Headers show
Series aarch64: Avoid out-of-range shrink-wrapped saves [PR111677] | expand

Commit Message

Alex Coplan Jan. 30, 2024, 3:30 p.m. UTC
Hi,

The PR shows us ICEing due to an unrecognizable TFmode save emitted by
aarch64_process_components.  The problem is that for T{I,F,D}mode we
conservatively require mems to be in range for x-register ldp/stp.  That
is because (at least for TImode) it can be allocated to both GPRs and
FPRs, and in the GPR case that is an x-reg ldp/stp, and the FPR case is
a q-register load/store.

As Richard pointed out in the PR, aarch64_get_separate_components
already checks that the offsets are suitable for a single load, so we
just need to choose a mode in aarch64_reg_save_mode that gives the full
q-register range.  In this patch, we choose V16QImode as an alternative
16-byte "bag-of-bits" mode that doesn't have the artificial range
restrictions imposed on T{I,F,D}mode.

For T{F,D}mode in GCC 15 I think we could consider relaxing the
restriction imposed in aarch64_classify_address, as AFAIK T{F,D}mode can
only be allocated to FPRs (unlike TImode).  But such a change seems too
invasive to consider for GCC 14 at this stage (let alone backports).

Fortunately the new flexible load/store pair patterns in GCC 14 allow
this mode change to work without further changes.  The backports are
more involved as we need to adjust the load/store pair handling to cater
for V16QImode in a few places.

Note that for the testcase we are relying on the torture options to add
-funroll-loops at -O3 which is necessary to trigger the ICE on trunk
(but not on the 13 branch).

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

	PR target/111677
	* config/aarch64/aarch64.cc (aarch64_reg_save_mode): Use
	V16QImode for the full 16-byte FPR saves in the vector PCS case.

gcc/testsuite/ChangeLog:

	PR target/111677
	* gcc.target/aarch64/torture/pr111677.c: New test.

Comments

Richard Sandiford Jan. 31, 2024, 9:33 a.m. UTC | #1
Alex Coplan <alex.coplan@arm.com> writes:
> Hi,
>
> The PR shows us ICEing due to an unrecognizable TFmode save emitted by
> aarch64_process_components.  The problem is that for T{I,F,D}mode we
> conservatively require mems to be in range for x-register ldp/stp.  That
> is because (at least for TImode) it can be allocated to both GPRs and
> FPRs, and in the GPR case that is an x-reg ldp/stp, and the FPR case is
> a q-register load/store.
>
> As Richard pointed out in the PR, aarch64_get_separate_components
> already checks that the offsets are suitable for a single load, so we
> just need to choose a mode in aarch64_reg_save_mode that gives the full
> q-register range.  In this patch, we choose V16QImode as an alternative
> 16-byte "bag-of-bits" mode that doesn't have the artificial range
> restrictions imposed on T{I,F,D}mode.
>
> For T{F,D}mode in GCC 15 I think we could consider relaxing the
> restriction imposed in aarch64_classify_address, as AFAIK T{F,D}mode can
> only be allocated to FPRs (unlike TImode).  But such a change seems too
> invasive to consider for GCC 14 at this stage (let alone backports).

GPRs can hold all three, due to the way aarch64_hard_regno_mode_ok
is defined.  (They can also hold individual Advanced SIMD vectors.)

But the ABI says that TFmode is passed in FPRs, so I agree that it
seems better to optimise for the FPR range.  Same for TDmode.

> Fortunately the new flexible load/store pair patterns in GCC 14 allow
> this mode change to work without further changes.  The backports are
> more involved as we need to adjust the load/store pair handling to cater
> for V16QImode in a few places.
>
> Note that for the testcase we are relying on the torture options to add
> -funroll-loops at -O3 which is necessary to trigger the ICE on trunk
> (but not on the 13 branch).
>
> Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
>
> Thanks,
> Alex
>
> gcc/ChangeLog:
>
> 	PR target/111677
> 	* config/aarch64/aarch64.cc (aarch64_reg_save_mode): Use
> 	V16QImode for the full 16-byte FPR saves in the vector PCS case.
>
> gcc/testsuite/ChangeLog:
>
> 	PR target/111677
> 	* gcc.target/aarch64/torture/pr111677.c: New test.

OK, thanks.

Richard
diff mbox series

Patch

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index a37d47b243e..4556b8dd504 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -2361,7 +2361,7 @@  aarch64_reg_save_mode (unsigned int regno)
       case ARM_PCS_SIMD:
 	/* The vector PCS saves the low 128 bits (which is the full
 	   register on non-SVE targets).  */
-	return TFmode;
+	return V16QImode;
 
       case ARM_PCS_SVE:
 	/* Use vectors of DImode for registers that need frame
diff --git a/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c b/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c
new file mode 100644
index 00000000000..6bb640c42c0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c
@@ -0,0 +1,28 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target fopenmp } */
+/* { dg-options "-ffast-math -fstack-protector-strong -fopenmp" } */
+typedef struct {
+  long size_z;
+  int width;
+} dt_bilateral_t;
+typedef float dt_aligned_pixel_t[4];
+#pragma omp declare simd
+void dt_bilateral_splat(dt_bilateral_t *b) {
+  float *buf;
+  long offsets[8];
+  for (; b;) {
+    int firstrow;
+    for (int j = firstrow; j; j++)
+      for (int i; i < b->width; i++) {
+        dt_aligned_pixel_t contrib;
+        for (int k = 0; k < 4; k++)
+          buf[offsets[k]] += contrib[k];
+      }
+    float *dest;
+    for (int j = (long)b; j; j++) {
+      float *src = (float *)b->size_z;
+      for (int i = 0; i < (long)b; i++)
+        dest[i] += src[i];
+    }
+  }
+}