Message ID | ZcpRRWbmWLngMD3T@arm.com |
---|---|
State | New |
Headers | show |
Series | [12] aarch64: Avoid out-of-range shrink-wrapped saves [PR111677] | expand |
Alex Coplan <alex.coplan@arm.com> writes: > This is a backport of the GCC 13 fix for PR111677 to the GCC 12 branch. > The only part of the patch that isn't a straight cherry-pick is due to > the TX iterator lacking TDmode for GCC 12, so this version adjusts > TX_V16QI accordingly. > > Bootstrapped/regtested on aarch64-linux-gnu, the only changes in the > testsuite I saw were in > gcc/testsuite/c-c++-common/hwasan/large-aligned-1.c where the dg-output > "READ of size 4 [...]" check appears to be flaky on the GCC 12 branch > since libhwasan gained the short granule tag feature, I've requested a > backport of the following patch (committed as > r13-100-g3771486daa1e904ceae6f3e135b28e58af33849f) which should fix that > (independent) issue for GCC 12: > https://gcc.gnu.org/pipermail/gcc-patches/2024-February/645278.html > > OK for the GCC 12 branch? OK, thanks. Richard > Thanks, > Alex > > -- >8 -- > > The PR shows us ICEing due to an unrecognizable TFmode save emitted by > aarch64_process_components. The problem is that for T{I,F,D}mode we > conservatively require mems to be in range for x-register ldp/stp. That > is because (at least for TImode) it can be allocated to both GPRs and > FPRs, and in the GPR case that is an x-reg ldp/stp, and the FPR case is > a q-register load/store. > > As Richard pointed out in the PR, aarch64_get_separate_components > already checks that the offsets are suitable for a single load, so we > just need to choose a mode in aarch64_reg_save_mode that gives the full > q-register range. In this patch, we choose V16QImode as an alternative > 16-byte "bag-of-bits" mode that doesn't have the artificial range > restrictions imposed on T{I,F,D}mode. > > Unlike for GCC 14 we need additional handling in the load/store pair > code as various cases are not expecting to see V16QImode (particularly > the writeback patterns, but also aarch64_gen_load_pair). > > gcc/ChangeLog: > > PR target/111677 > * config/aarch64/aarch64.cc (aarch64_reg_save_mode): Use > V16QImode for the full 16-byte FPR saves in the vector PCS case. > (aarch64_gen_storewb_pair): Handle V16QImode. > (aarch64_gen_loadwb_pair): Likewise. > (aarch64_gen_load_pair): Likewise. > * config/aarch64/aarch64.md (loadwb_pair<TX:mode>_<P:mode>): > Rename to ... > (loadwb_pair<TX_V16QI:mode>_<P:mode>): ... this, extending to > V16QImode. > (storewb_pair<TX:mode>_<P:mode>): Rename to ... > (storewb_pair<TX_V16QI:mode>_<P:mode>): ... this, extending to > V16QImode. > * config/aarch64/iterators.md (TX_V16QI): New. > > gcc/testsuite/ChangeLog: > > PR target/111677 > * gcc.target/aarch64/torture/pr111677.c: New test. > > (cherry picked from commit 2bd8264a131ee1215d3bc6181722f9d30f5569c3) > --- > gcc/config/aarch64/aarch64.cc | 13 ++++++- > gcc/config/aarch64/aarch64.md | 35 ++++++++++--------- > gcc/config/aarch64/iterators.md | 3 ++ > .../gcc.target/aarch64/torture/pr111677.c | 28 +++++++++++++++ > 4 files changed, 61 insertions(+), 18 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/aarch64/torture/pr111677.c > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 3bccd96a23d..2bbba323770 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -4135,7 +4135,7 @@ aarch64_reg_save_mode (unsigned int regno) > case ARM_PCS_SIMD: > /* The vector PCS saves the low 128 bits (which is the full > register on non-SVE targets). */ > - return TFmode; > + return V16QImode; > > case ARM_PCS_SVE: > /* Use vectors of DImode for registers that need frame > @@ -8602,6 +8602,10 @@ aarch64_gen_storewb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2, > return gen_storewb_pairtf_di (base, base, reg, reg2, > GEN_INT (-adjustment), > GEN_INT (UNITS_PER_VREG - adjustment)); > + case E_V16QImode: > + return gen_storewb_pairv16qi_di (base, base, reg, reg2, > + GEN_INT (-adjustment), > + GEN_INT (UNITS_PER_VREG - adjustment)); > default: > gcc_unreachable (); > } > @@ -8647,6 +8651,10 @@ aarch64_gen_loadwb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2, > case E_TFmode: > return gen_loadwb_pairtf_di (base, base, reg, reg2, GEN_INT (adjustment), > GEN_INT (UNITS_PER_VREG)); > + case E_V16QImode: > + return gen_loadwb_pairv16qi_di (base, base, reg, reg2, > + GEN_INT (adjustment), > + GEN_INT (UNITS_PER_VREG)); > default: > gcc_unreachable (); > } > @@ -8730,6 +8738,9 @@ aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx mem1, rtx reg2, > case E_V4SImode: > return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2); > > + case E_V16QImode: > + return gen_load_pairv16qiv16qi (reg1, mem1, reg2, mem2); > + > default: > gcc_unreachable (); > } > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md > index fb100bdf6b3..99f185718c9 100644 > --- a/gcc/config/aarch64/aarch64.md > +++ b/gcc/config/aarch64/aarch64.md > @@ -1874,17 +1874,18 @@ (define_insn "loadwb_pair<GPF:mode>_<P:mode>" > [(set_attr "type" "neon_load1_2reg")] > ) > > -(define_insn "loadwb_pair<TX:mode>_<P:mode>" > +(define_insn "loadwb_pair<TX_V16QI:mode>_<P:mode>" > [(parallel > [(set (match_operand:P 0 "register_operand" "=k") > - (plus:P (match_operand:P 1 "register_operand" "0") > - (match_operand:P 4 "aarch64_mem_pair_offset" "n"))) > - (set (match_operand:TX 2 "register_operand" "=w") > - (mem:TX (match_dup 1))) > - (set (match_operand:TX 3 "register_operand" "=w") > - (mem:TX (plus:P (match_dup 1) > + (plus:P (match_operand:P 1 "register_operand" "0") > + (match_operand:P 4 "aarch64_mem_pair_offset" "n"))) > + (set (match_operand:TX_V16QI 2 "register_operand" "=w") > + (mem:TX_V16QI (match_dup 1))) > + (set (match_operand:TX_V16QI 3 "register_operand" "=w") > + (mem:TX_V16QI (plus:P (match_dup 1) > (match_operand:P 5 "const_int_operand" "n"))))])] > - "TARGET_SIMD && INTVAL (operands[5]) == GET_MODE_SIZE (<TX:MODE>mode)" > + "TARGET_SIMD > + && known_eq (INTVAL (operands[5]), GET_MODE_SIZE (<TX_V16QI:MODE>mode))" > "ldp\\t%q2, %q3, [%1], %4" > [(set_attr "type" "neon_ldp_q")] > ) > @@ -1923,20 +1924,20 @@ (define_insn "storewb_pair<GPF:mode>_<P:mode>" > [(set_attr "type" "neon_store1_2reg<q>")] > ) > > -(define_insn "storewb_pair<TX:mode>_<P:mode>" > +(define_insn "storewb_pair<TX_V16QI:mode>_<P:mode>" > [(parallel > [(set (match_operand:P 0 "register_operand" "=&k") > - (plus:P (match_operand:P 1 "register_operand" "0") > - (match_operand:P 4 "aarch64_mem_pair_offset" "n"))) > - (set (mem:TX (plus:P (match_dup 0) > + (plus:P (match_operand:P 1 "register_operand" "0") > + (match_operand:P 4 "aarch64_mem_pair_offset" "n"))) > + (set (mem:TX_V16QI (plus:P (match_dup 0) > (match_dup 4))) > - (match_operand:TX 2 "register_operand" "w")) > - (set (mem:TX (plus:P (match_dup 0) > + (match_operand:TX_V16QI 2 "register_operand" "w")) > + (set (mem:TX_V16QI (plus:P (match_dup 0) > (match_operand:P 5 "const_int_operand" "n"))) > - (match_operand:TX 3 "register_operand" "w"))])] > + (match_operand:TX_V16QI 3 "register_operand" "w"))])] > "TARGET_SIMD > - && INTVAL (operands[5]) > - == INTVAL (operands[4]) + GET_MODE_SIZE (<TX:MODE>mode)" > + && known_eq (INTVAL (operands[5]), > + INTVAL (operands[4]) + GET_MODE_SIZE (<TX_V16QI:MODE>mode))" > "stp\\t%q2, %q3, [%0, %4]!" > [(set_attr "type" "neon_stp_q")] > ) > diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md > index 26a840d7fe9..d49e37893df 100644 > --- a/gcc/config/aarch64/iterators.md > +++ b/gcc/config/aarch64/iterators.md > @@ -303,6 +303,9 @@ (define_mode_iterator VS [V2SI V4SI]) > > (define_mode_iterator TX [TI TF]) > > +;; TX plus V16QImode. > +(define_mode_iterator TX_V16QI [TI TF V16QI]) > + > ;; Advanced SIMD opaque structure modes. > (define_mode_iterator VSTRUCT [OI CI XI]) > > diff --git a/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c b/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c > new file mode 100644 > index 00000000000..6bb640c42c0 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c > @@ -0,0 +1,28 @@ > +/* { dg-do compile } */ > +/* { dg-require-effective-target fopenmp } */ > +/* { dg-options "-ffast-math -fstack-protector-strong -fopenmp" } */ > +typedef struct { > + long size_z; > + int width; > +} dt_bilateral_t; > +typedef float dt_aligned_pixel_t[4]; > +#pragma omp declare simd > +void dt_bilateral_splat(dt_bilateral_t *b) { > + float *buf; > + long offsets[8]; > + for (; b;) { > + int firstrow; > + for (int j = firstrow; j; j++) > + for (int i; i < b->width; i++) { > + dt_aligned_pixel_t contrib; > + for (int k = 0; k < 4; k++) > + buf[offsets[k]] += contrib[k]; > + } > + float *dest; > + for (int j = (long)b; j; j++) { > + float *src = (float *)b->size_z; > + for (int i = 0; i < (long)b; i++) > + dest[i] += src[i]; > + } > + } > +}
On 14/02/2024 11:18, Richard Sandiford wrote: > Alex Coplan <alex.coplan@arm.com> writes: > > This is a backport of the GCC 13 fix for PR111677 to the GCC 12 branch. > > The only part of the patch that isn't a straight cherry-pick is due to > > the TX iterator lacking TDmode for GCC 12, so this version adjusts > > TX_V16QI accordingly. > > > > Bootstrapped/regtested on aarch64-linux-gnu, the only changes in the > > testsuite I saw were in > > gcc/testsuite/c-c++-common/hwasan/large-aligned-1.c where the dg-output > > "READ of size 4 [...]" check appears to be flaky on the GCC 12 branch > > since libhwasan gained the short granule tag feature, I've requested a > > backport of the following patch (committed as > > r13-100-g3771486daa1e904ceae6f3e135b28e58af33849f) which should fix that > > (independent) issue for GCC 12: > > https://gcc.gnu.org/pipermail/gcc-patches/2024-February/645278.html > > > > OK for the GCC 12 branch? > > OK, thanks. Thanks. The patch cherry-picks cleanly on the GCC 11 branch, and bootstraps/regtests OK there. Is it OK for GCC 11 too, even though the issue is latent there (at least for the testcase in the patch)? Alex > > Richard > > > Thanks, > > Alex > > > > -- >8 -- > > > > The PR shows us ICEing due to an unrecognizable TFmode save emitted by > > aarch64_process_components. The problem is that for T{I,F,D}mode we > > conservatively require mems to be in range for x-register ldp/stp. That > > is because (at least for TImode) it can be allocated to both GPRs and > > FPRs, and in the GPR case that is an x-reg ldp/stp, and the FPR case is > > a q-register load/store. > > > > As Richard pointed out in the PR, aarch64_get_separate_components > > already checks that the offsets are suitable for a single load, so we > > just need to choose a mode in aarch64_reg_save_mode that gives the full > > q-register range. In this patch, we choose V16QImode as an alternative > > 16-byte "bag-of-bits" mode that doesn't have the artificial range > > restrictions imposed on T{I,F,D}mode. > > > > Unlike for GCC 14 we need additional handling in the load/store pair > > code as various cases are not expecting to see V16QImode (particularly > > the writeback patterns, but also aarch64_gen_load_pair). > > > > gcc/ChangeLog: > > > > PR target/111677 > > * config/aarch64/aarch64.cc (aarch64_reg_save_mode): Use > > V16QImode for the full 16-byte FPR saves in the vector PCS case. > > (aarch64_gen_storewb_pair): Handle V16QImode. > > (aarch64_gen_loadwb_pair): Likewise. > > (aarch64_gen_load_pair): Likewise. > > * config/aarch64/aarch64.md (loadwb_pair<TX:mode>_<P:mode>): > > Rename to ... > > (loadwb_pair<TX_V16QI:mode>_<P:mode>): ... this, extending to > > V16QImode. > > (storewb_pair<TX:mode>_<P:mode>): Rename to ... > > (storewb_pair<TX_V16QI:mode>_<P:mode>): ... this, extending to > > V16QImode. > > * config/aarch64/iterators.md (TX_V16QI): New. > > > > gcc/testsuite/ChangeLog: > > > > PR target/111677 > > * gcc.target/aarch64/torture/pr111677.c: New test. > > > > (cherry picked from commit 2bd8264a131ee1215d3bc6181722f9d30f5569c3) > > --- > > gcc/config/aarch64/aarch64.cc | 13 ++++++- > > gcc/config/aarch64/aarch64.md | 35 ++++++++++--------- > > gcc/config/aarch64/iterators.md | 3 ++ > > .../gcc.target/aarch64/torture/pr111677.c | 28 +++++++++++++++ > > 4 files changed, 61 insertions(+), 18 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/aarch64/torture/pr111677.c > > > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > > index 3bccd96a23d..2bbba323770 100644 > > --- a/gcc/config/aarch64/aarch64.cc > > +++ b/gcc/config/aarch64/aarch64.cc > > @@ -4135,7 +4135,7 @@ aarch64_reg_save_mode (unsigned int regno) > > case ARM_PCS_SIMD: > > /* The vector PCS saves the low 128 bits (which is the full > > register on non-SVE targets). */ > > - return TFmode; > > + return V16QImode; > > > > case ARM_PCS_SVE: > > /* Use vectors of DImode for registers that need frame > > @@ -8602,6 +8602,10 @@ aarch64_gen_storewb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2, > > return gen_storewb_pairtf_di (base, base, reg, reg2, > > GEN_INT (-adjustment), > > GEN_INT (UNITS_PER_VREG - adjustment)); > > + case E_V16QImode: > > + return gen_storewb_pairv16qi_di (base, base, reg, reg2, > > + GEN_INT (-adjustment), > > + GEN_INT (UNITS_PER_VREG - adjustment)); > > default: > > gcc_unreachable (); > > } > > @@ -8647,6 +8651,10 @@ aarch64_gen_loadwb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2, > > case E_TFmode: > > return gen_loadwb_pairtf_di (base, base, reg, reg2, GEN_INT (adjustment), > > GEN_INT (UNITS_PER_VREG)); > > + case E_V16QImode: > > + return gen_loadwb_pairv16qi_di (base, base, reg, reg2, > > + GEN_INT (adjustment), > > + GEN_INT (UNITS_PER_VREG)); > > default: > > gcc_unreachable (); > > } > > @@ -8730,6 +8738,9 @@ aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx mem1, rtx reg2, > > case E_V4SImode: > > return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2); > > > > + case E_V16QImode: > > + return gen_load_pairv16qiv16qi (reg1, mem1, reg2, mem2); > > + > > default: > > gcc_unreachable (); > > } > > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md > > index fb100bdf6b3..99f185718c9 100644 > > --- a/gcc/config/aarch64/aarch64.md > > +++ b/gcc/config/aarch64/aarch64.md > > @@ -1874,17 +1874,18 @@ (define_insn "loadwb_pair<GPF:mode>_<P:mode>" > > [(set_attr "type" "neon_load1_2reg")] > > ) > > > > -(define_insn "loadwb_pair<TX:mode>_<P:mode>" > > +(define_insn "loadwb_pair<TX_V16QI:mode>_<P:mode>" > > [(parallel > > [(set (match_operand:P 0 "register_operand" "=k") > > - (plus:P (match_operand:P 1 "register_operand" "0") > > - (match_operand:P 4 "aarch64_mem_pair_offset" "n"))) > > - (set (match_operand:TX 2 "register_operand" "=w") > > - (mem:TX (match_dup 1))) > > - (set (match_operand:TX 3 "register_operand" "=w") > > - (mem:TX (plus:P (match_dup 1) > > + (plus:P (match_operand:P 1 "register_operand" "0") > > + (match_operand:P 4 "aarch64_mem_pair_offset" "n"))) > > + (set (match_operand:TX_V16QI 2 "register_operand" "=w") > > + (mem:TX_V16QI (match_dup 1))) > > + (set (match_operand:TX_V16QI 3 "register_operand" "=w") > > + (mem:TX_V16QI (plus:P (match_dup 1) > > (match_operand:P 5 "const_int_operand" "n"))))])] > > - "TARGET_SIMD && INTVAL (operands[5]) == GET_MODE_SIZE (<TX:MODE>mode)" > > + "TARGET_SIMD > > + && known_eq (INTVAL (operands[5]), GET_MODE_SIZE (<TX_V16QI:MODE>mode))" > > "ldp\\t%q2, %q3, [%1], %4" > > [(set_attr "type" "neon_ldp_q")] > > ) > > @@ -1923,20 +1924,20 @@ (define_insn "storewb_pair<GPF:mode>_<P:mode>" > > [(set_attr "type" "neon_store1_2reg<q>")] > > ) > > > > -(define_insn "storewb_pair<TX:mode>_<P:mode>" > > +(define_insn "storewb_pair<TX_V16QI:mode>_<P:mode>" > > [(parallel > > [(set (match_operand:P 0 "register_operand" "=&k") > > - (plus:P (match_operand:P 1 "register_operand" "0") > > - (match_operand:P 4 "aarch64_mem_pair_offset" "n"))) > > - (set (mem:TX (plus:P (match_dup 0) > > + (plus:P (match_operand:P 1 "register_operand" "0") > > + (match_operand:P 4 "aarch64_mem_pair_offset" "n"))) > > + (set (mem:TX_V16QI (plus:P (match_dup 0) > > (match_dup 4))) > > - (match_operand:TX 2 "register_operand" "w")) > > - (set (mem:TX (plus:P (match_dup 0) > > + (match_operand:TX_V16QI 2 "register_operand" "w")) > > + (set (mem:TX_V16QI (plus:P (match_dup 0) > > (match_operand:P 5 "const_int_operand" "n"))) > > - (match_operand:TX 3 "register_operand" "w"))])] > > + (match_operand:TX_V16QI 3 "register_operand" "w"))])] > > "TARGET_SIMD > > - && INTVAL (operands[5]) > > - == INTVAL (operands[4]) + GET_MODE_SIZE (<TX:MODE>mode)" > > + && known_eq (INTVAL (operands[5]), > > + INTVAL (operands[4]) + GET_MODE_SIZE (<TX_V16QI:MODE>mode))" > > "stp\\t%q2, %q3, [%0, %4]!" > > [(set_attr "type" "neon_stp_q")] > > ) > > diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md > > index 26a840d7fe9..d49e37893df 100644 > > --- a/gcc/config/aarch64/iterators.md > > +++ b/gcc/config/aarch64/iterators.md > > @@ -303,6 +303,9 @@ (define_mode_iterator VS [V2SI V4SI]) > > > > (define_mode_iterator TX [TI TF]) > > > > +;; TX plus V16QImode. > > +(define_mode_iterator TX_V16QI [TI TF V16QI]) > > + > > ;; Advanced SIMD opaque structure modes. > > (define_mode_iterator VSTRUCT [OI CI XI]) > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c b/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c > > new file mode 100644 > > index 00000000000..6bb640c42c0 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c > > @@ -0,0 +1,28 @@ > > +/* { dg-do compile } */ > > +/* { dg-require-effective-target fopenmp } */ > > +/* { dg-options "-ffast-math -fstack-protector-strong -fopenmp" } */ > > +typedef struct { > > + long size_z; > > + int width; > > +} dt_bilateral_t; > > +typedef float dt_aligned_pixel_t[4]; > > +#pragma omp declare simd > > +void dt_bilateral_splat(dt_bilateral_t *b) { > > + float *buf; > > + long offsets[8]; > > + for (; b;) { > > + int firstrow; > > + for (int j = firstrow; j; j++) > > + for (int i; i < b->width; i++) { > > + dt_aligned_pixel_t contrib; > > + for (int k = 0; k < 4; k++) > > + buf[offsets[k]] += contrib[k]; > > + } > > + float *dest; > > + for (int j = (long)b; j; j++) { > > + float *src = (float *)b->size_z; > > + for (int i = 0; i < (long)b; i++) > > + dest[i] += src[i]; > > + } > > + } > > +}
Alex Coplan <alex.coplan@arm.com> writes: > On 14/02/2024 11:18, Richard Sandiford wrote: >> Alex Coplan <alex.coplan@arm.com> writes: >> > This is a backport of the GCC 13 fix for PR111677 to the GCC 12 branch. >> > The only part of the patch that isn't a straight cherry-pick is due to >> > the TX iterator lacking TDmode for GCC 12, so this version adjusts >> > TX_V16QI accordingly. >> > >> > Bootstrapped/regtested on aarch64-linux-gnu, the only changes in the >> > testsuite I saw were in >> > gcc/testsuite/c-c++-common/hwasan/large-aligned-1.c where the dg-output >> > "READ of size 4 [...]" check appears to be flaky on the GCC 12 branch >> > since libhwasan gained the short granule tag feature, I've requested a >> > backport of the following patch (committed as >> > r13-100-g3771486daa1e904ceae6f3e135b28e58af33849f) which should fix that >> > (independent) issue for GCC 12: >> > https://gcc.gnu.org/pipermail/gcc-patches/2024-February/645278.html >> > >> > OK for the GCC 12 branch? >> >> OK, thanks. > > Thanks. The patch cherry-picks cleanly on the GCC 11 branch, and > bootstraps/regtests OK there. Is it OK for GCC 11 too, even though the > issue is latent there (at least for the testcase in the patch)? Yeah, I think we should apply it there too, since the bug is likely to manifest with other testcases. Richard > Alex > >> >> Richard >> >> > Thanks, >> > Alex >> > >> > -- >8 -- >> > >> > The PR shows us ICEing due to an unrecognizable TFmode save emitted by >> > aarch64_process_components. The problem is that for T{I,F,D}mode we >> > conservatively require mems to be in range for x-register ldp/stp. That >> > is because (at least for TImode) it can be allocated to both GPRs and >> > FPRs, and in the GPR case that is an x-reg ldp/stp, and the FPR case is >> > a q-register load/store. >> > >> > As Richard pointed out in the PR, aarch64_get_separate_components >> > already checks that the offsets are suitable for a single load, so we >> > just need to choose a mode in aarch64_reg_save_mode that gives the full >> > q-register range. In this patch, we choose V16QImode as an alternative >> > 16-byte "bag-of-bits" mode that doesn't have the artificial range >> > restrictions imposed on T{I,F,D}mode. >> > >> > Unlike for GCC 14 we need additional handling in the load/store pair >> > code as various cases are not expecting to see V16QImode (particularly >> > the writeback patterns, but also aarch64_gen_load_pair). >> > >> > gcc/ChangeLog: >> > >> > PR target/111677 >> > * config/aarch64/aarch64.cc (aarch64_reg_save_mode): Use >> > V16QImode for the full 16-byte FPR saves in the vector PCS case. >> > (aarch64_gen_storewb_pair): Handle V16QImode. >> > (aarch64_gen_loadwb_pair): Likewise. >> > (aarch64_gen_load_pair): Likewise. >> > * config/aarch64/aarch64.md (loadwb_pair<TX:mode>_<P:mode>): >> > Rename to ... >> > (loadwb_pair<TX_V16QI:mode>_<P:mode>): ... this, extending to >> > V16QImode. >> > (storewb_pair<TX:mode>_<P:mode>): Rename to ... >> > (storewb_pair<TX_V16QI:mode>_<P:mode>): ... this, extending to >> > V16QImode. >> > * config/aarch64/iterators.md (TX_V16QI): New. >> > >> > gcc/testsuite/ChangeLog: >> > >> > PR target/111677 >> > * gcc.target/aarch64/torture/pr111677.c: New test. >> > >> > (cherry picked from commit 2bd8264a131ee1215d3bc6181722f9d30f5569c3) >> > --- >> > gcc/config/aarch64/aarch64.cc | 13 ++++++- >> > gcc/config/aarch64/aarch64.md | 35 ++++++++++--------- >> > gcc/config/aarch64/iterators.md | 3 ++ >> > .../gcc.target/aarch64/torture/pr111677.c | 28 +++++++++++++++ >> > 4 files changed, 61 insertions(+), 18 deletions(-) >> > create mode 100644 gcc/testsuite/gcc.target/aarch64/torture/pr111677.c >> > >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >> > index 3bccd96a23d..2bbba323770 100644 >> > --- a/gcc/config/aarch64/aarch64.cc >> > +++ b/gcc/config/aarch64/aarch64.cc >> > @@ -4135,7 +4135,7 @@ aarch64_reg_save_mode (unsigned int regno) >> > case ARM_PCS_SIMD: >> > /* The vector PCS saves the low 128 bits (which is the full >> > register on non-SVE targets). */ >> > - return TFmode; >> > + return V16QImode; >> > >> > case ARM_PCS_SVE: >> > /* Use vectors of DImode for registers that need frame >> > @@ -8602,6 +8602,10 @@ aarch64_gen_storewb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2, >> > return gen_storewb_pairtf_di (base, base, reg, reg2, >> > GEN_INT (-adjustment), >> > GEN_INT (UNITS_PER_VREG - adjustment)); >> > + case E_V16QImode: >> > + return gen_storewb_pairv16qi_di (base, base, reg, reg2, >> > + GEN_INT (-adjustment), >> > + GEN_INT (UNITS_PER_VREG - adjustment)); >> > default: >> > gcc_unreachable (); >> > } >> > @@ -8647,6 +8651,10 @@ aarch64_gen_loadwb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2, >> > case E_TFmode: >> > return gen_loadwb_pairtf_di (base, base, reg, reg2, GEN_INT (adjustment), >> > GEN_INT (UNITS_PER_VREG)); >> > + case E_V16QImode: >> > + return gen_loadwb_pairv16qi_di (base, base, reg, reg2, >> > + GEN_INT (adjustment), >> > + GEN_INT (UNITS_PER_VREG)); >> > default: >> > gcc_unreachable (); >> > } >> > @@ -8730,6 +8738,9 @@ aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx mem1, rtx reg2, >> > case E_V4SImode: >> > return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2); >> > >> > + case E_V16QImode: >> > + return gen_load_pairv16qiv16qi (reg1, mem1, reg2, mem2); >> > + >> > default: >> > gcc_unreachable (); >> > } >> > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md >> > index fb100bdf6b3..99f185718c9 100644 >> > --- a/gcc/config/aarch64/aarch64.md >> > +++ b/gcc/config/aarch64/aarch64.md >> > @@ -1874,17 +1874,18 @@ (define_insn "loadwb_pair<GPF:mode>_<P:mode>" >> > [(set_attr "type" "neon_load1_2reg")] >> > ) >> > >> > -(define_insn "loadwb_pair<TX:mode>_<P:mode>" >> > +(define_insn "loadwb_pair<TX_V16QI:mode>_<P:mode>" >> > [(parallel >> > [(set (match_operand:P 0 "register_operand" "=k") >> > - (plus:P (match_operand:P 1 "register_operand" "0") >> > - (match_operand:P 4 "aarch64_mem_pair_offset" "n"))) >> > - (set (match_operand:TX 2 "register_operand" "=w") >> > - (mem:TX (match_dup 1))) >> > - (set (match_operand:TX 3 "register_operand" "=w") >> > - (mem:TX (plus:P (match_dup 1) >> > + (plus:P (match_operand:P 1 "register_operand" "0") >> > + (match_operand:P 4 "aarch64_mem_pair_offset" "n"))) >> > + (set (match_operand:TX_V16QI 2 "register_operand" "=w") >> > + (mem:TX_V16QI (match_dup 1))) >> > + (set (match_operand:TX_V16QI 3 "register_operand" "=w") >> > + (mem:TX_V16QI (plus:P (match_dup 1) >> > (match_operand:P 5 "const_int_operand" "n"))))])] >> > - "TARGET_SIMD && INTVAL (operands[5]) == GET_MODE_SIZE (<TX:MODE>mode)" >> > + "TARGET_SIMD >> > + && known_eq (INTVAL (operands[5]), GET_MODE_SIZE (<TX_V16QI:MODE>mode))" >> > "ldp\\t%q2, %q3, [%1], %4" >> > [(set_attr "type" "neon_ldp_q")] >> > ) >> > @@ -1923,20 +1924,20 @@ (define_insn "storewb_pair<GPF:mode>_<P:mode>" >> > [(set_attr "type" "neon_store1_2reg<q>")] >> > ) >> > >> > -(define_insn "storewb_pair<TX:mode>_<P:mode>" >> > +(define_insn "storewb_pair<TX_V16QI:mode>_<P:mode>" >> > [(parallel >> > [(set (match_operand:P 0 "register_operand" "=&k") >> > - (plus:P (match_operand:P 1 "register_operand" "0") >> > - (match_operand:P 4 "aarch64_mem_pair_offset" "n"))) >> > - (set (mem:TX (plus:P (match_dup 0) >> > + (plus:P (match_operand:P 1 "register_operand" "0") >> > + (match_operand:P 4 "aarch64_mem_pair_offset" "n"))) >> > + (set (mem:TX_V16QI (plus:P (match_dup 0) >> > (match_dup 4))) >> > - (match_operand:TX 2 "register_operand" "w")) >> > - (set (mem:TX (plus:P (match_dup 0) >> > + (match_operand:TX_V16QI 2 "register_operand" "w")) >> > + (set (mem:TX_V16QI (plus:P (match_dup 0) >> > (match_operand:P 5 "const_int_operand" "n"))) >> > - (match_operand:TX 3 "register_operand" "w"))])] >> > + (match_operand:TX_V16QI 3 "register_operand" "w"))])] >> > "TARGET_SIMD >> > - && INTVAL (operands[5]) >> > - == INTVAL (operands[4]) + GET_MODE_SIZE (<TX:MODE>mode)" >> > + && known_eq (INTVAL (operands[5]), >> > + INTVAL (operands[4]) + GET_MODE_SIZE (<TX_V16QI:MODE>mode))" >> > "stp\\t%q2, %q3, [%0, %4]!" >> > [(set_attr "type" "neon_stp_q")] >> > ) >> > diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md >> > index 26a840d7fe9..d49e37893df 100644 >> > --- a/gcc/config/aarch64/iterators.md >> > +++ b/gcc/config/aarch64/iterators.md >> > @@ -303,6 +303,9 @@ (define_mode_iterator VS [V2SI V4SI]) >> > >> > (define_mode_iterator TX [TI TF]) >> > >> > +;; TX plus V16QImode. >> > +(define_mode_iterator TX_V16QI [TI TF V16QI]) >> > + >> > ;; Advanced SIMD opaque structure modes. >> > (define_mode_iterator VSTRUCT [OI CI XI]) >> > >> > diff --git a/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c b/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c >> > new file mode 100644 >> > index 00000000000..6bb640c42c0 >> > --- /dev/null >> > +++ b/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c >> > @@ -0,0 +1,28 @@ >> > +/* { dg-do compile } */ >> > +/* { dg-require-effective-target fopenmp } */ >> > +/* { dg-options "-ffast-math -fstack-protector-strong -fopenmp" } */ >> > +typedef struct { >> > + long size_z; >> > + int width; >> > +} dt_bilateral_t; >> > +typedef float dt_aligned_pixel_t[4]; >> > +#pragma omp declare simd >> > +void dt_bilateral_splat(dt_bilateral_t *b) { >> > + float *buf; >> > + long offsets[8]; >> > + for (; b;) { >> > + int firstrow; >> > + for (int j = firstrow; j; j++) >> > + for (int i; i < b->width; i++) { >> > + dt_aligned_pixel_t contrib; >> > + for (int k = 0; k < 4; k++) >> > + buf[offsets[k]] += contrib[k]; >> > + } >> > + float *dest; >> > + for (int j = (long)b; j; j++) { >> > + float *src = (float *)b->size_z; >> > + for (int i = 0; i < (long)b; i++) >> > + dest[i] += src[i]; >> > + } >> > + } >> > +}
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 3bccd96a23d..2bbba323770 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -4135,7 +4135,7 @@ aarch64_reg_save_mode (unsigned int regno) case ARM_PCS_SIMD: /* The vector PCS saves the low 128 bits (which is the full register on non-SVE targets). */ - return TFmode; + return V16QImode; case ARM_PCS_SVE: /* Use vectors of DImode for registers that need frame @@ -8602,6 +8602,10 @@ aarch64_gen_storewb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2, return gen_storewb_pairtf_di (base, base, reg, reg2, GEN_INT (-adjustment), GEN_INT (UNITS_PER_VREG - adjustment)); + case E_V16QImode: + return gen_storewb_pairv16qi_di (base, base, reg, reg2, + GEN_INT (-adjustment), + GEN_INT (UNITS_PER_VREG - adjustment)); default: gcc_unreachable (); } @@ -8647,6 +8651,10 @@ aarch64_gen_loadwb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2, case E_TFmode: return gen_loadwb_pairtf_di (base, base, reg, reg2, GEN_INT (adjustment), GEN_INT (UNITS_PER_VREG)); + case E_V16QImode: + return gen_loadwb_pairv16qi_di (base, base, reg, reg2, + GEN_INT (adjustment), + GEN_INT (UNITS_PER_VREG)); default: gcc_unreachable (); } @@ -8730,6 +8738,9 @@ aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx mem1, rtx reg2, case E_V4SImode: return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2); + case E_V16QImode: + return gen_load_pairv16qiv16qi (reg1, mem1, reg2, mem2); + default: gcc_unreachable (); } diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index fb100bdf6b3..99f185718c9 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -1874,17 +1874,18 @@ (define_insn "loadwb_pair<GPF:mode>_<P:mode>" [(set_attr "type" "neon_load1_2reg")] ) -(define_insn "loadwb_pair<TX:mode>_<P:mode>" +(define_insn "loadwb_pair<TX_V16QI:mode>_<P:mode>" [(parallel [(set (match_operand:P 0 "register_operand" "=k") - (plus:P (match_operand:P 1 "register_operand" "0") - (match_operand:P 4 "aarch64_mem_pair_offset" "n"))) - (set (match_operand:TX 2 "register_operand" "=w") - (mem:TX (match_dup 1))) - (set (match_operand:TX 3 "register_operand" "=w") - (mem:TX (plus:P (match_dup 1) + (plus:P (match_operand:P 1 "register_operand" "0") + (match_operand:P 4 "aarch64_mem_pair_offset" "n"))) + (set (match_operand:TX_V16QI 2 "register_operand" "=w") + (mem:TX_V16QI (match_dup 1))) + (set (match_operand:TX_V16QI 3 "register_operand" "=w") + (mem:TX_V16QI (plus:P (match_dup 1) (match_operand:P 5 "const_int_operand" "n"))))])] - "TARGET_SIMD && INTVAL (operands[5]) == GET_MODE_SIZE (<TX:MODE>mode)" + "TARGET_SIMD + && known_eq (INTVAL (operands[5]), GET_MODE_SIZE (<TX_V16QI:MODE>mode))" "ldp\\t%q2, %q3, [%1], %4" [(set_attr "type" "neon_ldp_q")] ) @@ -1923,20 +1924,20 @@ (define_insn "storewb_pair<GPF:mode>_<P:mode>" [(set_attr "type" "neon_store1_2reg<q>")] ) -(define_insn "storewb_pair<TX:mode>_<P:mode>" +(define_insn "storewb_pair<TX_V16QI:mode>_<P:mode>" [(parallel [(set (match_operand:P 0 "register_operand" "=&k") - (plus:P (match_operand:P 1 "register_operand" "0") - (match_operand:P 4 "aarch64_mem_pair_offset" "n"))) - (set (mem:TX (plus:P (match_dup 0) + (plus:P (match_operand:P 1 "register_operand" "0") + (match_operand:P 4 "aarch64_mem_pair_offset" "n"))) + (set (mem:TX_V16QI (plus:P (match_dup 0) (match_dup 4))) - (match_operand:TX 2 "register_operand" "w")) - (set (mem:TX (plus:P (match_dup 0) + (match_operand:TX_V16QI 2 "register_operand" "w")) + (set (mem:TX_V16QI (plus:P (match_dup 0) (match_operand:P 5 "const_int_operand" "n"))) - (match_operand:TX 3 "register_operand" "w"))])] + (match_operand:TX_V16QI 3 "register_operand" "w"))])] "TARGET_SIMD - && INTVAL (operands[5]) - == INTVAL (operands[4]) + GET_MODE_SIZE (<TX:MODE>mode)" + && known_eq (INTVAL (operands[5]), + INTVAL (operands[4]) + GET_MODE_SIZE (<TX_V16QI:MODE>mode))" "stp\\t%q2, %q3, [%0, %4]!" [(set_attr "type" "neon_stp_q")] ) diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md index 26a840d7fe9..d49e37893df 100644 --- a/gcc/config/aarch64/iterators.md +++ b/gcc/config/aarch64/iterators.md @@ -303,6 +303,9 @@ (define_mode_iterator VS [V2SI V4SI]) (define_mode_iterator TX [TI TF]) +;; TX plus V16QImode. +(define_mode_iterator TX_V16QI [TI TF V16QI]) + ;; Advanced SIMD opaque structure modes. (define_mode_iterator VSTRUCT [OI CI XI]) diff --git a/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c b/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c new file mode 100644 index 00000000000..6bb640c42c0 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c @@ -0,0 +1,28 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target fopenmp } */ +/* { dg-options "-ffast-math -fstack-protector-strong -fopenmp" } */ +typedef struct { + long size_z; + int width; +} dt_bilateral_t; +typedef float dt_aligned_pixel_t[4]; +#pragma omp declare simd +void dt_bilateral_splat(dt_bilateral_t *b) { + float *buf; + long offsets[8]; + for (; b;) { + int firstrow; + for (int j = firstrow; j; j++) + for (int i; i < b->width; i++) { + dt_aligned_pixel_t contrib; + for (int k = 0; k < 4; k++) + buf[offsets[k]] += contrib[k]; + } + float *dest; + for (int j = (long)b; j; j++) { + float *src = (float *)b->size_z; + for (int i = 0; i < (long)b; i++) + dest[i] += src[i]; + } + } +}