Message ID | 20180123154122.12322-1-siddhesh@sourceware.org |
---|---|
State | New |
Headers | show |
Series | [aarch64,v2] Disable reg offset in quad-word store for Falkor | expand |
Hi Siddhesh, On 23/01/18 15:41, Siddhesh Poyarekar wrote: > Hi, > > Here's v2 of the patch to disable register offset addressing mode for > stores of 128-bit values on Falkor because they're very costly. > Differences from the last version: > > - Incorporated changes Jim made to his patch earlier that I missed, > i.e. adding an extra tuning parameter called > SLOW_REGOFFSET_QUADWORD_STORE instead of making it conditional on > TUNE_FALKOR. > > - Added a new query type ADDR_QUERY_STR to indicate the queried > address is used for a store. This way I can use it for other > scenarios where stores are significantly more expensive than loads, > such as pre/post modify addressing modes. > > - Incorporated the constraint functionality into > aarch64_legitimate_address_p and aarch64_classify_address. > > I evaluated the suggestion of using costs to do this but it's not > possible with the current costs as they do not differentiate between > loads and stores. If modifying all passes that use these costs to > identify loads vs stores is considered OK (ivopts seems to be the > biggest user) then I can volunteer to do that work for gcc9 and > evetually replace this. > I would tend towards making the costs usage more intelligent and differentiating between loads and stores but I agree that is definitely GCC 9 material. Whether this approach is an acceptable stopgap for GCC 8 is up to the aarch64 maintainers and will depend, among other things, on the impact it has on generic (non-Falkor) codegen. A good experiment to help this approach would be to compile a large codebase (for example CPU2017) with a non-Falkor -mcpu setting and make sure that there are no assembly changes (or minimal). This would help justify the aarch64.md constraint changes. I have a couple of comments on the patch inline. Thanks, Kyrill > ---- > > On Falkor, because of an idiosyncracy of how the pipelines are designed, a > quad-word store using a reg+reg addressing mode is almost twice as slow as an > add followed by a quad-word store with a single reg addressing mode. So we > get better performance if we disallow addressing modes using register offsets > with quad-word stores. > > This patch improves fpspeed by 0.3% and intspeed by 0.22% in CPU2017, > with omnetpp_s (4.3%) and pop2_s (2.6%) being the biggest winners. > > 2018-xx-xx Jim Wilson <jim.wilson@linaro.org> > Kugan Vivenakandarajah <kugan.vivekanandarajah@linaro.org> > Siddhesh Poyarekar <siddhesh@sourceware.org> > > gcc/ > * gcc/config/aarch64/aarch64-protos.h (aarch64_addr_query_type): > New member ADDR_QUERY_STR. > * gcc/config/aarch64/aarch64-tuning-flags.def > (SLOW_REGOFFSET_QUADWORD_STORE): New. > * gcc/config/aarch64/aarch64.c (qdf24xx_tunings): Add > SLOW_REGOFFSET_QUADWORD_STORE to tuning flags. > (aarch64_classify_address): Avoid register indexing for quad > mode stores when SLOW_REGOFFSET_QUADWORD_STORE is set. > * gcc/config/aarch64/constraints.md (Uts): New constraint. > * gcc/config/aarch64/aarch64.md (movti_aarch64, movtf_aarch64): > Use it. > * gcc/config/aarch64/aarch64-simd.md (aarch64_simd_mov<mode>): > Likewise. > file paths don't need the gcc/ because the ChangeLog file is already in the gcc/ directory > gcc/testsuite/ > * gcc/testsuite/gcc.target/aarch64/pr82533.c: New test case. Similarly, you don't need the gcc/testsuite/ prefix. Also, since you have a bugzilla PR entry please reference it in the ChangeLog right above the file list: <tab>PR target/82533 That way when the patch is committed the SVN hooks will pick it up automagically and update bugzilla. > --- > gcc/config/aarch64/aarch64-protos.h | 4 ++++ > gcc/config/aarch64/aarch64-simd.md | 4 ++-- > gcc/config/aarch64/aarch64-tuning-flags.def | 4 ++++ > gcc/config/aarch64/aarch64.c | 12 +++++++++++- > gcc/config/aarch64/aarch64.md | 6 +++--- > gcc/config/aarch64/constraints.md | 7 +++++++ > gcc/testsuite/gcc.target/aarch64/pr82533.c | 11 +++++++++++ > 7 files changed, 42 insertions(+), 6 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/aarch64/pr82533.c > > diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h > index ef1b0bc8e28..5fedc85f283 100644 > --- a/gcc/config/aarch64/aarch64-protos.h > +++ b/gcc/config/aarch64/aarch64-protos.h > @@ -120,6 +120,9 @@ enum aarch64_symbol_type > ADDR_QUERY_LDP_STP > Query what is valid for a load/store pair. > > + ADDR_QUERY_STR > + Query what is valid for a store. > + > ADDR_QUERY_ANY > Query what is valid for at least one memory constraint, which may > allow things that "m" doesn't. For example, the SVE LDR and STR > @@ -128,6 +131,7 @@ enum aarch64_symbol_type > enum aarch64_addr_query_type { > ADDR_QUERY_M, > ADDR_QUERY_LDP_STP, > + ADDR_QUERY_STR, > ADDR_QUERY_ANY > }; > > diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md > index 3d1f6a01cb7..48d92702723 100644 > --- a/gcc/config/aarch64/aarch64-simd.md > +++ b/gcc/config/aarch64/aarch64-simd.md > @@ -131,9 +131,9 @@ > > (define_insn "*aarch64_simd_mov<VQ:mode>" > [(set (match_operand:VQ 0 "nonimmediate_operand" > - "=w, Umq, m, w, ?r, ?w, ?r, w") > + "=w, Umq, Uts, w, ?r, ?w, ?r, w") > (match_operand:VQ 1 "general_operand" > - "m, Dz, w, w, w, r, r, Dn"))] > + "m, Dz, w, w, w, r, r, Dn"))] > "TARGET_SIMD > && (register_operand (operands[0], <MODE>mode) > || aarch64_simd_reg_or_zero (operands[1], <MODE>mode))" > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def > index ea9ead234cb..04baf5b6de6 100644 > --- a/gcc/config/aarch64/aarch64-tuning-flags.def > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def > @@ -41,4 +41,8 @@ AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW) > are not considered cheap. */ > AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND) > > +/* Don't use a register offset in a memory address for a quad-word store. */ > +AARCH64_EXTRA_TUNING_OPTION ("slow_regoffset_quadword_store", > + SLOW_REGOFFSET_QUADWORD_STORE) > + > #undef AARCH64_EXTRA_TUNING_OPTION > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index 174310c9f1b..6aeeca8673c 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -875,7 +875,7 @@ static const struct tune_params qdf24xx_tunings = > 2, /* min_div_recip_mul_df. */ > 0, /* max_case_values. */ > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > - (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */ > + (AARCH64_EXTRA_TUNE_SLOW_REGOFFSET_QUADWORD_STORE), /* tune_flags. */ > &qdf24xx_prefetch_tune > }; > > @@ -5530,6 +5530,16 @@ aarch64_classify_address (struct aarch64_address_info *info, > || vec_flags == VEC_ADVSIMD > || vec_flags == VEC_SVE_DATA)); > > + /* Avoid register indexing for 128-bit stores when the > + AARCH64_EXTRA_TUNE_SLOW_REGOFFSET_QUADWORD_STORE option is set. */ > + if (!optimize_size > + && type == ADDR_QUERY_STR > + && (aarch64_tune_params.extra_tuning_flags > + & AARCH64_EXTRA_TUNE_SLOW_REGOFFSET_QUADWORD_STORE) > + && (mode == TImode || mode == TFmode > + || aarch64_vector_data_mode_p (mode))) > + allow_reg_index_p = false; The aarch64_classify_vector_mode code has been reworked recently for SVE so I'm not entirely up to date with its logic, but I believe that "aarch64_classify_vector_mode (mode)" will allow 64-bit vector modes, which would not be using the 128-bit Q register, so you may be disabling register indexing for D-register memory stores. > + > /* For SVE, only accept [Rn], [Rn, Rm, LSL #shift] and > [Rn, #offset, MUL VL]. */ > if ((vec_flags & (VEC_SVE_DATA | VEC_SVE_PRED)) != 0 > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md > index a6ecb391309..0c29e707676 100644 > --- a/gcc/config/aarch64/aarch64.md > +++ b/gcc/config/aarch64/aarch64.md > @@ -1079,7 +1079,7 @@ > > (define_insn "*movti_aarch64" > [(set (match_operand:TI 0 > - "nonimmediate_operand" "= r,w, r,w,r,m,m,w,m") > + "nonimmediate_operand" "= r,w, r,w,r,m,m,w,Uts") > (match_operand:TI 1 > "aarch64_movti_operand" " rUti,r, w,w,m,r,Z,m,w"))] > "(register_operand (operands[0], TImode) > @@ -1226,9 +1226,9 @@ > > (define_insn "*movtf_aarch64" > [(set (match_operand:TF 0 > - "nonimmediate_operand" "=w,?&r,w ,?r,w,?w,w,m,?r,m ,m") > + "nonimmediate_operand" "=w,?&r,w ,?r,w,?w,w,Uts,?r,m ,m") > (match_operand:TF 1 > - "general_operand" " w,?r, ?r,w ,Y,Y ,m,w,m ,?r,Y"))] > + "general_operand" " w,?r, ?r,w ,Y,Y ,m,w ,m ,?r,Y"))] > "TARGET_FLOAT && (register_operand (operands[0], TFmode) > || aarch64_reg_or_fp_zero (operands[1], TFmode))" > "@ > diff --git a/gcc/config/aarch64/constraints.md b/gcc/config/aarch64/constraints.md > index 3eb07f11894..2ecab8f1286 100644 > --- a/gcc/config/aarch64/constraints.md > +++ b/gcc/config/aarch64/constraints.md > @@ -249,6 +249,13 @@ > (match_test "aarch64_legitimate_address_p (V2DImode, > XEXP (op, 0), 1)"))) > > +(define_memory_constraint "Uts" > + "@internal > + An address valid for storing a 128-it AdvSIMD register" > + (and (match_code "mem") > + (match_test "aarch64_legitimate_address_p (GET_MODE (op), XEXP (op, 0), > + 1, ADDR_QUERY_STR)"))) > + > (define_memory_constraint "Uty" > "@internal > An address valid for SVE LD1Rs." > diff --git a/gcc/testsuite/gcc.target/aarch64/pr82533.c b/gcc/testsuite/gcc.target/aarch64/pr82533.c > new file mode 100644 > index 00000000000..fa28ffac03a > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pr82533.c > @@ -0,0 +1,11 @@ > +/* { dg-do compile } */ > +/* { dg-options "-mcpu=falkor -O2 -ftree-vectorize" } */ > + > +void > +copy (int N, double *c, double *a) > +{ > + for (int i = 0; i < N; ++i) > + c[i] = a[i]; > +} > + > +/* { dg-final { scan-assembler-not "str\tq\[0-9\]+, \\\[x\[0-9\]+, x\[0-9\]+\\\]" } } */ > -- > 2.14.3 >
On Wednesday 24 January 2018 05:50 PM, Kyrill Tkachov wrote: > I would tend towards making the costs usage more intelligent and > differentiating > between loads and stores but I agree that is definitely GCC 9 material. > Whether this approach is an acceptable stopgap for GCC 8 is up to the > aarch64 maintainers > and will depend, among other things, on the impact it has on generic > (non-Falkor) codegen. > A good experiment to help this approach would be to compile a large > codebase (for example CPU2017) > with a non-Falkor -mcpu setting and make sure that there are no assembly > changes (or minimal). > This would help justify the aarch64.md constraint changes. Thanks, I'll verify with CPU2017. > file paths don't need the gcc/ because the ChangeLog file is already in > the gcc/ directory > >> gcc/testsuite/ >> * gcc/testsuite/gcc.target/aarch64/pr82533.c: New test case. > > Similarly, you don't need the gcc/testsuite/ prefix. > Also, since you have a bugzilla PR entry please reference it in the > ChangeLog > right above the file list: > <tab>PR target/82533 > > That way when the patch is committed the SVN hooks will pick it up > automagically and update bugzilla. Ugh, sorry, I was tardy about this - I'll fix it up. >> @@ -5530,6 +5530,16 @@ aarch64_classify_address (struct >> aarch64_address_info *info, >> || vec_flags == VEC_ADVSIMD >> || vec_flags == VEC_SVE_DATA)); >> >> + /* Avoid register indexing for 128-bit stores when the >> + AARCH64_EXTRA_TUNE_SLOW_REGOFFSET_QUADWORD_STORE option is set. */ >> + if (!optimize_size >> + && type == ADDR_QUERY_STR >> + && (aarch64_tune_params.extra_tuning_flags >> + & AARCH64_EXTRA_TUNE_SLOW_REGOFFSET_QUADWORD_STORE) >> + && (mode == TImode || mode == TFmode >> + || aarch64_vector_data_mode_p (mode))) >> + allow_reg_index_p = false; > > The aarch64_classify_vector_mode code has been reworked recently for SVE > so I'm not entirely > up to date with its logic, but I believe that > "aarch64_classify_vector_mode (mode)" will > allow 64-bit vector modes, which would not be using the 128-bit Q > register, so you may be disabling > register indexing for D-register memory stores. I check this and fix the condition if necessary. Thanks, Siddhesh
On Wednesday 24 January 2018 06:29 PM, Siddhesh Poyarekar wrote: >>> + /* Avoid register indexing for 128-bit stores when the >>> + AARCH64_EXTRA_TUNE_SLOW_REGOFFSET_QUADWORD_STORE option is set. */ >>> + if (!optimize_size >>> + && type == ADDR_QUERY_STR >>> + && (aarch64_tune_params.extra_tuning_flags >>> + & AARCH64_EXTRA_TUNE_SLOW_REGOFFSET_QUADWORD_STORE) >>> + && (mode == TImode || mode == TFmode >>> + || aarch64_vector_data_mode_p (mode))) >>> + allow_reg_index_p = false; >> >> The aarch64_classify_vector_mode code has been reworked recently for SVE >> so I'm not entirely >> up to date with its logic, but I believe that >> "aarch64_classify_vector_mode (mode)" will >> allow 64-bit vector modes, which would not be using the 128-bit Q >> register, so you may be disabling >> register indexing for D-register memory stores. > > I check this and fix the condition if necessary. Looking back at the patch I remember why I used aarch64_vector_data_mode_p; this is to catch the pattern aarch64_simd_mov<VQ:mode> which optimizes a 64-bit store pair into a single quad word store. It should not avoid register indexing for any other vector modes since their patterns won't pass ADDR_QUERY_STR. In any case, I will be doing the CPU2017 run without -mcpu=falkor, so I'll report results from that. Siddhesh
diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h index ef1b0bc8e28..5fedc85f283 100644 --- a/gcc/config/aarch64/aarch64-protos.h +++ b/gcc/config/aarch64/aarch64-protos.h @@ -120,6 +120,9 @@ enum aarch64_symbol_type ADDR_QUERY_LDP_STP Query what is valid for a load/store pair. + ADDR_QUERY_STR + Query what is valid for a store. + ADDR_QUERY_ANY Query what is valid for at least one memory constraint, which may allow things that "m" doesn't. For example, the SVE LDR and STR @@ -128,6 +131,7 @@ enum aarch64_symbol_type enum aarch64_addr_query_type { ADDR_QUERY_M, ADDR_QUERY_LDP_STP, + ADDR_QUERY_STR, ADDR_QUERY_ANY }; diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 3d1f6a01cb7..48d92702723 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -131,9 +131,9 @@ (define_insn "*aarch64_simd_mov<VQ:mode>" [(set (match_operand:VQ 0 "nonimmediate_operand" - "=w, Umq, m, w, ?r, ?w, ?r, w") + "=w, Umq, Uts, w, ?r, ?w, ?r, w") (match_operand:VQ 1 "general_operand" - "m, Dz, w, w, w, r, r, Dn"))] + "m, Dz, w, w, w, r, r, Dn"))] "TARGET_SIMD && (register_operand (operands[0], <MODE>mode) || aarch64_simd_reg_or_zero (operands[1], <MODE>mode))" diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def index ea9ead234cb..04baf5b6de6 100644 --- a/gcc/config/aarch64/aarch64-tuning-flags.def +++ b/gcc/config/aarch64/aarch64-tuning-flags.def @@ -41,4 +41,8 @@ AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW) are not considered cheap. */ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND) +/* Don't use a register offset in a memory address for a quad-word store. */ +AARCH64_EXTRA_TUNING_OPTION ("slow_regoffset_quadword_store", + SLOW_REGOFFSET_QUADWORD_STORE) + #undef AARCH64_EXTRA_TUNING_OPTION diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 174310c9f1b..6aeeca8673c 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -875,7 +875,7 @@ static const struct tune_params qdf24xx_tunings = 2, /* min_div_recip_mul_df. */ 0, /* max_case_values. */ tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ - (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */ + (AARCH64_EXTRA_TUNE_SLOW_REGOFFSET_QUADWORD_STORE), /* tune_flags. */ &qdf24xx_prefetch_tune }; @@ -5530,6 +5530,16 @@ aarch64_classify_address (struct aarch64_address_info *info, || vec_flags == VEC_ADVSIMD || vec_flags == VEC_SVE_DATA)); + /* Avoid register indexing for 128-bit stores when the + AARCH64_EXTRA_TUNE_SLOW_REGOFFSET_QUADWORD_STORE option is set. */ + if (!optimize_size + && type == ADDR_QUERY_STR + && (aarch64_tune_params.extra_tuning_flags + & AARCH64_EXTRA_TUNE_SLOW_REGOFFSET_QUADWORD_STORE) + && (mode == TImode || mode == TFmode + || aarch64_vector_data_mode_p (mode))) + allow_reg_index_p = false; + /* For SVE, only accept [Rn], [Rn, Rm, LSL #shift] and [Rn, #offset, MUL VL]. */ if ((vec_flags & (VEC_SVE_DATA | VEC_SVE_PRED)) != 0 diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index a6ecb391309..0c29e707676 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -1079,7 +1079,7 @@ (define_insn "*movti_aarch64" [(set (match_operand:TI 0 - "nonimmediate_operand" "= r,w, r,w,r,m,m,w,m") + "nonimmediate_operand" "= r,w, r,w,r,m,m,w,Uts") (match_operand:TI 1 "aarch64_movti_operand" " rUti,r, w,w,m,r,Z,m,w"))] "(register_operand (operands[0], TImode) @@ -1226,9 +1226,9 @@ (define_insn "*movtf_aarch64" [(set (match_operand:TF 0 - "nonimmediate_operand" "=w,?&r,w ,?r,w,?w,w,m,?r,m ,m") + "nonimmediate_operand" "=w,?&r,w ,?r,w,?w,w,Uts,?r,m ,m") (match_operand:TF 1 - "general_operand" " w,?r, ?r,w ,Y,Y ,m,w,m ,?r,Y"))] + "general_operand" " w,?r, ?r,w ,Y,Y ,m,w ,m ,?r,Y"))] "TARGET_FLOAT && (register_operand (operands[0], TFmode) || aarch64_reg_or_fp_zero (operands[1], TFmode))" "@ diff --git a/gcc/config/aarch64/constraints.md b/gcc/config/aarch64/constraints.md index 3eb07f11894..2ecab8f1286 100644 --- a/gcc/config/aarch64/constraints.md +++ b/gcc/config/aarch64/constraints.md @@ -249,6 +249,13 @@ (match_test "aarch64_legitimate_address_p (V2DImode, XEXP (op, 0), 1)"))) +(define_memory_constraint "Uts" + "@internal + An address valid for storing a 128-it AdvSIMD register" + (and (match_code "mem") + (match_test "aarch64_legitimate_address_p (GET_MODE (op), XEXP (op, 0), + 1, ADDR_QUERY_STR)"))) + (define_memory_constraint "Uty" "@internal An address valid for SVE LD1Rs." diff --git a/gcc/testsuite/gcc.target/aarch64/pr82533.c b/gcc/testsuite/gcc.target/aarch64/pr82533.c new file mode 100644 index 00000000000..fa28ffac03a --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr82533.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-mcpu=falkor -O2 -ftree-vectorize" } */ + +void +copy (int N, double *c, double *a) +{ + for (int i = 0; i < N; ++i) + c[i] = a[i]; +} + +/* { dg-final { scan-assembler-not "str\tq\[0-9\]+, \\\[x\[0-9\]+, x\[0-9\]+\\\]" } } */