Message ID | 20170302214419.GT1849@tucnak |
---|---|
State | New |
Headers | show |
On Thu, Mar 2, 2017 at 10:44 PM, Jakub Jelinek <jakub@redhat.com> wrote: > Hi! > > For -O1 and above, we force all operands and targets of x86 builtins into > registers during expansion, because we expect combiner will do its job. > But for -O0, we try to keep them in MEM if the predicate allows it, we just > make sure that at most one input operand is a MEM. We still allow > MEM destination and one MEM input operand, which is why we ICE - most > insns/expanders don't allow that in the predicates and it is fine, but > various insns/expanders have e.g. nonimmediate_operand on both target > and some input operand and just have a condition that requires that one > of those is a register or that both of them are not memory. The expanders > don't check the condition though, it is checked only when the insn is being > recognized in vregs pass and that is too late. > The following patch fixes that by forcing all input operands at -O0 into > non-memory if the target is memory. > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? > > Or are there any define_insn/define_expand where it is desirable to have > both input and output operand a MEM (and does it have to be matching)? > For various scalar binary and unary expanders the backend already uses a helper > that will force something into memory if dest and src are both memory and > not rtx_equal_p, but do we have anything like that in anything these two > builtin expanders emit? Any insn with matched memory (movlps, movhps and similar) can also operate with matched register. To my knowledge, all insn patterns are written in this way, since in the past we had plenty of problems with matched memory operands. Also, I really don't remember what benefit brings us to *not* force input operand to a register at -O0 at expand time and leave to RA to find matched memory. OTOH, with -O0 we already have so many unnecessary moves that you get tears in the eyes... > 2017-03-02 Jakub Jelinek <jakub@redhat.com> > > PR target/79807 > * config/i386/i386.c (ix86_expand_multi_arg_builtin): If target > is a memory operand, increase num_memory. > (ix86_expand_args_builtin): Likewise. > > * gcc.target/i386/pr79807.c: New test. OK. Thanks, Uros. > > --- gcc/config/i386/i386.c.jj 2017-03-02 08:08:43.000000000 +0100 > +++ gcc/config/i386/i386.c 2017-03-02 17:58:28.396999033 +0100 > @@ -34249,6 +34249,8 @@ ix86_expand_multi_arg_builtin (enum insn > || GET_MODE (target) != tmode > || !insn_data[icode].operand[0].predicate (target, tmode)) > target = gen_reg_rtx (tmode); > + else if (memory_operand (target, tmode)) > + num_memory++; > > gcc_assert (nargs <= 4); > > @@ -35534,6 +35536,8 @@ ix86_expand_args_builtin (const struct b > || GET_MODE (target) != tmode > || !insn_p->operand[0].predicate (target, tmode)) > target = gen_reg_rtx (tmode); > + else if (memory_operand (target, tmode)) > + num_memory++; > real_target = target; > } > else > --- gcc/testsuite/gcc.target/i386/pr79807.c.jj 2017-03-02 18:03:30.032023436 +0100 > +++ gcc/testsuite/gcc.target/i386/pr79807.c 2017-03-02 18:02:53.000000000 +0100 > @@ -0,0 +1,12 @@ > +/* PR target/79807 */ > +/* { dg-do compile } */ > +/* { dg-options "-O0 -mavx -ffloat-store" } */ > + > +typedef double __v2df __attribute__ ((__vector_size__ (16))); > +typedef double __v4df __attribute__ ((__vector_size__ (32))); > + > +__v2df > +foo (__v4df x) > +{ > + return __builtin_ia32_pd_pd256 (x); > +} > > Jakub
On Fri, Mar 03, 2017 at 09:10:03AM +0100, Uros Bizjak wrote: > > Or are there any define_insn/define_expand where it is desirable to have > > both input and output operand a MEM (and does it have to be matching)? > > For various scalar binary and unary expanders the backend already uses a helper > > that will force something into memory if dest and src are both memory and > > not rtx_equal_p, but do we have anything like that in anything these two > > builtin expanders emit? > > Any insn with matched memory (movlps, movhps and similar) can also > operate with matched register. To my knowledge, all insn patterns are > written in this way, since in the past we had plenty of problems with > matched memory operands. > > Also, I really don't remember what benefit brings us to *not* force > input operand to a register at -O0 at expand time and leave to RA to > find matched memory. OTOH, with -O0 we already have so many > unnecessary moves that you get tears in the eyes... So, I've tried: make mddump for i in `sed -n -e 's/^.*CODE_FOR_\([^,]*\),.*$/\1/p' ../../gcc/config/i386/i386-builtin.def`; do sed -n '/^(define_\(insn\|expand\) ("'$i'"/,/^$/p' tmp-mddump.md; done and looked for "=.*[mo] in there. Some insns might want that && !(MEM_P (operands[0]) && MEM_P (operands[1])), e.g. (define_insn_and_split "avx512f_<castmode><avxsizesuffix>_<castmode>" [(set (match_operand:AVX512MODE2P 0 "nonimmediate_operand" "=x,m") (unspec:AVX512MODE2P [(match_operand:<ssequartermode> 1 "nonimmediate_operand" "xm,x")] UNSPEC_CAST))] "TARGET_AVX512F" "#" "&& reload_completed" [(set (match_dup 0) (match_dup 1))] as the constraint require that both operands aren't memory, shall I create a patch for that? This is the first category below. The second category is where matching operand is ok, so my patch can pessimize stuff. I wonder if we couldn't handle this like: /* If we aren't optimizing, only allow one memory operand to be generated. */ if (memory_operand (op, mode)) { const char *const constraint = insn_data[icode].operand[i + adjust + 1].constraint; if (optimize || num_memory != 1 || !rtx_equal_p (real_target, op)) num_memory++; /* sse2_movsd allows matching operand. */ else if (icode == CODE_FOR_sse2_movsd) ; /* Various masked insns allow matching operand. */ else if (insn_data[icode].operand[i + adjust + 1].predicate == vector_move_operand && (strcmp (constraint, "0C") == 0 || strcmp (constraint, "0C,0") == 0)) ; else num_memory++; } (though perhaps sse2_movsd is still too simplistic, because it has just =m v 0 and =o 0 v alternatives, so if i + adjust + 1 is 2, then it is fine as is, if it is 1, then only if the memory is offsettable; though perhaps LRA can just turn non-offsettable memory into offsettable one through secondary reload). And the last category is with "=m" destination (not allowing anything else) and (match_dup 0) somewhere in the pattern, I believe those have to be expanded specially, because otherwise one e.g. wouldn't get MEM_P target if optimize at all (I think the builtins are supposed to pass pointer to the result in that case) and the match_dup just doesn't appear as another operand. missing !(MEM && MEM) ? ======================= avx512f_pd512_256pd avx512f_pd512_pd avx512f_ps512_256ps avx512f_ps512_ps avx512f_si512_256si avx512f_si512_si avx_pd256_pd avx_ps256_ps avx_si256_si sse_storehps sse_storelps valid matching mem ================== avx512f_ss_truncatev16siv16hi2_mask avx512f_ss_truncatev16siv16qi2_mask avx512f_ss_truncatev8div8hi2_mask avx512f_ss_truncatev8div8si2_mask avx512f_truncatev16siv16hi2_mask avx512f_truncatev16siv16qi2_mask avx512f_truncatev8div8hi2_mask avx512f_truncatev8div8si2_mask avx512f_us_truncatev16siv16hi2_mask avx512f_us_truncatev16siv16qi2_mask avx512f_us_truncatev8div8hi2_mask avx512f_us_truncatev8div8si2_mask avx512f_vcvtps2ph512_mask avx512vl_ss_truncatev16hiv16qi2_mask avx512vl_ss_truncatev4div4si2_mask avx512vl_ss_truncatev8siv8hi2_mask avx512vl_truncatev16hiv16qi2_mask avx512vl_truncatev4div4si2_mask avx512vl_truncatev8siv8hi2_mask avx512vl_us_truncatev16hiv16qi2_mask avx512vl_us_truncatev4div4si2_mask avx512vl_us_truncatev8siv8hi2_mask sse2_movsd vcvtps2ph256_mask match_dup on mem target ======================= avx512f_ss_truncatev8div16qi2_mask_store avx512f_storev16sf_mask avx512f_storev16si_mask avx512f_storev8df_mask avx512f_storev8di_mask avx512f_truncatev8div16qi2_mask_store avx512f_us_truncatev8div16qi2_mask_store avx512vl_compressstorev2df_mask avx512vl_compressstorev2di_mask avx512vl_compressstorev4df_mask avx512vl_compressstorev4di_mask avx512vl_compressstorev4sf_mask avx512vl_compressstorev4si_mask avx512vl_compressstorev8sf_mask avx512vl_compressstorev8si_mask avx512vl_ss_truncatev2div2hi2_mask_store avx512vl_ss_truncatev2div2qi2_mask_store avx512vl_ss_truncatev2div2si2_mask_store avx512vl_ss_truncatev4div4hi2_mask_store avx512vl_ss_truncatev4div4qi2_mask_store avx512vl_ss_truncatev4siv4hi2_mask_store avx512vl_ss_truncatev4siv4qi2_mask_store avx512vl_ss_truncatev8siv8qi2_mask_store avx512vl_storev16hi_mask avx512vl_storev16qi_mask avx512vl_storev2df_mask avx512vl_storev2di_mask avx512vl_storev32qi_mask avx512vl_storev4df_mask avx512vl_storev4di_mask avx512vl_storev4sf_mask avx512vl_storev4si_mask avx512vl_storev8hi_mask avx512vl_storev8sf_mask avx512vl_storev8si_mask avx512vl_truncatev2div2hi2_mask_store avx512vl_truncatev2div2qi2_mask_store avx512vl_truncatev2div2si2_mask_store avx512vl_truncatev4div4hi2_mask_store avx512vl_truncatev4div4qi2_mask_store avx512vl_truncatev4siv4hi2_mask_store avx512vl_truncatev4siv4qi2_mask_store avx512vl_truncatev8siv8qi2_mask_store avx512vl_us_truncatev2div2hi2_mask_store avx512vl_us_truncatev2div2qi2_mask_store avx512vl_us_truncatev2div2si2_mask_store avx512vl_us_truncatev4div4hi2_mask_store avx512vl_us_truncatev4div4qi2_mask_store avx512vl_us_truncatev4siv4hi2_mask_store avx512vl_us_truncatev4siv4qi2_mask_store avx512vl_us_truncatev8siv8qi2_mask_store Jakub
On Fri, Mar 3, 2017 at 11:35 AM, Jakub Jelinek <jakub@redhat.com> wrote: > On Fri, Mar 03, 2017 at 09:10:03AM +0100, Uros Bizjak wrote: >> > Or are there any define_insn/define_expand where it is desirable to have >> > both input and output operand a MEM (and does it have to be matching)? >> > For various scalar binary and unary expanders the backend already uses a helper >> > that will force something into memory if dest and src are both memory and >> > not rtx_equal_p, but do we have anything like that in anything these two >> > builtin expanders emit? >> >> Any insn with matched memory (movlps, movhps and similar) can also >> operate with matched register. To my knowledge, all insn patterns are >> written in this way, since in the past we had plenty of problems with >> matched memory operands. >> >> Also, I really don't remember what benefit brings us to *not* force >> input operand to a register at -O0 at expand time and leave to RA to >> find matched memory. OTOH, with -O0 we already have so many >> unnecessary moves that you get tears in the eyes... > > So, I've tried: > make mddump > for i in `sed -n -e 's/^.*CODE_FOR_\([^,]*\),.*$/\1/p' ../../gcc/config/i386/i386-builtin.def`; do sed -n '/^(define_\(insn\|expand\) ("'$i'"/,/^$/p' tmp-mddump.md; done > and looked for "=.*[mo] in there. > Some insns might want that && !(MEM_P (operands[0]) && MEM_P (operands[1])), > e.g. > (define_insn_and_split "avx512f_<castmode><avxsizesuffix>_<castmode>" > [(set (match_operand:AVX512MODE2P 0 "nonimmediate_operand" "=x,m") > (unspec:AVX512MODE2P > [(match_operand:<ssequartermode> 1 "nonimmediate_operand" "xm,x")] > UNSPEC_CAST))] > "TARGET_AVX512F" > "#" > "&& reload_completed" > [(set (match_dup 0) (match_dup 1))] > as the constraint require that both operands aren't memory, shall I create a > patch for that? This is the first category below. Yes. Although expander takes care not to generate two memory references, combine can propagate memory to the other operand, creating semi-invalid RTX that is later resolved by RA. > The second category is where matching operand is ok, so my patch can > pessimize stuff. I wonder if we couldn't handle this like: > /* If we aren't optimizing, only allow one memory operand to be > generated. */ > if (memory_operand (op, mode)) > { > const char *const constraint > = insn_data[icode].operand[i + adjust + 1].constraint; > if (optimize > || num_memory != 1 > || !rtx_equal_p (real_target, op)) > num_memory++; > /* sse2_movsd allows matching operand. */ > else if (icode == CODE_FOR_sse2_movsd) > ; > /* Various masked insns allow matching operand. */ > else if (insn_data[icode].operand[i + adjust + 1].predicate > == vector_move_operand > && (strcmp (constraint, "0C") == 0 > || strcmp (constraint, "0C,0") == 0)) > ; > else > num_memory++; > } > (though perhaps sse2_movsd is still too simplistic, because it has just > =m v 0 and =o 0 v alternatives, so if i + adjust + 1 is 2, then it is > fine as is, if it is 1, then only if the memory is offsettable; though > perhaps LRA can just turn non-offsettable memory into offsettable one > through secondary reload). There is similarity with arithmetic operations. Please note that arith ops fixup their arguments using ix86_expand_binary_operator, and their insn patterns have to satisfy ix86_binary_operator_ok insn predicate. However, even using this infrastructure, I have noticed that the compiler is quite sloppy with memory operands, and we mostly get matched memory after combine pass (if we are lucky). Regarding the implementation, I'd rather see another flag for i386-builtin.def entries, where we can mark if matching memory operand is allowed. The expander can then look at the flag, and do some special magic. > And the last category is with "=m" destination (not allowing anything else) > and (match_dup 0) somewhere in the pattern, I believe those have to be > expanded specially, because otherwise one e.g. wouldn't get MEM_P target > if optimize at all (I think the builtins are supposed to pass pointer to the > result in that case) and the match_dup just doesn't appear as another operand. It looks to me, that the expander already takes care for matching, where strictly required. OTOH, matching memory requirement would also fit in the proposed implementation with flag in builtin entries, we can mark if matching memory operand is required, not only allowed. However, as said above, expanding with memory operands is somehow neglected area, desperately looking for some love. Uros. > missing !(MEM && MEM) ? > ======================= > avx512f_pd512_256pd > avx512f_pd512_pd > avx512f_ps512_256ps > avx512f_ps512_ps > avx512f_si512_256si > avx512f_si512_si > avx_pd256_pd > avx_ps256_ps > avx_si256_si > sse_storehps > sse_storelps > > valid matching mem > ================== > avx512f_ss_truncatev16siv16hi2_mask > avx512f_ss_truncatev16siv16qi2_mask > avx512f_ss_truncatev8div8hi2_mask > avx512f_ss_truncatev8div8si2_mask > avx512f_truncatev16siv16hi2_mask > avx512f_truncatev16siv16qi2_mask > avx512f_truncatev8div8hi2_mask > avx512f_truncatev8div8si2_mask > avx512f_us_truncatev16siv16hi2_mask > avx512f_us_truncatev16siv16qi2_mask > avx512f_us_truncatev8div8hi2_mask > avx512f_us_truncatev8div8si2_mask > avx512f_vcvtps2ph512_mask > avx512vl_ss_truncatev16hiv16qi2_mask > avx512vl_ss_truncatev4div4si2_mask > avx512vl_ss_truncatev8siv8hi2_mask > avx512vl_truncatev16hiv16qi2_mask > avx512vl_truncatev4div4si2_mask > avx512vl_truncatev8siv8hi2_mask > avx512vl_us_truncatev16hiv16qi2_mask > avx512vl_us_truncatev4div4si2_mask > avx512vl_us_truncatev8siv8hi2_mask > sse2_movsd > vcvtps2ph256_mask > > match_dup on mem target > ======================= > avx512f_ss_truncatev8div16qi2_mask_store > avx512f_storev16sf_mask > avx512f_storev16si_mask > avx512f_storev8df_mask > avx512f_storev8di_mask > avx512f_truncatev8div16qi2_mask_store > avx512f_us_truncatev8div16qi2_mask_store > avx512vl_compressstorev2df_mask > avx512vl_compressstorev2di_mask > avx512vl_compressstorev4df_mask > avx512vl_compressstorev4di_mask > avx512vl_compressstorev4sf_mask > avx512vl_compressstorev4si_mask > avx512vl_compressstorev8sf_mask > avx512vl_compressstorev8si_mask > avx512vl_ss_truncatev2div2hi2_mask_store > avx512vl_ss_truncatev2div2qi2_mask_store > avx512vl_ss_truncatev2div2si2_mask_store > avx512vl_ss_truncatev4div4hi2_mask_store > avx512vl_ss_truncatev4div4qi2_mask_store > avx512vl_ss_truncatev4siv4hi2_mask_store > avx512vl_ss_truncatev4siv4qi2_mask_store > avx512vl_ss_truncatev8siv8qi2_mask_store > avx512vl_storev16hi_mask > avx512vl_storev16qi_mask > avx512vl_storev2df_mask > avx512vl_storev2di_mask > avx512vl_storev32qi_mask > avx512vl_storev4df_mask > avx512vl_storev4di_mask > avx512vl_storev4sf_mask > avx512vl_storev4si_mask > avx512vl_storev8hi_mask > avx512vl_storev8sf_mask > avx512vl_storev8si_mask > avx512vl_truncatev2div2hi2_mask_store > avx512vl_truncatev2div2qi2_mask_store > avx512vl_truncatev2div2si2_mask_store > avx512vl_truncatev4div4hi2_mask_store > avx512vl_truncatev4div4qi2_mask_store > avx512vl_truncatev4siv4hi2_mask_store > avx512vl_truncatev4siv4qi2_mask_store > avx512vl_truncatev8siv8qi2_mask_store > avx512vl_us_truncatev2div2hi2_mask_store > avx512vl_us_truncatev2div2qi2_mask_store > avx512vl_us_truncatev2div2si2_mask_store > avx512vl_us_truncatev4div4hi2_mask_store > avx512vl_us_truncatev4div4qi2_mask_store > avx512vl_us_truncatev4siv4hi2_mask_store > avx512vl_us_truncatev4siv4qi2_mask_store > avx512vl_us_truncatev8siv8qi2_mask_store > > Jakub
On Fri, Mar 03, 2017 at 12:18:09PM +0100, Uros Bizjak wrote: > > as the constraint require that both operands aren't memory, shall I create a > > patch for that? This is the first category below. > > Yes. Although expander takes care not to generate two memory > references, combine can propagate memory to the other operand, > creating semi-invalid RTX that is later resolved by RA. Ok, will work on a patch. > > The second category is where matching operand is ok, so my patch can > > pessimize stuff. I wonder if we couldn't handle this like: > > /* If we aren't optimizing, only allow one memory operand to be > > generated. */ > > if (memory_operand (op, mode)) > > { > > const char *const constraint > > = insn_data[icode].operand[i + adjust + 1].constraint; > > if (optimize > > || num_memory != 1 > > || !rtx_equal_p (real_target, op)) > > num_memory++; > > /* sse2_movsd allows matching operand. */ > > else if (icode == CODE_FOR_sse2_movsd) > > ; > > /* Various masked insns allow matching operand. */ > > else if (insn_data[icode].operand[i + adjust + 1].predicate > > == vector_move_operand > > && (strcmp (constraint, "0C") == 0 > > || strcmp (constraint, "0C,0") == 0)) > > ; > > else > > num_memory++; > > } > > (though perhaps sse2_movsd is still too simplistic, because it has just > > =m v 0 and =o 0 v alternatives, so if i + adjust + 1 is 2, then it is > > fine as is, if it is 1, then only if the memory is offsettable; though > > perhaps LRA can just turn non-offsettable memory into offsettable one > > through secondary reload). > > There is similarity with arithmetic operations. Please note that arith > ops fixup their arguments using ix86_expand_binary_operator, and their > insn patterns have to satisfy ix86_binary_operator_ok insn predicate. > However, even using this infrastructure, I have noticed that the > compiler is quite sloppy with memory operands, and we mostly get > matched memory after combine pass (if we are lucky). > > Regarding the implementation, I'd rather see another flag for > i386-builtin.def entries, where we can mark if matching memory operand > is allowed. The expander can then look at the flag, and do some > special magic. I think maintaining such a flag would be a nightmare, I think more insns will need it in the future and it will be hard to remember. Maybe it is better not to do anything about those (i.e. apply my patch as is), after all -O0 is not optimized code. Jakub
On Fri, Mar 3, 2017 at 12:30 PM, Jakub Jelinek <jakub@redhat.com> wrote: >> Regarding the implementation, I'd rather see another flag for >> i386-builtin.def entries, where we can mark if matching memory operand >> is allowed. The expander can then look at the flag, and do some >> special magic. > > I think maintaining such a flag would be a nightmare, I think more insns > will need it in the future and it will be hard to remember. Maybe it is better > not to do anything about those (i.e. apply my patch as is), after all -O0 > is not optimized code. Yes, I agree. Uros.
--- gcc/config/i386/i386.c.jj 2017-03-02 08:08:43.000000000 +0100 +++ gcc/config/i386/i386.c 2017-03-02 17:58:28.396999033 +0100 @@ -34249,6 +34249,8 @@ ix86_expand_multi_arg_builtin (enum insn || GET_MODE (target) != tmode || !insn_data[icode].operand[0].predicate (target, tmode)) target = gen_reg_rtx (tmode); + else if (memory_operand (target, tmode)) + num_memory++; gcc_assert (nargs <= 4); @@ -35534,6 +35536,8 @@ ix86_expand_args_builtin (const struct b || GET_MODE (target) != tmode || !insn_p->operand[0].predicate (target, tmode)) target = gen_reg_rtx (tmode); + else if (memory_operand (target, tmode)) + num_memory++; real_target = target; } else --- gcc/testsuite/gcc.target/i386/pr79807.c.jj 2017-03-02 18:03:30.032023436 +0100 +++ gcc/testsuite/gcc.target/i386/pr79807.c 2017-03-02 18:02:53.000000000 +0100 @@ -0,0 +1,12 @@ +/* PR target/79807 */ +/* { dg-do compile } */ +/* { dg-options "-O0 -mavx -ffloat-store" } */ + +typedef double __v2df __attribute__ ((__vector_size__ (16))); +typedef double __v4df __attribute__ ((__vector_size__ (32))); + +__v2df +foo (__v4df x) +{ + return __builtin_ia32_pd_pd256 (x); +}