Message ID | 68c1aa7d-0a7b-1427-55f8-edc6302f00dc@suse.com |
---|---|
State | New |
Headers | show |
Series | x86: make VPTERNLOG* usable on less than 512-bit operands with just AVX512F | expand |
On Wed, Jun 14, 2023 at 1:59 PM Jan Beulich via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > There's no reason to constrain this to AVX512VL, as the wider operation > is not usable for more narrow operands only when the possible memory But this may require more resources (on AMD znver4 processor a zmm instruction will also be split into 2 uops, right?) And on some intel processors(SKX/CLX) there will be frequency reduction. If it needs to be done, it is better guarded with !TARGET_PREFER_AVX256, at least when micro-architecture AVX256_OPTIMAL or users explicitly uses -mprefer-vector-width=256, we don't want to produce any zmm instruction for surprise.(Although -mprefer-vector-width=256 is supposed for auto-vectorizer, but backend codegen also use it under such cases, i.e. in *movsf_internal alternative 5 use zmm only TARGET_AVX512F && !TARGET_PREFER_AVX256.) > source is a non-broadcast one. This way even the scalar copysign<mode>3 > can benefit from the operation being a single-insn one (leaving aside > moves which the compiler decides to insert for unclear reasons, and > leaving aside the fact that bcst_mem_operand() is too restrictive for > broadcast to be embedded right into VPTERNLOG*). > > Along with this also request value duplication in > ix86_expand_copysign()'s call to ix86_build_signbit_mask(), eliminating > excess space allocation in .rodata.*, filled with zeros which are never > read. > > gcc/ > > * config/i386/i386-expand.cc (ix86_expand_copysign): Request > value duplication by ix86_build_signbit_mask() when AVX512F and > not HFmode. > * config/i386/sse.md (*<avx512>_vternlog<mode>_all): Convert to > 2-alternative form. Adjust "mode" attribute. Add "enabled" > attribute. > (*<avx512>_vpternlog<mode>_1): Relax to just TARGET_AVX512F. > (*<avx512>_vpternlog<mode>_2): Likewise. > (*<avx512>_vpternlog<mode>_3): Likewise. > --- > I guess the underlying pattern, going along the lines of what > <mask_codefor>one_cmpl<mode>2<mask_name> uses, can be applied elsewhere > as well. > > HFmode could use embedded broadcast too for copysign and alike, but that > would need to be V2HF -> V8HF (for which I don't think there are any > existing patterns). > > --- a/gcc/config/i386/i386-expand.cc > +++ b/gcc/config/i386/i386-expand.cc > @@ -2266,7 +2266,7 @@ ix86_expand_copysign (rtx operands[]) > else > dest = NULL_RTX; > op1 = lowpart_subreg (vmode, force_reg (mode, operands[2]), mode); > - mask = ix86_build_signbit_mask (vmode, 0, 0); > + mask = ix86_build_signbit_mask (vmode, TARGET_AVX512F && mode != HFmode, 0); > > if (CONST_DOUBLE_P (operands[1])) > { > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -12399,11 +12399,11 @@ > (set_attr "mode" "<sseinsnmode>")]) > > (define_insn "*<avx512>_vternlog<mode>_all" > - [(set (match_operand:V 0 "register_operand" "=v") > + [(set (match_operand:V 0 "register_operand" "=v,v") > (unspec:V > - [(match_operand:V 1 "register_operand" "0") > - (match_operand:V 2 "register_operand" "v") > - (match_operand:V 3 "bcst_vector_operand" "vmBr") > + [(match_operand:V 1 "register_operand" "0,0") > + (match_operand:V 2 "register_operand" "v,v") > + (match_operand:V 3 "bcst_vector_operand" "vBr,m") > (match_operand:SI 4 "const_0_to_255_operand")] > UNSPEC_VTERNLOG))] > "TARGET_AVX512F > @@ -12411,10 +12411,22 @@ > it's not real AVX512FP16 instruction. */ > && (GET_MODE_SIZE (GET_MODE_INNER (<MODE>mode)) >= 4 > || GET_CODE (operands[3]) != VEC_DUPLICATE)" > - "vpternlog<ternlogsuffix>\t{%4, %3, %2, %0|%0, %2, %3, %4}" > +{ > + if (TARGET_AVX512VL) > + return "vpternlog<ternlogsuffix>\t{%4, %3, %2, %0|%0, %2, %3, %4}"; > + else > + return "vpternlog<ternlogsuffix>\t{%4, %g3, %g2, %g0|%g0, %g2, %g3, %4}"; > +} > [(set_attr "type" "sselog") > (set_attr "prefix" "evex") > - (set_attr "mode" "<sseinsnmode>")]) > + (set (attr "mode") > + (if_then_else (match_test "TARGET_AVX512VL") > + (const_string "<sseinsnmode>") > + (const_string "XI"))) > + (set (attr "enabled") > + (if_then_else (eq_attr "alternative" "1") > + (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL") > + (const_string "*")))]) > > ;; There must be lots of other combinations like > ;; > @@ -12443,7 +12455,7 @@ > (any_logic2:V > (match_operand:V 3 "regmem_or_bitnot_regmem_operand") > (match_operand:V 4 "regmem_or_bitnot_regmem_operand"))))] > - "(<MODE_SIZE> == 64 || TARGET_AVX512VL) > + "TARGET_AVX512F > && ix86_pre_reload_split () > && (rtx_equal_p (STRIP_UNARY (operands[1]), > STRIP_UNARY (operands[4])) > @@ -12527,7 +12539,7 @@ > (match_operand:V 2 "regmem_or_bitnot_regmem_operand")) > (match_operand:V 3 "regmem_or_bitnot_regmem_operand")) > (match_operand:V 4 "regmem_or_bitnot_regmem_operand")))] > - "(<MODE_SIZE> == 64 || TARGET_AVX512VL) > + "TARGET_AVX512F > && ix86_pre_reload_split () > && (rtx_equal_p (STRIP_UNARY (operands[1]), > STRIP_UNARY (operands[4])) > @@ -12610,7 +12622,7 @@ > (match_operand:V 1 "regmem_or_bitnot_regmem_operand") > (match_operand:V 2 "regmem_or_bitnot_regmem_operand")) > (match_operand:V 3 "regmem_or_bitnot_regmem_operand")))] > - "(<MODE_SIZE> == 64 || TARGET_AVX512VL) > + "TARGET_AVX512F > && ix86_pre_reload_split ()" > "#" > "&& 1"
On 14.06.2023 10:10, Hongtao Liu wrote: > On Wed, Jun 14, 2023 at 1:59 PM Jan Beulich via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote: >> >> There's no reason to constrain this to AVX512VL, as the wider operation >> is not usable for more narrow operands only when the possible memory > But this may require more resources (on AMD znver4 processor a zmm > instruction will also be split into 2 uops, right?) And on some intel > processors(SKX/CLX) there will be frequency reduction. I'm afraid I don't follow: Largely the same AVX512 code would be generated when passing -mavx512vl, so how can power/performance considerations matter here? All I'm doing here (and in a few more patches I'm still in the process of testing) is relax when AVX512 insns can actually be used (reducing the copying between registers and/or the number of insns needed). My understanding on the Intel side is that it only matters whether AVX512 insns are used, not what vector length they are. You may be right about znver4, though. Nevertheless I agree ... > If it needs to be done, it is better guarded with > !TARGET_PREFER_AVX256, at least when micro-architecture AVX256_OPTIMAL > or users explicitly uses -mprefer-vector-width=256, we don't want to > produce any zmm instruction for surprise.(Although > -mprefer-vector-width=256 is supposed for auto-vectorizer, but backend > codegen also use it under such cases, i.e. in *movsf_internal > alternative 5 use zmm only TARGET_AVX512F && !TARGET_PREFER_AVX256.) ... that respecting such overrides is probably desirable, so I'll adjust. Jan >> source is a non-broadcast one. This way even the scalar copysign<mode>3 >> can benefit from the operation being a single-insn one (leaving aside >> moves which the compiler decides to insert for unclear reasons, and >> leaving aside the fact that bcst_mem_operand() is too restrictive for >> broadcast to be embedded right into VPTERNLOG*). >> >> Along with this also request value duplication in >> ix86_expand_copysign()'s call to ix86_build_signbit_mask(), eliminating >> excess space allocation in .rodata.*, filled with zeros which are never >> read. >> >> gcc/ >> >> * config/i386/i386-expand.cc (ix86_expand_copysign): Request >> value duplication by ix86_build_signbit_mask() when AVX512F and >> not HFmode. >> * config/i386/sse.md (*<avx512>_vternlog<mode>_all): Convert to >> 2-alternative form. Adjust "mode" attribute. Add "enabled" >> attribute. >> (*<avx512>_vpternlog<mode>_1): Relax to just TARGET_AVX512F. >> (*<avx512>_vpternlog<mode>_2): Likewise. >> (*<avx512>_vpternlog<mode>_3): Likewise.
On Wed, Jun 14, 2023 at 5:32 PM Jan Beulich <jbeulich@suse.com> wrote: > > On 14.06.2023 10:10, Hongtao Liu wrote: > > On Wed, Jun 14, 2023 at 1:59 PM Jan Beulich via Gcc-patches > > <gcc-patches@gcc.gnu.org> wrote: > >> > >> There's no reason to constrain this to AVX512VL, as the wider operation > >> is not usable for more narrow operands only when the possible memory > > But this may require more resources (on AMD znver4 processor a zmm > > instruction will also be split into 2 uops, right?) And on some intel > > processors(SKX/CLX) there will be frequency reduction. > > I'm afraid I don't follow: Largely the same AVX512 code would be > generated when passing -mavx512vl, so how can power/performance > considerations matter here? All I'm doing here (and in a few more Yes , for -march=*** is ok since AVX512VL is included. what your patch improve is -mavx512f -mno-avx512vl, but for specific option combinations like -mavx512f -mprefer-vector-width=256 -mno-avx512vl, your patch will produce zmm instruction which is not expected. > patches I'm still in the process of testing) is relax when AVX512 > insns can actually be used (reducing the copying between registers > and/or the number of insns needed). My understanding on the Intel > side is that it only matters whether AVX512 insns are used, not No, vector length matters, ymm/xmm evex insns are ok to use, but zmm insns will cause frequency reduction. > what vector length they are. You may be right about znver4, though. > > Nevertheless I agree ... > > > If it needs to be done, it is better guarded with > > !TARGET_PREFER_AVX256, at least when micro-architecture AVX256_OPTIMAL > > or users explicitly uses -mprefer-vector-width=256, we don't want to > > produce any zmm instruction for surprise.(Although > > -mprefer-vector-width=256 is supposed for auto-vectorizer, but backend > > codegen also use it under such cases, i.e. in *movsf_internal > > alternative 5 use zmm only TARGET_AVX512F && !TARGET_PREFER_AVX256.) > > ... that respecting such overrides is probably desirable, so I'll > adjust. > > Jan > > >> source is a non-broadcast one. This way even the scalar copysign<mode>3 > >> can benefit from the operation being a single-insn one (leaving aside > >> moves which the compiler decides to insert for unclear reasons, and > >> leaving aside the fact that bcst_mem_operand() is too restrictive for > >> broadcast to be embedded right into VPTERNLOG*). > >> > >> Along with this also request value duplication in > >> ix86_expand_copysign()'s call to ix86_build_signbit_mask(), eliminating > >> excess space allocation in .rodata.*, filled with zeros which are never > >> read. > >> > >> gcc/ > >> > >> * config/i386/i386-expand.cc (ix86_expand_copysign): Request > >> value duplication by ix86_build_signbit_mask() when AVX512F and > >> not HFmode. > >> * config/i386/sse.md (*<avx512>_vternlog<mode>_all): Convert to > >> 2-alternative form. Adjust "mode" attribute. Add "enabled" > >> attribute. > >> (*<avx512>_vpternlog<mode>_1): Relax to just TARGET_AVX512F. > >> (*<avx512>_vpternlog<mode>_2): Likewise. > >> (*<avx512>_vpternlog<mode>_3): Likewise. >
--- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -2266,7 +2266,7 @@ ix86_expand_copysign (rtx operands[]) else dest = NULL_RTX; op1 = lowpart_subreg (vmode, force_reg (mode, operands[2]), mode); - mask = ix86_build_signbit_mask (vmode, 0, 0); + mask = ix86_build_signbit_mask (vmode, TARGET_AVX512F && mode != HFmode, 0); if (CONST_DOUBLE_P (operands[1])) { --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -12399,11 +12399,11 @@ (set_attr "mode" "<sseinsnmode>")]) (define_insn "*<avx512>_vternlog<mode>_all" - [(set (match_operand:V 0 "register_operand" "=v") + [(set (match_operand:V 0 "register_operand" "=v,v") (unspec:V - [(match_operand:V 1 "register_operand" "0") - (match_operand:V 2 "register_operand" "v") - (match_operand:V 3 "bcst_vector_operand" "vmBr") + [(match_operand:V 1 "register_operand" "0,0") + (match_operand:V 2 "register_operand" "v,v") + (match_operand:V 3 "bcst_vector_operand" "vBr,m") (match_operand:SI 4 "const_0_to_255_operand")] UNSPEC_VTERNLOG))] "TARGET_AVX512F @@ -12411,10 +12411,22 @@ it's not real AVX512FP16 instruction. */ && (GET_MODE_SIZE (GET_MODE_INNER (<MODE>mode)) >= 4 || GET_CODE (operands[3]) != VEC_DUPLICATE)" - "vpternlog<ternlogsuffix>\t{%4, %3, %2, %0|%0, %2, %3, %4}" +{ + if (TARGET_AVX512VL) + return "vpternlog<ternlogsuffix>\t{%4, %3, %2, %0|%0, %2, %3, %4}"; + else + return "vpternlog<ternlogsuffix>\t{%4, %g3, %g2, %g0|%g0, %g2, %g3, %4}"; +} [(set_attr "type" "sselog") (set_attr "prefix" "evex") - (set_attr "mode" "<sseinsnmode>")]) + (set (attr "mode") + (if_then_else (match_test "TARGET_AVX512VL") + (const_string "<sseinsnmode>") + (const_string "XI"))) + (set (attr "enabled") + (if_then_else (eq_attr "alternative" "1") + (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL") + (const_string "*")))]) ;; There must be lots of other combinations like ;; @@ -12443,7 +12455,7 @@ (any_logic2:V (match_operand:V 3 "regmem_or_bitnot_regmem_operand") (match_operand:V 4 "regmem_or_bitnot_regmem_operand"))))] - "(<MODE_SIZE> == 64 || TARGET_AVX512VL) + "TARGET_AVX512F && ix86_pre_reload_split () && (rtx_equal_p (STRIP_UNARY (operands[1]), STRIP_UNARY (operands[4])) @@ -12527,7 +12539,7 @@ (match_operand:V 2 "regmem_or_bitnot_regmem_operand")) (match_operand:V 3 "regmem_or_bitnot_regmem_operand")) (match_operand:V 4 "regmem_or_bitnot_regmem_operand")))] - "(<MODE_SIZE> == 64 || TARGET_AVX512VL) + "TARGET_AVX512F && ix86_pre_reload_split () && (rtx_equal_p (STRIP_UNARY (operands[1]), STRIP_UNARY (operands[4])) @@ -12610,7 +12622,7 @@ (match_operand:V 1 "regmem_or_bitnot_regmem_operand") (match_operand:V 2 "regmem_or_bitnot_regmem_operand")) (match_operand:V 3 "regmem_or_bitnot_regmem_operand")))] - "(<MODE_SIZE> == 64 || TARGET_AVX512VL) + "TARGET_AVX512F && ix86_pre_reload_split ()" "#" "&& 1"