Message ID | CAMe9rOryQB=OT=RAff6BkLbbOrx9tDB1by6M_tnuYPBxmXQ5mQ@mail.gmail.com |
---|---|
State | New |
Headers | show |
Series | V2 [PATCH] i386: Add pass_remove_partial_avx_dependency | expand |
On Fri, Oct 19, 2018 at 1:44 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > On 10/18/18, Jan Hubicka <hubicka@ucw.cz> wrote: > >> we need to generate > >> > >> vxorp[ds] %xmmN, %xmmN, %xmmN > >> ... > >> vcvtss2sd f(%rip), %xmmN, %xmmX > >> ... > >> vcvtsi2ss i(%rip), %xmmN, %xmmY > >> > >> to avoid partial XMM register stall. This patch adds a pass to generate > >> a single > >> > >> vxorps %xmmN, %xmmN, %xmmN > >> > >> at function entry, which is shared by all SF and DF conversions, instead > >> of generating one > >> > >> vxorp[ds] %xmmN, %xmmN, %xmmN > >> > >> for each SF/DF conversion. > >> > >> Performance impacts on SPEC CPU 2017 rate with 1 copy using > >> > >> -Ofast -march=native -mfpmath=sse -fno-associative-math -funroll-loops > >> > >> are > >> > >> 1. On Broadwell server: > >> > >> 500.perlbench_r (-0.82%) > >> 502.gcc_r (0.73%) > >> 505.mcf_r (-0.24%) > >> 520.omnetpp_r (-2.22%) > >> 523.xalancbmk_r (-1.47%) > >> 525.x264_r (0.31%) > >> 531.deepsjeng_r (0.27%) > >> 541.leela_r (0.85%) > >> 548.exchange2_r (-0.11%) > >> 557.xz_r (-0.34%) > >> Geomean: (-0.23%) > >> > >> 503.bwaves_r (0.00%) > >> 507.cactuBSSN_r (-1.88%) > >> 508.namd_r (0.00%) > >> 510.parest_r (-0.56%) > >> 511.povray_r (0.49%) > >> 519.lbm_r (-1.28%) > >> 521.wrf_r (-0.28%) > >> 526.blender_r (0.55%) > >> 527.cam4_r (-0.20%) > >> 538.imagick_r (2.52%) > >> 544.nab_r (-0.18%) > >> 549.fotonik3d_r (-0.51%) > >> 554.roms_r (-0.22%) > >> Geomean: (0.00%) > > > > I wonder why the patch seems to have more effect on specint that should not > > care much > > about float<->double conversions? > > These are within noise range. > > >> number of vxorp[ds]: > >> > >> before after difference > >> 14570 4515 -69% > >> > >> OK for trunk? > > > > This looks very nice though. > > > > > + if (v4sf_const0) > > + { > > + /* Generate a single vxorps at function entry and preform df > > + rescan. */ > > + bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb; > > + insn = BB_HEAD (bb); > > + set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode)); > > + set_insn = emit_insn_after (set, insn); > > + df_insn_rescan (set_insn); > > + df_process_deferred_rescans (); > > + } > > > > It seems suboptimal to place the const0 at the entry of function - if the > > conversoin happens in cold region of function this will just increase > > register > > pressure. I guess right answer would be to look for the postdominance > > frontier > > Did you mean "the nearest common dominator"? > > > of the set of all uses of the zero register? > > > > Here is the updated patch to adds a pass to generate a single > > vxorps %xmmN, %xmmN, %xmmN > > at entry of the nearest common dominator for basic blocks with SF/DF > conversions. OK for trunk? > PING: https://gcc.gnu.org/ml/gcc-patches/2018-10/msg01175.html
> > Did you mean "the nearest common dominator"? If the nearest common dominator appears in the loop while all uses are out of loops, this will result in suboptimal xor placement. In this case you want to split edges out of the loop. In general this is what the LCM framework will do for you if the problem is modelled siimlar way as in mode_swtiching. At entry function mode is "no zero register needed" and all conversions need mode "zero register needed". Mode switching should then do the correct placement decisions (reaching minimal number of executions of xor). Jeff, whan is your optinion on the approach taken by the patch? It seems like a special case of more general issue, but I do not see very elegant way to solve it at least in the GCC 9 horisont, so if the placement is correct we can probalby go either with new pass or making this part of mode swithcing (which is anyway run by x86 backend) Honza > > > of the set of all uses of the zero register? > > > > Here is the updated patch to adds a pass to generate a single > > vxorps %xmmN, %xmmN, %xmmN > > at entry of the nearest common dominator for basic blocks with SF/DF > conversions. OK for trunk? > > Thanks. > > > -- > H.J. > From e2a437f48778ae9586f2038220840ecc41566f69 Mon Sep 17 00:00:00 2001 > From: "H.J. Lu" <hjl.tools@gmail.com> > Date: Wed, 15 Aug 2018 09:58:31 -0700 > Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency > > With -mavx, for > > [hjl@gnu-cfl-1 skx-2]$ cat foo.i > extern float f; > extern double d; > extern int i; > > void > foo (void) > { > d = f; > f = i; > } > > we need to generate > > vxorp[ds] %xmmN, %xmmN, %xmmN > ... > vcvtss2sd f(%rip), %xmmN, %xmmX > ... > vcvtsi2ss i(%rip), %xmmN, %xmmY > > to avoid partial XMM register stall. This patch adds a pass to generate > a single > > vxorps %xmmN, %xmmN, %xmmN > > at entry of the nearest common dominator for basic blocks with SF/DF > conversions, instead of generating one > > vxorp[ds] %xmmN, %xmmN, %xmmN > > for each SF/DF conversion. > > Performance impacts on SPEC CPU 2017 rate with 1 copy using > > -Ofast -march=native -mfpmath=sse -fno-associative-math -funroll-loops > > are > > 1. On Broadwell server: > > 500.perlbench_r (-0.82%) > 502.gcc_r (0.73%) > 505.mcf_r (-0.24%) > 520.omnetpp_r (-2.22%) > 523.xalancbmk_r (-1.47%) > 525.x264_r (0.31%) > 531.deepsjeng_r (0.27%) > 541.leela_r (0.85%) > 548.exchange2_r (-0.11%) > 557.xz_r (-0.34%) > Geomean: (-0.23%) > > 503.bwaves_r (0.00%) > 507.cactuBSSN_r (-1.88%) > 508.namd_r (0.00%) > 510.parest_r (-0.56%) > 511.povray_r (0.49%) > 519.lbm_r (-1.28%) > 521.wrf_r (-0.28%) > 526.blender_r (0.55%) > 527.cam4_r (-0.20%) > 538.imagick_r (2.52%) > 544.nab_r (-0.18%) > 549.fotonik3d_r (-0.51%) > 554.roms_r (-0.22%) > Geomean: (0.00%) > > 2. On Skylake client: > > 500.perlbench_r (-0.29%) > 502.gcc_r (-0.36%) > 505.mcf_r (1.77%) > 520.omnetpp_r (-0.26%) > 523.xalancbmk_r (-3.69%) > 525.x264_r (-0.32%) > 531.deepsjeng_r (0.00%) > 541.leela_r (-0.46%) > 548.exchange2_r (0.00%) > 557.xz_r (0.00%) > Geomean: (-0.34%) > > 503.bwaves_r (0.00%) > 507.cactuBSSN_r (-0.56%) > 508.namd_r (0.87%) > 510.parest_r (0.00%) > 511.povray_r (-0.73%) > 519.lbm_r (0.84%) > 521.wrf_r (0.00%) > 526.blender_r (-0.81%) > 527.cam4_r (-0.43%) > 538.imagick_r (2.55%) > 544.nab_r (0.28%) > 549.fotonik3d_r (0.00%) > 554.roms_r (0.32%) > Geomean: (0.12%) > > 3. On Skylake server: > > 500.perlbench_r (-0.55%) > 502.gcc_r (0.69%) > 505.mcf_r (0.00%) > 520.omnetpp_r (-0.33%) > 523.xalancbmk_r (-0.21%) > 525.x264_r (-0.27%) > 531.deepsjeng_r (0.00%) > 541.leela_r (0.00%) > 548.exchange2_r (-0.11%) > 557.xz_r (0.00%) > Geomean: (0.00%) > > 503.bwaves_r (0.58%) > 507.cactuBSSN_r (0.00%) > 508.namd_r (0.00%) > 510.parest_r (0.18%) > 511.povray_r (-0.58%) > 519.lbm_r (0.25%) > 521.wrf_r (0.40%) > 526.blender_r (0.34%) > 527.cam4_r (0.19%) > 538.imagick_r (5.87%) > 544.nab_r (0.17%) > 549.fotonik3d_r (0.00%) > 554.roms_r (0.00%) > Geomean: (0.62%) > > On Skylake client, impacts on 538.imagick_r are > > size before: > > text data bss dec hex filename > 2555577 10876 5576 2572029 273efd imagick_r.exe > > size after: > > text data bss dec hex filename > 2511825 10876 5576 2528277 269415 imagick_r.exe > > number of vxorp[ds]: > > before after difference > 14570 4515 -69% > > gcc/ > > 2018-08-28 H.J. Lu <hongjiu.lu@intel.com> > Sunil K Pandey <sunil.k.pandey@intel.com> > > PR target/87007 > * config/i386/i386-passes.def: Add > pass_remove_partial_avx_dependency. > * config/i386/i386-protos.h > (make_pass_remove_partial_avx_dependency): New. > * config/i386/i386.c (make_pass_remove_partial_avx_dependency): > New function. > (pass_data_remove_partial_avx_dependency): New. > (pass_remove_partial_avx_dependency): Likewise. > (make_pass_remove_partial_avx_dependency): Likewise. > * config/i386/i386.md (SF/DF conversion splitters): Disabled > for TARGET_AVX. > > gcc/testsuite/ > > 2018-08-28 H.J. Lu <hongjiu.lu@intel.com> > Sunil K Pandey <sunil.k.pandey@intel.com> > > PR target/87007 > * gcc.target/i386/pr87007.c: New file. > --- > gcc/config/i386/i386-passes.def | 2 + > gcc/config/i386/i386-protos.h | 2 + > gcc/config/i386/i386.c | 179 ++++++++++++++++++++++++ > gcc/config/i386/i386.md | 9 +- > gcc/testsuite/gcc.target/i386/pr87007.c | 15 ++ > 5 files changed, 204 insertions(+), 3 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr87007.c > > diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def > index 4a6a95cc2d9..b439f3a9028 100644 > --- a/gcc/config/i386/i386-passes.def > +++ b/gcc/config/i386/i386-passes.def > @@ -31,3 +31,5 @@ along with GCC; see the file COPYING3. If not see > INSERT_PASS_BEFORE (pass_cse2, 1, pass_stv, true /* timode_p */); > > INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbranch); > + > + INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency); > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h > index d1d59633dc0..1626c9d0d70 100644 > --- a/gcc/config/i386/i386-protos.h > +++ b/gcc/config/i386/i386-protos.h > @@ -358,3 +358,5 @@ class rtl_opt_pass; > extern rtl_opt_pass *make_pass_insert_vzeroupper (gcc::context *); > extern rtl_opt_pass *make_pass_stv (gcc::context *); > extern rtl_opt_pass *make_pass_insert_endbranch (gcc::context *); > +extern rtl_opt_pass *make_pass_remove_partial_avx_dependency > + (gcc::context *); > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c > index e47603eb5ff..4f0270ebfff 100644 > --- a/gcc/config/i386/i386.c > +++ b/gcc/config/i386/i386.c > @@ -2741,6 +2741,185 @@ make_pass_insert_endbranch (gcc::context *ctxt) > return new pass_insert_endbranch (ctxt); > } > > +/* At entry of the nearest common dominator for basic blocks with > + conversions, generate a single > + vxorps %xmmN, %xmmN, %xmmN > + for all > + vcvtss2sd op, %xmmN, %xmmX > + vcvtsd2ss op, %xmmN, %xmmX > + vcvtsi2ss op, %xmmN, %xmmX > + vcvtsi2sd op, %xmmN, %xmmX > + */ > + > +static unsigned int > +remove_partial_avx_dependency (void) > +{ > + timevar_push (TV_MACH_DEP); > + > + calculate_dominance_info (CDI_DOMINATORS); > + df_set_flags (DF_DEFER_INSN_RESCAN); > + df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN); > + df_md_add_problem (); > + df_analyze (); > + > + bitmap_obstack_initialize (NULL); > + bitmap convert_bbs = BITMAP_ALLOC (NULL); > + > + basic_block bb; > + rtx_insn *insn, *set_insn; > + rtx set; > + rtx v4sf_const0 = NULL_RTX; > + > + FOR_EACH_BB_FN (bb, cfun) > + { > + FOR_BB_INSNS (bb, insn) > + { > + if (!NONDEBUG_INSN_P (insn)) > + continue; > + > + set = single_set (insn); > + if (set) > + { > + machine_mode dest_vecmode, dest_mode; > + rtx src = SET_SRC (set); > + rtx dest, vec, zero; > + > + /* Check for conversions to SF or DF. */ > + switch (GET_CODE (src)) > + { > + case FLOAT_TRUNCATE: > + /* DF -> SF. */ > + if (GET_MODE (XEXP (src, 0)) != DFmode) > + continue; > + /* Fall through. */ > + case FLOAT_EXTEND: > + /* SF -> DF. */ > + case FLOAT: > + /* SI -> SF, SI -> DF, DI -> SF, DI -> DF. */ > + dest = SET_DEST (set); > + dest_mode = GET_MODE (dest); > + switch (dest_mode) > + { > + case E_SFmode: > + dest_vecmode = V4SFmode; > + break; > + case E_DFmode: > + dest_vecmode = V2DFmode; > + break; > + default: > + continue; > + } > + > + if (!TARGET_64BIT > + && GET_MODE (XEXP (src, 0)) == DImode) > + continue; > + > + if (!v4sf_const0) > + v4sf_const0 = gen_reg_rtx (V4SFmode); > + > + if (dest_vecmode == V4SFmode) > + zero = v4sf_const0; > + else > + zero = gen_rtx_SUBREG (V2DFmode, v4sf_const0, 0); > + > + /* Change source to vector mode. */ > + src = gen_rtx_VEC_DUPLICATE (dest_vecmode, src); > + src = gen_rtx_VEC_MERGE (dest_vecmode, src, zero, > + GEN_INT (HOST_WIDE_INT_1U)); > + /* Change destination to vector mode. */ > + vec = gen_reg_rtx (dest_vecmode); > + /* Generate a XMM vector SET. */ > + set = gen_rtx_SET (vec, src); > + set_insn = emit_insn_before (set, insn); > + df_insn_rescan (set_insn); > + > + src = gen_rtx_SUBREG (dest_mode, vec, 0); > + set = gen_rtx_SET (dest, src); > + > + /* Drop possible dead definitions. */ > + PATTERN (insn) = set; > + > + INSN_CODE (insn) = -1; > + recog_memoized (insn); > + df_insn_rescan (insn); > + bitmap_set_bit (convert_bbs, bb->index); > + break; > + > + default: > + break; > + } > + } > + } > + } > + > + if (v4sf_const0) > + { > + /* Generate a single vxorps at entry of the nearest common > + dominator for basic blocks with conversions. */ > + bb = nearest_common_dominator_for_set (CDI_DOMINATORS, > + convert_bbs); > + insn = BB_HEAD (bb); > + if (!NONDEBUG_INSN_P (insn)) > + insn = next_nonnote_nondebug_insn (insn); > + set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode)); > + set_insn = emit_insn_before (set, insn); > + df_insn_rescan (set_insn); > + df_process_deferred_rescans (); > + } > + > + bitmap_obstack_release (NULL); > + BITMAP_FREE (convert_bbs); > + > + timevar_pop (TV_MACH_DEP); > + return 0; > +} > + > +namespace { > + > +const pass_data pass_data_remove_partial_avx_dependency = > +{ > + RTL_PASS, /* type */ > + "rpad", /* name */ > + OPTGROUP_NONE, /* optinfo_flags */ > + TV_MACH_DEP, /* tv_id */ > + 0, /* properties_required */ > + 0, /* properties_provided */ > + 0, /* properties_destroyed */ > + 0, /* todo_flags_start */ > + TODO_df_finish, /* todo_flags_finish */ > +}; > + > +class pass_remove_partial_avx_dependency : public rtl_opt_pass > +{ > +public: > + pass_remove_partial_avx_dependency (gcc::context *ctxt) > + : rtl_opt_pass (pass_data_remove_partial_avx_dependency, ctxt) > + {} > + > + /* opt_pass methods: */ > + virtual bool gate (function *) > + { > + return (TARGET_AVX > + && TARGET_SSE_PARTIAL_REG_DEPENDENCY > + && TARGET_SSE_MATH > + && optimize > + && optimize_function_for_speed_p (cfun)); > + } > + > + virtual unsigned int execute (function *) > + { > + return remove_partial_avx_dependency (); > + } > +}; // class pass_rpad > + > +} // anon namespace > + > +rtl_opt_pass * > +make_pass_remove_partial_avx_dependency (gcc::context *ctxt) > +{ > + return new pass_remove_partial_avx_dependency (ctxt); > +} > + > /* Return true if a red-zone is in use. We can't use red-zone when > there are local indirect jumps, like "indirect_jump" or "tablejump", > which jumps to another place in the function, since "call" in the > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md > index 7fb2b144f47..2c7bf0c0419 100644 > --- a/gcc/config/i386/i386.md > +++ b/gcc/config/i386/i386.md > @@ -4426,7 +4426,8 @@ > [(set (match_operand:DF 0 "sse_reg_operand") > (float_extend:DF > (match_operand:SF 1 "nonimmediate_operand")))] > - "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed > + "!TARGET_AVX > + && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed > && optimize_function_for_speed_p (cfun) > && (!REG_P (operands[1]) > || REGNO (operands[0]) != REGNO (operands[1])) > @@ -4584,7 +4585,8 @@ > [(set (match_operand:SF 0 "sse_reg_operand") > (float_truncate:SF > (match_operand:DF 1 "nonimmediate_operand")))] > - "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed > + "!TARGET_AVX > + && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed > && optimize_function_for_speed_p (cfun) > && (!REG_P (operands[1]) > || REGNO (operands[0]) != REGNO (operands[1])) > @@ -5088,7 +5090,8 @@ > (define_split > [(set (match_operand:MODEF 0 "sse_reg_operand") > (float:MODEF (match_operand:SWI48 1 "nonimmediate_operand")))] > - "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed > + "!TARGET_AVX > + && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed > && optimize_function_for_speed_p (cfun) > && (!EXT_REX_SSE_REG_P (operands[0]) > || TARGET_AVX512VL)" > diff --git a/gcc/testsuite/gcc.target/i386/pr87007.c b/gcc/testsuite/gcc.target/i386/pr87007.c > new file mode 100644 > index 00000000000..e6f17ad6dc0 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr87007.c > @@ -0,0 +1,15 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -march=skylake" } */ > + > +extern float f; > +extern double d; > +extern int i; > + > +void > +foo (void) > +{ > + d = f; > + f = i; > +} > + > +/* { dg-final { scan-assembler-times "vxorp\[ds\]\[^\n\r\]*xmm\[0-9\]" 1 } } */ > -- > 2.17.2 >
On 11/5/18 7:21 AM, Jan Hubicka wrote: >> >> Did you mean "the nearest common dominator"? > > If the nearest common dominator appears in the loop while all uses are > out of loops, this will result in suboptimal xor placement. > In this case you want to split edges out of the loop. > > In general this is what the LCM framework will do for you if the problem > is modelled siimlar way as in mode_swtiching. At entry function mode is > "no zero register needed" and all conversions need mode "zero register > needed". Mode switching should then do the correct placement decisions > (reaching minimal number of executions of xor). > > Jeff, whan is your optinion on the approach taken by the patch? > It seems like a special case of more general issue, but I do not see > very elegant way to solve it at least in the GCC 9 horisont, so if > the placement is correct we can probalby go either with new pass or > making this part of mode swithcing (which is anyway run by x86 backend) So I haven't followed this discussion at all, but did touch on this issue with some patch a month or two ago with a target patch that was trying to avoid the partial stalls. My assumption is that we're trying to find one or more places to initialize the upper half of an avx register so as to avoid partial register stall at existing sites that set the upper half. This sounds like a classic PRE/LCM style problem (of which mode switching is just another variant). A common-dominator approach is closer to a classic GCSE and is going to result is more initializations at sub-optimal points than a PRE/LCM style. The only advantage a common-dominator approach would have that I could think of would be potentially further separating the initialization from the subsequent use points which avoid store-store stalls or somesuch. I doubt this effect would be enough to overcome the inherent advantages of a PRE/LCM approach. ISTM that if we were to scan the RTL noting which instructions set the upper part of the avx register, which instructions have potential stalls and which instructions reset the upper half to an indeterminate state (calls), then we have the local properties. THen we feed that into a traditional LCM solver and we get back the optimal points. The only weirdness is that we don't want to move existing instructions that set the upper bits. Those are essentially fixed. So maybe this is actually better modeled by click's algorithm which has the concept of pinned instructions. But click's algorithm assumes SSA. Ugh. I'd probably have to sit down with it for a while -- it might be possible to handle the fixed instructions using some of the ideas from click (essentially exposing the earliest/latest results from LCM, then picking a point on the dominator path between earliest and latest). jeff
> On 11/5/18 7:21 AM, Jan Hubicka wrote: > >> > >> Did you mean "the nearest common dominator"? > > > > If the nearest common dominator appears in the loop while all uses are > > out of loops, this will result in suboptimal xor placement. > > In this case you want to split edges out of the loop. > > > > In general this is what the LCM framework will do for you if the problem > > is modelled siimlar way as in mode_swtiching. At entry function mode is > > "no zero register needed" and all conversions need mode "zero register > > needed". Mode switching should then do the correct placement decisions > > (reaching minimal number of executions of xor). > > > > Jeff, whan is your optinion on the approach taken by the patch? > > It seems like a special case of more general issue, but I do not see > > very elegant way to solve it at least in the GCC 9 horisont, so if > > the placement is correct we can probalby go either with new pass or > > making this part of mode swithcing (which is anyway run by x86 backend) > So I haven't followed this discussion at all, but did touch on this > issue with some patch a month or two ago with a target patch that was > trying to avoid the partial stalls. > > My assumption is that we're trying to find one or more places to > initialize the upper half of an avx register so as to avoid partial > register stall at existing sites that set the upper half. > > This sounds like a classic PRE/LCM style problem (of which mode > switching is just another variant). A common-dominator approach is > closer to a classic GCSE and is going to result is more initializations > at sub-optimal points than a PRE/LCM style. yes, it is usual code placement problem. It is special case because the zero register is not modified by the conversion (just we need to have zero somewhere). So basically we do not have kills to the zero except for entry block. Honza
On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote: > > > On 11/5/18 7:21 AM, Jan Hubicka wrote: > > >> > > >> Did you mean "the nearest common dominator"? > > > > > > If the nearest common dominator appears in the loop while all uses are > > > out of loops, this will result in suboptimal xor placement. > > > In this case you want to split edges out of the loop. > > > > > > In general this is what the LCM framework will do for you if the problem > > > is modelled siimlar way as in mode_swtiching. At entry function mode is > > > "no zero register needed" and all conversions need mode "zero register > > > needed". Mode switching should then do the correct placement decisions > > > (reaching minimal number of executions of xor). > > > > > > Jeff, whan is your optinion on the approach taken by the patch? > > > It seems like a special case of more general issue, but I do not see > > > very elegant way to solve it at least in the GCC 9 horisont, so if > > > the placement is correct we can probalby go either with new pass or > > > making this part of mode swithcing (which is anyway run by x86 backend) > > So I haven't followed this discussion at all, but did touch on this > > issue with some patch a month or two ago with a target patch that was > > trying to avoid the partial stalls. > > > > My assumption is that we're trying to find one or more places to > > initialize the upper half of an avx register so as to avoid partial > > register stall at existing sites that set the upper half. > > > > This sounds like a classic PRE/LCM style problem (of which mode > > switching is just another variant). A common-dominator approach is > > closer to a classic GCSE and is going to result is more initializations > > at sub-optimal points than a PRE/LCM style. > > yes, it is usual code placement problem. It is special case because the > zero register is not modified by the conversion (just we need to have > zero somewhere). So basically we do not have kills to the zero except > for entry block. > Do you have testcase to show thatf the nearest common dominator in the loop, while all uses areout of loops, leads to suboptimal xor placement?
On 11/28/18 12:48 PM, H.J. Lu wrote: > On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote: >> >>> On 11/5/18 7:21 AM, Jan Hubicka wrote: >>>>> >>>>> Did you mean "the nearest common dominator"? >>>> >>>> If the nearest common dominator appears in the loop while all uses are >>>> out of loops, this will result in suboptimal xor placement. >>>> In this case you want to split edges out of the loop. >>>> >>>> In general this is what the LCM framework will do for you if the problem >>>> is modelled siimlar way as in mode_swtiching. At entry function mode is >>>> "no zero register needed" and all conversions need mode "zero register >>>> needed". Mode switching should then do the correct placement decisions >>>> (reaching minimal number of executions of xor). >>>> >>>> Jeff, whan is your optinion on the approach taken by the patch? >>>> It seems like a special case of more general issue, but I do not see >>>> very elegant way to solve it at least in the GCC 9 horisont, so if >>>> the placement is correct we can probalby go either with new pass or >>>> making this part of mode swithcing (which is anyway run by x86 backend) >>> So I haven't followed this discussion at all, but did touch on this >>> issue with some patch a month or two ago with a target patch that was >>> trying to avoid the partial stalls. >>> >>> My assumption is that we're trying to find one or more places to >>> initialize the upper half of an avx register so as to avoid partial >>> register stall at existing sites that set the upper half. >>> >>> This sounds like a classic PRE/LCM style problem (of which mode >>> switching is just another variant). A common-dominator approach is >>> closer to a classic GCSE and is going to result is more initializations >>> at sub-optimal points than a PRE/LCM style. >> >> yes, it is usual code placement problem. It is special case because the >> zero register is not modified by the conversion (just we need to have >> zero somewhere). So basically we do not have kills to the zero except >> for entry block. >> > > Do you have testcase to show thatf the nearest common dominator > in the loop, while all uses areout of loops, leads to suboptimal xor > placement? I don't have a testcase, but it's all but certain nearest common dominator is going to be a suboptimal placement. That's going to create paths where you're going to emit the xor when it's not used. The whole point of the LCM algorithms is they are optimal in terms of expression evaluations. jeff >
> On 11/28/18 12:48 PM, H.J. Lu wrote: > > On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote: > >> > >>> On 11/5/18 7:21 AM, Jan Hubicka wrote: > >>>>> > >>>>> Did you mean "the nearest common dominator"? > >>>> > >>>> If the nearest common dominator appears in the loop while all uses are > >>>> out of loops, this will result in suboptimal xor placement. > >>>> In this case you want to split edges out of the loop. > >>>> > >>>> In general this is what the LCM framework will do for you if the problem > >>>> is modelled siimlar way as in mode_swtiching. At entry function mode is > >>>> "no zero register needed" and all conversions need mode "zero register > >>>> needed". Mode switching should then do the correct placement decisions > >>>> (reaching minimal number of executions of xor). > >>>> > >>>> Jeff, whan is your optinion on the approach taken by the patch? > >>>> It seems like a special case of more general issue, but I do not see > >>>> very elegant way to solve it at least in the GCC 9 horisont, so if > >>>> the placement is correct we can probalby go either with new pass or > >>>> making this part of mode swithcing (which is anyway run by x86 backend) > >>> So I haven't followed this discussion at all, but did touch on this > >>> issue with some patch a month or two ago with a target patch that was > >>> trying to avoid the partial stalls. > >>> > >>> My assumption is that we're trying to find one or more places to > >>> initialize the upper half of an avx register so as to avoid partial > >>> register stall at existing sites that set the upper half. > >>> > >>> This sounds like a classic PRE/LCM style problem (of which mode > >>> switching is just another variant). A common-dominator approach is > >>> closer to a classic GCSE and is going to result is more initializations > >>> at sub-optimal points than a PRE/LCM style. > >> > >> yes, it is usual code placement problem. It is special case because the > >> zero register is not modified by the conversion (just we need to have > >> zero somewhere). So basically we do not have kills to the zero except > >> for entry block. > >> > > > > Do you have testcase to show thatf the nearest common dominator > > in the loop, while all uses areout of loops, leads to suboptimal xor > > placement? > I don't have a testcase, but it's all but certain nearest common > dominator is going to be a suboptimal placement. That's going to create > paths where you're going to emit the xor when it's not used. > > The whole point of the LCM algorithms is they are optimal in terms of > expression evaluations. i think testcase should be something like test() { while (true) { if (cond1) { do_one_conversion; return; } if (cond2) { do_other_conversion; return; } } } Honza > > jeff > > >
On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <law@redhat.com> wrote: > > On 11/28/18 12:48 PM, H.J. Lu wrote: > > On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote: > >> > >>> On 11/5/18 7:21 AM, Jan Hubicka wrote: > >>>>> > >>>>> Did you mean "the nearest common dominator"? > >>>> > >>>> If the nearest common dominator appears in the loop while all uses are > >>>> out of loops, this will result in suboptimal xor placement. > >>>> In this case you want to split edges out of the loop. > >>>> > >>>> In general this is what the LCM framework will do for you if the problem > >>>> is modelled siimlar way as in mode_swtiching. At entry function mode is > >>>> "no zero register needed" and all conversions need mode "zero register > >>>> needed". Mode switching should then do the correct placement decisions > >>>> (reaching minimal number of executions of xor). > >>>> > >>>> Jeff, whan is your optinion on the approach taken by the patch? > >>>> It seems like a special case of more general issue, but I do not see > >>>> very elegant way to solve it at least in the GCC 9 horisont, so if > >>>> the placement is correct we can probalby go either with new pass or > >>>> making this part of mode swithcing (which is anyway run by x86 backend) > >>> So I haven't followed this discussion at all, but did touch on this > >>> issue with some patch a month or two ago with a target patch that was > >>> trying to avoid the partial stalls. > >>> > >>> My assumption is that we're trying to find one or more places to > >>> initialize the upper half of an avx register so as to avoid partial > >>> register stall at existing sites that set the upper half. > >>> > >>> This sounds like a classic PRE/LCM style problem (of which mode > >>> switching is just another variant). A common-dominator approach is > >>> closer to a classic GCSE and is going to result is more initializations > >>> at sub-optimal points than a PRE/LCM style. > >> > >> yes, it is usual code placement problem. It is special case because the > >> zero register is not modified by the conversion (just we need to have > >> zero somewhere). So basically we do not have kills to the zero except > >> for entry block. > >> > > > > Do you have testcase to show thatf the nearest common dominator > > in the loop, while all uses areout of loops, leads to suboptimal xor > > placement? > I don't have a testcase, but it's all but certain nearest common > dominator is going to be a suboptimal placement. That's going to create > paths where you're going to emit the xor when it's not used. > > The whole point of the LCM algorithms is they are optimal in terms of > expression evaluations. We tried LCM and it didn't work well for this case. LCM places a single VXOR close to the location where it is needed, which can be inside a loop. There is nothing wrong with the LCM algorithms. But this doesn't solve https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007 where VXOR is executed multiple times inside of a function, instead of just once. We are investigating to generate a single VXOR at entry of the nearest dominator for basic blocks with SF/DF conversions, which is in the the fake loop that contains the whole function: bb = nearest_common_dominator_for_set (CDI_DOMINATORS, convert_bbs); while (bb->loop_father->latch != EXIT_BLOCK_PTR_FOR_FN (cfun)) bb = get_immediate_dominator (CDI_DOMINATORS, bb->loop_father->header); insn = BB_HEAD (bb); if (!NONDEBUG_INSN_P (insn)) insn = next_nonnote_nondebug_insn (insn); set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode)); set_insn = emit_insn_before (set, insn);
On Wed, Nov 28, 2018 at 12:21 PM Jan Hubicka <hubicka@ucw.cz> wrote: > > > On 11/28/18 12:48 PM, H.J. Lu wrote: > > > On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote: > > >> > > >>> On 11/5/18 7:21 AM, Jan Hubicka wrote: > > >>>>> > > >>>>> Did you mean "the nearest common dominator"? > > >>>> > > >>>> If the nearest common dominator appears in the loop while all uses are > > >>>> out of loops, this will result in suboptimal xor placement. > > >>>> In this case you want to split edges out of the loop. > > >>>> > > >>>> In general this is what the LCM framework will do for you if the problem > > >>>> is modelled siimlar way as in mode_swtiching. At entry function mode is > > >>>> "no zero register needed" and all conversions need mode "zero register > > >>>> needed". Mode switching should then do the correct placement decisions > > >>>> (reaching minimal number of executions of xor). > > >>>> > > >>>> Jeff, whan is your optinion on the approach taken by the patch? > > >>>> It seems like a special case of more general issue, but I do not see > > >>>> very elegant way to solve it at least in the GCC 9 horisont, so if > > >>>> the placement is correct we can probalby go either with new pass or > > >>>> making this part of mode swithcing (which is anyway run by x86 backend) > > >>> So I haven't followed this discussion at all, but did touch on this > > >>> issue with some patch a month or two ago with a target patch that was > > >>> trying to avoid the partial stalls. > > >>> > > >>> My assumption is that we're trying to find one or more places to > > >>> initialize the upper half of an avx register so as to avoid partial > > >>> register stall at existing sites that set the upper half. > > >>> > > >>> This sounds like a classic PRE/LCM style problem (of which mode > > >>> switching is just another variant). A common-dominator approach is > > >>> closer to a classic GCSE and is going to result is more initializations > > >>> at sub-optimal points than a PRE/LCM style. > > >> > > >> yes, it is usual code placement problem. It is special case because the > > >> zero register is not modified by the conversion (just we need to have > > >> zero somewhere). So basically we do not have kills to the zero except > > >> for entry block. > > >> > > > > > > Do you have testcase to show thatf the nearest common dominator > > > in the loop, while all uses areout of loops, leads to suboptimal xor > > > placement? > > I don't have a testcase, but it's all but certain nearest common > > dominator is going to be a suboptimal placement. That's going to create > > paths where you're going to emit the xor when it's not used. > > > > The whole point of the LCM algorithms is they are optimal in terms of > > expression evaluations. > > i think testcase should be something like > > test() > { > while (true) > { > if (cond1) > { > do_one_conversion; > return; > } > if (cond2) > { > do_other_conversion; > return; > } > } > } We got [hjl@gnu-cfl-1 pr87007]$ cat test2.i extern float f1[],f2[]; extern int i1[],i2[]; float foo (int k, int n[]) { if (k == 1) return 1; if (k == 4) return 5; for(int i = 0; i != k; i++){ if(n[i] > 100) f1[i] = i1[i]; else f2[i] = i2[i]; } return k; } [hjl@gnu-cfl-1 pr87007]$ make test2.s /export/build/gnu/tools-build/gcc-debug/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/tools-build/gcc-debug/build-x86_64-linux/gcc/ -O2 -mavx -S test2.i [hjl@gnu-cfl-1 pr87007]$ cat test2.s .file "test2.i" .text .p2align 4 .globl foo .type foo, @function foo: .LFB0: .cfi_startproc vmovss .LC0(%rip), %xmm0 cmpl $1, %edi je .L15 vmovss .LC1(%rip), %xmm0 cmpl $4, %edi je .L15 vxorps %xmm0, %xmm0, %xmm0 testl %edi, %edi je .L3 leal -1(%rdi), %ecx xorl %eax, %eax jmp .L6 .p2align 4,,10 .p2align 3 .L17: vcvtsi2ss i1(,%rax,4), %xmm0, %xmm1 leaq 1(%rax), %rdx vmovss %xmm1, f1(,%rax,4) cmpq %rcx, %rax je .L3 .L9: movq %rdx, %rax .L6: cmpl $100, (%rsi,%rax,4) jg .L17 vcvtsi2ss i2(,%rax,4), %xmm0, %xmm1 leaq 1(%rax), %rdx vmovss %xmm1, f2(,%rax,4) cmpq %rcx, %rax jne .L9 .L3: vcvtsi2ss %edi, %xmm0, %xmm0 .L15: ret .cfi_endproc .LFE0: .size foo, .-foo .section .rodata.cst4,"aM",@progbits,4 .align 4 .LC0: .long 1065353216 .align 4 .LC1: .long 1084227584 .ident "GCC: (GNU) 9.0.0 20181230 (experimental)" .section .note.GNU-stack,"",@progbits [hjl@gnu-cfl-1 pr87007]$ The placement is optimal.
On 12/30/18 9:50 AM, H.J. Lu wrote: > On Wed, Nov 28, 2018 at 12:21 PM Jan Hubicka <hubicka@ucw.cz> wrote: >> >>> On 11/28/18 12:48 PM, H.J. Lu wrote: >>>> On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote: >>>>> >>>>>> On 11/5/18 7:21 AM, Jan Hubicka wrote: >>>>>>>> >>>>>>>> Did you mean "the nearest common dominator"? >>>>>>> >>>>>>> If the nearest common dominator appears in the loop while all uses are >>>>>>> out of loops, this will result in suboptimal xor placement. >>>>>>> In this case you want to split edges out of the loop. >>>>>>> >>>>>>> In general this is what the LCM framework will do for you if the problem >>>>>>> is modelled siimlar way as in mode_swtiching. At entry function mode is >>>>>>> "no zero register needed" and all conversions need mode "zero register >>>>>>> needed". Mode switching should then do the correct placement decisions >>>>>>> (reaching minimal number of executions of xor). >>>>>>> >>>>>>> Jeff, whan is your optinion on the approach taken by the patch? >>>>>>> It seems like a special case of more general issue, but I do not see >>>>>>> very elegant way to solve it at least in the GCC 9 horisont, so if >>>>>>> the placement is correct we can probalby go either with new pass or >>>>>>> making this part of mode swithcing (which is anyway run by x86 backend) >>>>>> So I haven't followed this discussion at all, but did touch on this >>>>>> issue with some patch a month or two ago with a target patch that was >>>>>> trying to avoid the partial stalls. >>>>>> >>>>>> My assumption is that we're trying to find one or more places to >>>>>> initialize the upper half of an avx register so as to avoid partial >>>>>> register stall at existing sites that set the upper half. >>>>>> >>>>>> This sounds like a classic PRE/LCM style problem (of which mode >>>>>> switching is just another variant). A common-dominator approach is >>>>>> closer to a classic GCSE and is going to result is more initializations >>>>>> at sub-optimal points than a PRE/LCM style. >>>>> >>>>> yes, it is usual code placement problem. It is special case because the >>>>> zero register is not modified by the conversion (just we need to have >>>>> zero somewhere). So basically we do not have kills to the zero except >>>>> for entry block. >>>>> >>>> >>>> Do you have testcase to show thatf the nearest common dominator >>>> in the loop, while all uses areout of loops, leads to suboptimal xor >>>> placement? >>> I don't have a testcase, but it's all but certain nearest common >>> dominator is going to be a suboptimal placement. That's going to create >>> paths where you're going to emit the xor when it's not used. >>> >>> The whole point of the LCM algorithms is they are optimal in terms of >>> expression evaluations. >> >> i think testcase should be something like >> >> test() >> { >> while (true) >> { >> if (cond1) >> { >> do_one_conversion; >> return; >> } >> if (cond2) >> { >> do_other_conversion; >> return; >> } >> } >> } > > We got > > [hjl@gnu-cfl-1 pr87007]$ cat test2.i > extern float f1[],f2[]; > extern int i1[],i2[]; > float > foo (int k, int n[]) > { > if (k == 1) > return 1; > > if (k == 4) > return 5; > > for(int i = 0; i != k; i++){ > if(n[i] > 100) > f1[i] = i1[i]; > else > f2[i] = i2[i]; > } > > return k; Note the if-if construct within the loop of the pseudo code will generate a notably different cfg than the if-else in your code (which creates a simple diamond). The pseudo code also exits in those cases rather than continuing the loop. The differences in the CFGs you wind up with are critically important. For the first chunk of pseudocode the computationally optimal placement point is just before the conversions. THere's precisely one execution of the xor at runtime no matter how many loop iterations occur. For your example code the optimal placement point is just outside the loop. Essentially both arms of the conditional need the xor. So possible insertion points are just before the loop, just inside the loop or in both arms. Obviously the optimal point is just before the loop. JEff
From e2a437f48778ae9586f2038220840ecc41566f69 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" <hjl.tools@gmail.com> Date: Wed, 15 Aug 2018 09:58:31 -0700 Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency With -mavx, for [hjl@gnu-cfl-1 skx-2]$ cat foo.i extern float f; extern double d; extern int i; void foo (void) { d = f; f = i; } we need to generate vxorp[ds] %xmmN, %xmmN, %xmmN ... vcvtss2sd f(%rip), %xmmN, %xmmX ... vcvtsi2ss i(%rip), %xmmN, %xmmY to avoid partial XMM register stall. This patch adds a pass to generate a single vxorps %xmmN, %xmmN, %xmmN at entry of the nearest common dominator for basic blocks with SF/DF conversions, instead of generating one vxorp[ds] %xmmN, %xmmN, %xmmN for each SF/DF conversion. Performance impacts on SPEC CPU 2017 rate with 1 copy using -Ofast -march=native -mfpmath=sse -fno-associative-math -funroll-loops are 1. On Broadwell server: 500.perlbench_r (-0.82%) 502.gcc_r (0.73%) 505.mcf_r (-0.24%) 520.omnetpp_r (-2.22%) 523.xalancbmk_r (-1.47%) 525.x264_r (0.31%) 531.deepsjeng_r (0.27%) 541.leela_r (0.85%) 548.exchange2_r (-0.11%) 557.xz_r (-0.34%) Geomean: (-0.23%) 503.bwaves_r (0.00%) 507.cactuBSSN_r (-1.88%) 508.namd_r (0.00%) 510.parest_r (-0.56%) 511.povray_r (0.49%) 519.lbm_r (-1.28%) 521.wrf_r (-0.28%) 526.blender_r (0.55%) 527.cam4_r (-0.20%) 538.imagick_r (2.52%) 544.nab_r (-0.18%) 549.fotonik3d_r (-0.51%) 554.roms_r (-0.22%) Geomean: (0.00%) 2. On Skylake client: 500.perlbench_r (-0.29%) 502.gcc_r (-0.36%) 505.mcf_r (1.77%) 520.omnetpp_r (-0.26%) 523.xalancbmk_r (-3.69%) 525.x264_r (-0.32%) 531.deepsjeng_r (0.00%) 541.leela_r (-0.46%) 548.exchange2_r (0.00%) 557.xz_r (0.00%) Geomean: (-0.34%) 503.bwaves_r (0.00%) 507.cactuBSSN_r (-0.56%) 508.namd_r (0.87%) 510.parest_r (0.00%) 511.povray_r (-0.73%) 519.lbm_r (0.84%) 521.wrf_r (0.00%) 526.blender_r (-0.81%) 527.cam4_r (-0.43%) 538.imagick_r (2.55%) 544.nab_r (0.28%) 549.fotonik3d_r (0.00%) 554.roms_r (0.32%) Geomean: (0.12%) 3. On Skylake server: 500.perlbench_r (-0.55%) 502.gcc_r (0.69%) 505.mcf_r (0.00%) 520.omnetpp_r (-0.33%) 523.xalancbmk_r (-0.21%) 525.x264_r (-0.27%) 531.deepsjeng_r (0.00%) 541.leela_r (0.00%) 548.exchange2_r (-0.11%) 557.xz_r (0.00%) Geomean: (0.00%) 503.bwaves_r (0.58%) 507.cactuBSSN_r (0.00%) 508.namd_r (0.00%) 510.parest_r (0.18%) 511.povray_r (-0.58%) 519.lbm_r (0.25%) 521.wrf_r (0.40%) 526.blender_r (0.34%) 527.cam4_r (0.19%) 538.imagick_r (5.87%) 544.nab_r (0.17%) 549.fotonik3d_r (0.00%) 554.roms_r (0.00%) Geomean: (0.62%) On Skylake client, impacts on 538.imagick_r are size before: text data bss dec hex filename 2555577 10876 5576 2572029 273efd imagick_r.exe size after: text data bss dec hex filename 2511825 10876 5576 2528277 269415 imagick_r.exe number of vxorp[ds]: before after difference 14570 4515 -69% gcc/ 2018-08-28 H.J. Lu <hongjiu.lu@intel.com> Sunil K Pandey <sunil.k.pandey@intel.com> PR target/87007 * config/i386/i386-passes.def: Add pass_remove_partial_avx_dependency. * config/i386/i386-protos.h (make_pass_remove_partial_avx_dependency): New. * config/i386/i386.c (make_pass_remove_partial_avx_dependency): New function. (pass_data_remove_partial_avx_dependency): New. (pass_remove_partial_avx_dependency): Likewise. (make_pass_remove_partial_avx_dependency): Likewise. * config/i386/i386.md (SF/DF conversion splitters): Disabled for TARGET_AVX. gcc/testsuite/ 2018-08-28 H.J. Lu <hongjiu.lu@intel.com> Sunil K Pandey <sunil.k.pandey@intel.com> PR target/87007 * gcc.target/i386/pr87007.c: New file. --- gcc/config/i386/i386-passes.def | 2 + gcc/config/i386/i386-protos.h | 2 + gcc/config/i386/i386.c | 179 ++++++++++++++++++++++++ gcc/config/i386/i386.md | 9 +- gcc/testsuite/gcc.target/i386/pr87007.c | 15 ++ 5 files changed, 204 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr87007.c diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def index 4a6a95cc2d9..b439f3a9028 100644 --- a/gcc/config/i386/i386-passes.def +++ b/gcc/config/i386/i386-passes.def @@ -31,3 +31,5 @@ along with GCC; see the file COPYING3. If not see INSERT_PASS_BEFORE (pass_cse2, 1, pass_stv, true /* timode_p */); INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbranch); + + INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency); diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index d1d59633dc0..1626c9d0d70 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -358,3 +358,5 @@ class rtl_opt_pass; extern rtl_opt_pass *make_pass_insert_vzeroupper (gcc::context *); extern rtl_opt_pass *make_pass_stv (gcc::context *); extern rtl_opt_pass *make_pass_insert_endbranch (gcc::context *); +extern rtl_opt_pass *make_pass_remove_partial_avx_dependency + (gcc::context *); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index e47603eb5ff..4f0270ebfff 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2741,6 +2741,185 @@ make_pass_insert_endbranch (gcc::context *ctxt) return new pass_insert_endbranch (ctxt); } +/* At entry of the nearest common dominator for basic blocks with + conversions, generate a single + vxorps %xmmN, %xmmN, %xmmN + for all + vcvtss2sd op, %xmmN, %xmmX + vcvtsd2ss op, %xmmN, %xmmX + vcvtsi2ss op, %xmmN, %xmmX + vcvtsi2sd op, %xmmN, %xmmX + */ + +static unsigned int +remove_partial_avx_dependency (void) +{ + timevar_push (TV_MACH_DEP); + + calculate_dominance_info (CDI_DOMINATORS); + df_set_flags (DF_DEFER_INSN_RESCAN); + df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN); + df_md_add_problem (); + df_analyze (); + + bitmap_obstack_initialize (NULL); + bitmap convert_bbs = BITMAP_ALLOC (NULL); + + basic_block bb; + rtx_insn *insn, *set_insn; + rtx set; + rtx v4sf_const0 = NULL_RTX; + + FOR_EACH_BB_FN (bb, cfun) + { + FOR_BB_INSNS (bb, insn) + { + if (!NONDEBUG_INSN_P (insn)) + continue; + + set = single_set (insn); + if (set) + { + machine_mode dest_vecmode, dest_mode; + rtx src = SET_SRC (set); + rtx dest, vec, zero; + + /* Check for conversions to SF or DF. */ + switch (GET_CODE (src)) + { + case FLOAT_TRUNCATE: + /* DF -> SF. */ + if (GET_MODE (XEXP (src, 0)) != DFmode) + continue; + /* Fall through. */ + case FLOAT_EXTEND: + /* SF -> DF. */ + case FLOAT: + /* SI -> SF, SI -> DF, DI -> SF, DI -> DF. */ + dest = SET_DEST (set); + dest_mode = GET_MODE (dest); + switch (dest_mode) + { + case E_SFmode: + dest_vecmode = V4SFmode; + break; + case E_DFmode: + dest_vecmode = V2DFmode; + break; + default: + continue; + } + + if (!TARGET_64BIT + && GET_MODE (XEXP (src, 0)) == DImode) + continue; + + if (!v4sf_const0) + v4sf_const0 = gen_reg_rtx (V4SFmode); + + if (dest_vecmode == V4SFmode) + zero = v4sf_const0; + else + zero = gen_rtx_SUBREG (V2DFmode, v4sf_const0, 0); + + /* Change source to vector mode. */ + src = gen_rtx_VEC_DUPLICATE (dest_vecmode, src); + src = gen_rtx_VEC_MERGE (dest_vecmode, src, zero, + GEN_INT (HOST_WIDE_INT_1U)); + /* Change destination to vector mode. */ + vec = gen_reg_rtx (dest_vecmode); + /* Generate a XMM vector SET. */ + set = gen_rtx_SET (vec, src); + set_insn = emit_insn_before (set, insn); + df_insn_rescan (set_insn); + + src = gen_rtx_SUBREG (dest_mode, vec, 0); + set = gen_rtx_SET (dest, src); + + /* Drop possible dead definitions. */ + PATTERN (insn) = set; + + INSN_CODE (insn) = -1; + recog_memoized (insn); + df_insn_rescan (insn); + bitmap_set_bit (convert_bbs, bb->index); + break; + + default: + break; + } + } + } + } + + if (v4sf_const0) + { + /* Generate a single vxorps at entry of the nearest common + dominator for basic blocks with conversions. */ + bb = nearest_common_dominator_for_set (CDI_DOMINATORS, + convert_bbs); + insn = BB_HEAD (bb); + if (!NONDEBUG_INSN_P (insn)) + insn = next_nonnote_nondebug_insn (insn); + set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode)); + set_insn = emit_insn_before (set, insn); + df_insn_rescan (set_insn); + df_process_deferred_rescans (); + } + + bitmap_obstack_release (NULL); + BITMAP_FREE (convert_bbs); + + timevar_pop (TV_MACH_DEP); + return 0; +} + +namespace { + +const pass_data pass_data_remove_partial_avx_dependency = +{ + RTL_PASS, /* type */ + "rpad", /* name */ + OPTGROUP_NONE, /* optinfo_flags */ + TV_MACH_DEP, /* tv_id */ + 0, /* properties_required */ + 0, /* properties_provided */ + 0, /* properties_destroyed */ + 0, /* todo_flags_start */ + TODO_df_finish, /* todo_flags_finish */ +}; + +class pass_remove_partial_avx_dependency : public rtl_opt_pass +{ +public: + pass_remove_partial_avx_dependency (gcc::context *ctxt) + : rtl_opt_pass (pass_data_remove_partial_avx_dependency, ctxt) + {} + + /* opt_pass methods: */ + virtual bool gate (function *) + { + return (TARGET_AVX + && TARGET_SSE_PARTIAL_REG_DEPENDENCY + && TARGET_SSE_MATH + && optimize + && optimize_function_for_speed_p (cfun)); + } + + virtual unsigned int execute (function *) + { + return remove_partial_avx_dependency (); + } +}; // class pass_rpad + +} // anon namespace + +rtl_opt_pass * +make_pass_remove_partial_avx_dependency (gcc::context *ctxt) +{ + return new pass_remove_partial_avx_dependency (ctxt); +} + /* Return true if a red-zone is in use. We can't use red-zone when there are local indirect jumps, like "indirect_jump" or "tablejump", which jumps to another place in the function, since "call" in the diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 7fb2b144f47..2c7bf0c0419 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -4426,7 +4426,8 @@ [(set (match_operand:DF 0 "sse_reg_operand") (float_extend:DF (match_operand:SF 1 "nonimmediate_operand")))] - "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed + "!TARGET_AVX + && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed && optimize_function_for_speed_p (cfun) && (!REG_P (operands[1]) || REGNO (operands[0]) != REGNO (operands[1])) @@ -4584,7 +4585,8 @@ [(set (match_operand:SF 0 "sse_reg_operand") (float_truncate:SF (match_operand:DF 1 "nonimmediate_operand")))] - "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed + "!TARGET_AVX + && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed && optimize_function_for_speed_p (cfun) && (!REG_P (operands[1]) || REGNO (operands[0]) != REGNO (operands[1])) @@ -5088,7 +5090,8 @@ (define_split [(set (match_operand:MODEF 0 "sse_reg_operand") (float:MODEF (match_operand:SWI48 1 "nonimmediate_operand")))] - "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed + "!TARGET_AVX + && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed && optimize_function_for_speed_p (cfun) && (!EXT_REX_SSE_REG_P (operands[0]) || TARGET_AVX512VL)" diff --git a/gcc/testsuite/gcc.target/i386/pr87007.c b/gcc/testsuite/gcc.target/i386/pr87007.c new file mode 100644 index 00000000000..e6f17ad6dc0 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr87007.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake" } */ + +extern float f; +extern double d; +extern int i; + +void +foo (void) +{ + d = f; + f = i; +} + +/* { dg-final { scan-assembler-times "vxorp\[ds\]\[^\n\r\]*xmm\[0-9\]" 1 } } */ -- 2.17.2