Message ID | 20130927085640.GD21484@kam.mff.cuni.cz |
---|---|
State | New |
Headers | show |
On Fri, Sep 27, 2013 at 1:56 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > Hi, > this is second part of the generic tuning changes sanityzing the tuning flags. > This patch again is supposed to deal with the "obvious" part only. > I will send separate patch for more changes. > > The flags changed agree on all CPUs considered for generic (and their > optimization manuals) + amdfam10, core2 and Atom SLM. > > I also added X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL to bobcat tuning, since it > seems like obvious omision (after double checking in optimization manual) and > droped X86_TUNE_FOUR_JUMP_LIMIT for buldozer cores. Implementation of this > feature was always bit weird and its main purpose was to avoid terrible branch > predictor degeneration on the older AMD branch predictors. I benchmarked both > spec2k and 2k6 to verify there are no regression. > > Especially X86_TUNE_REASSOC_FP_TO_PARALLEL seems to bring nice improvements in specfp > benchmarks. > > Bootstrapped/regtested x86_64-linux, will wait for comments and commit it > during weekend. I will be happy to revisit any of the generic tuning if > regressions pop up. > > Overall this patch also brings small code size improvements for smaller > loads/stores and less padding at -O2. Differences are sub 0.1% however. > > Honza > * x86-tune.def (X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL): Enable for generic. > (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL): Likewise. > (X86_TUNE_FOUR_JUMP_LIMIT): Drop for generic and buldozer. > (X86_TUNE_PAD_RETURNS): Drop for newer AMD chips. Can we drop generic on X86_TUNE_PAD_RETURNS? > (X86_TUNE_AVOID_VECTOR_DECODE): Drop for generic. > (X86_TUNE_REASSOC_FP_TO_PARALLEL): Enable for generic.
> On Fri, Sep 27, 2013 at 1:56 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > > Hi, > > this is second part of the generic tuning changes sanityzing the tuning flags. > > This patch again is supposed to deal with the "obvious" part only. > > I will send separate patch for more changes. > > > > The flags changed agree on all CPUs considered for generic (and their > > optimization manuals) + amdfam10, core2 and Atom SLM. > > > > I also added X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL to bobcat tuning, since it > > seems like obvious omision (after double checking in optimization manual) and > > droped X86_TUNE_FOUR_JUMP_LIMIT for buldozer cores. Implementation of this > > feature was always bit weird and its main purpose was to avoid terrible branch > > predictor degeneration on the older AMD branch predictors. I benchmarked both > > spec2k and 2k6 to verify there are no regression. > > > > Especially X86_TUNE_REASSOC_FP_TO_PARALLEL seems to bring nice improvements in specfp > > benchmarks. > > > > Bootstrapped/regtested x86_64-linux, will wait for comments and commit it > > during weekend. I will be happy to revisit any of the generic tuning if > > regressions pop up. > > > > Overall this patch also brings small code size improvements for smaller > > loads/stores and less padding at -O2. Differences are sub 0.1% however. > > > > Honza > > * x86-tune.def (X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL): Enable for generic. > > (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL): Likewise. > > (X86_TUNE_FOUR_JUMP_LIMIT): Drop for generic and buldozer. > > (X86_TUNE_PAD_RETURNS): Drop for newer AMD chips. > > Can we drop generic on X86_TUNE_PAD_RETURNS? It is on my list for not-so-obvious changes. I tested and removed it from BDVER with intention to drop it from generic. But after furhter testing I lean towards keeping it for some extra time. I tested it on fam10 machines and it causes over 10% regressions on some benchmarks, including bzip and botan (where it is up to 4-fold regression). Missing a return on amdfam10 hardware is bad, because it causes return stack to go out of sync. At the same time I can not really measure benefits for disabling it - the code size cost is very small and runtime cost on non-amdfam10 cores is not important, too, since the function call overhead hide the extra nop quite easily. So I would incline to be apply extra care on this flag and keep it for extra release or two. Most of gcc.opensuse.org testing runs on these and adding random branch mispredictions will trash them. At the related note, would would you think of X86_TUNE_PARTIAL_FLAG_REG_STALL? I benchmarked it on my I5 notebook and it seems to have no measurable effects on spec2k6. I also did some benchmarking of the patch to disable alignments you proposed. Unforutnately I can measure slowdowns on fam10/bdver/and on botan/hand written loops even for core. I am considering to drop the branch target/function alignment and keep only loop alignment, but I did not test this yet. Honza
On Fri, Sep 27, 2013 at 8:36 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> On Fri, Sep 27, 2013 at 1:56 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> > Hi, >> > this is second part of the generic tuning changes sanityzing the tuning flags. >> > This patch again is supposed to deal with the "obvious" part only. >> > I will send separate patch for more changes. >> > >> > The flags changed agree on all CPUs considered for generic (and their >> > optimization manuals) + amdfam10, core2 and Atom SLM. >> > >> > I also added X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL to bobcat tuning, since it >> > seems like obvious omision (after double checking in optimization manual) and >> > droped X86_TUNE_FOUR_JUMP_LIMIT for buldozer cores. Implementation of this >> > feature was always bit weird and its main purpose was to avoid terrible branch >> > predictor degeneration on the older AMD branch predictors. I benchmarked both >> > spec2k and 2k6 to verify there are no regression. >> > >> > Especially X86_TUNE_REASSOC_FP_TO_PARALLEL seems to bring nice improvements in specfp >> > benchmarks. >> > >> > Bootstrapped/regtested x86_64-linux, will wait for comments and commit it >> > during weekend. I will be happy to revisit any of the generic tuning if >> > regressions pop up. >> > >> > Overall this patch also brings small code size improvements for smaller >> > loads/stores and less padding at -O2. Differences are sub 0.1% however. >> > >> > Honza >> > * x86-tune.def (X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL): Enable for generic. >> > (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL): Likewise. >> > (X86_TUNE_FOUR_JUMP_LIMIT): Drop for generic and buldozer. >> > (X86_TUNE_PAD_RETURNS): Drop for newer AMD chips. >> >> Can we drop generic on X86_TUNE_PAD_RETURNS? > It is on my list for not-so-obvious changes. I tested and removed it from > BDVER with intention to drop it from generic. But after furhter testing I lean > towards keeping it for some extra time. > > I tested it on fam10 machines and it causes over 10% regressions on some > benchmarks, including bzip and botan (where it is up to 4-fold regression). > Missing a return on amdfam10 hardware is bad, because it causes return stack to > go out of sync. At the same time I can not really measure benefits for > disabling it - the code size cost is very small and runtime cost on > non-amdfam10 cores is not important, too, since the function call overhead hide > the extra nop quite easily. I see. > So I would incline to be apply extra care on this flag and keep it for extra > release or two. Most of gcc.opensuse.org testing runs on these and adding > random branch mispredictions will trash them. > > At the related note, would would you think of X86_TUNE_PARTIAL_FLAG_REG_STALL? > I benchmarked it on my I5 notebook and it seems to have no measurable effects > on spec2k6. > > I also did some benchmarking of the patch to disable alignments you proposed. > Unforutnately I can measure slowdowns on fam10/bdver/and on botan/hand written > loops even for core. I am not surprised about hand written loops. Have you tried SPEC CPU rate? > I am considering to drop the branch target/function alignment and keep only loop > alignment, but I did not test this yet. > > Honza
Jan Hubicka <hubicka@ucw.cz> writes: > > I also added X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL to bobcat tuning, since it > seems like obvious omision (after double checking in optimization manual) and > droped X86_TUNE_FOUR_JUMP_LIMIT for buldozer cores. When tuning for Intel SandyBridge+ it would be actually interesting to use a 32 byte window instead of 16 bytes. The decoded icache has a 3 jump limit per 32byte. So if K8 support is dropped from generic could just change it over to 32 bytes there? -Andi
On Fri, Sep 27, 2013 at 8:46 AM, H.J. Lu <hjl.tools@gmail.com> wrote: >> So I would incline to be apply extra care on this flag and keep it for extra >> release or two. Most of gcc.opensuse.org testing runs on these and adding >> random branch mispredictions will trash them. >> >> At the related note, would would you think of X86_TUNE_PARTIAL_FLAG_REG_STALL? >> I benchmarked it on my I5 notebook and it seems to have no measurable effects >> on spec2k6. >> >> I also did some benchmarking of the patch to disable alignments you proposed. >> Unforutnately I can measure slowdowns on fam10/bdver/and on botan/hand written >> loops even for core. > > I am not surprised about hand written loops. Have you > tried SPEC CPU rate? > >> I am considering to drop the branch target/function alignment and keep only loop >> alignment, but I did not test this yet. >> This sounds a good idea. I will give it a try on Intel processors.
Index: config/i386/x86-tune.def =================================================================== --- config/i386/x86-tune.def (revision 202966) +++ config/i386/x86-tune.def (working copy) @@ -115,9 +115,9 @@ DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_DEPEN m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM | m_AMDFAM10 | m_BDVER | m_GENERIC) DEF_TUNE (X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL, "sse_unaligned_load_optimal", - m_COREI7 | m_AMDFAM10 | m_BDVER | m_BTVER | m_SLM) + m_COREI7 | m_AMDFAM10 | m_BDVER | m_BTVER | m_SLM | m_GENERIC) DEF_TUNE (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL, "sse_unaligned_store_optimal", - m_COREI7 | m_BDVER | m_SLM) + m_COREI7 | m_BDVER | m_BTVER | m_SLM | m_GENERIC) DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL, "sse_packed_single_insn_optimal", m_BDVER) /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies @@ -146,8 +146,7 @@ DEF_TUNE (X86_TUNE_INTER_UNIT_CONVERSION /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more than 4 branch instructions in the 16 byte window. */ DEF_TUNE (X86_TUNE_FOUR_JUMP_LIMIT, "four_jump_limit", - m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM | m_AMD_MULTIPLE - | m_GENERIC) + m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM | m_ATHLON_K8 | m_AMDFAM10) DEF_TUNE (X86_TUNE_SCHEDULE, "schedule", m_PENT | m_PPRO | m_CORE_ALL | m_ATOM | m_SLM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC) @@ -156,13 +155,13 @@ DEF_TUNE (X86_TUNE_USE_BT, "use_bt", DEF_TUNE (X86_TUNE_USE_INCDEC, "use_incdec", ~(m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM | m_GENERIC)) DEF_TUNE (X86_TUNE_PAD_RETURNS, "pad_returns", - m_AMD_MULTIPLE | m_GENERIC) + m_ATHLON_K8 | m_AMDFAM10 | | m_GENERIC) DEF_TUNE (X86_TUNE_PAD_SHORT_FUNCTION, "pad_short_function", m_ATOM) DEF_TUNE (X86_TUNE_EXT_80387_CONSTANTS, "ext_80387_constants", m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM | m_K6_GEODE | m_ATHLON_K8 | m_GENERIC) DEF_TUNE (X86_TUNE_AVOID_VECTOR_DECODE, "avoid_vector_decode", - m_K8 | m_GENERIC) + m_K8) /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for HImode and SImode multiply, but 386 and 486 do HImode multiply faster. */ DEF_TUNE (X86_TUNE_PROMOTE_HIMODE_IMUL, "promote_himode_imul", @@ -217,7 +216,7 @@ DEF_TUNE (X86_TUNE_REASSOC_INT_TO_PARALL /* X86_TUNE_REASSOC_FP_TO_PARALLEL: Try to produce parallel computations during reassociation of fp computation. */ DEF_TUNE (X86_TUNE_REASSOC_FP_TO_PARALLEL, "reassoc_fp_to_parallel", - m_ATOM | m_SLM | m_HASWELL | m_BDVER1 | m_BDVER2) + m_ATOM | m_SLM | m_HASWELL | m_BDVER1 | m_BDVER2 | m_GENERIC) /* X86_TUNE_GENERAL_REGS_SSE_SPILL: Try to spill general regs to SSE regs instead of memory. */ DEF_TUNE (X86_TUNE_GENERAL_REGS_SSE_SPILL, "general_regs_sse_spill",