Message ID | 20180109102606.GA59414@kam.mff.cuni.cz |
---|---|
State | New |
Headers | show |
Series | Disable autogeneration of gather instructions on Ryzen and generic | expand |
On Tue, Jan 9, 2018 at 11:26 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > Hi, > gather instructions are rather hard to implement in hardware and except for > skylake+ chips (i.e. haswell and Zen) they seems to be rather slow; to the > degree I did not find real world loop where gather would help on Zen. > This patch simply adds a knob to disable its autogeneration (builtin still > works). I have considered two alternatives > 1) tune this with x86-tune-costs because gather is still profitable than > scalar code if we do very expensive operations (such as sequence of divides) > on the values gathered/scattered > 2) implement expansion of gathers into primitive instructions. This is faster > as semantics of gather is bit weird and not fully needed for our vectorizer > > I did not have luck to get any good results out of 1 alone as cost model is not > very realistic. 1+2 probably makes sense but we can do this incrementally as > that would make most sense to be implemented generically in vectorizer on the > top of this change. Note that the vectorizer gives up on loops with gathers with no target support for gathers. It could simply open-code the gather though (and properly cost that open-coded variant), that's probably the way to go here. Richard. > Given that gather is problematic even on skylake+ as shown in the PR (which has > most optimized implementation of it) it is good to have a knob to control its > codegen at first place. > > I have also disabled gathers for generic. This is because its use causes > some two-digit regression on zen for spec2k17 while there are no measurable > benefits on Intel. Note that this affects only > -march=<somethig supporting gather> -mtune=generic > as by default we do not use AVX2. > > Bootstrapped/regtested x86_64-linux, plan to commit it later today if there > are no complains. > > Honza > > PR target/81616 > * i386.c (ix86_vectorize_builtin_gather): Check TARGET_USE_GATHER. > * i386.h (TARGET_USE_GATHER): Define. > * x86-tune.def (X86_TUNE_USE_GATHER): New. > Index: config/i386/i386.c > =================================================================== > --- config/i386/i386.c (revision 256369) > +++ config/i386/i386.c (working copy) > @@ -38233,7 +38233,7 @@ ix86_vectorize_builtin_gather (const_tre > bool si; > enum ix86_builtins code; > > - if (! TARGET_AVX2) > + if (! TARGET_AVX2 || !TARGET_USE_GATHER) > return NULL_TREE; > > if ((TREE_CODE (index_type) != INTEGER_TYPE > Index: config/i386/i386.h > =================================================================== > --- config/i386/i386.h (revision 256369) > +++ config/i386/i386.h (working copy) > @@ -498,6 +498,8 @@ extern unsigned char ix86_tune_features[ > ix86_tune_features[X86_TUNE_SLOW_PSHUFB] > #define TARGET_AVOID_4BYTE_PREFIXES \ > ix86_tune_features[X86_TUNE_AVOID_4BYTE_PREFIXES] > +#define TARGET_USE_GATHER \ > + ix86_tune_features[X86_TUNE_USE_GATHER] > #define TARGET_FUSE_CMP_AND_BRANCH_32 \ > ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32] > #define TARGET_FUSE_CMP_AND_BRANCH_64 \ > Index: config/i386/x86-tune.def > =================================================================== > --- config/i386/x86-tune.def (revision 256369) > +++ config/i386/x86-tune.def (working copy) > @@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_ps > DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes", > m_SILVERMONT | m_INTEL) > > +/* X86_TUNE_USE_GATHER: Use gather instructions. */ > +DEF_TUNE (X86_TUNE_USE_GATHER, "use_gather", > + ~(m_ZNVER1 | m_GENERIC)) > + > /*****************************************************************************/ > /* AVX instruction selection tuning (some of SSE flags affects AVX, too) */ > /*****************************************************************************/
> On Tue, Jan 9, 2018 at 11:26 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > > Hi, > > gather instructions are rather hard to implement in hardware and except for > > skylake+ chips (i.e. haswell and Zen) they seems to be rather slow; to the > > degree I did not find real world loop where gather would help on Zen. > > This patch simply adds a knob to disable its autogeneration (builtin still > > works). I have considered two alternatives > > 1) tune this with x86-tune-costs because gather is still profitable than > > scalar code if we do very expensive operations (such as sequence of divides) > > on the values gathered/scattered > > 2) implement expansion of gathers into primitive instructions. This is faster > > as semantics of gather is bit weird and not fully needed for our vectorizer > > > > I did not have luck to get any good results out of 1 alone as cost model is not > > very realistic. 1+2 probably makes sense but we can do this incrementally as > > that would make most sense to be implemented generically in vectorizer on the > > top of this change. > > Note that the vectorizer gives up on loops with gathers with no target > support for > gathers. It could simply open-code the gather though (and properly cost that > open-coded variant), that's probably the way to go here. Agreed, that is whay i meant by 2. Generic code has all info to implement gathers/scatters by hand when they are not supported and estimate cost of the implementation. This is what I meant by the paragraph above. I tried to look into it about month ago but got bit lost in vectorizer internals :( Honza > > Richard. > > > Given that gather is problematic even on skylake+ as shown in the PR (which has > > most optimized implementation of it) it is good to have a knob to control its > > codegen at first place. > > > > I have also disabled gathers for generic. This is because its use causes > > some two-digit regression on zen for spec2k17 while there are no measurable > > benefits on Intel. Note that this affects only > > -march=<somethig supporting gather> -mtune=generic > > as by default we do not use AVX2. > > > > Bootstrapped/regtested x86_64-linux, plan to commit it later today if there > > are no complains. > > > > Honza > > > > PR target/81616 > > * i386.c (ix86_vectorize_builtin_gather): Check TARGET_USE_GATHER. > > * i386.h (TARGET_USE_GATHER): Define. > > * x86-tune.def (X86_TUNE_USE_GATHER): New. > > Index: config/i386/i386.c > > =================================================================== > > --- config/i386/i386.c (revision 256369) > > +++ config/i386/i386.c (working copy) > > @@ -38233,7 +38233,7 @@ ix86_vectorize_builtin_gather (const_tre > > bool si; > > enum ix86_builtins code; > > > > - if (! TARGET_AVX2) > > + if (! TARGET_AVX2 || !TARGET_USE_GATHER) > > return NULL_TREE; > > > > if ((TREE_CODE (index_type) != INTEGER_TYPE > > Index: config/i386/i386.h > > =================================================================== > > --- config/i386/i386.h (revision 256369) > > +++ config/i386/i386.h (working copy) > > @@ -498,6 +498,8 @@ extern unsigned char ix86_tune_features[ > > ix86_tune_features[X86_TUNE_SLOW_PSHUFB] > > #define TARGET_AVOID_4BYTE_PREFIXES \ > > ix86_tune_features[X86_TUNE_AVOID_4BYTE_PREFIXES] > > +#define TARGET_USE_GATHER \ > > + ix86_tune_features[X86_TUNE_USE_GATHER] > > #define TARGET_FUSE_CMP_AND_BRANCH_32 \ > > ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32] > > #define TARGET_FUSE_CMP_AND_BRANCH_64 \ > > Index: config/i386/x86-tune.def > > =================================================================== > > --- config/i386/x86-tune.def (revision 256369) > > +++ config/i386/x86-tune.def (working copy) > > @@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_ps > > DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes", > > m_SILVERMONT | m_INTEL) > > > > +/* X86_TUNE_USE_GATHER: Use gather instructions. */ > > +DEF_TUNE (X86_TUNE_USE_GATHER, "use_gather", > > + ~(m_ZNVER1 | m_GENERIC)) > > + > > /*****************************************************************************/ > > /* AVX instruction selection tuning (some of SSE flags affects AVX, too) */ > > /*****************************************************************************/
On Tue, Jan 9, 2018 at 11:58 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> On Tue, Jan 9, 2018 at 11:26 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> > Hi, >> > gather instructions are rather hard to implement in hardware and except for >> > skylake+ chips (i.e. haswell and Zen) they seems to be rather slow; to the >> > degree I did not find real world loop where gather would help on Zen. >> > This patch simply adds a knob to disable its autogeneration (builtin still >> > works). I have considered two alternatives >> > 1) tune this with x86-tune-costs because gather is still profitable than >> > scalar code if we do very expensive operations (such as sequence of divides) >> > on the values gathered/scattered >> > 2) implement expansion of gathers into primitive instructions. This is faster >> > as semantics of gather is bit weird and not fully needed for our vectorizer >> > >> > I did not have luck to get any good results out of 1 alone as cost model is not >> > very realistic. 1+2 probably makes sense but we can do this incrementally as >> > that would make most sense to be implemented generically in vectorizer on the >> > top of this change. >> >> Note that the vectorizer gives up on loops with gathers with no target >> support for >> gathers. It could simply open-code the gather though (and properly cost that >> open-coded variant), that's probably the way to go here. > > Agreed, that is whay i meant by 2. Generic code has all info to implement gathers/scatters > by hand when they are not supported and estimate cost of the implementation. > This is what I meant by the paragraph above. I tried to look into it about month > ago but got bit lost in vectorizer internals :( Yeah, I think the only bits that cannot be done open-coded in an efficient way is the masking. Richard. > Honza >> >> Richard. >> >> > Given that gather is problematic even on skylake+ as shown in the PR (which has >> > most optimized implementation of it) it is good to have a knob to control its >> > codegen at first place. >> > >> > I have also disabled gathers for generic. This is because its use causes >> > some two-digit regression on zen for spec2k17 while there are no measurable >> > benefits on Intel. Note that this affects only >> > -march=<somethig supporting gather> -mtune=generic >> > as by default we do not use AVX2. >> > >> > Bootstrapped/regtested x86_64-linux, plan to commit it later today if there >> > are no complains. >> > >> > Honza >> > >> > PR target/81616 >> > * i386.c (ix86_vectorize_builtin_gather): Check TARGET_USE_GATHER. >> > * i386.h (TARGET_USE_GATHER): Define. >> > * x86-tune.def (X86_TUNE_USE_GATHER): New. >> > Index: config/i386/i386.c >> > =================================================================== >> > --- config/i386/i386.c (revision 256369) >> > +++ config/i386/i386.c (working copy) >> > @@ -38233,7 +38233,7 @@ ix86_vectorize_builtin_gather (const_tre >> > bool si; >> > enum ix86_builtins code; >> > >> > - if (! TARGET_AVX2) >> > + if (! TARGET_AVX2 || !TARGET_USE_GATHER) >> > return NULL_TREE; >> > >> > if ((TREE_CODE (index_type) != INTEGER_TYPE >> > Index: config/i386/i386.h >> > =================================================================== >> > --- config/i386/i386.h (revision 256369) >> > +++ config/i386/i386.h (working copy) >> > @@ -498,6 +498,8 @@ extern unsigned char ix86_tune_features[ >> > ix86_tune_features[X86_TUNE_SLOW_PSHUFB] >> > #define TARGET_AVOID_4BYTE_PREFIXES \ >> > ix86_tune_features[X86_TUNE_AVOID_4BYTE_PREFIXES] >> > +#define TARGET_USE_GATHER \ >> > + ix86_tune_features[X86_TUNE_USE_GATHER] >> > #define TARGET_FUSE_CMP_AND_BRANCH_32 \ >> > ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32] >> > #define TARGET_FUSE_CMP_AND_BRANCH_64 \ >> > Index: config/i386/x86-tune.def >> > =================================================================== >> > --- config/i386/x86-tune.def (revision 256369) >> > +++ config/i386/x86-tune.def (working copy) >> > @@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_ps >> > DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes", >> > m_SILVERMONT | m_INTEL) >> > >> > +/* X86_TUNE_USE_GATHER: Use gather instructions. */ >> > +DEF_TUNE (X86_TUNE_USE_GATHER, "use_gather", >> > + ~(m_ZNVER1 | m_GENERIC)) >> > + >> > /*****************************************************************************/ >> > /* AVX instruction selection tuning (some of SSE flags affects AVX, too) */ >> > /*****************************************************************************/
On 01/09/2018 11:28 AM, Richard Biener wrote: > Note that the vectorizer gives up on loops with gathers with no target > support for > gathers. It could simply open-code the gather though (and properly cost that > open-coded variant), that's probably the way to go here. Man, I wish I had made this comment a few years ago, when Jakub Jelinek implemented support for gather in vectorization *of our loops*. Before that, I had only seen programmers do it for our code by rewriting the Fortran into something that was completely unreadable.
Index: config/i386/i386.c =================================================================== --- config/i386/i386.c (revision 256369) +++ config/i386/i386.c (working copy) @@ -38233,7 +38233,7 @@ ix86_vectorize_builtin_gather (const_tre bool si; enum ix86_builtins code; - if (! TARGET_AVX2) + if (! TARGET_AVX2 || !TARGET_USE_GATHER) return NULL_TREE; if ((TREE_CODE (index_type) != INTEGER_TYPE Index: config/i386/i386.h =================================================================== --- config/i386/i386.h (revision 256369) +++ config/i386/i386.h (working copy) @@ -498,6 +498,8 @@ extern unsigned char ix86_tune_features[ ix86_tune_features[X86_TUNE_SLOW_PSHUFB] #define TARGET_AVOID_4BYTE_PREFIXES \ ix86_tune_features[X86_TUNE_AVOID_4BYTE_PREFIXES] +#define TARGET_USE_GATHER \ + ix86_tune_features[X86_TUNE_USE_GATHER] #define TARGET_FUSE_CMP_AND_BRANCH_32 \ ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32] #define TARGET_FUSE_CMP_AND_BRANCH_64 \ Index: config/i386/x86-tune.def =================================================================== --- config/i386/x86-tune.def (revision 256369) +++ config/i386/x86-tune.def (working copy) @@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_ps DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes", m_SILVERMONT | m_INTEL) +/* X86_TUNE_USE_GATHER: Use gather instructions. */ +DEF_TUNE (X86_TUNE_USE_GATHER, "use_gather", + ~(m_ZNVER1 | m_GENERIC)) + /*****************************************************************************/ /* AVX instruction selection tuning (some of SSE flags affects AVX, too) */ /*****************************************************************************/