diff mbox series

Disable autogeneration of gather instructions on Ryzen and generic

Message ID 20180109102606.GA59414@kam.mff.cuni.cz
State New
Headers show
Series Disable autogeneration of gather instructions on Ryzen and generic | expand

Commit Message

Jan Hubicka Jan. 9, 2018, 10:26 a.m. UTC
Hi,
gather instructions are rather hard to implement in hardware and except for
skylake+ chips (i.e. haswell and Zen) they seems to be rather slow; to the
degree I did not find real world loop where gather would help on Zen.
This patch simply adds a knob to disable its autogeneration (builtin still
works). I have considered two alternatives
 1) tune this with x86-tune-costs because gather is still profitable than
    scalar code if we do very expensive operations (such as sequence of divides)
    on the values gathered/scattered
 2) implement expansion of gathers into primitive instructions.  This is faster
    as semantics of gather is bit weird and not fully needed for our vectorizer

I did not have luck to get any good results out of 1 alone as cost model is not
very realistic.  1+2 probably makes sense but we can do this incrementally as
that would make most sense to be implemented generically in vectorizer on the
top of this change.

Given that gather is problematic even on skylake+ as shown in the PR (which has
most optimized implementation of it) it is good to have a knob to control its
codegen at first place.

I have also disabled gathers for generic.  This is because its use causes
some two-digit regression on zen for spec2k17 while there are no measurable
benefits on Intel.  Note that this affects only
-march=<somethig supporting gather> -mtune=generic
as by default we do not use AVX2.

Bootstrapped/regtested x86_64-linux, plan to commit it later today if there
are no complains.

Honza

	PR target/81616
	* i386.c (ix86_vectorize_builtin_gather): Check TARGET_USE_GATHER.
	* i386.h (TARGET_USE_GATHER): Define.
	* x86-tune.def (X86_TUNE_USE_GATHER): New.

Comments

Richard Biener Jan. 9, 2018, 10:28 a.m. UTC | #1
On Tue, Jan 9, 2018 at 11:26 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
> gather instructions are rather hard to implement in hardware and except for
> skylake+ chips (i.e. haswell and Zen) they seems to be rather slow; to the
> degree I did not find real world loop where gather would help on Zen.
> This patch simply adds a knob to disable its autogeneration (builtin still
> works). I have considered two alternatives
>  1) tune this with x86-tune-costs because gather is still profitable than
>     scalar code if we do very expensive operations (such as sequence of divides)
>     on the values gathered/scattered
>  2) implement expansion of gathers into primitive instructions.  This is faster
>     as semantics of gather is bit weird and not fully needed for our vectorizer
>
> I did not have luck to get any good results out of 1 alone as cost model is not
> very realistic.  1+2 probably makes sense but we can do this incrementally as
> that would make most sense to be implemented generically in vectorizer on the
> top of this change.

Note that the vectorizer gives up on loops with gathers with no target
support for
gathers.  It could simply open-code the gather though (and properly cost that
open-coded variant), that's probably the way to go here.

Richard.

> Given that gather is problematic even on skylake+ as shown in the PR (which has
> most optimized implementation of it) it is good to have a knob to control its
> codegen at first place.
>
> I have also disabled gathers for generic.  This is because its use causes
> some two-digit regression on zen for spec2k17 while there are no measurable
> benefits on Intel.  Note that this affects only
> -march=<somethig supporting gather> -mtune=generic
> as by default we do not use AVX2.
>
> Bootstrapped/regtested x86_64-linux, plan to commit it later today if there
> are no complains.
>
> Honza
>
>         PR target/81616
>         * i386.c (ix86_vectorize_builtin_gather): Check TARGET_USE_GATHER.
>         * i386.h (TARGET_USE_GATHER): Define.
>         * x86-tune.def (X86_TUNE_USE_GATHER): New.
> Index: config/i386/i386.c
> ===================================================================
> --- config/i386/i386.c  (revision 256369)
> +++ config/i386/i386.c  (working copy)
> @@ -38233,7 +38233,7 @@ ix86_vectorize_builtin_gather (const_tre
>    bool si;
>    enum ix86_builtins code;
>
> -  if (! TARGET_AVX2)
> +  if (! TARGET_AVX2 || !TARGET_USE_GATHER)
>      return NULL_TREE;
>
>    if ((TREE_CODE (index_type) != INTEGER_TYPE
> Index: config/i386/i386.h
> ===================================================================
> --- config/i386/i386.h  (revision 256369)
> +++ config/i386/i386.h  (working copy)
> @@ -498,6 +498,8 @@ extern unsigned char ix86_tune_features[
>         ix86_tune_features[X86_TUNE_SLOW_PSHUFB]
>  #define TARGET_AVOID_4BYTE_PREFIXES \
>         ix86_tune_features[X86_TUNE_AVOID_4BYTE_PREFIXES]
> +#define TARGET_USE_GATHER \
> +       ix86_tune_features[X86_TUNE_USE_GATHER]
>  #define TARGET_FUSE_CMP_AND_BRANCH_32 \
>         ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32]
>  #define TARGET_FUSE_CMP_AND_BRANCH_64 \
> Index: config/i386/x86-tune.def
> ===================================================================
> --- config/i386/x86-tune.def    (revision 256369)
> +++ config/i386/x86-tune.def    (working copy)
> @@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_ps
>  DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
>            m_SILVERMONT | m_INTEL)
>
> +/* X86_TUNE_USE_GATHER: Use gather instructions.  */
> +DEF_TUNE (X86_TUNE_USE_GATHER, "use_gather",
> +          ~(m_ZNVER1 | m_GENERIC))
> +
>  /*****************************************************************************/
>  /* AVX instruction selection tuning (some of SSE flags affects AVX, too)     */
>  /*****************************************************************************/
Jan Hubicka Jan. 9, 2018, 10:58 a.m. UTC | #2
> On Tue, Jan 9, 2018 at 11:26 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> > Hi,
> > gather instructions are rather hard to implement in hardware and except for
> > skylake+ chips (i.e. haswell and Zen) they seems to be rather slow; to the
> > degree I did not find real world loop where gather would help on Zen.
> > This patch simply adds a knob to disable its autogeneration (builtin still
> > works). I have considered two alternatives
> >  1) tune this with x86-tune-costs because gather is still profitable than
> >     scalar code if we do very expensive operations (such as sequence of divides)
> >     on the values gathered/scattered
> >  2) implement expansion of gathers into primitive instructions.  This is faster
> >     as semantics of gather is bit weird and not fully needed for our vectorizer
> >
> > I did not have luck to get any good results out of 1 alone as cost model is not
> > very realistic.  1+2 probably makes sense but we can do this incrementally as
> > that would make most sense to be implemented generically in vectorizer on the
> > top of this change.
> 
> Note that the vectorizer gives up on loops with gathers with no target
> support for
> gathers.  It could simply open-code the gather though (and properly cost that
> open-coded variant), that's probably the way to go here.

Agreed, that is whay i meant by 2.  Generic code has all info to implement gathers/scatters
by hand when they are not supported and estimate cost of the implementation.
This is what I meant by the paragraph above.  I tried to look into it about month
ago but got bit lost in vectorizer internals :(

Honza
> 
> Richard.
> 
> > Given that gather is problematic even on skylake+ as shown in the PR (which has
> > most optimized implementation of it) it is good to have a knob to control its
> > codegen at first place.
> >
> > I have also disabled gathers for generic.  This is because its use causes
> > some two-digit regression on zen for spec2k17 while there are no measurable
> > benefits on Intel.  Note that this affects only
> > -march=<somethig supporting gather> -mtune=generic
> > as by default we do not use AVX2.
> >
> > Bootstrapped/regtested x86_64-linux, plan to commit it later today if there
> > are no complains.
> >
> > Honza
> >
> >         PR target/81616
> >         * i386.c (ix86_vectorize_builtin_gather): Check TARGET_USE_GATHER.
> >         * i386.h (TARGET_USE_GATHER): Define.
> >         * x86-tune.def (X86_TUNE_USE_GATHER): New.
> > Index: config/i386/i386.c
> > ===================================================================
> > --- config/i386/i386.c  (revision 256369)
> > +++ config/i386/i386.c  (working copy)
> > @@ -38233,7 +38233,7 @@ ix86_vectorize_builtin_gather (const_tre
> >    bool si;
> >    enum ix86_builtins code;
> >
> > -  if (! TARGET_AVX2)
> > +  if (! TARGET_AVX2 || !TARGET_USE_GATHER)
> >      return NULL_TREE;
> >
> >    if ((TREE_CODE (index_type) != INTEGER_TYPE
> > Index: config/i386/i386.h
> > ===================================================================
> > --- config/i386/i386.h  (revision 256369)
> > +++ config/i386/i386.h  (working copy)
> > @@ -498,6 +498,8 @@ extern unsigned char ix86_tune_features[
> >         ix86_tune_features[X86_TUNE_SLOW_PSHUFB]
> >  #define TARGET_AVOID_4BYTE_PREFIXES \
> >         ix86_tune_features[X86_TUNE_AVOID_4BYTE_PREFIXES]
> > +#define TARGET_USE_GATHER \
> > +       ix86_tune_features[X86_TUNE_USE_GATHER]
> >  #define TARGET_FUSE_CMP_AND_BRANCH_32 \
> >         ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32]
> >  #define TARGET_FUSE_CMP_AND_BRANCH_64 \
> > Index: config/i386/x86-tune.def
> > ===================================================================
> > --- config/i386/x86-tune.def    (revision 256369)
> > +++ config/i386/x86-tune.def    (working copy)
> > @@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_ps
> >  DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
> >            m_SILVERMONT | m_INTEL)
> >
> > +/* X86_TUNE_USE_GATHER: Use gather instructions.  */
> > +DEF_TUNE (X86_TUNE_USE_GATHER, "use_gather",
> > +          ~(m_ZNVER1 | m_GENERIC))
> > +
> >  /*****************************************************************************/
> >  /* AVX instruction selection tuning (some of SSE flags affects AVX, too)     */
> >  /*****************************************************************************/
Richard Biener Jan. 9, 2018, 11:30 a.m. UTC | #3
On Tue, Jan 9, 2018 at 11:58 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Tue, Jan 9, 2018 at 11:26 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > Hi,
>> > gather instructions are rather hard to implement in hardware and except for
>> > skylake+ chips (i.e. haswell and Zen) they seems to be rather slow; to the
>> > degree I did not find real world loop where gather would help on Zen.
>> > This patch simply adds a knob to disable its autogeneration (builtin still
>> > works). I have considered two alternatives
>> >  1) tune this with x86-tune-costs because gather is still profitable than
>> >     scalar code if we do very expensive operations (such as sequence of divides)
>> >     on the values gathered/scattered
>> >  2) implement expansion of gathers into primitive instructions.  This is faster
>> >     as semantics of gather is bit weird and not fully needed for our vectorizer
>> >
>> > I did not have luck to get any good results out of 1 alone as cost model is not
>> > very realistic.  1+2 probably makes sense but we can do this incrementally as
>> > that would make most sense to be implemented generically in vectorizer on the
>> > top of this change.
>>
>> Note that the vectorizer gives up on loops with gathers with no target
>> support for
>> gathers.  It could simply open-code the gather though (and properly cost that
>> open-coded variant), that's probably the way to go here.
>
> Agreed, that is whay i meant by 2.  Generic code has all info to implement gathers/scatters
> by hand when they are not supported and estimate cost of the implementation.
> This is what I meant by the paragraph above.  I tried to look into it about month
> ago but got bit lost in vectorizer internals :(

Yeah, I think the only bits that cannot be done open-coded in an
efficient way is
the masking.

Richard.

> Honza
>>
>> Richard.
>>
>> > Given that gather is problematic even on skylake+ as shown in the PR (which has
>> > most optimized implementation of it) it is good to have a knob to control its
>> > codegen at first place.
>> >
>> > I have also disabled gathers for generic.  This is because its use causes
>> > some two-digit regression on zen for spec2k17 while there are no measurable
>> > benefits on Intel.  Note that this affects only
>> > -march=<somethig supporting gather> -mtune=generic
>> > as by default we do not use AVX2.
>> >
>> > Bootstrapped/regtested x86_64-linux, plan to commit it later today if there
>> > are no complains.
>> >
>> > Honza
>> >
>> >         PR target/81616
>> >         * i386.c (ix86_vectorize_builtin_gather): Check TARGET_USE_GATHER.
>> >         * i386.h (TARGET_USE_GATHER): Define.
>> >         * x86-tune.def (X86_TUNE_USE_GATHER): New.
>> > Index: config/i386/i386.c
>> > ===================================================================
>> > --- config/i386/i386.c  (revision 256369)
>> > +++ config/i386/i386.c  (working copy)
>> > @@ -38233,7 +38233,7 @@ ix86_vectorize_builtin_gather (const_tre
>> >    bool si;
>> >    enum ix86_builtins code;
>> >
>> > -  if (! TARGET_AVX2)
>> > +  if (! TARGET_AVX2 || !TARGET_USE_GATHER)
>> >      return NULL_TREE;
>> >
>> >    if ((TREE_CODE (index_type) != INTEGER_TYPE
>> > Index: config/i386/i386.h
>> > ===================================================================
>> > --- config/i386/i386.h  (revision 256369)
>> > +++ config/i386/i386.h  (working copy)
>> > @@ -498,6 +498,8 @@ extern unsigned char ix86_tune_features[
>> >         ix86_tune_features[X86_TUNE_SLOW_PSHUFB]
>> >  #define TARGET_AVOID_4BYTE_PREFIXES \
>> >         ix86_tune_features[X86_TUNE_AVOID_4BYTE_PREFIXES]
>> > +#define TARGET_USE_GATHER \
>> > +       ix86_tune_features[X86_TUNE_USE_GATHER]
>> >  #define TARGET_FUSE_CMP_AND_BRANCH_32 \
>> >         ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32]
>> >  #define TARGET_FUSE_CMP_AND_BRANCH_64 \
>> > Index: config/i386/x86-tune.def
>> > ===================================================================
>> > --- config/i386/x86-tune.def    (revision 256369)
>> > +++ config/i386/x86-tune.def    (working copy)
>> > @@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_ps
>> >  DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
>> >            m_SILVERMONT | m_INTEL)
>> >
>> > +/* X86_TUNE_USE_GATHER: Use gather instructions.  */
>> > +DEF_TUNE (X86_TUNE_USE_GATHER, "use_gather",
>> > +          ~(m_ZNVER1 | m_GENERIC))
>> > +
>> >  /*****************************************************************************/
>> >  /* AVX instruction selection tuning (some of SSE flags affects AVX, too)     */
>> >  /*****************************************************************************/
Toon Moene Jan. 11, 2018, 8:46 p.m. UTC | #4
On 01/09/2018 11:28 AM, Richard Biener wrote:

> Note that the vectorizer gives up on loops with gathers with no target
> support for
> gathers.  It could simply open-code the gather though (and properly cost that
> open-coded variant), that's probably the way to go here.

Man, I wish I had made this comment a few years ago, when Jakub Jelinek 
implemented support for gather in vectorization *of our loops*.

Before that, I had only seen programmers do it for our code by rewriting 
the Fortran into something that was completely unreadable.
diff mbox series

Patch

Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 256369)
+++ config/i386/i386.c	(working copy)
@@ -38233,7 +38233,7 @@  ix86_vectorize_builtin_gather (const_tre
   bool si;
   enum ix86_builtins code;
 
-  if (! TARGET_AVX2)
+  if (! TARGET_AVX2 || !TARGET_USE_GATHER)
     return NULL_TREE;
 
   if ((TREE_CODE (index_type) != INTEGER_TYPE
Index: config/i386/i386.h
===================================================================
--- config/i386/i386.h	(revision 256369)
+++ config/i386/i386.h	(working copy)
@@ -498,6 +498,8 @@  extern unsigned char ix86_tune_features[
 	ix86_tune_features[X86_TUNE_SLOW_PSHUFB]
 #define TARGET_AVOID_4BYTE_PREFIXES \
 	ix86_tune_features[X86_TUNE_AVOID_4BYTE_PREFIXES]
+#define TARGET_USE_GATHER \
+	ix86_tune_features[X86_TUNE_USE_GATHER]
 #define TARGET_FUSE_CMP_AND_BRANCH_32 \
 	ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32]
 #define TARGET_FUSE_CMP_AND_BRANCH_64 \
Index: config/i386/x86-tune.def
===================================================================
--- config/i386/x86-tune.def	(revision 256369)
+++ config/i386/x86-tune.def	(working copy)
@@ -399,6 +399,10 @@  DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_ps
 DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
           m_SILVERMONT | m_INTEL)
 
+/* X86_TUNE_USE_GATHER: Use gather instructions.  */
+DEF_TUNE (X86_TUNE_USE_GATHER, "use_gather",
+          ~(m_ZNVER1 | m_GENERIC))
+
 /*****************************************************************************/
 /* AVX instruction selection tuning (some of SSE flags affects AVX, too)     */
 /*****************************************************************************/