diff mbox series

[14/25] Disable inefficient vectorization of elementwise loads/stores.

Message ID fb85f5cc96463b1a779cd4f874dff269960b40a3.1536144068.git.ams@codesourcery.com
State New
Headers show
Series AMD GCN Port | expand

Commit Message

Andrew Stubbs Sept. 5, 2018, 11:50 a.m. UTC
If the autovectorizer tries to load a GCN 64-lane vector elementwise then it
blows away the register file and produces horrible code.

This patch simply disallows elementwise loads for such large vectors.  Is there
a better way to disable this in the middle-end?

2018-09-05  Julian Brown  <julian@codesourcery.com>

	gcc/
	* tree-vect-stmts.c (get_load_store_type): Don't use VMAT_ELEMENTWISE
	loads/stores with many-element (>=64) vectors.
---
 gcc/tree-vect-stmts.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Comments

Richard Sandiford Sept. 17, 2018, 9:14 a.m. UTC | #1
<ams@codesourcery.com> writes:
> If the autovectorizer tries to load a GCN 64-lane vector elementwise then it
> blows away the register file and produces horrible code.

Do all the registers really need to be live at once, or is it "just" bad
scheduling?  I'd have expected the initial rtl to load each element and
then insert it immediately, so that the number of insertions doesn't
directly affect register pressure.

> This patch simply disallows elementwise loads for such large vectors.  Is there
> a better way to disable this in the middle-end?

Do you ever want elementwise accesses for GCN?  If not, it might be
better to disable them in the target's cost model.

Thanks,
Richard

>
> 2018-09-05  Julian Brown  <julian@codesourcery.com>
>
> 	gcc/
> 	* tree-vect-stmts.c (get_load_store_type): Don't use VMAT_ELEMENTWISE
> 	loads/stores with many-element (>=64) vectors.
> ---
>  gcc/tree-vect-stmts.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index 8875201..a333991 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -2452,6 +2452,26 @@ get_load_store_type (stmt_vec_info stmt_info, tree vectype, bool slp,
>  	*memory_access_type = VMAT_CONTIGUOUS;
>      }
>  
> +  /* FIXME: Element-wise accesses can be extremely expensive if we have a
> +     large number of elements to deal with (e.g. 64 for AMD GCN) using the
> +     current generic code expansion.  Until an efficient code sequence is
> +     supported for affected targets instead, don't attempt vectorization for
> +     VMAT_ELEMENTWISE at all.  */
> +  if (*memory_access_type == VMAT_ELEMENTWISE)
> +    {
> +      poly_uint64 nelements = TYPE_VECTOR_SUBPARTS (vectype);
> +
> +      if (maybe_ge (nelements, 64))
> +	{
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +	      "too many elements (%u) for elementwise accesses\n",
> +	      (unsigned) nelements.to_constant ());
> +
> +	  return false;
> +	}
> +    }
> +
>    if ((*memory_access_type == VMAT_ELEMENTWISE
>         || *memory_access_type == VMAT_STRIDED_SLP)
>        && !nunits.is_constant ())
Andrew Stubbs Sept. 17, 2018, 9:39 a.m. UTC | #2
On 17/09/18 10:14, Richard Sandiford wrote:
> <ams@codesourcery.com> writes:
>> If the autovectorizer tries to load a GCN 64-lane vector elementwise then it
>> blows away the register file and produces horrible code.
> 
> Do all the registers really need to be live at once, or is it "just" bad
> scheduling?  I'd have expected the initial rtl to load each element and
> then insert it immediately, so that the number of insertions doesn't
> directly affect register pressure.

They don't need to be live at once, architecturally speaking, but that's 
the way it happened.  No doubt there is another solution to fix it, but 
it's not a use case I believe we want to spend time optimizing.

Actually, I've not tested what happens without this in GCC 9, so that's 
probably worth checking, but I'd still be concerned about it blowing up 
on real code somewhere.

>> This patch simply disallows elementwise loads for such large vectors.  Is there
>> a better way to disable this in the middle-end?
> 
> Do you ever want elementwise accesses for GCN?  If not, it might be
> better to disable them in the target's cost model.

The hardware is perfectly capable of extracting or setting vector 
elements, but given that it can do full gather/scatter from arbitrary 
addresses it's not something we want to do in general.

A normal scalar load will use a vector register (lane 0). The value then 
has to be moved to a scalar register, and only then can v_writelane 
insert it into the final destination.

Alternatively you could use a mask_load to load the value directly to 
the correct lane, but I don't believe that's something GCC does.

Andrew
Richard Sandiford Sept. 17, 2018, 11:43 a.m. UTC | #3
Andrew Stubbs <ams@codesourcery.com> writes:
> On 17/09/18 10:14, Richard Sandiford wrote:
>> <ams@codesourcery.com> writes:
>>> If the autovectorizer tries to load a GCN 64-lane vector elementwise then it
>>> blows away the register file and produces horrible code.
>> 
>> Do all the registers really need to be live at once, or is it "just" bad
>> scheduling?  I'd have expected the initial rtl to load each element and
>> then insert it immediately, so that the number of insertions doesn't
>> directly affect register pressure.
>
> They don't need to be live at once, architecturally speaking, but that's 
> the way it happened.  No doubt there is another solution to fix it, but 
> it's not a use case I believe we want to spend time optimizing.
>
> Actually, I've not tested what happens without this in GCC 9, so that's 
> probably worth checking, but I'd still be concerned about it blowing up 
> on real code somewhere.
>
>>> This patch simply disallows elementwise loads for such large vectors.
>>> Is there
>>> a better way to disable this in the middle-end?
>> 
>> Do you ever want elementwise accesses for GCN?  If not, it might be
>> better to disable them in the target's cost model.
>
> The hardware is perfectly capable of extracting or setting vector 
> elements, but given that it can do full gather/scatter from arbitrary 
> addresses it's not something we want to do in general.
>
> A normal scalar load will use a vector register (lane 0). The value then 
> has to be moved to a scalar register, and only then can v_writelane 
> insert it into the final destination.

OK, sounds like the cost of vec_construct is too low then.  But looking
at the port, I see you have:

/* Implement TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST.  */

int
gcn_vectorization_cost (enum vect_cost_for_stmt ARG_UNUSED (type_of_cost),
			tree ARG_UNUSED (vectype), int ARG_UNUSED (misalign))
{
  /* Always vectorize.  */
  return 1;
}

which short-circuits the cost-model altogether.  Isn't that part
of the problem?

Richard

>
> Alternatively you could use a mask_load to load the value directly to 
> the correct lane, but I don't believe that's something GCC does.
>
> Andrew
Andrew Stubbs Sept. 17, 2018, 12:40 p.m. UTC | #4
On 17/09/18 12:43, Richard Sandiford wrote:
> OK, sounds like the cost of vec_construct is too low then.  But looking
> at the port, I see you have:
> 
> /* Implement TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST.  */
> 
> int
> gcn_vectorization_cost (enum vect_cost_for_stmt ARG_UNUSED (type_of_cost),
> 			tree ARG_UNUSED (vectype), int ARG_UNUSED (misalign))
> {
>    /* Always vectorize.  */
>    return 1;
> }
> 
> which short-circuits the cost-model altogether.  Isn't that part
> of the problem?

Well, it's possible that that's a little simplistic. ;-)

Although, actually the elementwise issue predates the existence of 
gcn_vectorization_cost, and the default does appear to penalize 
vec_construct somewhat.

Actually, the default definition doesn't seem to do much besides 
increase vec_construct, so I'm not sure now why I needed to change it? 
Hmm, more experiments to do.

Thanks for the pointer.

Andrew
Richard Biener Sept. 20, 2018, 12:56 p.m. UTC | #5
On Mon, Sep 17, 2018 at 2:40 PM Andrew Stubbs <ams@codesourcery.com> wrote:
>
> On 17/09/18 12:43, Richard Sandiford wrote:
> > OK, sounds like the cost of vec_construct is too low then.  But looking
> > at the port, I see you have:
> >
> > /* Implement TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST.  */
> >
> > int
> > gcn_vectorization_cost (enum vect_cost_for_stmt ARG_UNUSED (type_of_cost),
> >                       tree ARG_UNUSED (vectype), int ARG_UNUSED (misalign))
> > {
> >    /* Always vectorize.  */
> >    return 1;
> > }
> >
> > which short-circuits the cost-model altogether.  Isn't that part
> > of the problem?
>
> Well, it's possible that that's a little simplistic. ;-)
>
> Although, actually the elementwise issue predates the existence of
> gcn_vectorization_cost, and the default does appear to penalize
> vec_construct somewhat.
>
> Actually, the default definition doesn't seem to do much besides
> increase vec_construct, so I'm not sure now why I needed to change it?
> Hmm, more experiments to do.
>
> Thanks for the pointer.

Btw, we do not consider to use gather/scatter for VMAT_ELEMENTWISE,
that's a missed "optimization" quite possibly because gather/scatter is so
expensive on x86.  Thus the vectorizer should consider this and use the
cheaper alternative according to the cost model (which you of course should
fill with sensible values...).

Richard.

> Andrew
Richard Sandiford Sept. 20, 2018, 1:40 p.m. UTC | #6
Richard Biener <richard.guenther@gmail.com> writes:
> On Mon, Sep 17, 2018 at 2:40 PM Andrew Stubbs <ams@codesourcery.com> wrote:
>> On 17/09/18 12:43, Richard Sandiford wrote:
>> > OK, sounds like the cost of vec_construct is too low then.  But looking
>> > at the port, I see you have:
>> >
>> > /* Implement TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST.  */
>> >
>> > int
>> > gcn_vectorization_cost (enum vect_cost_for_stmt ARG_UNUSED (type_of_cost),
>> >                       tree ARG_UNUSED (vectype), int ARG_UNUSED (misalign))
>> > {
>> >    /* Always vectorize.  */
>> >    return 1;
>> > }
>> >
>> > which short-circuits the cost-model altogether.  Isn't that part
>> > of the problem?
>>
>> Well, it's possible that that's a little simplistic. ;-)
>>
>> Although, actually the elementwise issue predates the existence of
>> gcn_vectorization_cost, and the default does appear to penalize
>> vec_construct somewhat.
>>
>> Actually, the default definition doesn't seem to do much besides
>> increase vec_construct, so I'm not sure now why I needed to change it?
>> Hmm, more experiments to do.
>>
>> Thanks for the pointer.
>
> Btw, we do not consider to use gather/scatter for VMAT_ELEMENTWISE,
> that's a missed "optimization" quite possibly because gather/scatter is so
> expensive on x86.  Thus the vectorizer should consider this and use the
> cheaper alternative according to the cost model (which you of course should
> fill with sensible values...).

Do you mean it this way round, or that it doesn't consider using
VMAT_ELEMENTWISE for natural gather/scatter accesses?  We do use
VMAT_GATHER_SCATTER instead of VMAT_ELEMENTWISE where possible for SVE,
but that relies on implementing the new optabs instead of using the old
built-in-based interface, so it doesn't work for x86 yet.

I guess we might need some way of selecting between the two if
the costs of gather and scatter are context-dependent in some way.
But if gather/scatter is always more expensive than VMAT_ELEMENTWISE
for certain modes then it's probably better not to define the optabs
for those modes.

Thanks,
Richard
Richard Biener Sept. 20, 2018, 2:11 p.m. UTC | #7
On Thu, Sep 20, 2018 at 3:40 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Mon, Sep 17, 2018 at 2:40 PM Andrew Stubbs <ams@codesourcery.com> wrote:
> >> On 17/09/18 12:43, Richard Sandiford wrote:
> >> > OK, sounds like the cost of vec_construct is too low then.  But looking
> >> > at the port, I see you have:
> >> >
> >> > /* Implement TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST.  */
> >> >
> >> > int
> >> > gcn_vectorization_cost (enum vect_cost_for_stmt ARG_UNUSED (type_of_cost),
> >> >                       tree ARG_UNUSED (vectype), int ARG_UNUSED (misalign))
> >> > {
> >> >    /* Always vectorize.  */
> >> >    return 1;
> >> > }
> >> >
> >> > which short-circuits the cost-model altogether.  Isn't that part
> >> > of the problem?
> >>
> >> Well, it's possible that that's a little simplistic. ;-)
> >>
> >> Although, actually the elementwise issue predates the existence of
> >> gcn_vectorization_cost, and the default does appear to penalize
> >> vec_construct somewhat.
> >>
> >> Actually, the default definition doesn't seem to do much besides
> >> increase vec_construct, so I'm not sure now why I needed to change it?
> >> Hmm, more experiments to do.
> >>
> >> Thanks for the pointer.
> >
> > Btw, we do not consider to use gather/scatter for VMAT_ELEMENTWISE,
> > that's a missed "optimization" quite possibly because gather/scatter is so
> > expensive on x86.  Thus the vectorizer should consider this and use the
> > cheaper alternative according to the cost model (which you of course should
> > fill with sensible values...).
>
> Do you mean it this way round, or that it doesn't consider using
> VMAT_ELEMENTWISE for natural gather/scatter accesses?  We do use
> VMAT_GATHER_SCATTER instead of VMAT_ELEMENTWISE where possible for SVE,
> but that relies on implementing the new optabs instead of using the old
> built-in-based interface, so it doesn't work for x86 yet.
>
> I guess we might need some way of selecting between the two if
> the costs of gather and scatter are context-dependent in some way.
> But if gather/scatter is always more expensive than VMAT_ELEMENTWISE
> for certain modes then it's probably better not to define the optabs
> for those modes.

I think we can't vectorize true gathers (indexed from memory loads) w/o
gather yet, right?  So I really was thinking of implementing VMAT_ELEMENTWISE
(invariant stride) and VMAT_STRIDED_SLP by composing the appropriate
index vector with a splat and multiplication and using a gather.  I think that's
not yet implemented?

But yes, vectorizing gathers as detected by dataref analysis w/o native gather
support would also be interesting.  We can do that by doing elementwise
loads and either load the indexes also elementwise or decompose the vector
of indexes (dependent on how that vector is computed).

Richard.

>
> Thanks,
> Richard
Richard Sandiford Sept. 20, 2018, 2:21 p.m. UTC | #8
Richard Biener <richard.guenther@gmail.com> writes:
> On Thu, Sep 20, 2018 at 3:40 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Mon, Sep 17, 2018 at 2:40 PM Andrew Stubbs <ams@codesourcery.com> wrote:
>> >> On 17/09/18 12:43, Richard Sandiford wrote:
>> >> > OK, sounds like the cost of vec_construct is too low then.  But looking
>> >> > at the port, I see you have:
>> >> >
>> >> > /* Implement TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST.  */
>> >> >
>> >> > int
>> >> > gcn_vectorization_cost (enum vect_cost_for_stmt ARG_UNUSED (type_of_cost),
>> >> >                       tree ARG_UNUSED (vectype), int ARG_UNUSED (misalign))
>> >> > {
>> >> >    /* Always vectorize.  */
>> >> >    return 1;
>> >> > }
>> >> >
>> >> > which short-circuits the cost-model altogether.  Isn't that part
>> >> > of the problem?
>> >>
>> >> Well, it's possible that that's a little simplistic. ;-)
>> >>
>> >> Although, actually the elementwise issue predates the existence of
>> >> gcn_vectorization_cost, and the default does appear to penalize
>> >> vec_construct somewhat.
>> >>
>> >> Actually, the default definition doesn't seem to do much besides
>> >> increase vec_construct, so I'm not sure now why I needed to change it?
>> >> Hmm, more experiments to do.
>> >>
>> >> Thanks for the pointer.
>> >
>> > Btw, we do not consider to use gather/scatter for VMAT_ELEMENTWISE,
>> > that's a missed "optimization" quite possibly because gather/scatter is so
>> > expensive on x86.  Thus the vectorizer should consider this and use the
>> > cheaper alternative according to the cost model (which you of course should
>> > fill with sensible values...).
>>
>> Do you mean it this way round, or that it doesn't consider using
>> VMAT_ELEMENTWISE for natural gather/scatter accesses?  We do use
>> VMAT_GATHER_SCATTER instead of VMAT_ELEMENTWISE where possible for SVE,
>> but that relies on implementing the new optabs instead of using the old
>> built-in-based interface, so it doesn't work for x86 yet.
>>
>> I guess we might need some way of selecting between the two if
>> the costs of gather and scatter are context-dependent in some way.
>> But if gather/scatter is always more expensive than VMAT_ELEMENTWISE
>> for certain modes then it's probably better not to define the optabs
>> for those modes.
>
> I think we can't vectorize true gathers (indexed from memory loads) w/o
> gather yet, right?

Right.

> So I really was thinking of implementing VMAT_ELEMENTWISE (invariant
> stride) and VMAT_STRIDED_SLP by composing the appropriate index vector
> with a splat and multiplication and using a gather.  I think that's
> not yet implemented?

For SVE we use:

      /* As a last resort, trying using a gather load or scatter store.

	 ??? Although the code can handle all group sizes correctly,
	 it probably isn't a win to use separate strided accesses based
	 on nearby locations.  Or, even if it's a win over scalar code,
	 it might not be a win over vectorizing at a lower VF, if that
	 allows us to use contiguous accesses.  */
      if (*memory_access_type == VMAT_ELEMENTWISE
	  && single_element_p
	  && loop_vinfo
	  && vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo,
						 masked_p, gs_info))
	*memory_access_type = VMAT_GATHER_SCATTER;

in get_group_load_store_type.  This only works when the target defines
gather/scatter using optabs rather than built-ins.

But yeah, no VMAT_STRIDED_SLP support yet.  That would be good
to have...

Richard
diff mbox series

Patch

diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 8875201..a333991 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -2452,6 +2452,26 @@  get_load_store_type (stmt_vec_info stmt_info, tree vectype, bool slp,
 	*memory_access_type = VMAT_CONTIGUOUS;
     }
 
+  /* FIXME: Element-wise accesses can be extremely expensive if we have a
+     large number of elements to deal with (e.g. 64 for AMD GCN) using the
+     current generic code expansion.  Until an efficient code sequence is
+     supported for affected targets instead, don't attempt vectorization for
+     VMAT_ELEMENTWISE at all.  */
+  if (*memory_access_type == VMAT_ELEMENTWISE)
+    {
+      poly_uint64 nelements = TYPE_VECTOR_SUBPARTS (vectype);
+
+      if (maybe_ge (nelements, 64))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+	      "too many elements (%u) for elementwise accesses\n",
+	      (unsigned) nelements.to_constant ());
+
+	  return false;
+	}
+    }
+
   if ((*memory_access_type == VMAT_ELEMENTWISE
        || *memory_access_type == VMAT_STRIDED_SLP)
       && !nunits.is_constant ())