diff mbox series

[13/25] Create TARGET_DISABLE_CURRENT_VECTOR_SIZE

Message ID 85da45b3271492b67b7f2a6f9474ce7a153e200c.1536144068.git.ams@codesourcery.com
State New
Headers show
Series AMD GCN Port | expand

Commit Message

Andrew Stubbs Sept. 5, 2018, 11:50 a.m. UTC
This feature probably ought to be reworked as a proper target hook, but I would
like to know if this is the correct solution to the problem first.

The problem is that GCN vectors have a fixed number of elements (64) and the
vector size varies with element size.  E.g. V64QI is 64 bytes and V64SI is 256
bytes.

This is a problem because GCC has an assumption that a) vector registers are
fixed size, and b) if there are multiple vector sizes you want to pick one size
and stick with it for the whole function.

This is a problem in various places, but mostly it's not fatal. However,
get_vectype_for_scalar_type caches the vector size for the first type it
encounters and then tries to apply that to all subsequent vectors, which
completely destroys vectorization.  The caching feature appears to be an
attempt to cope with AVX having a different vector size to other x86 vector
options.

This patch simply disables the cache so that it must ask the backend for the
preferred mode for every type.

2018-09-05  Andrew Stubbs  <ams@codesourcery.com>

	gcc/
	* tree-vect-stmts.c (get_vectype_for_scalar_type): Implement
	TARGET_DISABLE_CURRENT_VECTOR_SIZE.
---
 gcc/tree-vect-stmts.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Richard Sandiford Sept. 17, 2018, 7:28 p.m. UTC | #1
<ams@codesourcery.com> writes:
> This feature probably ought to be reworked as a proper target hook, but I would
> like to know if this is the correct solution to the problem first.
>
> The problem is that GCN vectors have a fixed number of elements (64) and the
> vector size varies with element size.  E.g. V64QI is 64 bytes and V64SI is 256
> bytes.
>
> This is a problem because GCC has an assumption that a) vector registers are
> fixed size, and b) if there are multiple vector sizes you want to pick one size
> and stick with it for the whole function.

The whole of the current vectorisation region rather than the whole function,
but yeah, this is a fundamental assumption with the current autovec code.
It's something that would be really good to fix...

> This is a problem in various places, but mostly it's not fatal. However,
> get_vectype_for_scalar_type caches the vector size for the first type it
> encounters and then tries to apply that to all subsequent vectors, which
> completely destroys vectorization.  The caching feature appears to be an
> attempt to cope with AVX having a different vector size to other x86 vector
> options.
>
> This patch simply disables the cache so that it must ask the backend for the
> preferred mode for every type.

TBH I'm surprised this works.  Obviously it does, otherwise you wouldn't
have posted it, but it seems like an accident.  Various parts of the
vectoriser query current_vector_size and expect it to be stable for
the current choice of vector size.

The underlying problem also affects (at least) base AArch64, SVE and x86_64.
We try to choose vector types on the fly based only on the type of a given
scalar value, but in reality, the type we want for a 32-bit element (say)
often depends on whether the vectorisation region also has smaller or
larger elements.  And in general we only know that after
vect_mark_stmts_to_be_vectorized, but we want to know the vector types
earlier, such as in pattern recognition and while building SLP trees.
It's a bit of a chicken-and-egg problem...

Richard
Andrew Stubbs Sept. 18, 2018, 8:43 a.m. UTC | #2
On 17/09/18 20:28, Richard Sandiford wrote:
>> This patch simply disables the cache so that it must ask the backend for the
>> preferred mode for every type.
> 
> TBH I'm surprised this works.  Obviously it does, otherwise you wouldn't
> have posted it, but it seems like an accident.  Various parts of the
> vectoriser query current_vector_size and expect it to be stable for
> the current choice of vector size.

Indeed, this is why this remains only a half-baked patch: I wasn't 
confident it was the correct or whole solution.

It works in so much as it fixes the immediate problem that I saw -- "no 
vector type" -- and makes a bunch of vect.exp testcases happy.

It's quite possible that something else is unhappy with this.

> The underlying problem also affects (at least) base AArch64, SVE and x86_64.
> We try to choose vector types on the fly based only on the type of a given
> scalar value, but in reality, the type we want for a 32-bit element (say)
> often depends on whether the vectorisation region also has smaller or
> larger elements.  And in general we only know that after
> vect_mark_stmts_to_be_vectorized, but we want to know the vector types
> earlier, such as in pattern recognition and while building SLP trees.
> It's a bit of a chicken-and-egg problem...

I don't understand why the number of bits in a vector is the key 
information here?

It would make sense if you were to say that the number of elements has 
to be fixed in a given region, because obviously that's tied to loop 
strides and such, but why the size?

It seems like there is an architecture were you don't want to mix 
instruction types (SSE vs. AVX?) and that makes sense for that 
architecture, but if that's the case then we need to be able to turn it 
off for other architectures.

For GCN, vectors are fully maskable, so we almost want such 
considerations to be completely ignored.  We basically want it to act 
like it can have any size vector it likes, up to 64 elements.

Andrew
Richard Sandiford Sept. 18, 2018, 11:21 a.m. UTC | #3
Andrew Stubbs <ams@codesourcery.com> writes:
> On 17/09/18 20:28, Richard Sandiford wrote:
>>> This patch simply disables the cache so that it must ask the backend for the
>>> preferred mode for every type.
>> 
>> TBH I'm surprised this works.  Obviously it does, otherwise you wouldn't
>> have posted it, but it seems like an accident.  Various parts of the
>> vectoriser query current_vector_size and expect it to be stable for
>> the current choice of vector size.
>
> Indeed, this is why this remains only a half-baked patch: I wasn't 
> confident it was the correct or whole solution.
>
> It works in so much as it fixes the immediate problem that I saw -- "no 
> vector type" -- and makes a bunch of vect.exp testcases happy.
>
> It's quite possible that something else is unhappy with this.
>
>> The underlying problem also affects (at least) base AArch64, SVE and x86_64.
>> We try to choose vector types on the fly based only on the type of a given
>> scalar value, but in reality, the type we want for a 32-bit element (say)
>> often depends on whether the vectorisation region also has smaller or
>> larger elements.  And in general we only know that after
>> vect_mark_stmts_to_be_vectorized, but we want to know the vector types
>> earlier, such as in pattern recognition and while building SLP trees.
>> It's a bit of a chicken-and-egg problem...
>
> I don't understand why the number of bits in a vector is the key 
> information here?

Arguably it shouldn't be, and it's really just a proxy for the vector
(sub)architecture.  But this is "should be" vs. "is" :-)

> It would make sense if you were to say that the number of elements has 
> to be fixed in a given region, because obviously that's tied to loop 
> strides and such, but why the size?
>
> It seems like there is an architecture were you don't want to mix 
> instruction types (SSE vs. AVX?) and that makes sense for that 
> architecture, but if that's the case then we need to be able to turn it 
> off for other architectures.

It's not about trying to avoid mixing vector sizes: from what Jakub
said earlier in the year, even x86 wants to do that (but can't yet).
The idea is instead to try the available possibilities.

E.g. for AArch64 we want to try SVE, 128-bit Advanced SIMD and
64-bit Advanced SIMD.  With something like:

  int *ip;
  short *sp;
  for (int i = 0; i < n; ++i)
    ip[i] = sp[i];

there are three valid choices for Advanced SIMD:

(1) use 1 128-bit vector of sp and 2 128-bit vectors of ip
(2) use 1 64-bit vector of sp and 2 64-bit vectors of ip
(3) use 1 64-bit vector of sp and 1 128-bit vector of ip

At the moment we only try (1) and (2), but in practice, (3) should be
better than (2) in most cases.  I guess in some ways trying all three
would be best, but if we only try two, trying (1) and (3) is better
than trying (1) and (2).

For:

  for (int i = 0; i < n; ++i)
    ip[i] += 1;

there are two valid choices for Advanced SIMD:

(4) use 1 128-bit vector of ip
(5) use 1 64-bit vector of ip

The problem for the current autovec set-up is that the ip type for
64-bit Advanced SIMD varies between (3) and (5): for (3) it's a
128-bit vector type and for (5) it's a 64-bit vector type.
So the type we want for a given vector subarchitecture is partly
determined by the other types in the region: it isn't simply a
function of the subarchitecture and the element type.

This is why the current autovec code only supports (1), (2),
(4) and (5).  And I think this is essentially the same limitation
that you're hitting.

> For GCN, vectors are fully maskable, so we almost want such 
> considerations to be completely ignored.  We basically want it to act 
> like it can have any size vector it likes, up to 64 elements.

SVE is similar.  But even for SVE there's an equivalent trade-off
between (1) and (3):

(1') use 1 fully-populated vector for sp and 2 fully-populated
     vectors for ip
(3') use 1 half-populated vector for sp and 1 fully-populated
     vector for ip

Which is best for more complicated examples depends on the balance
between ip-based work and sp-based work.  The packing and unpacking
in (1') has a cost, but it would pay off if there was much more
sp work than ip work, since in that case (3') would spend most
of its time operating on partially-populated vectors.

Would the same be useful for GCN, or do you basically always
want a VF of 64?

None of this is a fundamental restriction in theory.  It's just
something that needs to be fixed.

One approach would be to get the loop vectoriser to iterate over the
number of lanes the target supports insteaad of all possible vector
sizes.  The problem is that on its own this would mean trying 4
lane counts even on targets with a single supported vector size.
So we'd need to do something a bit smarter...

Richard
Andrew Stubbs Sept. 18, 2018, 8:22 p.m. UTC | #4
On 18/09/18 12:21, Richard Sandiford wrote:
> Would the same be useful for GCN, or do you basically always
> want a VF of 64?

Always 64; the vector size varies between 512-bit and 4096-bit, as needed.

> None of this is a fundamental restriction in theory.  It's just
> something that needs to be fixed.
> 
> One approach would be to get the loop vectoriser to iterate over the
> number of lanes the target supports insteaad of all possible vector
> sizes.  The problem is that on its own this would mean trying 4
> lane counts even on targets with a single supported vector size.
> So we'd need to do something a bit smarter...

Yeah, that sounds like an interesting project, but way more than I think 
I need. Basically, we don't need to iterate over anything; there's only 
one option for each mode.

For the purposes of this patch, might it be enough to track down all the 
places that use the current_vector_size and fix them up, somehow?

Obviously, I'm not sure what that means just yet ...

Andrew
Richard Biener Sept. 19, 2018, 1:45 p.m. UTC | #5
On Tue, Sep 18, 2018 at 10:22 PM Andrew Stubbs <ams@codesourcery.com> wrote:
>
> On 18/09/18 12:21, Richard Sandiford wrote:
> > Would the same be useful for GCN, or do you basically always
> > want a VF of 64?
>
> Always 64; the vector size varies between 512-bit and 4096-bit, as needed.
>
> > None of this is a fundamental restriction in theory.  It's just
> > something that needs to be fixed.
> >
> > One approach would be to get the loop vectoriser to iterate over the
> > number of lanes the target supports insteaad of all possible vector
> > sizes.  The problem is that on its own this would mean trying 4
> > lane counts even on targets with a single supported vector size.
> > So we'd need to do something a bit smarter...
>
> Yeah, that sounds like an interesting project, but way more than I think
> I need. Basically, we don't need to iterate over anything; there's only
> one option for each mode.
>
> For the purposes of this patch, might it be enough to track down all the
> places that use the current_vector_size and fix them up, somehow?
>
> Obviously, I'm not sure what that means just yet ...

I think the only part that wants a "fixed" size is the code iterating over
vector sizes.  All the rest of the code simply wants to commit to
a specific vector type for each DEF - to match the ISAs we've faced
sofar the approach is simply to choose the vector type of current_vector_size
size and proper element type.

I've long wanted to fix that part in a way to actually commit to vector types
later and compute the DEF vector type of a stmt by looking at the vector
type of the USEs and the operation.

So I guess the current_vector_size thing isn't too hard to get rid of, what
you'd end up with would be using that size when you decide for vector
types for loads (where there are no USEs with vector types, so for example
this would not apply to gathers).

So I'd say you want to refactor get_same_sized_vectype uses and
make the size argument to get_vectype_for_scalar_type_and_size
a hint only.

Richard.

>
> Andrew
Andrew Stubbs Sept. 28, 2018, 12:47 p.m. UTC | #6
On 19/09/18 14:45, Richard Biener wrote:
> So I guess the current_vector_size thing isn't too hard to get rid of, what
> you'd end up with would be using that size when you decide for vector
> types for loads (where there are no USEs with vector types, so for example
> this would not apply to gathers).

I've finally got back to looking at this ...

My patch works because current_vector_size is only referenced in two 
places. One is passed to get_vectype_for_scalar_type_and_size, and that 
function simply calls targetm.vectorize.preferred_simd_mode when the 
requested size is zero. The other is passed to build_truth_vector_type, 
which only uses it to call targetm.vectorize.get_mask_mode, and the GCN 
backend ignores the size parameter because it only has one option. 
Presumably other backends would object to a zero size mask.

So, as I said originally, the effect is that leaving current_vector_size 
zeroed means "always ask the backend".

Pretty much everything else chains off of those places using 
get_same_sized_vectype, so ignoring current_vector_size is safe on GCN, 
and might even be safe on other architectures?

> So I'd say you want to refactor get_same_sized_vectype uses and
> make the size argument to get_vectype_for_scalar_type_and_size
> a hint only.

I've looked through the uses of get_same_sized_vectype and I've come to 
the conclusion that many of them really mean it.

For example, vectorizable_bswap tries to reinterpret a vector register 
as a byte vector so that it can permute it. This is an optimization that 
won't work on GCN (because the vector registers don't work like that), 
but seems like a valid use of the vector size characteristic of other 
architectures.

For another example, vectorizable_conversion is targeting the 
vec_pack_trunc patterns, and therefore really does want to specify the 
types. Again, this isn't something we want to do on GCN (a regular trunc 
pattern with a vector mode will work fine).

However, vectorizable_operation seems to use it to try to match the 
input and output types to the same vector unit (i.e. vector size); at 
least that's my interpretation. It returns "not vectorizable" if the 
input and output vectors have different numbers of elements. For most 
operators the lhs and rhs types will be the same, so we're all good, but 
I imagine that this code will prevent TRUNC being vectorized on GCN 
because the "same size" vector doesn't exist, and it doesn't check if 
there's a vector with the same number of elements (I've not actually 
tried that, yet, and there may be extra magic elsewhere for that case, 
but YSWIM).

I don't think changing this case to a new "get_same_length_vectype" 
would be appropriate for many architectures, so I'm not sure what to do 
here?

We could fix this with new target hooks, perhaps?

TARGET_VECTORIZE_REINTERPRET_VECTOR (vectype_in, scalartype_out)

   Returns a new vectype (or mode) that uses the same vector register as
   vectype_in, but has elements of scalartype_out.

   The default implementation would be get_same_sized_vectype.

   GCN would just return NULL, because you can't do that kind of
   optimization.

TARGET_VECTORIZE_COMPATIBLE_VECTOR (opcode, vectype_in, scalartype_out)

   Returns a new vectype (or mode) that has the right number of elements
   for the opcode (i.e. the same number, or 2x for packed opcodes), and
   elements of scalartype_out.  The backend might choose a different
   vector size, but promises that hardware can do the operation (i.e.
   it's not mixing vector units).

   The default implementation would be get_same_sized_vectype, for
   backward compatibility.

   GCN would simply return V64xx according to scalartype_out, and NULL
   for unsupported opcodes.

Of course, none of this addresses the question of which vector size to 
choose in the first place. I've not figured out how it might ever start 
with a type other than the "preferred SIMD mode", yet.

Thoughts?

Andrew
Richard Biener Oct. 1, 2018, 8:04 a.m. UTC | #7
On Fri, Sep 28, 2018 at 2:47 PM Andrew Stubbs <ams@codesourcery.com> wrote:
>
> On 19/09/18 14:45, Richard Biener wrote:
> > So I guess the current_vector_size thing isn't too hard to get rid of, what
> > you'd end up with would be using that size when you decide for vector
> > types for loads (where there are no USEs with vector types, so for example
> > this would not apply to gathers).
>
> I've finally got back to looking at this ...
>
> My patch works because current_vector_size is only referenced in two
> places. One is passed to get_vectype_for_scalar_type_and_size, and that
> function simply calls targetm.vectorize.preferred_simd_mode when the
> requested size is zero. The other is passed to build_truth_vector_type,
> which only uses it to call targetm.vectorize.get_mask_mode, and the GCN
> backend ignores the size parameter because it only has one option.
> Presumably other backends would object to a zero size mask.
>
> So, as I said originally, the effect is that leaving current_vector_size
> zeroed means "always ask the backend".

Yes.

> Pretty much everything else chains off of those places using
> get_same_sized_vectype, so ignoring current_vector_size is safe on GCN,
> and might even be safe on other architectures?

Other architectures really only use it when there's a choice, like
choosing between V4SI, V8SI and V16SI on x86_64.  current_vector_size
was introduced to be able to "iterate" over supported ISAs and let the
vectorizer decide which one to use in the end (SSE vs. AVX vs. AVX512).

The value of zero is simply to give the target another chance to set
its prefered
value based on the first call.  I'd call that a bit awkward (*)

For architectures that only have a single "vector size" this variable
is really spurious and whether it is zero or non-zero doesn't make a difference.
Apart from your architecture of course where non-zero doesn't work ;)

(*) so one possibility would be to forgo with the special-value of zero
("auto-detect") and thus not change current_vector_size in
get_vectype_for_scalar_type at all.  For targets which report multiple
vector size support set current_vector_size to the prefered one in the
loop over vector sizes and for targets that do not simply keep it at zero.

> > So I'd say you want to refactor get_same_sized_vectype uses and
> > make the size argument to get_vectype_for_scalar_type_and_size
> > a hint only.
>
> I've looked through the uses of get_same_sized_vectype and I've come to
> the conclusion that many of them really mean it.
>
> For example, vectorizable_bswap tries to reinterpret a vector register
> as a byte vector so that it can permute it. This is an optimization that
> won't work on GCN (because the vector registers don't work like that),
> but seems like a valid use of the vector size characteristic of other
> architectures.

True.

> For another example, vectorizable_conversion is targeting the
> vec_pack_trunc patterns, and therefore really does want to specify the
> types. Again, this isn't something we want to do on GCN (a regular trunc
> pattern with a vector mode will work fine).
>
> However, vectorizable_operation seems to use it to try to match the
> input and output types to the same vector unit (i.e. vector size); at
> least that's my interpretation. It returns "not vectorizable" if the
> input and output vectors have different numbers of elements. For most
> operators the lhs and rhs types will be the same, so we're all good, but
> I imagine that this code will prevent TRUNC being vectorized on GCN
> because the "same size" vector doesn't exist, and it doesn't check if
> there's a vector with the same number of elements (I've not actually
> tried that, yet, and there may be extra magic elsewhere for that case,
> but YSWIM).

Yeah, we don't have a get_vector_type_for_scalar_type_and_nelems
which would probably be semantically better in many places.

> I don't think changing this case to a new "get_same_length_vectype"
> would be appropriate for many architectures, so I'm not sure what to do
> here?
>
> We could fix this with new target hooks, perhaps?
>
> TARGET_VECTORIZE_REINTERPRET_VECTOR (vectype_in, scalartype_out)
>
>    Returns a new vectype (or mode) that uses the same vector register as
>    vectype_in, but has elements of scalartype_out.
>
>    The default implementation would be get_same_sized_vectype.
>
>    GCN would just return NULL, because you can't do that kind of
>    optimization.
>
> TARGET_VECTORIZE_COMPATIBLE_VECTOR (opcode, vectype_in, scalartype_out)
>
>    Returns a new vectype (or mode) that has the right number of elements
>    for the opcode (i.e. the same number, or 2x for packed opcodes), and
>    elements of scalartype_out.  The backend might choose a different
>    vector size, but promises that hardware can do the operation (i.e.
>    it's not mixing vector units).
>
>    The default implementation would be get_same_sized_vectype, for
>    backward compatibility.
>
>    GCN would simply return V64xx according to scalartype_out, and NULL
>    for unsupported opcodes.

I don't like putting the burden on the target here too much given the vectorizer
should know what kind of constraints it has given it implements the
vectorization
on GIMPLE which as IL constraints that are to be met - we just need to
ask for vector types with the appropriate constraints rather than using
same-size everywhere.

> Of course, none of this addresses the question of which vector size to
> choose in the first place.

See above for a suggestion.

> I've not figured out how it might ever start
> with a type other than the "preferred SIMD mode", yet.

In practically all cases vect_analyze_data_refs calling
get_vectype_for_scalar_type
on a load will be the one nailing down current_vector_size (if zero).
I also cannot
quickly think of a case where that would differ from "preferred SIMD
mode" unless
the target simply lies to us here ;)

So, would a current_vector_size re-org like outlined above help you?  I agree
leaving it at zero should work unless there's code in the vectorizer
that is simply
wrong.  Addressing some GCN issues with get_vectype_for_scalar_type_and_nunits
would also OK with me (if that works).

Thanks,
Richard.

> Thoughts?
>
> Andrew
diff mbox series

Patch

diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 607a2bd..8875201 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -9945,9 +9945,12 @@  get_vectype_for_scalar_type (tree scalar_type)
   tree vectype;
   vectype = get_vectype_for_scalar_type_and_size (scalar_type,
 						  current_vector_size);
+/* FIXME: use a proper target hook or macro.  */
+#ifndef TARGET_DISABLE_CURRENT_VECTOR_SIZE
   if (vectype
       && known_eq (current_vector_size, 0U))
     current_vector_size = GET_MODE_SIZE (TYPE_MODE (vectype));
+#endif
   return vectype;
 }