mbox series

[0/6,RFC] Relax single-vector-size restriction

Message ID n083282q-0soo-43p0-4qr7-0qos15o313p8@fhfr.qr
Headers show
Series Relax single-vector-size restriction | expand

Message

Richard Biener Dec. 13, 2023, 12:30 p.m. UTC
I've been asked to look into how to best relax the current restriction
of the vectorizer that it prefers to use a single vector size throughout
loop vectorization.  That size is determined by the preferred_simd_mode
and the autovectorize_vector_modes hook for other-than-first iterations.

The target does have some leeway with its related_mode hook which you
can see in the aarch64 backend which has a "hack" prefering
"1 128-bit vector instead of 2 64-bit vectors" (for ADVSIMD).  
Incidentially that hack allows it to vectorize gcc.dg/vect/pr65947-7.c
which uses a condition reduction that is generally unhappy about the
ncopies > 1 case.

The first roadblock you hit when trying to relax things is that we
are assigning vector types very early - during data reference
analysis and during pattern matching and then for the rest of stmts
as part of determining the vectorization factor.

The patch series starts pushing that back (with some exceptions - it's
a proof-of-concept), trying to get us to the point of determining
the vectorization factor to use and only after that assign vector
types (with that VF as one of the constraints).  In particular the
patch tries to avoid altering the VF choice as we're still iterating
over the SIMD modes (possibly iterating over { VF, mode } pairs
where 'mode' would be VLA or VLS might be a future improvement).

Apart from gcc.dg/vect/pr65947-7.c which I'd like to see vectorized on
x86_64 there is a motivational testcase like

double x[1024];
char y[1024]; 
void foo  ()
{
  for (int i = 0 ; i < 16; ++i) 
    {
      x[i] = i;
      y[i] = i;
    }
}

where the iteration domain constrains the VF and we currently end
up vectorizing this with SSE vectors, causing 8 vector stores to x[]
even when AVX2 or AVX512 widths would be available there.

After a lot of different experiments I finally settled on the following
minimal impact solution - after determining the VF we assign vector
types but allow larger than the current active vector modes up to
the size of the mode of the preferred_simd_mode when that stays within
the constraint of the VF.  For the second example above on x86
with -march=znver4 we then fail vectorizing with V64QImode
(AVX512, the preferred_simd_mode) and for V32QImode (AVX2) because
of the low iteration count but we succeed with V16QImode (SSE, as
with current GCC) but are able to choose V8DFmode for the accesses
to x[] (AVX512, the preferred_simd_mode).  The condition reduction
case works in a similar way - with just SSE we succeed with V4HImode
but use V4SImode for the condition, keeping ncopies == 1 and making
the vectorizer happy.

The patch series prototypes this for non-SLP loop vectorization
(because the testcases above do not use SLP) - the prototype doesn't
pass testing and I won't pursue this further until we get rid of
the non-SLP path.

The series starts with some cleanups that might still be applicable
though, reducing calls to get_vectype_for_scalar_type where the
vector types should be known already (all of the constant/external
def kinds will go away with SLP-only anyway).  Then as I first
tried to vary VF it makes LOOP_VINFO_VECT_FACTOR an rvalue to
make sure we're nowhere rely on its value before it's really
final.

Gathers/scatters also complicate manners right now since we're
analyzing them very early (and that analysis needs a vector type),
but the actual offset def we need to mark relevant is tightly
coupled with the vector type chosen for it (and what the target
actually supports).  That's going to be trick.  I also noticed
that we might no longer need the gather/scater pattern support
as SLP can handle them without the IFNs(?)  Some general
API cleanup wrt unsigned vs. poly-uint and finally the last
patch in the series defers setting STMT_VINFO_VECTYPE (with
exceptions as I said) and has a cobbled up loop to assign
vector types after the VF is determined with the above
described scheme.

There's complication around mask types, so the patch goes one
step further and makes vectorizable_operation determine
the vector type of the def from the vector types of the
operands.  I think that in the end we want to "force" as few
vector types as possible and perform upward/downward propagation
from within vectorizable_* which would need a new mode of
operation for this (figure either output or input vector types
from what is present, possibly signaling DEFER and queuing
either uses of the output or the fixed inputs for further
processing).

I'd like to get some feedback on the way I chose to wire the
new flexibility into the existing mode iteration and whether
that's sound both for SVE/NEON or whether any of you have
concerns around that or ideas how to instead exploit such
flexibility.

Thanks,
Richard.