Message ID | mpt1romm94o.fsf@arm.com |
---|---|
State | New |
Headers | show |
Series | vect: Tweak vect_better_loop_vinfo_p handling of variable VFs | expand |
On Fri, Apr 17, 2020 at 5:21 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > This patch fixes a large lmbench performance regression with > 128-bit SVE, compiled in length-agnostic mode. > > vect_better_loop_vinfo_p (new in GCC 10) tries to estimate whether > a new loop_vinfo is cheaper than a previous one, with an in-built > preference for the old one. For variable VF it prefers the old > loop_vinfo if it is cheaper for at least one VF. However, we have > no idea how likely that VF is in practice. > > Another extreme would be to do what most of the rest of the > vectoriser does, and rely solely on the constant estimated VF. > But as noted in the comment, this means that a one-unit cost > difference would be enough to pick the new loop_vinfo, > despite the target generally preferring the old loop_vinfo > where possible. The cost model just isn't accurate enough > for that to produce good results as things stand: there might > not be any practical benefit to the new loop_vinfo at the > estimated VF, and it would be significantly worse for higher VFs. > > The patch instead goes for a hacky compromise: make sure that the new > loop_vinfo is also no worse than the old loop_vinfo at double the > estimated VF. For all but trivial loops, this ensures that the > new loop_vinfo is only chosen if it is better than the old one > by a non-trivial amount at the estimated VF. It also avoids > putting too much faith in the VF estimate. > > I realise this isn't great, but it's supposed to be a conservative fix > suitable for stage 4. The only affected testcases are the ones for > pr89007-*.c, where Advanced SIMD is indeed preferred for 128-bit SVE > and is no worse for 256-bit SVE. > > Part of the problem here is that if the new loop_vinfo is better, > we discard the old one and never consider using it even as an > epilogue loop. This means that if we choose Advanced SIMD over SVE, > we're much more likely to have left-over scalar elements. > > Another is that the estimate provided by estimated_poly_value might have > different probabilities attached. E.g. when tuning for a particular core, > the estimate is probably accurate, but when tuning for generic code, > the estimate is more of a guess. Relying solely on the estimate is > probably correct for the former but not for the latter. > > Hopefully those are things that we could tackle in GCC 11. > > Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu. > OK to install? OK. Richard. > Richard > > > 2020-04-17 Richard Sandiford <richard.sandiford@arm.com> > > gcc/ > * tree-vect-loop.c (vect_better_loop_vinfo_p): If old_loop_vinfo > has a variable VF, prefer new_loop_vinfo if it is cheaper for the > estimated VF and is no worse at double the estimated VF. > > gcc/testsuite/ > * gcc.target/aarch64/sve/cost_model_8.c: New test. > * gcc.target/aarch64/sve/cost_model_9.c: Likewise. > * gcc.target/aarch64/sve/pr89007-1.c: Add -msve-vector-bits=512. > * gcc.target/aarch64/sve/pr89007-2.c: Likewise. > --- > .../gcc.target/aarch64/sve/cost_model_8.c | 12 +++++++ > .../gcc.target/aarch64/sve/cost_model_9.c | 13 ++++++++ > .../gcc.target/aarch64/sve/pr89007-1.c | 2 +- > .../gcc.target/aarch64/sve/pr89007-2.c | 2 +- > gcc/tree-vect-loop.c | 31 ++++++++++++++++++- > 5 files changed, 57 insertions(+), 3 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cost_model_8.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cost_model_9.c > > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c > index 265bcfdc5af..b6c3faeae51 100644 > --- a/gcc/tree-vect-loop.c > +++ b/gcc/tree-vect-loop.c > @@ -2414,7 +2414,36 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo, > poly_widest_int rel_old = (old_loop_vinfo->vec_inside_cost > * poly_widest_int (new_vf)); > if (maybe_lt (rel_old, rel_new)) > - return false; > + { > + /* When old_loop_vinfo uses a variable vectorization factor, > + we know that it has a lower cost for at least one runtime VF. > + However, we don't know how likely that VF is. > + > + One option would be to compare the costs for the estimated VFs. > + The problem is that that can put too much pressure on the cost > + model. E.g. if the estimated VF is also the lowest possible VF, > + and if old_loop_vinfo is 1 unit worse than new_loop_vinfo > + for the estimated VF, we'd then choose new_loop_vinfo even > + though (a) new_loop_vinfo might not actually be better than > + old_loop_vinfo for that VF and (b) it would be significantly > + worse at larger VFs. > + > + Here we go for a hacky compromise: pick new_loop_vinfo if it is > + no more expensive than old_loop_vinfo even after doubling the > + estimated old_loop_vinfo VF. For all but trivial loops, this > + ensures that we only pick new_loop_vinfo if it is significantly > + better than old_loop_vinfo at the estimated VF. */ > + if (rel_new.is_constant ()) > + return false; > + > + HOST_WIDE_INT new_estimated_vf = estimated_poly_value (new_vf); > + HOST_WIDE_INT old_estimated_vf = estimated_poly_value (old_vf); > + widest_int estimated_rel_new = (new_loop_vinfo->vec_inside_cost > + * widest_int (old_estimated_vf)); > + widest_int estimated_rel_old = (old_loop_vinfo->vec_inside_cost > + * widest_int (new_estimated_vf)); > + return estimated_rel_new * 2 <= estimated_rel_old; > + } > if (known_lt (rel_new, rel_old)) > return true; > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_8.c b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_8.c > new file mode 100644 > index 00000000000..80c3a23e18a > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_8.c > @@ -0,0 +1,12 @@ > +/* { dg-options "-O3 -msve-vector-bits=scalable" } */ > + > +void > +vset (int *restrict dst, int *restrict src, int count) > +{ > + for (int i = 0; i < count; ++i) > +#pragma GCC unroll 4 > + for (int j = 0; j < 4; ++j) > + *dst++ = 1; > +} > + > +/* { dg-final { scan-assembler-times {\tst1w\tz} 1 } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_9.c b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_9.c > new file mode 100644 > index 00000000000..e7a1bac3c83 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_9.c > @@ -0,0 +1,13 @@ > +/* { dg-options "-O3 -msve-vector-bits=scalable" } */ > + > +void > +vset (int *restrict dst, int *restrict src, int count) > +{ > + for (int i = 0; i < count; ++i) > +#pragma GCC unroll 8 > + for (int j = 0; j < 8; ++j) > + *dst++ = 1; > +} > + > +/* { dg-final { scan-assembler-not {\tst1w\tz} } } */ > +/* { dg-final { scan-assembler-times {\tstp\tq} 1 } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c > index af4aff4ec6d..ff9550c9109 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c > @@ -1,5 +1,5 @@ > /* { dg-do assemble { target aarch64_asm_sve_ok } } */ > -/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve --save-temps" } */ > +/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve -msve-vector-bits=512 --save-temps" } */ > /* { dg-final { check-function-bodies "**" "" } } */ > > #define N 1024 > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c > index 2ccdd0d353e..da345fe8bd6 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c > @@ -1,5 +1,5 @@ > /* { dg-do assemble { target aarch64_asm_sve_ok } } */ > -/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve --save-temps" } */ > +/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve -msve-vector-bits=512 --save-temps" } */ > /* { dg-final { check-function-bodies "**" "" } } */ > > #define N 1024
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 265bcfdc5af..b6c3faeae51 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -2414,7 +2414,36 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo, poly_widest_int rel_old = (old_loop_vinfo->vec_inside_cost * poly_widest_int (new_vf)); if (maybe_lt (rel_old, rel_new)) - return false; + { + /* When old_loop_vinfo uses a variable vectorization factor, + we know that it has a lower cost for at least one runtime VF. + However, we don't know how likely that VF is. + + One option would be to compare the costs for the estimated VFs. + The problem is that that can put too much pressure on the cost + model. E.g. if the estimated VF is also the lowest possible VF, + and if old_loop_vinfo is 1 unit worse than new_loop_vinfo + for the estimated VF, we'd then choose new_loop_vinfo even + though (a) new_loop_vinfo might not actually be better than + old_loop_vinfo for that VF and (b) it would be significantly + worse at larger VFs. + + Here we go for a hacky compromise: pick new_loop_vinfo if it is + no more expensive than old_loop_vinfo even after doubling the + estimated old_loop_vinfo VF. For all but trivial loops, this + ensures that we only pick new_loop_vinfo if it is significantly + better than old_loop_vinfo at the estimated VF. */ + if (rel_new.is_constant ()) + return false; + + HOST_WIDE_INT new_estimated_vf = estimated_poly_value (new_vf); + HOST_WIDE_INT old_estimated_vf = estimated_poly_value (old_vf); + widest_int estimated_rel_new = (new_loop_vinfo->vec_inside_cost + * widest_int (old_estimated_vf)); + widest_int estimated_rel_old = (old_loop_vinfo->vec_inside_cost + * widest_int (new_estimated_vf)); + return estimated_rel_new * 2 <= estimated_rel_old; + } if (known_lt (rel_new, rel_old)) return true; diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_8.c b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_8.c new file mode 100644 index 00000000000..80c3a23e18a --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_8.c @@ -0,0 +1,12 @@ +/* { dg-options "-O3 -msve-vector-bits=scalable" } */ + +void +vset (int *restrict dst, int *restrict src, int count) +{ + for (int i = 0; i < count; ++i) +#pragma GCC unroll 4 + for (int j = 0; j < 4; ++j) + *dst++ = 1; +} + +/* { dg-final { scan-assembler-times {\tst1w\tz} 1 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_9.c b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_9.c new file mode 100644 index 00000000000..e7a1bac3c83 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_9.c @@ -0,0 +1,13 @@ +/* { dg-options "-O3 -msve-vector-bits=scalable" } */ + +void +vset (int *restrict dst, int *restrict src, int count) +{ + for (int i = 0; i < count; ++i) +#pragma GCC unroll 8 + for (int j = 0; j < 8; ++j) + *dst++ = 1; +} + +/* { dg-final { scan-assembler-not {\tst1w\tz} } } */ +/* { dg-final { scan-assembler-times {\tstp\tq} 1 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c index af4aff4ec6d..ff9550c9109 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c @@ -1,5 +1,5 @@ /* { dg-do assemble { target aarch64_asm_sve_ok } } */ -/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve --save-temps" } */ +/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve -msve-vector-bits=512 --save-temps" } */ /* { dg-final { check-function-bodies "**" "" } } */ #define N 1024 diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c index 2ccdd0d353e..da345fe8bd6 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c @@ -1,5 +1,5 @@ /* { dg-do assemble { target aarch64_asm_sve_ok } } */ -/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve --save-temps" } */ +/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve -msve-vector-bits=512 --save-temps" } */ /* { dg-final { check-function-bodies "**" "" } } */ #define N 1024