Message ID | 5BE565CE.5000709@foss.arm.com |
---|---|
State | New |
Headers | show |
Series | [cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64 | expand |
On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote: > > Hi all, > > In this testcase the codegen for VLA SVE is worse than it could be due to unrolling: > > fully_peel_me: > mov x1, 5 > ptrue p1.d, all > whilelo p0.d, xzr, x1 > ld1d z0.d, p0/z, [x0] > fadd z0.d, z0.d, z0.d > st1d z0.d, p0, [x0] > cntd x2 > addvl x3, x0, #1 > whilelo p0.d, x2, x1 > beq .L1 > ld1d z0.d, p0/z, [x0, #1, mul vl] > fadd z0.d, z0.d, z0.d > st1d z0.d, p0, [x3] > cntw x2 > incb x0, all, mul #2 > whilelo p0.d, x2, x1 > beq .L1 > ld1d z0.d, p0/z, [x0] > fadd z0.d, z0.d, z0.d > st1d z0.d, p0, [x0] > .L1: > ret > > In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count. > For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size > and hurts icache performance. > > This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration > count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some > Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate > the branches. > > So for the above testcase we generate now: > fully_peel_me: > mov x2, 5 > mov x3, x2 > mov x1, 0 > whilelo p0.d, xzr, x2 > ptrue p1.d, all > .L2: > ld1d z0.d, p0/z, [x0, x1, lsl 3] > fadd z0.d, z0.d, z0.d > st1d z0.d, p0, [x0, x1, lsl 3] > incd x1 > whilelo p0.d, x1, x3 > bne .L2 > ret > > Not perfect still, but it's preferable to the original code. > The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged > (until other target people experiment with it and set it, if appropriate). > > Bootstrapped and tested on aarch64-none-linux-gnu. > Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance. > > Ok for trunk? Hum. Why introduce a new --param and not simply key on flag_peel_loops instead? That is enabled by default at -O3 and with FDO but you of course can control that in your targets post-option-processing hook. It might also make sense to have more fine-grained control for this and allow a target to say whether it wants to peel a specific loop or not when the middle-end thinks that would be profitable. Richard. > Thanks, > Kyrill > > > 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > * params.def (PARAM_UNROLL_KNOWN_LOOP_ITERATIONS_ONLY): Define. > * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Use above to > disable unrolling on unknown iteration count. > * config/aarch64/aarch64.c (aarch64_override_options_internal): Set > PARAM_UNROLL_KNOWN_LOOP_ITERATIONS_ONLY to 1. > * doc/invoke.texi (--param unroll-known-loop-iterations-only): > Document. > > 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > * gcc.target/aarch64/sve/unroll-1.c: New test. >
On 09/11/18 12:18, Richard Biener wrote: > On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov > <kyrylo.tkachov@foss.arm.com> wrote: >> >> Hi all, >> >> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling: >> >> fully_peel_me: >> mov x1, 5 >> ptrue p1.d, all >> whilelo p0.d, xzr, x1 >> ld1d z0.d, p0/z, [x0] >> fadd z0.d, z0.d, z0.d >> st1d z0.d, p0, [x0] >> cntd x2 >> addvl x3, x0, #1 >> whilelo p0.d, x2, x1 >> beq .L1 >> ld1d z0.d, p0/z, [x0, #1, mul vl] >> fadd z0.d, z0.d, z0.d >> st1d z0.d, p0, [x3] >> cntw x2 >> incb x0, all, mul #2 >> whilelo p0.d, x2, x1 >> beq .L1 >> ld1d z0.d, p0/z, [x0] >> fadd z0.d, z0.d, z0.d >> st1d z0.d, p0, [x0] >> .L1: >> ret >> >> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count. >> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size >> and hurts icache performance. >> >> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration >> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some >> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate >> the branches. >> >> So for the above testcase we generate now: >> fully_peel_me: >> mov x2, 5 >> mov x3, x2 >> mov x1, 0 >> whilelo p0.d, xzr, x2 >> ptrue p1.d, all >> .L2: >> ld1d z0.d, p0/z, [x0, x1, lsl 3] >> fadd z0.d, z0.d, z0.d >> st1d z0.d, p0, [x0, x1, lsl 3] >> incd x1 >> whilelo p0.d, x1, x3 >> bne .L2 >> ret >> >> Not perfect still, but it's preferable to the original code. >> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged >> (until other target people experiment with it and set it, if appropriate). >> >> Bootstrapped and tested on aarch64-none-linux-gnu. >> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance. >> >> Ok for trunk? > > Hum. Why introduce a new --param and not simply key on > flag_peel_loops instead? That is > enabled by default at -O3 and with FDO but you of course can control > that in your targets > post-option-processing hook. You mean like this? It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :) But I suppose it's a reasonable change. > > It might also make sense to have more fine-grained control for this > and allow a target > to say whether it wants to peel a specific loop or not when the > middle-end thinks that > would be profitable. Can be worth looking at as a follow-up. Do you envisage the target analysing the gimple statements of the loop to figure out its cost? Thanks, Kyrill 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll loop when number of iterations is not known and flag_peel_loops is in effect. 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> * gcc.target/aarch64/sve/unroll-1.c: New test. diff --git a/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c b/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c new file mode 100644 index 0000000000000000000000000000000000000000..7f53d20cbf8e18a4389b86c037f56f024bac22a5 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-O3" } */ + +/* Check that simple loop is not fully unrolled. */ + +void +fully_peel_me (double *x) +{ + for (int i = 0; i < 5; i++) + x[i] = x[i] * 2; +} + +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, \[.+]\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[.+\]\n} 1 } } */ diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c index c2953059fb9218d4bc4cf12fe9277a552b4a04bd..daeddb384254775d6482ed5580e8262e4b26a87f 100644 --- a/gcc/tree-ssa-loop-ivcanon.c +++ b/gcc/tree-ssa-loop-ivcanon.c @@ -883,6 +883,16 @@ try_unroll_loop_completely (struct loop *loop, loop->num); return false; } + else if (TREE_CODE (niter) == SCEV_NOT_KNOWN + && flag_peel_loops) + { + if (dump_enabled_p ()) + dump_printf (MSG_NOTE, "Not unrolling loop %d: " + "exact number of iterations not known " + "(-fpeel-loops).\n", + loop->num); + return false; + } } initialize_original_copy_tables ();
On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote: > > On 09/11/18 12:18, Richard Biener wrote: > > On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov > > <kyrylo.tkachov@foss.arm.com> wrote: > >> > >> Hi all, > >> > >> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling: > >> > >> fully_peel_me: > >> mov x1, 5 > >> ptrue p1.d, all > >> whilelo p0.d, xzr, x1 > >> ld1d z0.d, p0/z, [x0] > >> fadd z0.d, z0.d, z0.d > >> st1d z0.d, p0, [x0] > >> cntd x2 > >> addvl x3, x0, #1 > >> whilelo p0.d, x2, x1 > >> beq .L1 > >> ld1d z0.d, p0/z, [x0, #1, mul vl] > >> fadd z0.d, z0.d, z0.d > >> st1d z0.d, p0, [x3] > >> cntw x2 > >> incb x0, all, mul #2 > >> whilelo p0.d, x2, x1 > >> beq .L1 > >> ld1d z0.d, p0/z, [x0] > >> fadd z0.d, z0.d, z0.d > >> st1d z0.d, p0, [x0] > >> .L1: > >> ret > >> > >> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count. > >> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size > >> and hurts icache performance. > >> > >> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration > >> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some > >> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate > >> the branches. > >> > >> So for the above testcase we generate now: > >> fully_peel_me: > >> mov x2, 5 > >> mov x3, x2 > >> mov x1, 0 > >> whilelo p0.d, xzr, x2 > >> ptrue p1.d, all > >> .L2: > >> ld1d z0.d, p0/z, [x0, x1, lsl 3] > >> fadd z0.d, z0.d, z0.d > >> st1d z0.d, p0, [x0, x1, lsl 3] > >> incd x1 > >> whilelo p0.d, x1, x3 > >> bne .L2 > >> ret > >> > >> Not perfect still, but it's preferable to the original code. > >> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged > >> (until other target people experiment with it and set it, if appropriate). > >> > >> Bootstrapped and tested on aarch64-none-linux-gnu. > >> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance. > >> > >> Ok for trunk? > > > > Hum. Why introduce a new --param and not simply key on > > flag_peel_loops instead? That is > > enabled by default at -O3 and with FDO but you of course can control > > that in your targets > > post-option-processing hook. > > You mean like this? > It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :) > But I suppose it's a reasonable change. No, that change is backward. What I said is that peeling is already conditional on flag_peel_loops and that is enabled by -O3. So you want to disable flag_peel_loops for SVE instead in the target. > > > > It might also make sense to have more fine-grained control for this > > and allow a target > > to say whether it wants to peel a specific loop or not when the > > middle-end thinks that > > would be profitable. > > Can be worth looking at as a follow-up. Do you envisage the target analysing > the gimple statements of the loop to figure out its cost? Kind-of. Sth like bool targetm.peel_loop (struct loop *); I have no idea whether you can easily detect a SVE vectorized loop though. Maybe there's always a special IV or so (the mask?) Richard. > Thanks, > Kyrill > > > 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll > loop when number of iterations is not known and flag_peel_loops is in > effect. > > 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > * gcc.target/aarch64/sve/unroll-1.c: New test. >
On 12/11/18 14:10, Richard Biener wrote: > On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov > <kyrylo.tkachov@foss.arm.com> wrote: >> On 09/11/18 12:18, Richard Biener wrote: >>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov >>> <kyrylo.tkachov@foss.arm.com> wrote: >>>> Hi all, >>>> >>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling: >>>> >>>> fully_peel_me: >>>> mov x1, 5 >>>> ptrue p1.d, all >>>> whilelo p0.d, xzr, x1 >>>> ld1d z0.d, p0/z, [x0] >>>> fadd z0.d, z0.d, z0.d >>>> st1d z0.d, p0, [x0] >>>> cntd x2 >>>> addvl x3, x0, #1 >>>> whilelo p0.d, x2, x1 >>>> beq .L1 >>>> ld1d z0.d, p0/z, [x0, #1, mul vl] >>>> fadd z0.d, z0.d, z0.d >>>> st1d z0.d, p0, [x3] >>>> cntw x2 >>>> incb x0, all, mul #2 >>>> whilelo p0.d, x2, x1 >>>> beq .L1 >>>> ld1d z0.d, p0/z, [x0] >>>> fadd z0.d, z0.d, z0.d >>>> st1d z0.d, p0, [x0] >>>> .L1: >>>> ret >>>> >>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count. >>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size >>>> and hurts icache performance. >>>> >>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration >>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some >>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate >>>> the branches. >>>> >>>> So for the above testcase we generate now: >>>> fully_peel_me: >>>> mov x2, 5 >>>> mov x3, x2 >>>> mov x1, 0 >>>> whilelo p0.d, xzr, x2 >>>> ptrue p1.d, all >>>> .L2: >>>> ld1d z0.d, p0/z, [x0, x1, lsl 3] >>>> fadd z0.d, z0.d, z0.d >>>> st1d z0.d, p0, [x0, x1, lsl 3] >>>> incd x1 >>>> whilelo p0.d, x1, x3 >>>> bne .L2 >>>> ret >>>> >>>> Not perfect still, but it's preferable to the original code. >>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged >>>> (until other target people experiment with it and set it, if appropriate). >>>> >>>> Bootstrapped and tested on aarch64-none-linux-gnu. >>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance. >>>> >>>> Ok for trunk? >>> Hum. Why introduce a new --param and not simply key on >>> flag_peel_loops instead? That is >>> enabled by default at -O3 and with FDO but you of course can control >>> that in your targets >>> post-option-processing hook. >> You mean like this? >> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :) >> But I suppose it's a reasonable change. > No, that change is backward. What I said is that peeling is already > conditional on > flag_peel_loops and that is enabled by -O3. So you want to disable > flag_peel_loops for > SVE instead in the target. Sorry, I got confused by the similarly named functions. I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path. try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops. Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable? Thanks, Kyrill >>> It might also make sense to have more fine-grained control for this >>> and allow a target >>> to say whether it wants to peel a specific loop or not when the >>> middle-end thinks that >>> would be profitable. >> Can be worth looking at as a follow-up. Do you envisage the target analysing >> the gimple statements of the loop to figure out its cost? > Kind-of. Sth like > > bool targetm.peel_loop (struct loop *); > > I have no idea whether you can easily detect a SVE vectorized loop though. > Maybe there's always a special IV or so (the mask?) > > Richard. > >> Thanks, >> Kyrill >> >> >> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> >> >> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll >> loop when number of iterations is not known and flag_peel_loops is in >> effect. >> >> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> >> >> * gcc.target/aarch64/sve/unroll-1.c: New test. >>
On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote: > > > On 12/11/18 14:10, Richard Biener wrote: > > On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov > > <kyrylo.tkachov@foss.arm.com> wrote: > >> On 09/11/18 12:18, Richard Biener wrote: > >>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov > >>> <kyrylo.tkachov@foss.arm.com> wrote: > >>>> Hi all, > >>>> > >>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling: > >>>> > >>>> fully_peel_me: > >>>> mov x1, 5 > >>>> ptrue p1.d, all > >>>> whilelo p0.d, xzr, x1 > >>>> ld1d z0.d, p0/z, [x0] > >>>> fadd z0.d, z0.d, z0.d > >>>> st1d z0.d, p0, [x0] > >>>> cntd x2 > >>>> addvl x3, x0, #1 > >>>> whilelo p0.d, x2, x1 > >>>> beq .L1 > >>>> ld1d z0.d, p0/z, [x0, #1, mul vl] > >>>> fadd z0.d, z0.d, z0.d > >>>> st1d z0.d, p0, [x3] > >>>> cntw x2 > >>>> incb x0, all, mul #2 > >>>> whilelo p0.d, x2, x1 > >>>> beq .L1 > >>>> ld1d z0.d, p0/z, [x0] > >>>> fadd z0.d, z0.d, z0.d > >>>> st1d z0.d, p0, [x0] > >>>> .L1: > >>>> ret > >>>> > >>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count. > >>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size > >>>> and hurts icache performance. > >>>> > >>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration > >>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some > >>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate > >>>> the branches. > >>>> > >>>> So for the above testcase we generate now: > >>>> fully_peel_me: > >>>> mov x2, 5 > >>>> mov x3, x2 > >>>> mov x1, 0 > >>>> whilelo p0.d, xzr, x2 > >>>> ptrue p1.d, all > >>>> .L2: > >>>> ld1d z0.d, p0/z, [x0, x1, lsl 3] > >>>> fadd z0.d, z0.d, z0.d > >>>> st1d z0.d, p0, [x0, x1, lsl 3] > >>>> incd x1 > >>>> whilelo p0.d, x1, x3 > >>>> bne .L2 > >>>> ret > >>>> > >>>> Not perfect still, but it's preferable to the original code. > >>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged > >>>> (until other target people experiment with it and set it, if appropriate). > >>>> > >>>> Bootstrapped and tested on aarch64-none-linux-gnu. > >>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance. > >>>> > >>>> Ok for trunk? > >>> Hum. Why introduce a new --param and not simply key on > >>> flag_peel_loops instead? That is > >>> enabled by default at -O3 and with FDO but you of course can control > >>> that in your targets > >>> post-option-processing hook. > >> You mean like this? > >> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :) > >> But I suppose it's a reasonable change. > > No, that change is backward. What I said is that peeling is already > > conditional on > > flag_peel_loops and that is enabled by -O3. So you want to disable > > flag_peel_loops for > > SVE instead in the target. > > Sorry, I got confused by the similarly named functions. > I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass > (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path. Well, peeling gets disabled. From your patch I see you want to disable "unrolling" when the number of loop iteration is not constant. That is called peeling where we need to emit the loop exit test N times. Did you check your testcases with -fno-peel-loops? > try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops. > Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable? > > Thanks, > Kyrill > > >>> It might also make sense to have more fine-grained control for this > >>> and allow a target > >>> to say whether it wants to peel a specific loop or not when the > >>> middle-end thinks that > >>> would be profitable. > >> Can be worth looking at as a follow-up. Do you envisage the target analysing > >> the gimple statements of the loop to figure out its cost? > > Kind-of. Sth like > > > > bool targetm.peel_loop (struct loop *); > > > > I have no idea whether you can easily detect a SVE vectorized loop though. > > Maybe there's always a special IV or so (the mask?) > > > > Richard. > > > >> Thanks, > >> Kyrill > >> > >> > >> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > >> > >> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll > >> loop when number of iterations is not known and flag_peel_loops is in > >> effect. > >> > >> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > >> > >> * gcc.target/aarch64/sve/unroll-1.c: New test. > >> >
Hi Richard, On 13/11/18 08:24, Richard Biener wrote: > On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov > <kyrylo.tkachov@foss.arm.com> wrote: >> >> On 12/11/18 14:10, Richard Biener wrote: >>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov >>> <kyrylo.tkachov@foss.arm.com> wrote: >>>> On 09/11/18 12:18, Richard Biener wrote: >>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov >>>>> <kyrylo.tkachov@foss.arm.com> wrote: >>>>>> Hi all, >>>>>> >>>>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling: >>>>>> >>>>>> fully_peel_me: >>>>>> mov x1, 5 >>>>>> ptrue p1.d, all >>>>>> whilelo p0.d, xzr, x1 >>>>>> ld1d z0.d, p0/z, [x0] >>>>>> fadd z0.d, z0.d, z0.d >>>>>> st1d z0.d, p0, [x0] >>>>>> cntd x2 >>>>>> addvl x3, x0, #1 >>>>>> whilelo p0.d, x2, x1 >>>>>> beq .L1 >>>>>> ld1d z0.d, p0/z, [x0, #1, mul vl] >>>>>> fadd z0.d, z0.d, z0.d >>>>>> st1d z0.d, p0, [x3] >>>>>> cntw x2 >>>>>> incb x0, all, mul #2 >>>>>> whilelo p0.d, x2, x1 >>>>>> beq .L1 >>>>>> ld1d z0.d, p0/z, [x0] >>>>>> fadd z0.d, z0.d, z0.d >>>>>> st1d z0.d, p0, [x0] >>>>>> .L1: >>>>>> ret >>>>>> >>>>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count. >>>>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size >>>>>> and hurts icache performance. >>>>>> >>>>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration >>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some >>>>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate >>>>>> the branches. >>>>>> >>>>>> So for the above testcase we generate now: >>>>>> fully_peel_me: >>>>>> mov x2, 5 >>>>>> mov x3, x2 >>>>>> mov x1, 0 >>>>>> whilelo p0.d, xzr, x2 >>>>>> ptrue p1.d, all >>>>>> .L2: >>>>>> ld1d z0.d, p0/z, [x0, x1, lsl 3] >>>>>> fadd z0.d, z0.d, z0.d >>>>>> st1d z0.d, p0, [x0, x1, lsl 3] >>>>>> incd x1 >>>>>> whilelo p0.d, x1, x3 >>>>>> bne .L2 >>>>>> ret >>>>>> >>>>>> Not perfect still, but it's preferable to the original code. >>>>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged >>>>>> (until other target people experiment with it and set it, if appropriate). >>>>>> >>>>>> Bootstrapped and tested on aarch64-none-linux-gnu. >>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance. >>>>>> >>>>>> Ok for trunk? >>>>> Hum. Why introduce a new --param and not simply key on >>>>> flag_peel_loops instead? That is >>>>> enabled by default at -O3 and with FDO but you of course can control >>>>> that in your targets >>>>> post-option-processing hook. >>>> You mean like this? >>>> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :) >>>> But I suppose it's a reasonable change. >>> No, that change is backward. What I said is that peeling is already >>> conditional on >>> flag_peel_loops and that is enabled by -O3. So you want to disable >>> flag_peel_loops for >>> SVE instead in the target. >> Sorry, I got confused by the similarly named functions. >> I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass >> (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path. > Well, peeling gets disabled. From your patch I see you want to > disable "unrolling" when > the number of loop iteration is not constant. That is called peeling > where we need to > emit the loop exit test N times. > > Did you check your testcases with -fno-peel-loops? -fno-peel-loops doesn't help in the testcases. The code that does this peeling (try_unroll_loop_completely) can be called through two paths, only one of which is gated on flag_peel_loops. Thanks, Kyrill >> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops. >> Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable? >> >> Thanks, >> Kyrill >> >>>>> It might also make sense to have more fine-grained control for this >>>>> and allow a target >>>>> to say whether it wants to peel a specific loop or not when the >>>>> middle-end thinks that >>>>> would be profitable. >>>> Can be worth looking at as a follow-up. Do you envisage the target analysing >>>> the gimple statements of the loop to figure out its cost? >>> Kind-of. Sth like >>> >>> bool targetm.peel_loop (struct loop *); >>> >>> I have no idea whether you can easily detect a SVE vectorized loop though. >>> Maybe there's always a special IV or so (the mask?) >>> >>> Richard. >>> >>>> Thanks, >>>> Kyrill >>>> >>>> >>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> >>>> >>>> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll >>>> loop when number of iterations is not known and flag_peel_loops is in >>>> effect. >>>> >>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> >>>> >>>> * gcc.target/aarch64/sve/unroll-1.c: New test. >>>>
On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote: > > Hi Richard, > > On 13/11/18 08:24, Richard Biener wrote: > > On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov > > <kyrylo.tkachov@foss.arm.com> wrote: > >> > >> On 12/11/18 14:10, Richard Biener wrote: > >>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov > >>> <kyrylo.tkachov@foss.arm.com> wrote: > >>>> On 09/11/18 12:18, Richard Biener wrote: > >>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov > >>>>> <kyrylo.tkachov@foss.arm.com> wrote: > >>>>>> Hi all, > >>>>>> > >>>>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling: > >>>>>> > >>>>>> fully_peel_me: > >>>>>> mov x1, 5 > >>>>>> ptrue p1.d, all > >>>>>> whilelo p0.d, xzr, x1 > >>>>>> ld1d z0.d, p0/z, [x0] > >>>>>> fadd z0.d, z0.d, z0.d > >>>>>> st1d z0.d, p0, [x0] > >>>>>> cntd x2 > >>>>>> addvl x3, x0, #1 > >>>>>> whilelo p0.d, x2, x1 > >>>>>> beq .L1 > >>>>>> ld1d z0.d, p0/z, [x0, #1, mul vl] > >>>>>> fadd z0.d, z0.d, z0.d > >>>>>> st1d z0.d, p0, [x3] > >>>>>> cntw x2 > >>>>>> incb x0, all, mul #2 > >>>>>> whilelo p0.d, x2, x1 > >>>>>> beq .L1 > >>>>>> ld1d z0.d, p0/z, [x0] > >>>>>> fadd z0.d, z0.d, z0.d > >>>>>> st1d z0.d, p0, [x0] > >>>>>> .L1: > >>>>>> ret > >>>>>> > >>>>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count. > >>>>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size > >>>>>> and hurts icache performance. > >>>>>> > >>>>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration > >>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some > >>>>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate > >>>>>> the branches. > >>>>>> > >>>>>> So for the above testcase we generate now: > >>>>>> fully_peel_me: > >>>>>> mov x2, 5 > >>>>>> mov x3, x2 > >>>>>> mov x1, 0 > >>>>>> whilelo p0.d, xzr, x2 > >>>>>> ptrue p1.d, all > >>>>>> .L2: > >>>>>> ld1d z0.d, p0/z, [x0, x1, lsl 3] > >>>>>> fadd z0.d, z0.d, z0.d > >>>>>> st1d z0.d, p0, [x0, x1, lsl 3] > >>>>>> incd x1 > >>>>>> whilelo p0.d, x1, x3 > >>>>>> bne .L2 > >>>>>> ret > >>>>>> > >>>>>> Not perfect still, but it's preferable to the original code. > >>>>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged > >>>>>> (until other target people experiment with it and set it, if appropriate). > >>>>>> > >>>>>> Bootstrapped and tested on aarch64-none-linux-gnu. > >>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance. > >>>>>> > >>>>>> Ok for trunk? > >>>>> Hum. Why introduce a new --param and not simply key on > >>>>> flag_peel_loops instead? That is > >>>>> enabled by default at -O3 and with FDO but you of course can control > >>>>> that in your targets > >>>>> post-option-processing hook. > >>>> You mean like this? > >>>> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :) > >>>> But I suppose it's a reasonable change. > >>> No, that change is backward. What I said is that peeling is already > >>> conditional on > >>> flag_peel_loops and that is enabled by -O3. So you want to disable > >>> flag_peel_loops for > >>> SVE instead in the target. > >> Sorry, I got confused by the similarly named functions. > >> I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass > >> (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path. > > Well, peeling gets disabled. From your patch I see you want to > > disable "unrolling" when > > the number of loop iteration is not constant. That is called peeling > > where we need to > > emit the loop exit test N times. > > > > Did you check your testcases with -fno-peel-loops? > > -fno-peel-loops doesn't help in the testcases. The code that does this peeling (try_unroll_loop_completely) > can be called through two paths, only one of which is gated on flag_peel_loops. I don't see the obvious here so I have to either sit down with a non-SVE specific testcase showing this, or I am misunderstanding the actual transform that you want to avoid. allow_peel is false when called from canonicalize_induction_variables. There's the slight chance that UL_NO_GROWTH lets through cases - is your case one of that? That is, does the following help? Index: gcc/tree-ssa-loop-ivcanon.c =================================================================== --- gcc/tree-ssa-loop-ivcanon.c (revision 266056) +++ gcc/tree-ssa-loop-ivcanon.c (working copy) @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop exit = NULL; /* See if we can improve our estimate by using recorded loop bounds. */ - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH) + if ((allow_peel || maxiter == 0) && maxiter >= 0 && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll)) { IIRC I allowed that case when adding allow_peel simply because it avoided some testsuite regressions. This means you eventually want to work on the size estimate of SVE style loops? Richard. > Thanks, > Kyrill > > > >> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops. > >> Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable? > >> > >> Thanks, > >> Kyrill > >> > >>>>> It might also make sense to have more fine-grained control for this > >>>>> and allow a target > >>>>> to say whether it wants to peel a specific loop or not when the > >>>>> middle-end thinks that > >>>>> would be profitable. > >>>> Can be worth looking at as a follow-up. Do you envisage the target analysing > >>>> the gimple statements of the loop to figure out its cost? > >>> Kind-of. Sth like > >>> > >>> bool targetm.peel_loop (struct loop *); > >>> > >>> I have no idea whether you can easily detect a SVE vectorized loop though. > >>> Maybe there's always a special IV or so (the mask?) > >>> > >>> Richard. > >>> > >>>> Thanks, > >>>> Kyrill > >>>> > >>>> > >>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > >>>> > >>>> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll > >>>> loop when number of iterations is not known and flag_peel_loops is in > >>>> effect. > >>>> > >>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > >>>> > >>>> * gcc.target/aarch64/sve/unroll-1.c: New test. > >>>> >
On 13/11/18 09:28, Richard Biener wrote: > On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov > <kyrylo.tkachov@foss.arm.com> wrote: >> Hi Richard, >> >> On 13/11/18 08:24, Richard Biener wrote: >>> On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov >>> <kyrylo.tkachov@foss.arm.com> wrote: >>>> On 12/11/18 14:10, Richard Biener wrote: >>>>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov >>>>> <kyrylo.tkachov@foss.arm.com> wrote: >>>>>> On 09/11/18 12:18, Richard Biener wrote: >>>>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov >>>>>>> <kyrylo.tkachov@foss.arm.com> wrote: >>>>>>>> Hi all, >>>>>>>> >>>>>>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling: >>>>>>>> >>>>>>>> fully_peel_me: >>>>>>>> mov x1, 5 >>>>>>>> ptrue p1.d, all >>>>>>>> whilelo p0.d, xzr, x1 >>>>>>>> ld1d z0.d, p0/z, [x0] >>>>>>>> fadd z0.d, z0.d, z0.d >>>>>>>> st1d z0.d, p0, [x0] >>>>>>>> cntd x2 >>>>>>>> addvl x3, x0, #1 >>>>>>>> whilelo p0.d, x2, x1 >>>>>>>> beq .L1 >>>>>>>> ld1d z0.d, p0/z, [x0, #1, mul vl] >>>>>>>> fadd z0.d, z0.d, z0.d >>>>>>>> st1d z0.d, p0, [x3] >>>>>>>> cntw x2 >>>>>>>> incb x0, all, mul #2 >>>>>>>> whilelo p0.d, x2, x1 >>>>>>>> beq .L1 >>>>>>>> ld1d z0.d, p0/z, [x0] >>>>>>>> fadd z0.d, z0.d, z0.d >>>>>>>> st1d z0.d, p0, [x0] >>>>>>>> .L1: >>>>>>>> ret >>>>>>>> >>>>>>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count. >>>>>>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size >>>>>>>> and hurts icache performance. >>>>>>>> >>>>>>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration >>>>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some >>>>>>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate >>>>>>>> the branches. >>>>>>>> >>>>>>>> So for the above testcase we generate now: >>>>>>>> fully_peel_me: >>>>>>>> mov x2, 5 >>>>>>>> mov x3, x2 >>>>>>>> mov x1, 0 >>>>>>>> whilelo p0.d, xzr, x2 >>>>>>>> ptrue p1.d, all >>>>>>>> .L2: >>>>>>>> ld1d z0.d, p0/z, [x0, x1, lsl 3] >>>>>>>> fadd z0.d, z0.d, z0.d >>>>>>>> st1d z0.d, p0, [x0, x1, lsl 3] >>>>>>>> incd x1 >>>>>>>> whilelo p0.d, x1, x3 >>>>>>>> bne .L2 >>>>>>>> ret >>>>>>>> >>>>>>>> Not perfect still, but it's preferable to the original code. >>>>>>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged >>>>>>>> (until other target people experiment with it and set it, if appropriate). >>>>>>>> >>>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu. >>>>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance. >>>>>>>> >>>>>>>> Ok for trunk? >>>>>>> Hum. Why introduce a new --param and not simply key on >>>>>>> flag_peel_loops instead? That is >>>>>>> enabled by default at -O3 and with FDO but you of course can control >>>>>>> that in your targets >>>>>>> post-option-processing hook. >>>>>> You mean like this? >>>>>> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :) >>>>>> But I suppose it's a reasonable change. >>>>> No, that change is backward. What I said is that peeling is already >>>>> conditional on >>>>> flag_peel_loops and that is enabled by -O3. So you want to disable >>>>> flag_peel_loops for >>>>> SVE instead in the target. >>>> Sorry, I got confused by the similarly named functions. >>>> I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass >>>> (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path. >>> Well, peeling gets disabled. From your patch I see you want to >>> disable "unrolling" when >>> the number of loop iteration is not constant. That is called peeling >>> where we need to >>> emit the loop exit test N times. >>> >>> Did you check your testcases with -fno-peel-loops? >> -fno-peel-loops doesn't help in the testcases. The code that does this peeling (try_unroll_loop_completely) >> can be called through two paths, only one of which is gated on flag_peel_loops. > I don't see the obvious here so I have to either sit down with a > non-SVE specific testcase > showing this, or I am misunderstanding the actual transform that you > want to avoid. > allow_peel is false when called from canonicalize_induction_variables. > There's the slight > chance that UL_NO_GROWTH lets through cases - is your case one of > that? That is, > does the following help? > > Index: gcc/tree-ssa-loop-ivcanon.c > =================================================================== > --- gcc/tree-ssa-loop-ivcanon.c (revision 266056) > +++ gcc/tree-ssa-loop-ivcanon.c (working copy) > @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop > exit = NULL; > > /* See if we can improve our estimate by using recorded loop bounds. */ > - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH) > + if ((allow_peel || maxiter == 0) > && maxiter >= 0 > && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll)) > { > > IIRC I allowed that case when adding allow_peel simply because it avoided some > testsuite regressions. This means you eventually want to work on the > size estimate > of SVE style loops? This doesn't help. Sorry if we're talking over each other here, I'm not very familiar with this area :( For this loop: for (int i = 0; i < 5; i++) x[i] = x[i] * 2; For normal vectorisation (e.g. AArch64 NEON) we know the exact number of executions of the loop latch. This gets fully unrolled as: ldp q2, q1, [x0] ldr d0, [x0, 32] fadd v2.2d, v2.2d, v2.2d fadd v1.2d, v1.2d, v1.2d fadd d0, d0, d0 stp q2, q1, [x0] str d0, [x0, 32] For vector length-agnostic SVE vectorisation we don't as we don't know the number of elements we process with each loop iteration. So the NEON unrolling becomes SVE peeling I guess. Note that the number of iterations in SVE is still "constant", just not known at compile-time. In this case peeling doesn't eliminate any branches and only serves to bloat code size. Kyrill > Richard. > >> Thanks, >> Kyrill >> >> >>>> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops. >>>> Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable? >>>> >>>> Thanks, >>>> Kyrill >>>> >>>>>>> It might also make sense to have more fine-grained control for this >>>>>>> and allow a target >>>>>>> to say whether it wants to peel a specific loop or not when the >>>>>>> middle-end thinks that >>>>>>> would be profitable. >>>>>> Can be worth looking at as a follow-up. Do you envisage the target analysing >>>>>> the gimple statements of the loop to figure out its cost? >>>>> Kind-of. Sth like >>>>> >>>>> bool targetm.peel_loop (struct loop *); >>>>> >>>>> I have no idea whether you can easily detect a SVE vectorized loop though. >>>>> Maybe there's always a special IV or so (the mask?) >>>>> >>>>> Richard. >>>>> >>>>>> Thanks, >>>>>> Kyrill >>>>>> >>>>>> >>>>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> >>>>>> >>>>>> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll >>>>>> loop when number of iterations is not known and flag_peel_loops is in >>>>>> effect. >>>>>> >>>>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> >>>>>> >>>>>> * gcc.target/aarch64/sve/unroll-1.c: New test. >>>>>>
On Tue, Nov 13, 2018 at 10:48 AM Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote: > > > On 13/11/18 09:28, Richard Biener wrote: > > On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov > > <kyrylo.tkachov@foss.arm.com> wrote: > >> Hi Richard, > >> > >> On 13/11/18 08:24, Richard Biener wrote: > >>> On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov > >>> <kyrylo.tkachov@foss.arm.com> wrote: > >>>> On 12/11/18 14:10, Richard Biener wrote: > >>>>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov > >>>>> <kyrylo.tkachov@foss.arm.com> wrote: > >>>>>> On 09/11/18 12:18, Richard Biener wrote: > >>>>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov > >>>>>>> <kyrylo.tkachov@foss.arm.com> wrote: > >>>>>>>> Hi all, > >>>>>>>> > >>>>>>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling: > >>>>>>>> > >>>>>>>> fully_peel_me: > >>>>>>>> mov x1, 5 > >>>>>>>> ptrue p1.d, all > >>>>>>>> whilelo p0.d, xzr, x1 > >>>>>>>> ld1d z0.d, p0/z, [x0] > >>>>>>>> fadd z0.d, z0.d, z0.d > >>>>>>>> st1d z0.d, p0, [x0] > >>>>>>>> cntd x2 > >>>>>>>> addvl x3, x0, #1 > >>>>>>>> whilelo p0.d, x2, x1 > >>>>>>>> beq .L1 > >>>>>>>> ld1d z0.d, p0/z, [x0, #1, mul vl] > >>>>>>>> fadd z0.d, z0.d, z0.d > >>>>>>>> st1d z0.d, p0, [x3] > >>>>>>>> cntw x2 > >>>>>>>> incb x0, all, mul #2 > >>>>>>>> whilelo p0.d, x2, x1 > >>>>>>>> beq .L1 > >>>>>>>> ld1d z0.d, p0/z, [x0] > >>>>>>>> fadd z0.d, z0.d, z0.d > >>>>>>>> st1d z0.d, p0, [x0] > >>>>>>>> .L1: > >>>>>>>> ret > >>>>>>>> > >>>>>>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count. > >>>>>>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size > >>>>>>>> and hurts icache performance. > >>>>>>>> > >>>>>>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration > >>>>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some > >>>>>>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate > >>>>>>>> the branches. > >>>>>>>> > >>>>>>>> So for the above testcase we generate now: > >>>>>>>> fully_peel_me: > >>>>>>>> mov x2, 5 > >>>>>>>> mov x3, x2 > >>>>>>>> mov x1, 0 > >>>>>>>> whilelo p0.d, xzr, x2 > >>>>>>>> ptrue p1.d, all > >>>>>>>> .L2: > >>>>>>>> ld1d z0.d, p0/z, [x0, x1, lsl 3] > >>>>>>>> fadd z0.d, z0.d, z0.d > >>>>>>>> st1d z0.d, p0, [x0, x1, lsl 3] > >>>>>>>> incd x1 > >>>>>>>> whilelo p0.d, x1, x3 > >>>>>>>> bne .L2 > >>>>>>>> ret > >>>>>>>> > >>>>>>>> Not perfect still, but it's preferable to the original code. > >>>>>>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged > >>>>>>>> (until other target people experiment with it and set it, if appropriate). > >>>>>>>> > >>>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu. > >>>>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance. > >>>>>>>> > >>>>>>>> Ok for trunk? > >>>>>>> Hum. Why introduce a new --param and not simply key on > >>>>>>> flag_peel_loops instead? That is > >>>>>>> enabled by default at -O3 and with FDO but you of course can control > >>>>>>> that in your targets > >>>>>>> post-option-processing hook. > >>>>>> You mean like this? > >>>>>> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :) > >>>>>> But I suppose it's a reasonable change. > >>>>> No, that change is backward. What I said is that peeling is already > >>>>> conditional on > >>>>> flag_peel_loops and that is enabled by -O3. So you want to disable > >>>>> flag_peel_loops for > >>>>> SVE instead in the target. > >>>> Sorry, I got confused by the similarly named functions. > >>>> I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass > >>>> (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path. > >>> Well, peeling gets disabled. From your patch I see you want to > >>> disable "unrolling" when > >>> the number of loop iteration is not constant. That is called peeling > >>> where we need to > >>> emit the loop exit test N times. > >>> > >>> Did you check your testcases with -fno-peel-loops? > >> -fno-peel-loops doesn't help in the testcases. The code that does this peeling (try_unroll_loop_completely) > >> can be called through two paths, only one of which is gated on flag_peel_loops. > > I don't see the obvious here so I have to either sit down with a > > non-SVE specific testcase > > showing this, or I am misunderstanding the actual transform that you > > want to avoid. > > allow_peel is false when called from canonicalize_induction_variables. > > There's the slight > > chance that UL_NO_GROWTH lets through cases - is your case one of > > that? That is, > > does the following help? > > > > Index: gcc/tree-ssa-loop-ivcanon.c > > =================================================================== > > --- gcc/tree-ssa-loop-ivcanon.c (revision 266056) > > +++ gcc/tree-ssa-loop-ivcanon.c (working copy) > > @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop > > exit = NULL; > > > > /* See if we can improve our estimate by using recorded loop bounds. */ > > - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH) > > + if ((allow_peel || maxiter == 0) > > && maxiter >= 0 > > && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll)) > > { > > > > IIRC I allowed that case when adding allow_peel simply because it avoided some > > testsuite regressions. This means you eventually want to work on the > > size estimate > > of SVE style loops? > > This doesn't help. > > Sorry if we're talking over each other here, I'm not very familiar with this area :( > For this loop: > for (int i = 0; i < 5; i++) > x[i] = x[i] * 2; > > For normal vectorisation (e.g. AArch64 NEON) we know the exact number of executions of the loop latch. > This gets fully unrolled as: > ldp q2, q1, [x0] > ldr d0, [x0, 32] > fadd v2.2d, v2.2d, v2.2d > fadd v1.2d, v1.2d, v1.2d > fadd d0, d0, d0 > stp q2, q1, [x0] > str d0, [x0, 32] > > For vector length-agnostic SVE vectorisation we don't as we don't know the number of elements we process > with each loop iteration. So the NEON unrolling becomes SVE peeling I guess. > Note that the number of iterations in SVE is still "constant", just not known at compile-time. > In this case peeling doesn't eliminate any branches and only serves to bloat code size. So I sat down with a cross and indeed for the late unrolling pass we simply pass in allow_peel == true given that try_unroll_loop_completely also performs peeling (and not just try_peel_loops which guards itself with flag_peel_loops). That means instead of the above the below fixes this (with -fno-peel-loops): Index: gcc/tree-ssa-loop-ivcanon.c =================================================================== --- gcc/tree-ssa-loop-ivcanon.c (revision 266072) +++ gcc/tree-ssa-loop-ivcanon.c (working copy) @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop exit = NULL; /* See if we can improve our estimate by using recorded loop bounds. */ - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH) + if (((allow_peel && flag_peel_loops) || maxiter == 0 || ul == UL_NO_GROWTH) && maxiter >= 0 && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll)) { there might be quite some testsuite fallout since flag_peel_loops is only enabled at -O3+, but one has to double-check. As said, a per-loop target control whether loop peeling is desirable would be an improvement I guess (apart from generally disabling peeling for aarch64). I suppose you can benchmark that together with the above fix. Richard. > Kyrill > > > Richard. > > > >> Thanks, > >> Kyrill > >> > >> > >>>> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops. > >>>> Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable? > >>>> > >>>> Thanks, > >>>> Kyrill > >>>> > >>>>>>> It might also make sense to have more fine-grained control for this > >>>>>>> and allow a target > >>>>>>> to say whether it wants to peel a specific loop or not when the > >>>>>>> middle-end thinks that > >>>>>>> would be profitable. > >>>>>> Can be worth looking at as a follow-up. Do you envisage the target analysing > >>>>>> the gimple statements of the loop to figure out its cost? > >>>>> Kind-of. Sth like > >>>>> > >>>>> bool targetm.peel_loop (struct loop *); > >>>>> > >>>>> I have no idea whether you can easily detect a SVE vectorized loop though. > >>>>> Maybe there's always a special IV or so (the mask?) > >>>>> > >>>>> Richard. > >>>>> > >>>>>> Thanks, > >>>>>> Kyrill > >>>>>> > >>>>>> > >>>>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > >>>>>> > >>>>>> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll > >>>>>> loop when number of iterations is not known and flag_peel_loops is in > >>>>>> effect. > >>>>>> > >>>>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > >>>>>> > >>>>>> * gcc.target/aarch64/sve/unroll-1.c: New test. > >>>>>> >
On Tue, Nov 13, 2018 at 3:33 PM Richard Biener <richard.guenther@gmail.com> wrote: > > On Tue, Nov 13, 2018 at 10:48 AM Kyrill Tkachov > <kyrylo.tkachov@foss.arm.com> wrote: > > > > > > On 13/11/18 09:28, Richard Biener wrote: > > > On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov > > > <kyrylo.tkachov@foss.arm.com> wrote: > > >> Hi Richard, > > >> > > >> On 13/11/18 08:24, Richard Biener wrote: > > >>> On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov > > >>> <kyrylo.tkachov@foss.arm.com> wrote: > > >>>> On 12/11/18 14:10, Richard Biener wrote: > > >>>>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov > > >>>>> <kyrylo.tkachov@foss.arm.com> wrote: > > >>>>>> On 09/11/18 12:18, Richard Biener wrote: > > >>>>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov > > >>>>>>> <kyrylo.tkachov@foss.arm.com> wrote: > > >>>>>>>> Hi all, > > >>>>>>>> > > >>>>>>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling: > > >>>>>>>> > > >>>>>>>> fully_peel_me: > > >>>>>>>> mov x1, 5 > > >>>>>>>> ptrue p1.d, all > > >>>>>>>> whilelo p0.d, xzr, x1 > > >>>>>>>> ld1d z0.d, p0/z, [x0] > > >>>>>>>> fadd z0.d, z0.d, z0.d > > >>>>>>>> st1d z0.d, p0, [x0] > > >>>>>>>> cntd x2 > > >>>>>>>> addvl x3, x0, #1 > > >>>>>>>> whilelo p0.d, x2, x1 > > >>>>>>>> beq .L1 > > >>>>>>>> ld1d z0.d, p0/z, [x0, #1, mul vl] > > >>>>>>>> fadd z0.d, z0.d, z0.d > > >>>>>>>> st1d z0.d, p0, [x3] > > >>>>>>>> cntw x2 > > >>>>>>>> incb x0, all, mul #2 > > >>>>>>>> whilelo p0.d, x2, x1 > > >>>>>>>> beq .L1 > > >>>>>>>> ld1d z0.d, p0/z, [x0] > > >>>>>>>> fadd z0.d, z0.d, z0.d > > >>>>>>>> st1d z0.d, p0, [x0] > > >>>>>>>> .L1: > > >>>>>>>> ret > > >>>>>>>> > > >>>>>>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count. > > >>>>>>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size > > >>>>>>>> and hurts icache performance. > > >>>>>>>> > > >>>>>>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration > > >>>>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some > > >>>>>>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate > > >>>>>>>> the branches. > > >>>>>>>> > > >>>>>>>> So for the above testcase we generate now: > > >>>>>>>> fully_peel_me: > > >>>>>>>> mov x2, 5 > > >>>>>>>> mov x3, x2 > > >>>>>>>> mov x1, 0 > > >>>>>>>> whilelo p0.d, xzr, x2 > > >>>>>>>> ptrue p1.d, all > > >>>>>>>> .L2: > > >>>>>>>> ld1d z0.d, p0/z, [x0, x1, lsl 3] > > >>>>>>>> fadd z0.d, z0.d, z0.d > > >>>>>>>> st1d z0.d, p0, [x0, x1, lsl 3] > > >>>>>>>> incd x1 > > >>>>>>>> whilelo p0.d, x1, x3 > > >>>>>>>> bne .L2 > > >>>>>>>> ret > > >>>>>>>> > > >>>>>>>> Not perfect still, but it's preferable to the original code. > > >>>>>>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged > > >>>>>>>> (until other target people experiment with it and set it, if appropriate). > > >>>>>>>> > > >>>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu. > > >>>>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance. > > >>>>>>>> > > >>>>>>>> Ok for trunk? > > >>>>>>> Hum. Why introduce a new --param and not simply key on > > >>>>>>> flag_peel_loops instead? That is > > >>>>>>> enabled by default at -O3 and with FDO but you of course can control > > >>>>>>> that in your targets > > >>>>>>> post-option-processing hook. > > >>>>>> You mean like this? > > >>>>>> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :) > > >>>>>> But I suppose it's a reasonable change. > > >>>>> No, that change is backward. What I said is that peeling is already > > >>>>> conditional on > > >>>>> flag_peel_loops and that is enabled by -O3. So you want to disable > > >>>>> flag_peel_loops for > > >>>>> SVE instead in the target. > > >>>> Sorry, I got confused by the similarly named functions. > > >>>> I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass > > >>>> (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path. > > >>> Well, peeling gets disabled. From your patch I see you want to > > >>> disable "unrolling" when > > >>> the number of loop iteration is not constant. That is called peeling > > >>> where we need to > > >>> emit the loop exit test N times. > > >>> > > >>> Did you check your testcases with -fno-peel-loops? > > >> -fno-peel-loops doesn't help in the testcases. The code that does this peeling (try_unroll_loop_completely) > > >> can be called through two paths, only one of which is gated on flag_peel_loops. > > > I don't see the obvious here so I have to either sit down with a > > > non-SVE specific testcase > > > showing this, or I am misunderstanding the actual transform that you > > > want to avoid. > > > allow_peel is false when called from canonicalize_induction_variables. > > > There's the slight > > > chance that UL_NO_GROWTH lets through cases - is your case one of > > > that? That is, > > > does the following help? > > > > > > Index: gcc/tree-ssa-loop-ivcanon.c > > > =================================================================== > > > --- gcc/tree-ssa-loop-ivcanon.c (revision 266056) > > > +++ gcc/tree-ssa-loop-ivcanon.c (working copy) > > > @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop > > > exit = NULL; > > > > > > /* See if we can improve our estimate by using recorded loop bounds. */ > > > - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH) > > > + if ((allow_peel || maxiter == 0) > > > && maxiter >= 0 > > > && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll)) > > > { > > > > > > IIRC I allowed that case when adding allow_peel simply because it avoided some > > > testsuite regressions. This means you eventually want to work on the > > > size estimate > > > of SVE style loops? > > > > This doesn't help. > > > > Sorry if we're talking over each other here, I'm not very familiar with this area :( > > For this loop: > > for (int i = 0; i < 5; i++) > > x[i] = x[i] * 2; > > > > For normal vectorisation (e.g. AArch64 NEON) we know the exact number of executions of the loop latch. > > This gets fully unrolled as: > > ldp q2, q1, [x0] > > ldr d0, [x0, 32] > > fadd v2.2d, v2.2d, v2.2d > > fadd v1.2d, v1.2d, v1.2d > > fadd d0, d0, d0 > > stp q2, q1, [x0] > > str d0, [x0, 32] > > > > For vector length-agnostic SVE vectorisation we don't as we don't know the number of elements we process > > with each loop iteration. So the NEON unrolling becomes SVE peeling I guess. > > Note that the number of iterations in SVE is still "constant", just not known at compile-time. > > In this case peeling doesn't eliminate any branches and only serves to bloat code size. > > So I sat down with a cross and indeed for the late unrolling pass we > simply pass in allow_peel == true > given that try_unroll_loop_completely also performs peeling (and not > just try_peel_loops which guards > itself with flag_peel_loops). That means instead of the above the > below fixes this (with -fno-peel-loops): > > Index: gcc/tree-ssa-loop-ivcanon.c > =================================================================== > --- gcc/tree-ssa-loop-ivcanon.c (revision 266072) > +++ gcc/tree-ssa-loop-ivcanon.c (working copy) > @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop > exit = NULL; > > /* See if we can improve our estimate by using recorded loop bounds. */ > - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH) > + if (((allow_peel && flag_peel_loops) || maxiter == 0 || ul == UL_NO_GROWTH) > && maxiter >= 0 > && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll)) > { > > there might be quite some testsuite fallout since flag_peel_loops is > only enabled at -O3+, > but one has to double-check. As said, a per-loop target control > whether loop peeling is > desirable would be an improvement I guess (apart from generally > disabling peeling for aarch64). > I suppose you can benchmark that together with the above fix. Oh - and I completely forgot about loop->unroll which the vectorizer could set (for SVE loops) to 1 which disables any unrolling. Richard. > Richard. > > > Kyrill > > > > > Richard. > > > > > >> Thanks, > > >> Kyrill > > >> > > >> > > >>>> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops. > > >>>> Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable? > > >>>> > > >>>> Thanks, > > >>>> Kyrill > > >>>> > > >>>>>>> It might also make sense to have more fine-grained control for this > > >>>>>>> and allow a target > > >>>>>>> to say whether it wants to peel a specific loop or not when the > > >>>>>>> middle-end thinks that > > >>>>>>> would be profitable. > > >>>>>> Can be worth looking at as a follow-up. Do you envisage the target analysing > > >>>>>> the gimple statements of the loop to figure out its cost? > > >>>>> Kind-of. Sth like > > >>>>> > > >>>>> bool targetm.peel_loop (struct loop *); > > >>>>> > > >>>>> I have no idea whether you can easily detect a SVE vectorized loop though. > > >>>>> Maybe there's always a special IV or so (the mask?) > > >>>>> > > >>>>> Richard. > > >>>>> > > >>>>>> Thanks, > > >>>>>> Kyrill > > >>>>>> > > >>>>>> > > >>>>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > >>>>>> > > >>>>>> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll > > >>>>>> loop when number of iterations is not known and flag_peel_loops is in > > >>>>>> effect. > > >>>>>> > > >>>>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > >>>>>> > > >>>>>> * gcc.target/aarch64/sve/unroll-1.c: New test. > > >>>>>> > >
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index f0e54eda80656829528c018357dde2e1e87f6ebd..34d08a075221fd4c098e9b5e8fabd8fe3948d285 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -10993,6 +10993,12 @@ aarch64_override_options_internal (struct gcc_options *opts) opts->x_param_values, global_options_set.x_param_values); + /* Don't unroll loops where the exact iteration count is not known at + compile-time. */ + maybe_set_param_value (PARAM_UNROLL_KNOWN_LOOP_ITERATIONS_ONLY, 1, + opts->x_param_values, + global_options_set.x_param_values); + /* If the user hasn't changed it via configure then set the default to 64 KB for the backend. */ maybe_set_param_value (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE, diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 802cc642453aef2d2c516bcbda22246252ec87c1..74e2aeda27d718264188761cf522d6c9f8025e07 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -10732,6 +10732,9 @@ The maximum number of branches on the hot path through the peeled sequence. @item max-completely-peeled-insns The maximum number of insns of a completely peeled loop. +@item unroll-known-loop-iterations-only +Only completely unroll loops where the iteration count is known. + @item max-completely-peel-times The maximum number of iterations of a loop to be suitable for complete peeling. diff --git a/gcc/params.def b/gcc/params.def index 4a5f2042dac72bb457488ac8bc35d09df94c929c..07946552232058cee41303e81ed694f7f0bb615e 100644 --- a/gcc/params.def +++ b/gcc/params.def @@ -344,6 +344,11 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, "The maximum depth of a loop nest we completely peel.", 8, 0, 0) +DEFPARAM(PARAM_UNROLL_KNOWN_LOOP_ITERATIONS_ONLY, + "unroll-known-loop-iterations-only", + "Only completely unroll loops where the iteration count is known", + 0, 0, 1) + /* The maximum number of insns of an unswitched loop. */ DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, "max-unswitch-insns", diff --git a/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c b/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c new file mode 100644 index 0000000000000000000000000000000000000000..7f53d20cbf8e18a4389b86c037f56f024bac22a5 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-O3" } */ + +/* Check that simple loop is not fully unrolled. */ + +void +fully_peel_me (double *x) +{ + for (int i = 0; i < 5; i++) + x[i] = x[i] * 2; +} + +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, \[.+]\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[.+\]\n} 1 } } */ diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c index eeae2a8c54af14e58970d1797c92ecc86ac0523c..a67800fe8807ba003c05c3d8bdd820cc8df93e57 100644 --- a/gcc/tree-ssa-loop-ivcanon.c +++ b/gcc/tree-ssa-loop-ivcanon.c @@ -883,6 +883,17 @@ try_unroll_loop_completely (struct loop *loop, loop->num); return false; } + else if (TREE_CODE (niter) == SCEV_NOT_KNOWN + && PARAM_VALUE (PARAM_UNROLL_KNOWN_LOOP_ITERATIONS_ONLY) + == 1) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "Not unrolling loop %d: " + "exact number of iterations not known " + "(--param unroll-known-loop-iterations-only).\n", + loop->num); + return false; + } } initialize_original_copy_tables ();