Message ID | mptk1gwzeoy.fsf@arm.com |
---|---|
State | New |
Headers | show |
Series | Fix a case in which the vector cost model was ignored | expand |
On Mon, Mar 18, 2019 at 10:59 AM Richard Sandiford <richard.sandiford@arm.com> wrote: > > This patch fixes a case in which we vectorised something with a > fully-predicated loop even after the cost model had rejected it. > E.g. the loop in the testcase has the costs: > > Vector inside of loop cost: 27 > Vector prologue cost: 0 > Vector epilogue cost: 0 > Scalar iteration cost: 7 > Scalar outside cost: 6 > Vector outside cost: 0 > prologue iterations: 0 > epilogue iterations: 0 > > and we can see that the loop executes at most three times, but we > decided to vectorise it anyway. > > (The costs here are equal for three iterations, but the same thing > happens even when the vector code is strictly more expensive.) > > The problem is the handling of "/VF" in: > > /* Calculate number of iterations required to make the vector version > profitable, relative to the loop bodies only. The following condition > must hold true: > SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC > where > SIC = scalar iteration cost, VIC = vector iteration cost, > VOC = vector outside cost, VF = vectorization factor, > PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations > SOC = scalar outside cost for run time cost model check. */ > > We treat the "/VF" as truncating, but for fully-predicated loops, it's > closer to a ceil division, since fractional iterations are handled by a > full iteration with some predicate bits set to false. > > The easiest fix seemed to be to calculate the minimum number of vector > iterations first, then use that to calculate the minimum number of scalar > iterations. > > Calculating the minimum number of vector iterations might make sense for > unpredicated loops too, since calculating the scalar niters directly > doesn't take into account the fact that the VIC multiple has to be an > integer. But the handling of PL_ITERS and EP_ITERS for unpredicated > loops is a bit hand-wavy anyway, so maybe vagueness here cancels out > vagueness there? Well, their estimate if we do not know them statically is. If we statically know pl/ep iters the formulat should match, no? Hopefully for GCC 10 we can even fix the case in PR89754. > Either way, changing this for unpredicated loops would be much too > invasive for stage 4, so the patch keeps it specific to fully-predicated > loops (i.e. SVE) for now. There's no functional change for other targets. > > Tested on aarch64-linux-gnu with and without SVE, and on x86_64-linux-gnu. > This is a regression introduced by the original cost model patches for > fully-predicated loops, so OK for GCC 9? OK. Thanks, Richard. > Thanks, > Richard > > > 2019-03-18 Richard Sandiford <richard.sandiford@arm.com> > > gcc/ > * tree-vect-loop.c (vect_estimate_min_profitable_iters): Fix the > calculation of the minimum number of scalar iterations for > fully-predicated loops. > > gcc/testsuite/ > * gcc.target/aarch64/sve/cost_model_1.c: New test. > > Index: gcc/tree-vect-loop.c > =================================================================== > --- gcc/tree-vect-loop.c 2019-03-08 18:15:33.668751875 +0000 > +++ gcc/tree-vect-loop.c 2019-03-18 09:55:03.257194326 +0000 > @@ -3600,14 +3600,89 @@ vect_estimate_min_profitable_iters (loop > /* Calculate number of iterations required to make the vector version > profitable, relative to the loop bodies only. The following condition > must hold true: > - SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC > + SIC * niters + SOC > VIC * ((niters - NPEEL) / VF) + VOC > where > SIC = scalar iteration cost, VIC = vector iteration cost, > VOC = vector outside cost, VF = vectorization factor, > - PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations > + NPEEL = prologue iterations + epilogue iterations, > SOC = scalar outside cost for run time cost model check. */ > > - if ((scalar_single_iter_cost * assumed_vf) > (int) vec_inside_cost) > + int saving_per_viter = (scalar_single_iter_cost * assumed_vf > + - vec_inside_cost); > + if (saving_per_viter <= 0) > + { > + if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize) > + warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd, > + "vectorization did not happen for a simd loop"); > + > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > + "cost model: the vector iteration cost = %d " > + "divided by the scalar iteration cost = %d " > + "is greater or equal to the vectorization factor = %d" > + ".\n", > + vec_inside_cost, scalar_single_iter_cost, assumed_vf); > + *ret_min_profitable_niters = -1; > + *ret_min_profitable_estimate = -1; > + return; > + } > + > + /* ??? The "if" arm is written to handle all cases; see below for what > + we would do for !LOOP_VINFO_FULLY_MASKED_P. */ > + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > + { > + /* Rewriting the condition above in terms of the number of > + vector iterations (vniters) rather than the number of > + scalar iterations (niters) gives: > + > + SIC * (vniters * VF + NPEEL) + SOC > VIC * vniters + VOC > + > + <==> vniters * (SIC * VF - VIC) > VOC - SIC * NPEEL - SOC > + > + For integer N, X and Y when X > 0: > + > + N * X > Y <==> N >= (Y /[floor] X) + 1. */ > + int outside_overhead = (vec_outside_cost > + - scalar_single_iter_cost * peel_iters_prologue > + - scalar_single_iter_cost * peel_iters_epilogue > + - scalar_outside_cost); > + /* We're only interested in cases that require at least one > + vector iteration. */ > + int min_vec_niters = 1; > + if (outside_overhead > 0) > + min_vec_niters = outside_overhead / saving_per_viter + 1; > + > + if (dump_enabled_p ()) > + dump_printf (MSG_NOTE, " Minimum number of vector iterations: %d\n", > + min_vec_niters); > + > + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > + { > + /* Now that we know the minimum number of vector iterations, > + find the minimum niters for which the scalar cost is larger: > + > + SIC * niters > VIC * vniters + VOC - SOC > + > + We know that the minimum niters is no more than > + vniters * VF + NPEEL, but it might be (and often is) less > + than that if a partial vector iteration is cheaper than the > + equivalent scalar code. */ > + int threshold = (vec_inside_cost * min_vec_niters > + + vec_outside_cost > + - scalar_outside_cost); > + if (threshold <= 0) > + min_profitable_iters = 1; > + else > + min_profitable_iters = threshold / scalar_single_iter_cost + 1; > + } > + else > + /* Convert the number of vector iterations into a number of > + scalar iterations. */ > + min_profitable_iters = (min_vec_niters * assumed_vf > + + peel_iters_prologue > + + peel_iters_epilogue); > + } > + else > { > min_profitable_iters = ((vec_outside_cost - scalar_outside_cost) > * assumed_vf > @@ -3617,8 +3692,7 @@ vect_estimate_min_profitable_iters (loop > min_profitable_iters = 0; > else > { > - min_profitable_iters /= ((scalar_single_iter_cost * assumed_vf) > - - vec_inside_cost); > + min_profitable_iters /= saving_per_viter; > > if ((scalar_single_iter_cost * assumed_vf * min_profitable_iters) > <= (((int) vec_inside_cost * min_profitable_iters) > @@ -3627,24 +3701,6 @@ vect_estimate_min_profitable_iters (loop > min_profitable_iters++; > } > } > - /* vector version will never be profitable. */ > - else > - { > - if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize) > - warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd, > - "vectorization did not happen for a simd loop"); > - > - if (dump_enabled_p ()) > - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > - "cost model: the vector iteration cost = %d " > - "divided by the scalar iteration cost = %d " > - "is greater or equal to the vectorization factor = %d" > - ".\n", > - vec_inside_cost, scalar_single_iter_cost, assumed_vf); > - *ret_min_profitable_niters = -1; > - *ret_min_profitable_estimate = -1; > - return; > - } > > if (dump_enabled_p ()) > dump_printf (MSG_NOTE, > @@ -3668,10 +3724,34 @@ vect_estimate_min_profitable_iters (loop > > Non-vectorized variant is SIC * niters and it must win over vector > variant on the expected loop trip count. The following condition must hold true: > - SIC * niters > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC + SOC */ > + SIC * niters > VIC * ((niters - NPEEL) / VF) + VOC + SOC */ > > if (vec_outside_cost <= 0) > min_profitable_estimate = 0; > + else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > + { > + /* This is a repeat of the code above, but with + SOC rather > + than - SOC. */ > + int outside_overhead = (vec_outside_cost > + - scalar_single_iter_cost * peel_iters_prologue > + - scalar_single_iter_cost * peel_iters_epilogue > + + scalar_outside_cost); > + int min_vec_niters = 1; > + if (outside_overhead > 0) > + min_vec_niters = outside_overhead / saving_per_viter + 1; > + > + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > + { > + int threshold = (vec_inside_cost * min_vec_niters > + + vec_outside_cost > + + scalar_outside_cost); > + min_profitable_estimate = threshold / scalar_single_iter_cost + 1; > + } > + else > + min_profitable_estimate = (min_vec_niters * assumed_vf > + + peel_iters_prologue > + + peel_iters_epilogue); > + } > else > { > min_profitable_estimate = ((vec_outside_cost + scalar_outside_cost) > Index: gcc/testsuite/gcc.target/aarch64/sve/cost_model_1.c > =================================================================== > --- /dev/null 2019-03-08 11:40:14.606883727 +0000 > +++ gcc/testsuite/gcc.target/aarch64/sve/cost_model_1.c 2019-03-18 09:55:03.257194326 +0000 > @@ -0,0 +1,12 @@ > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */ > + > +void > +f (unsigned int *restrict x, unsigned int *restrict y, > + unsigned char *restrict z, unsigned int n) > +{ > + for (unsigned int i = 0; i < n % 4; ++i) > + x[i] = x[i] + y[i] + z[i]; > +} > + > +/* { dg-final { scan-tree-dump "not vectorized: estimated iteration count too small" vect } } */ > +/* { dg-final { scan-tree-dump "vectorized 0 loops" vect } } */
Richard Biener <richard.guenther@gmail.com> writes: > On Mon, Mar 18, 2019 at 10:59 AM Richard Sandiford > <richard.sandiford@arm.com> wrote: >> >> This patch fixes a case in which we vectorised something with a >> fully-predicated loop even after the cost model had rejected it. >> E.g. the loop in the testcase has the costs: >> >> Vector inside of loop cost: 27 >> Vector prologue cost: 0 >> Vector epilogue cost: 0 >> Scalar iteration cost: 7 >> Scalar outside cost: 6 >> Vector outside cost: 0 >> prologue iterations: 0 >> epilogue iterations: 0 >> >> and we can see that the loop executes at most three times, but we >> decided to vectorise it anyway. >> >> (The costs here are equal for three iterations, but the same thing >> happens even when the vector code is strictly more expensive.) >> >> The problem is the handling of "/VF" in: >> >> /* Calculate number of iterations required to make the vector version >> profitable, relative to the loop bodies only. The following condition >> must hold true: >> SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC >> where >> SIC = scalar iteration cost, VIC = vector iteration cost, >> VOC = vector outside cost, VF = vectorization factor, >> PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations >> SOC = scalar outside cost for run time cost model check. */ >> >> We treat the "/VF" as truncating, but for fully-predicated loops, it's >> closer to a ceil division, since fractional iterations are handled by a >> full iteration with some predicate bits set to false. >> >> The easiest fix seemed to be to calculate the minimum number of vector >> iterations first, then use that to calculate the minimum number of scalar >> iterations. >> >> Calculating the minimum number of vector iterations might make sense for >> unpredicated loops too, since calculating the scalar niters directly >> doesn't take into account the fact that the VIC multiple has to be an >> integer. But the handling of PL_ITERS and EP_ITERS for unpredicated >> loops is a bit hand-wavy anyway, so maybe vagueness here cancels out >> vagueness there? > > Well, their estimate if we do not know them statically is. If we statically > know pl/ep iters the formulat should match, no? Yeah, true. I don't see how much we gain by trying to estimate the number of peels for the runtime threshold. The NPEEL-dependent component of VOC is really just SIC * NPEEL, which makes sense given that each peeled iteration is just a normal scalar iteration. VOC also includes extra base overhead on top of that, but once the number of scalar iterations is big enough for the inner-loop saving to compensate for the base overhead, each extra peel counts equally against both sides. So I think we could estimate the minimum number of vector (rather than scalar) iterations without taking the number of peels into account for VOC, just the potential presence of peeling. We could then add a worst-case amount of prologue peeling, an estimate (as in the patch) or the actual runtime amount (which would mean calculating the value even when the vector loop isn't used). > Hopefully for GCC 10 we can even fix the case in PR89754. Yeah, would be nice. :-) Would also be good to be able to compare inner and outer loop vectorisation side-by-side and pick whichever's best. We want that for SVE to avoid e.g. using outer-loop vectorisation for a 3-iteration outer loop and a many-iteration inner loop. Thanks, Richard
On 18 March 2019 10:58:53 CET, Richard Sandiford <richard.sandiford@arm.com> wrote: >This patch fixes a case in which we vectorised something with a >fully-predicated loop even after the cost model had rejected it. >E.g. the loop in the testcase has the costs: > > Vector inside of loop cost: 27 > Vector prologue cost: 0 > Vector epilogue cost: 0 > Scalar iteration cost: 7 > Scalar outside cost: 6 > Vector outside cost: 0 > prologue iterations: 0 > epilogue iterations: 0 > >and we can see that the loop executes at most three times, but we >decided to vectorise it anyway. > >(The costs here are equal for three iterations, but the same thing >happens even when the vector code is strictly more expensive.) > >The problem is the handling of "/VF" in: > > /* Calculate number of iterations required to make the vector version > profitable, relative to the loop bodies only. The following condition > must hold true: > SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC > where > SIC = scalar iteration cost, VIC = vector iteration cost, > VOC = vector outside cost, VF = vectorization factor, > PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations > SOC = scalar outside cost for run time cost model check. */ > >We treat the "/VF" as truncating, but for fully-predicated loops, it's >closer to a ceil division, since fractional iterations are handled by a >full iteration with some predicate bits set to false. > >The easiest fix seemed to be to calculate the minimum number of vector >iterations first, then use that to calculate the minimum number of >scalar >iterations. > >Calculating the minimum number of vector iterations might make sense >for >unpredicated loops too, since calculating the scalar niters directly >doesn't take into account the fact that the VIC multiple has to be an >integer. But the handling of PL_ITERS and EP_ITERS for unpredicated >loops is a bit hand-wavy anyway, so maybe vagueness here cancels out >vagueness there? > >Either way, changing this for unpredicated loops would be much too >invasive for stage 4, so the patch keeps it specific to >fully-predicated >loops (i.e. SVE) for now. There's no functional change for other >targets. > >Tested on aarch64-linux-gnu with and without SVE, and on >x86_64-linux-gnu. >This is a regression introduced by the original cost model patches for >fully-predicated loops, so OK for GCC 9? > >Thanks, >Richard > > >2019-03-18 Richard Sandiford <richard.sandiford@arm.com> > >gcc/ > * tree-vect-loop.c (vect_estimate_min_profitable_iters): Fix the > calculation of the minimum number of scalar iterations for > fully-predicated loops. > >gcc/testsuite/ > * gcc.target/aarch64/sve/cost_model_1.c: New test. > >Index: gcc/tree-vect-loop.c >=================================================================== >--- gcc/tree-vect-loop.c 2019-03-08 18:15:33.668751875 +0000 >+++ gcc/tree-vect-loop.c 2019-03-18 09:55:03.257194326 +0000 >@@ -3600,14 +3600,89 @@ vect_estimate_min_profitable_iters (loop > /* Calculate number of iterations required to make the vector version > profitable, relative to the loop bodies only. The following condition > must hold true: >- SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC >+ SIC * niters + SOC > VIC * ((niters - NPEEL) / VF) + VOC > where > SIC = scalar iteration cost, VIC = vector iteration cost, > VOC = vector outside cost, VF = vectorization factor, >- PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations >+ NPEEL = prologue iterations + epilogue iterations, > SOC = scalar outside cost for run time cost model check. */ > >- if ((scalar_single_iter_cost * assumed_vf) > (int) vec_inside_cost) >+ int saving_per_viter = (scalar_single_iter_cost * assumed_vf >+ - vec_inside_cost); >+ if (saving_per_viter <= 0) >+ { >+ if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize) >+ warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd, >+ "vectorization did not happen for a simd loop"); >+ >+ if (dump_enabled_p ()) >+ dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >+ "cost model: the vector iteration cost = %d " >+ "divided by the scalar iteration cost = %d " >+ "is greater or equal to the vectorization factor = %d" >+ ".\n", >+ vec_inside_cost, scalar_single_iter_cost, assumed_vf); >+ *ret_min_profitable_niters = -1; >+ *ret_min_profitable_estimate = -1; >+ return; >+ } >+ >+ /* ??? The "if" arm is written to handle all cases; see below for >what >+ we would do for !LOOP_VINFO_FULLY_MASKED_P. */ >+ if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) >+ { The condition above seems to contain... >+ /* Rewriting the condition above in terms of the number of >+ vector iterations (vniters) rather than the number of >+ scalar iterations (niters) gives: >+ >+ SIC * (vniters * VF + NPEEL) + SOC > VIC * vniters + VOC >+ >+ <==> vniters * (SIC * VF - VIC) > VOC - SIC * NPEEL - SOC >+ >+ For integer N, X and Y when X > 0: >+ >+ N * X > Y <==> N >= (Y /[floor] X) + 1. */ >+ int outside_overhead = (vec_outside_cost >+ - scalar_single_iter_cost * peel_iters_prologue >+ - scalar_single_iter_cost * peel_iters_epilogue >+ - scalar_outside_cost); >+ /* We're only interested in cases that require at least one >+ vector iteration. */ >+ int min_vec_niters = 1; >+ if (outside_overhead > 0) >+ min_vec_niters = outside_overhead / saving_per_viter + 1; >+ >+ if (dump_enabled_p ()) >+ dump_printf (MSG_NOTE, " Minimum number of vector iterations: %d\n", >+ min_vec_niters); >+ >+ if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) >+ { ... this identical condition, AFAICS? So this second conditions else arm should be dead, shouldn't it? thanks, >+ /* Now that we know the minimum number of vector iterations, >+ find the minimum niters for which the scalar cost is larger: >+ >+ SIC * niters > VIC * vniters + VOC - SOC >+ >+ We know that the minimum niters is no more than >+ vniters * VF + NPEEL, but it might be (and often is) less >+ than that if a partial vector iteration is cheaper than the >+ equivalent scalar code. */ >+ int threshold = (vec_inside_cost * min_vec_niters >+ + vec_outside_cost >+ - scalar_outside_cost); >+ if (threshold <= 0) >+ min_profitable_iters = 1; >+ else >+ min_profitable_iters = threshold / scalar_single_iter_cost + 1; >+ } >+ else >+ /* Convert the number of vector iterations into a number of >+ scalar iterations. */ >+ min_profitable_iters = (min_vec_niters * assumed_vf >+ + peel_iters_prologue >+ + peel_iters_epilogue); >+ } >+ else > { > min_profitable_iters = ((vec_outside_cost - scalar_outside_cost) > * assumed_vf >@@ -3617,8 +3692,7 @@ vect_estimate_min_profitable_iters (loop > min_profitable_iters = 0; > else > { >- min_profitable_iters /= ((scalar_single_iter_cost * assumed_vf) >- - vec_inside_cost); >+ min_profitable_iters /= saving_per_viter; > > if ((scalar_single_iter_cost * assumed_vf * min_profitable_iters) > <= (((int) vec_inside_cost * min_profitable_iters) >@@ -3627,24 +3701,6 @@ vect_estimate_min_profitable_iters (loop > min_profitable_iters++; > } > } >- /* vector version will never be profitable. */ >- else >- { >- if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize) >- warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd, >- "vectorization did not happen for a simd loop"); >- >- if (dump_enabled_p ()) >- dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >- "cost model: the vector iteration cost = %d " >- "divided by the scalar iteration cost = %d " >- "is greater or equal to the vectorization factor = %d" >- ".\n", >- vec_inside_cost, scalar_single_iter_cost, assumed_vf); >- *ret_min_profitable_niters = -1; >- *ret_min_profitable_estimate = -1; >- return; >- } > > if (dump_enabled_p ()) > dump_printf (MSG_NOTE, >@@ -3668,10 +3724,34 @@ vect_estimate_min_profitable_iters (loop > > Non-vectorized variant is SIC * niters and it must win over vector >variant on the expected loop trip count. The following condition must >hold true: >- SIC * niters > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC + SOC >*/ >+ SIC * niters > VIC * ((niters - NPEEL) / VF) + VOC + SOC */ > > if (vec_outside_cost <= 0) > min_profitable_estimate = 0; >+ else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) >+ { >+ /* This is a repeat of the code above, but with + SOC rather >+ than - SOC. */ >+ int outside_overhead = (vec_outside_cost >+ - scalar_single_iter_cost * peel_iters_prologue >+ - scalar_single_iter_cost * peel_iters_epilogue >+ + scalar_outside_cost); >+ int min_vec_niters = 1; >+ if (outside_overhead > 0) >+ min_vec_niters = outside_overhead / saving_per_viter + 1; >+ >+ if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) >+ { >+ int threshold = (vec_inside_cost * min_vec_niters >+ + vec_outside_cost >+ + scalar_outside_cost); >+ min_profitable_estimate = threshold / scalar_single_iter_cost + 1; >+ } >+ else >+ min_profitable_estimate = (min_vec_niters * assumed_vf >+ + peel_iters_prologue >+ + peel_iters_epilogue); >+ } > else > { > min_profitable_estimate = ((vec_outside_cost + scalar_outside_cost) >Index: gcc/testsuite/gcc.target/aarch64/sve/cost_model_1.c >=================================================================== >--- /dev/null 2019-03-08 11:40:14.606883727 +0000 >+++ gcc/testsuite/gcc.target/aarch64/sve/cost_model_1.c 2019-03-18 >09:55:03.257194326 +0000 >@@ -0,0 +1,12 @@ >+/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */ >+ >+void >+f (unsigned int *restrict x, unsigned int *restrict y, >+ unsigned char *restrict z, unsigned int n) >+{ >+ for (unsigned int i = 0; i < n % 4; ++i) >+ x[i] = x[i] + y[i] + z[i]; >+} >+ >+/* { dg-final { scan-tree-dump "not vectorized: estimated iteration >count too small" vect } } */ >+/* { dg-final { scan-tree-dump "vectorized 0 loops" vect } } */
Bernhard Reutner-Fischer <rep.dot.nop@gmail.com> writes: > On 18 March 2019 10:58:53 CET, Richard Sandiford <richard.sandiford@arm.com> wrote: >>This patch fixes a case in which we vectorised something with a >>fully-predicated loop even after the cost model had rejected it. >>E.g. the loop in the testcase has the costs: >> >> Vector inside of loop cost: 27 >> Vector prologue cost: 0 >> Vector epilogue cost: 0 >> Scalar iteration cost: 7 >> Scalar outside cost: 6 >> Vector outside cost: 0 >> prologue iterations: 0 >> epilogue iterations: 0 >> >>and we can see that the loop executes at most three times, but we >>decided to vectorise it anyway. >> >>(The costs here are equal for three iterations, but the same thing >>happens even when the vector code is strictly more expensive.) >> >>The problem is the handling of "/VF" in: >> >> /* Calculate number of iterations required to make the vector version >> profitable, relative to the loop bodies only. The following condition >> must hold true: >> SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC >> where >> SIC = scalar iteration cost, VIC = vector iteration cost, >> VOC = vector outside cost, VF = vectorization factor, >> PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations >> SOC = scalar outside cost for run time cost model check. */ >> >>We treat the "/VF" as truncating, but for fully-predicated loops, it's >>closer to a ceil division, since fractional iterations are handled by a >>full iteration with some predicate bits set to false. >> >>The easiest fix seemed to be to calculate the minimum number of vector >>iterations first, then use that to calculate the minimum number of >>scalar >>iterations. >> >>Calculating the minimum number of vector iterations might make sense >>for >>unpredicated loops too, since calculating the scalar niters directly >>doesn't take into account the fact that the VIC multiple has to be an >>integer. But the handling of PL_ITERS and EP_ITERS for unpredicated >>loops is a bit hand-wavy anyway, so maybe vagueness here cancels out >>vagueness there? >> >>Either way, changing this for unpredicated loops would be much too >>invasive for stage 4, so the patch keeps it specific to >>fully-predicated >>loops (i.e. SVE) for now. There's no functional change for other >>targets. >> >>Tested on aarch64-linux-gnu with and without SVE, and on >>x86_64-linux-gnu. >>This is a regression introduced by the original cost model patches for >>fully-predicated loops, so OK for GCC 9? >> >>Thanks, >>Richard >> >> >>2019-03-18 Richard Sandiford <richard.sandiford@arm.com> >> >>gcc/ >> * tree-vect-loop.c (vect_estimate_min_profitable_iters): Fix the >> calculation of the minimum number of scalar iterations for >> fully-predicated loops. >> >>gcc/testsuite/ >> * gcc.target/aarch64/sve/cost_model_1.c: New test. >> >>Index: gcc/tree-vect-loop.c >>=================================================================== >>--- gcc/tree-vect-loop.c 2019-03-08 18:15:33.668751875 +0000 >>+++ gcc/tree-vect-loop.c 2019-03-18 09:55:03.257194326 +0000 >>@@ -3600,14 +3600,89 @@ vect_estimate_min_profitable_iters (loop >> /* Calculate number of iterations required to make the vector version >> profitable, relative to the loop bodies only. The following condition >> must hold true: >>- SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC >>+ SIC * niters + SOC > VIC * ((niters - NPEEL) / VF) + VOC >> where >> SIC = scalar iteration cost, VIC = vector iteration cost, >> VOC = vector outside cost, VF = vectorization factor, >>- PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations >>+ NPEEL = prologue iterations + epilogue iterations, >> SOC = scalar outside cost for run time cost model check. */ >> >>- if ((scalar_single_iter_cost * assumed_vf) > (int) vec_inside_cost) >>+ int saving_per_viter = (scalar_single_iter_cost * assumed_vf >>+ - vec_inside_cost); >>+ if (saving_per_viter <= 0) >>+ { >>+ if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize) >>+ warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd, >>+ "vectorization did not happen for a simd loop"); >>+ >>+ if (dump_enabled_p ()) >>+ dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >>+ "cost model: the vector iteration cost = %d " >>+ "divided by the scalar iteration cost = %d " >>+ "is greater or equal to the vectorization factor = %d" >>+ ".\n", >>+ vec_inside_cost, scalar_single_iter_cost, assumed_vf); >>+ *ret_min_profitable_niters = -1; >>+ *ret_min_profitable_estimate = -1; >>+ return; >>+ } >>+ >>+ /* ??? The "if" arm is written to handle all cases; see below for >>what >>+ we would do for !LOOP_VINFO_FULLY_MASKED_P. */ >>+ if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) >>+ { > > The condition above seems to contain... > > >>+ /* Rewriting the condition above in terms of the number of >>+ vector iterations (vniters) rather than the number of >>+ scalar iterations (niters) gives: >>+ >>+ SIC * (vniters * VF + NPEEL) + SOC > VIC * vniters + VOC >>+ >>+ <==> vniters * (SIC * VF - VIC) > VOC - SIC * NPEEL - SOC >>+ >>+ For integer N, X and Y when X > 0: >>+ >>+ N * X > Y <==> N >= (Y /[floor] X) + 1. */ >>+ int outside_overhead = (vec_outside_cost >>+ - scalar_single_iter_cost * peel_iters_prologue >>+ - scalar_single_iter_cost * peel_iters_epilogue >>+ - scalar_outside_cost); >>+ /* We're only interested in cases that require at least one >>+ vector iteration. */ >>+ int min_vec_niters = 1; >>+ if (outside_overhead > 0) >>+ min_vec_niters = outside_overhead / saving_per_viter + 1; >>+ >>+ if (dump_enabled_p ()) >>+ dump_printf (MSG_NOTE, " Minimum number of vector iterations: %d\n", >>+ min_vec_niters); >>+ >>+ if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) >>+ { > > ... this identical condition, AFAICS? > So this second conditions else arm should be dead, shouldn't it? Yeah, that's what: /* ??? The "if" arm is written to handle all cases; see below for what we would do for !LOOP_VINFO_FULLY_MASKED_P. */ was trying to say. Like I mentioned in the covering note, in principle the approach of calculating the minimum number of vector iterations should work for all cases, and we might want to consider doing that for stage 1. I wanted to show what the !LOOP_VINFO_FULLY_MASKED_P code would look like if we did that. I'd wondered about putting the inner else in an #if 0 or comment instead, but this way makes it easier to experiment with. Thanks, Richard
On Tue, 19 Mar 2019 08:47:49 +0000 Richard Sandiford <richard.sandiford@arm.com> wrote: > > ... this identical condition, AFAICS? > > So this second conditions else arm should be dead, shouldn't it? > > Yeah, that's what: > > /* ??? The "if" arm is written to handle all cases; see below for what > we would do for !LOOP_VINFO_FULLY_MASKED_P. */ Ok, i somehow managed not to correlate this comment to the inner if's else. It's been late. Sorry for the noise! thanks, > > was trying to say. Like I mentioned in the covering note, in principle > the approach of calculating the minimum number of vector iterations > should work for all cases, and we might want to consider doing that > for stage 1. I wanted to show what the !LOOP_VINFO_FULLY_MASKED_P > code would look like if we did that. > > I'd wondered about putting the inner else in an #if 0 or comment instead, > but this way makes it easier to experiment with. > > Thanks, > Richard
Index: gcc/tree-vect-loop.c =================================================================== --- gcc/tree-vect-loop.c 2019-03-08 18:15:33.668751875 +0000 +++ gcc/tree-vect-loop.c 2019-03-18 09:55:03.257194326 +0000 @@ -3600,14 +3600,89 @@ vect_estimate_min_profitable_iters (loop /* Calculate number of iterations required to make the vector version profitable, relative to the loop bodies only. The following condition must hold true: - SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC + SIC * niters + SOC > VIC * ((niters - NPEEL) / VF) + VOC where SIC = scalar iteration cost, VIC = vector iteration cost, VOC = vector outside cost, VF = vectorization factor, - PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations + NPEEL = prologue iterations + epilogue iterations, SOC = scalar outside cost for run time cost model check. */ - if ((scalar_single_iter_cost * assumed_vf) > (int) vec_inside_cost) + int saving_per_viter = (scalar_single_iter_cost * assumed_vf + - vec_inside_cost); + if (saving_per_viter <= 0) + { + if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize) + warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd, + "vectorization did not happen for a simd loop"); + + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "cost model: the vector iteration cost = %d " + "divided by the scalar iteration cost = %d " + "is greater or equal to the vectorization factor = %d" + ".\n", + vec_inside_cost, scalar_single_iter_cost, assumed_vf); + *ret_min_profitable_niters = -1; + *ret_min_profitable_estimate = -1; + return; + } + + /* ??? The "if" arm is written to handle all cases; see below for what + we would do for !LOOP_VINFO_FULLY_MASKED_P. */ + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + { + /* Rewriting the condition above in terms of the number of + vector iterations (vniters) rather than the number of + scalar iterations (niters) gives: + + SIC * (vniters * VF + NPEEL) + SOC > VIC * vniters + VOC + + <==> vniters * (SIC * VF - VIC) > VOC - SIC * NPEEL - SOC + + For integer N, X and Y when X > 0: + + N * X > Y <==> N >= (Y /[floor] X) + 1. */ + int outside_overhead = (vec_outside_cost + - scalar_single_iter_cost * peel_iters_prologue + - scalar_single_iter_cost * peel_iters_epilogue + - scalar_outside_cost); + /* We're only interested in cases that require at least one + vector iteration. */ + int min_vec_niters = 1; + if (outside_overhead > 0) + min_vec_niters = outside_overhead / saving_per_viter + 1; + + if (dump_enabled_p ()) + dump_printf (MSG_NOTE, " Minimum number of vector iterations: %d\n", + min_vec_niters); + + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + { + /* Now that we know the minimum number of vector iterations, + find the minimum niters for which the scalar cost is larger: + + SIC * niters > VIC * vniters + VOC - SOC + + We know that the minimum niters is no more than + vniters * VF + NPEEL, but it might be (and often is) less + than that if a partial vector iteration is cheaper than the + equivalent scalar code. */ + int threshold = (vec_inside_cost * min_vec_niters + + vec_outside_cost + - scalar_outside_cost); + if (threshold <= 0) + min_profitable_iters = 1; + else + min_profitable_iters = threshold / scalar_single_iter_cost + 1; + } + else + /* Convert the number of vector iterations into a number of + scalar iterations. */ + min_profitable_iters = (min_vec_niters * assumed_vf + + peel_iters_prologue + + peel_iters_epilogue); + } + else { min_profitable_iters = ((vec_outside_cost - scalar_outside_cost) * assumed_vf @@ -3617,8 +3692,7 @@ vect_estimate_min_profitable_iters (loop min_profitable_iters = 0; else { - min_profitable_iters /= ((scalar_single_iter_cost * assumed_vf) - - vec_inside_cost); + min_profitable_iters /= saving_per_viter; if ((scalar_single_iter_cost * assumed_vf * min_profitable_iters) <= (((int) vec_inside_cost * min_profitable_iters) @@ -3627,24 +3701,6 @@ vect_estimate_min_profitable_iters (loop min_profitable_iters++; } } - /* vector version will never be profitable. */ - else - { - if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize) - warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd, - "vectorization did not happen for a simd loop"); - - if (dump_enabled_p ()) - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "cost model: the vector iteration cost = %d " - "divided by the scalar iteration cost = %d " - "is greater or equal to the vectorization factor = %d" - ".\n", - vec_inside_cost, scalar_single_iter_cost, assumed_vf); - *ret_min_profitable_niters = -1; - *ret_min_profitable_estimate = -1; - return; - } if (dump_enabled_p ()) dump_printf (MSG_NOTE, @@ -3668,10 +3724,34 @@ vect_estimate_min_profitable_iters (loop Non-vectorized variant is SIC * niters and it must win over vector variant on the expected loop trip count. The following condition must hold true: - SIC * niters > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC + SOC */ + SIC * niters > VIC * ((niters - NPEEL) / VF) + VOC + SOC */ if (vec_outside_cost <= 0) min_profitable_estimate = 0; + else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + { + /* This is a repeat of the code above, but with + SOC rather + than - SOC. */ + int outside_overhead = (vec_outside_cost + - scalar_single_iter_cost * peel_iters_prologue + - scalar_single_iter_cost * peel_iters_epilogue + + scalar_outside_cost); + int min_vec_niters = 1; + if (outside_overhead > 0) + min_vec_niters = outside_overhead / saving_per_viter + 1; + + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + { + int threshold = (vec_inside_cost * min_vec_niters + + vec_outside_cost + + scalar_outside_cost); + min_profitable_estimate = threshold / scalar_single_iter_cost + 1; + } + else + min_profitable_estimate = (min_vec_niters * assumed_vf + + peel_iters_prologue + + peel_iters_epilogue); + } else { min_profitable_estimate = ((vec_outside_cost + scalar_outside_cost) Index: gcc/testsuite/gcc.target/aarch64/sve/cost_model_1.c =================================================================== --- /dev/null 2019-03-08 11:40:14.606883727 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve/cost_model_1.c 2019-03-18 09:55:03.257194326 +0000 @@ -0,0 +1,12 @@ +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */ + +void +f (unsigned int *restrict x, unsigned int *restrict y, + unsigned char *restrict z, unsigned int n) +{ + for (unsigned int i = 0; i < n % 4; ++i) + x[i] = x[i] + y[i] + z[i]; +} + +/* { dg-final { scan-tree-dump "not vectorized: estimated iteration count too small" vect } } */ +/* { dg-final { scan-tree-dump "vectorized 0 loops" vect } } */