diff mbox series

[vect] Use main loop's thresholds and vectorization factor to narrow upper_bound of epilogue

Message ID 5bb6a2ad-2096-b888-6f31-331de44a2ef7@arm.com
State New
Headers show
Series [vect] Use main loop's thresholds and vectorization factor to narrow upper_bound of epilogue | expand

Commit Message

Andre Vieira (lists) May 24, 2021, 6:17 a.m. UTC
Hi,

When vectorizing with --param vect-partial-vector-usage=1 the vectorizer 
uses an unpredicated (all-true predicate for SVE) main loop and a 
predicated tail loop. The way this was implemented seems to mean it 
re-uses the same vector-mode for both loops, which means the tail loop 
isn't an actual loop but only executes one iteration.

This patch uses the knowledge of the conditions to enter an epilogue 
loop to help come up with a potentially more restricive upper bound.

Regression tested on aarch64-linux-gnu and also ran the testsuite using 
'--param vect-partial-vector-usage=1' detecting no ICEs and no execution 
failures.

Would be good to have this tested for PPC too as I believe they are the 
main users of the --param vect-partial-vector-usage=1 option. Can 
someone help me test (and maybe even benchmark?) this on a PPC target?

Kind regards,
Andre

gcc/ChangeLog:

         * tree-vect-loop.c (vect_transform_loop): Use main loop's 
various' thresholds
         to narrow the upper bound on epilogue iterations.

gcc/testsuite/ChangeLog:

         * gcc.target/aarch64/sve/part_vect_single_iter_epilog.c: New test.

Comments

Kewen.Lin May 24, 2021, 7:21 a.m. UTC | #1
Hi Andre,

on 2021/5/24 下午2:17, Andre Vieira (lists) via Gcc-patches wrote:
> Hi,
> 
> When vectorizing with --param vect-partial-vector-usage=1 the vectorizer uses an unpredicated (all-true predicate for SVE) main loop and a predicated tail loop. The way this was implemented seems to mean it re-uses the same vector-mode for both loops, which means the tail loop isn't an actual loop but only executes one iteration.
> 
> This patch uses the knowledge of the conditions to enter an epilogue loop to help come up with a potentially more restricive upper bound.
> 
> Regression tested on aarch64-linux-gnu and also ran the testsuite using '--param vect-partial-vector-usage=1' detecting no ICEs and no execution failures.
> 
> Would be good to have this tested for PPC too as I believe they are the main users of the --param vect-partial-vector-usage=1 option. Can someone help me test (and maybe even benchmark?) this on a PPC target?
> 


Thanks for doing this!  I can test it on Power10 which enables this parameter
by default, also evaluate its impact on SPEC2017 Ofast/unroll.

Do you have any preference for the baseline commit?  I'll use r12-0 if it's fine.

BR,
Kewen
Richard Sandiford May 24, 2021, 10:30 a.m. UTC | #2
"Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> Hi,
>
> When vectorizing with --param vect-partial-vector-usage=1 the vectorizer 
> uses an unpredicated (all-true predicate for SVE) main loop and a 
> predicated tail loop. The way this was implemented seems to mean it 
> re-uses the same vector-mode for both loops, which means the tail loop 
> isn't an actual loop but only executes one iteration.
>
> This patch uses the knowledge of the conditions to enter an epilogue 
> loop to help come up with a potentially more restricive upper bound.
>
> Regression tested on aarch64-linux-gnu and also ran the testsuite using 
> '--param vect-partial-vector-usage=1' detecting no ICEs and no execution 
> failures.
>
> Would be good to have this tested for PPC too as I believe they are the 
> main users of the --param vect-partial-vector-usage=1 option. Can 
> someone help me test (and maybe even benchmark?) this on a PPC target?
>
> Kind regards,
> Andre

LGTM.  OK if no objections and if the Power testing comes back clean.

Thanks,
Richard

> gcc/ChangeLog:
>
>          * tree-vect-loop.c (vect_transform_loop): Use main loop's 
> various' thresholds
>          to narrow the upper bound on epilogue iterations.
>
> gcc/testsuite/ChangeLog:
>
>          * gcc.target/aarch64/sve/part_vect_single_iter_epilog.c: New test.
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/part_vect_single_iter_epilog.c b/gcc/testsuite/gcc.target/aarch64/sve/part_vect_single_iter_epilog.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..a03229eb55585f637ebd5288fb4c00f8f921d44c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/part_vect_single_iter_epilog.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 --param vect-partial-vector-usage=1" } */
> +
> +void
> +foo (short * __restrict__ a, short * __restrict__ b, short * __restrict__ c, int n)
> +{
> +  for (int i = 0; i < n; ++i)
> +    c[i] = a[i] + b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]+.h, wzr, [xw][0-9]+} 1 } } */
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 3e973e774af8f9205be893e01ad9263281116885..81e9c5cc42415a0a92b765bc46640105670c4e6b 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -9723,12 +9723,31 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>    /* In these calculations the "- 1" converts loop iteration counts
>       back to latch counts.  */
>    if (loop->any_upper_bound)
> -    loop->nb_iterations_upper_bound
> -      = (final_iter_may_be_partial
> -	 ? wi::udiv_ceil (loop->nb_iterations_upper_bound + bias_for_lowest,
> -			  lowest_vf) - 1
> -	 : wi::udiv_floor (loop->nb_iterations_upper_bound + bias_for_lowest,
> -			   lowest_vf) - 1);
> +    {
> +      loop_vec_info main_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
> +      loop->nb_iterations_upper_bound
> +	= (final_iter_may_be_partial
> +	   ? wi::udiv_ceil (loop->nb_iterations_upper_bound + bias_for_lowest,
> +			    lowest_vf) - 1
> +	   : wi::udiv_floor (loop->nb_iterations_upper_bound + bias_for_lowest,
> +			     lowest_vf) - 1);
> +      if (main_vinfo)
> +	{
> +	  unsigned int bound;
> +	  poly_uint64 main_iters
> +	    = upper_bound (LOOP_VINFO_VECT_FACTOR (main_vinfo),
> +			   LOOP_VINFO_COST_MODEL_THRESHOLD (main_vinfo));
> +	  main_iters
> +	    = upper_bound (main_iters,
> +			   LOOP_VINFO_VERSIONING_THRESHOLD (main_vinfo));
> +	  if (can_div_away_from_zero_p (main_iters,
> +					LOOP_VINFO_VECT_FACTOR (loop_vinfo),
> +					&bound))
> +	    loop->nb_iterations_upper_bound
> +	      = wi::umin ((widest_int) (bound - 1),
> +			  loop->nb_iterations_upper_bound);
> +      }
> +  }
>    if (loop->any_likely_upper_bound)
>      loop->nb_iterations_likely_upper_bound
>        = (final_iter_may_be_partial
Kewen.Lin May 25, 2021, 8:42 a.m. UTC | #3
on 2021/5/24 下午3:21, Kewen.Lin via Gcc-patches wrote:
> Hi Andre,
> 
> on 2021/5/24 下午2:17, Andre Vieira (lists) via Gcc-patches wrote:
>> Hi,
>>
>> When vectorizing with --param vect-partial-vector-usage=1 the vectorizer uses an unpredicated (all-true predicate for SVE) main loop and a predicated tail loop. The way this was implemented seems to mean it re-uses the same vector-mode for both loops, which means the tail loop isn't an actual loop but only executes one iteration.
>>
>> This patch uses the knowledge of the conditions to enter an epilogue loop to help come up with a potentially more restricive upper bound.
>>
>> Regression tested on aarch64-linux-gnu and also ran the testsuite using '--param vect-partial-vector-usage=1' detecting no ICEs and no execution failures.
>>
>> Would be good to have this tested for PPC too as I believe they are the main users of the --param vect-partial-vector-usage=1 option. Can someone help me test (and maybe even benchmark?) this on a PPC target?
>>
> 
> 
> Thanks for doing this!  I can test it on Power10 which enables this parameter
> by default, also evaluate its impact on SPEC2017 Ofast/unroll.
> 

Bootstrapped/regtested on powerpc64le-linux-gnu Power10.
SPEC2017 run didn't show any remarkable improvement/degradation.

BR,
Kewen
Andre Vieira (lists) June 3, 2021, 12:42 p.m. UTC | #4
Thank you Kewen!!

I will apply this now.

BR,
Andre

On 25/05/2021 09:42, Kewen.Lin wrote:
> on 2021/5/24 下午3:21, Kewen.Lin via Gcc-patches wrote:
>> Hi Andre,
>>
>> on 2021/5/24 下午2:17, Andre Vieira (lists) via Gcc-patches wrote:
>>> Hi,
>>>
>>> When vectorizing with --param vect-partial-vector-usage=1 the vectorizer uses an unpredicated (all-true predicate for SVE) main loop and a predicated tail loop. The way this was implemented seems to mean it re-uses the same vector-mode for both loops, which means the tail loop isn't an actual loop but only executes one iteration.
>>>
>>> This patch uses the knowledge of the conditions to enter an epilogue loop to help come up with a potentially more restricive upper bound.
>>>
>>> Regression tested on aarch64-linux-gnu and also ran the testsuite using '--param vect-partial-vector-usage=1' detecting no ICEs and no execution failures.
>>>
>>> Would be good to have this tested for PPC too as I believe they are the main users of the --param vect-partial-vector-usage=1 option. Can someone help me test (and maybe even benchmark?) this on a PPC target?
>>>
>>
>> Thanks for doing this!  I can test it on Power10 which enables this parameter
>> by default, also evaluate its impact on SPEC2017 Ofast/unroll.
>>
> Bootstrapped/regtested on powerpc64le-linux-gnu Power10.
> SPEC2017 run didn't show any remarkable improvement/degradation.
>
> BR,
> Kewen
diff mbox series

Patch

diff --git a/gcc/testsuite/gcc.target/aarch64/sve/part_vect_single_iter_epilog.c b/gcc/testsuite/gcc.target/aarch64/sve/part_vect_single_iter_epilog.c
new file mode 100644
index 0000000000000000000000000000000000000000..a03229eb55585f637ebd5288fb4c00f8f921d44c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/part_vect_single_iter_epilog.c
@@ -0,0 +1,11 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 --param vect-partial-vector-usage=1" } */
+
+void
+foo (short * __restrict__ a, short * __restrict__ b, short * __restrict__ c, int n)
+{
+  for (int i = 0; i < n; ++i)
+    c[i] = a[i] + b[i];
+}
+
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]+.h, wzr, [xw][0-9]+} 1 } } */
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 3e973e774af8f9205be893e01ad9263281116885..81e9c5cc42415a0a92b765bc46640105670c4e6b 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -9723,12 +9723,31 @@  vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
   /* In these calculations the "- 1" converts loop iteration counts
      back to latch counts.  */
   if (loop->any_upper_bound)
-    loop->nb_iterations_upper_bound
-      = (final_iter_may_be_partial
-	 ? wi::udiv_ceil (loop->nb_iterations_upper_bound + bias_for_lowest,
-			  lowest_vf) - 1
-	 : wi::udiv_floor (loop->nb_iterations_upper_bound + bias_for_lowest,
-			   lowest_vf) - 1);
+    {
+      loop_vec_info main_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
+      loop->nb_iterations_upper_bound
+	= (final_iter_may_be_partial
+	   ? wi::udiv_ceil (loop->nb_iterations_upper_bound + bias_for_lowest,
+			    lowest_vf) - 1
+	   : wi::udiv_floor (loop->nb_iterations_upper_bound + bias_for_lowest,
+			     lowest_vf) - 1);
+      if (main_vinfo)
+	{
+	  unsigned int bound;
+	  poly_uint64 main_iters
+	    = upper_bound (LOOP_VINFO_VECT_FACTOR (main_vinfo),
+			   LOOP_VINFO_COST_MODEL_THRESHOLD (main_vinfo));
+	  main_iters
+	    = upper_bound (main_iters,
+			   LOOP_VINFO_VERSIONING_THRESHOLD (main_vinfo));
+	  if (can_div_away_from_zero_p (main_iters,
+					LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+					&bound))
+	    loop->nb_iterations_upper_bound
+	      = wi::umin ((widest_int) (bound - 1),
+			  loop->nb_iterations_upper_bound);
+      }
+  }
   if (loop->any_likely_upper_bound)
     loop->nb_iterations_likely_upper_bound
       = (final_iter_may_be_partial