diff mbox series

middle-end vect: adjust loop upper bounds when peeling for gaps and early break [PR114403]

Message ID patch-18385-tamar@arm.com
State New
Headers show
Series middle-end vect: adjust loop upper bounds when peeling for gaps and early break [PR114403] | expand

Commit Message

Tamar Christina April 4, 2024, 4:15 p.m. UTC
Hi All,

The report shows that we end up in a situation where the code has been peeled
for gaps and we have an early break.

The code for peeling for gaps assume that a scalar loop needs to perform at
least one iteration.  However this doesn't take into account early break where
the scalar loop may not need to be executed.

That the early break loop can be partial is not accounted for in this scenario.
loop partiality is normally handled by setting bias_for_lowest to 1, but when
peeling for gaps we end up with 0, which when the loop upper bounds are
calculated means that a partial loop iteration loses the final partial iter:

Analyzing # of iterations of loop 1
  exit condition [8, + , 18446744073709551615] != 0
  bounds on difference of bases: -8 ... -8
  result:
    # of iterations 8, bounded by 8

and a VF=4 calculating:

Loop 1 iterates at most 1 times.
Loop 1 likely iterates at most 1 times.
Analyzing # of iterations of loop 1
  exit condition [1, + , 1](no_overflow) < bnd.5505_39
  bounds on difference of bases: 0 ... 4611686018427387902
Matching expression match.pd:2011, generic-match-8.cc:27
Applying pattern match.pd:2067, generic-match-1.cc:4813
  result:
    # of iterations bnd.5505_39 + 18446744073709551615, bounded by 4611686018427387902
Estimating sizes for loop 1
...
   Induction variable computation will be folded away.
  size:   2 if (ivtmp_312 < bnd.5505_39)
   Exit condition will be eliminated in last copy.
size: 24-3, last_iteration: 24-5
  Loop size: 24
  Estimated size after unrolling: 26
;; Guessed iterations of loop 1 is 0.858446. New upper bound 1.

upper bound should be 2 not 1.

This patch forced the bias_for_lowest to be 1 even when peeling for gaps.

I have however not been able to write a standalone reproducer for this so I have
no tests but bootstrap and LLVM build fine now.

The testcase:

#define COUNT 9
#define SIZE COUNT * 4
#define TYPE unsigned long

TYPE x[SIZE], y[SIZE];

void __attribute__((noipa))
loop (TYPE val)
{
  for (int i = 0; i < COUNT; ++i)
    {
      if (x[i * 4] > val || x[i * 4 + 1] > val)
        return;
      x[i * 4] = y[i * 2] + 1;
      x[i * 4 + 1] = y[i * 2] + 2;
      x[i * 4 + 2] = y[i * 2 + 1] + 3;
      x[i * 4 + 3] = y[i * 2 + 1] + 4;
    }
}

does perform the peeling for gaps and early beak, however it creates a hybrid
loop which works fine. adjusting the indices to non linear also works. So I'd
like to submit the fix and work on a testcase separately if needed.

Bootstrapped Regtested on x86_64-pc-linux-gnu no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	PR tree-optimization/114403
	* tree-vect-loop.cc (vect_transform_loop): Adjust upper bounds for when
	peeling for gaps and early break.

---




--

Comments

Richard Biener April 5, 2024, 7:07 a.m. UTC | #1
On Thu, 4 Apr 2024, Tamar Christina wrote:

> Hi All,
> 
> The report shows that we end up in a situation where the code has been peeled
> for gaps and we have an early break.
> 
> The code for peeling for gaps assume that a scalar loop needs to perform at
> least one iteration.  However this doesn't take into account early break where
> the scalar loop may not need to be executed.

But we always re-start the vector iteration where the early break happens?

> That the early break loop can be partial is not accounted for in this scenario.
> loop partiality is normally handled by setting bias_for_lowest to 1, but when
> peeling for gaps we end up with 0, which when the loop upper bounds are
> calculated means that a partial loop iteration loses the final partial iter:
> 
> Analyzing # of iterations of loop 1
>   exit condition [8, + , 18446744073709551615] != 0
>   bounds on difference of bases: -8 ... -8
>   result:
>     # of iterations 8, bounded by 8
> 
> and a VF=4 calculating:
> 
> Loop 1 iterates at most 1 times.
> Loop 1 likely iterates at most 1 times.
> Analyzing # of iterations of loop 1
>   exit condition [1, + , 1](no_overflow) < bnd.5505_39
>   bounds on difference of bases: 0 ... 4611686018427387902
> Matching expression match.pd:2011, generic-match-8.cc:27
> Applying pattern match.pd:2067, generic-match-1.cc:4813
>   result:
>     # of iterations bnd.5505_39 + 18446744073709551615, bounded by 4611686018427387902
> Estimating sizes for loop 1
> ...
>    Induction variable computation will be folded away.
>   size:   2 if (ivtmp_312 < bnd.5505_39)
>    Exit condition will be eliminated in last copy.
> size: 24-3, last_iteration: 24-5
>   Loop size: 24
>   Estimated size after unrolling: 26
> ;; Guessed iterations of loop 1 is 0.858446. New upper bound 1.
> 
> upper bound should be 2 not 1.

Why?  This means the vector loop will iterate once (thus the body
executed twice), isn't that correct?  Peeling for gaps means the
main IV will exit the loop in time.

> 
> This patch forced the bias_for_lowest to be 1 even when peeling for gaps.

(*)

> I have however not been able to write a standalone reproducer for this so I have
> no tests but bootstrap and LLVM build fine now.
> 
> The testcase:
> 
> #define COUNT 9
> #define SIZE COUNT * 4
> #define TYPE unsigned long
> 
> TYPE x[SIZE], y[SIZE];
> 
> void __attribute__((noipa))
> loop (TYPE val)
> {
>   for (int i = 0; i < COUNT; ++i)
>     {
>       if (x[i * 4] > val || x[i * 4 + 1] > val)
>         return;
>       x[i * 4] = y[i * 2] + 1;
>       x[i * 4 + 1] = y[i * 2] + 2;
>       x[i * 4 + 2] = y[i * 2 + 1] + 3;
>       x[i * 4 + 3] = y[i * 2 + 1] + 4;
>     }
> }
> 
> does perform the peeling for gaps and early beak, however it creates a hybrid
> loop which works fine. adjusting the indices to non linear also works. So I'd
> like to submit the fix and work on a testcase separately if needed.

You can have peeling for gaps without SLP by doing interleaving.

#define COUNT 9
#define TYPE unsigned long

TYPE x[COUNT], y[COUNT*2];

void __attribute__((noipa))
loop (TYPE val)
{
  for (int i = 0; i < COUNT; ++i)
    { 
      if (x[i] > val)
        return;
      x[i] = y[i * 2];
   }
}

gets me partial vectors and peeling for gaps with -O3 -march=armv8.2-a+sve 
-fno-vect-cost-model (with cost modeling we choose ADVSIMD).  Does
this reproduce the issue?

Richard.


> Bootstrapped Regtested on x86_64-pc-linux-gnu no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	PR tree-optimization/114403
> 	* tree-vect-loop.cc (vect_transform_loop): Adjust upper bounds for when
> 	peeling for gaps and early break.
> 
> ---
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 4375ebdcb493a90fd0501cbb4b07466077b525c3..bf1bb9b005c68fbb13ee1b1279424865b237245a 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -12139,7 +12139,8 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>    /* The minimum number of iterations performed by the epilogue.  This
>       is 1 when peeling for gaps because we always need a final scalar
>       iteration.  */
> -  int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0;
> +  int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> +			   && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo) ? 1 : 0;

(*) This adjusts min_epilogue_iters though and honestly the whole code
looks like a mess.  I'm quoting a bit more here:

>    /* +1 to convert latch counts to loop iteration counts,
>       -min_epilogue_iters to remove iterations that cannot be performed
>         by the vector code.  */
  int bias_for_lowest = 1 - min_epilogue_iters;
  int bias_for_assumed = bias_for_lowest;

The variable names and comments now have nothing to do with the
actual magic we compute into them.

I think it would be an improvement to disentangle this a bit like
doing

   /* +1 to convert latch counts to loop iteration counts.  */
   int bias_for_lowest = 1;
   /* Comment, explain why peeling for gaps isn't relevant.  */
   if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     bias_for_lowest += ... ?
   /* When peeling for gaps we have at least one scalar iteration in
      the epilog.  */
   else if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
     bias_for_lowest -= 1;
   int bias_for_assumed = bias_for_lowest;

I'm still not convinced that you need to "ignore" peeling for gaps
when doing early exit vectorization?

Richard.
diff mbox series

Patch

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 4375ebdcb493a90fd0501cbb4b07466077b525c3..bf1bb9b005c68fbb13ee1b1279424865b237245a 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -12139,7 +12139,8 @@  vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
   /* The minimum number of iterations performed by the epilogue.  This
      is 1 when peeling for gaps because we always need a final scalar
      iteration.  */
-  int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0;
+  int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+			   && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo) ? 1 : 0;
   /* +1 to convert latch counts to loop iteration counts,
      -min_epilogue_iters to remove iterations that cannot be performed
        by the vector code.  */