diff mbox series

[1/2,vect] PR 88915: Vectorize epilogues when versioning loops

Message ID 385547e6-abbd-3633-ad69-d4fb6e604c97@arm.com
State New
Headers show
Series [1/2,vect] PR 88915: Vectorize epilogues when versioning loops | expand

Commit Message

Andre Vieira (lists) Aug. 23, 2019, 4:50 p.m. UTC
Hi,

This patch is an improvement on my last RFC.  As you pointed out, we can 
do the vectorization analysis of the epilogues before doing the 
transformation, using the same approach as used by openmp simd.  I have 
not yet incorporated the cost tweaks for vectorizing the epilogue, I 
would like to do this in a subsequent patch, to make it easier to test 
the differences.

I currently disable the vectorization of epilogues when versioning for 
iterations.  This is simply because I do not completely understand how 
the assumptions are created and couldn't determine whether using 
skip_vectors with this would work.  If you don't think it is a problem 
or have a testcase to show it work I would gladly look at it.

Bootstrapped this and the next patch on x86_64 and 
aarch64-unknown-linux-gnu, with no regressions (after test changes in 
next patch).

gcc/ChangeLog:
2019-08-23  Andre Vieira  <andre.simoesdiasvieira@arm.com>

          PR 88915
          * gentype.c (main): Add poly_uint64 type to generator.
          * tree-vect-loop.c (vect_analyze_loop_2): Make it determine
          whether we vectorize epilogue loops.
          (vect_analyze_loop): Idem.
          (vect_transform_loop): Pass decision to vectorize epilogues
          to vect_do_peeling.
          * tree-vect-loop-manip.c (vect_do_peeling): Enable skip-vectors
          when doing loop versioning if we decided to vectorize epilogues.
          (vect-loop_versioning): Moved decision to check_profitability
          based on cost model.
          * tree-vectorizer.h (vect_loop_versioning, vect_do_peeling,
          vect_analyze_loop, vect_transform_loop): Update declarations.
          * tree-vectorizer.c: Include params.h
          (try_vectorize_loop_1): Initialize vect_epilogues_nomask
          to PARAM_VECT_EPILOGUES_NOMASK and pass it to vect_analyze_loop
          and vect_transform_loop.  Also make sure vectorizing epilogues
          does not count towards number of vectorized loops.

Comments

Richard Biener Aug. 26, 2019, 12:41 p.m. UTC | #1
On Fri, 23 Aug 2019, Andre Vieira (lists) wrote:

> Hi,
> 
> This patch is an improvement on my last RFC.  As you pointed out, we can do
> the vectorization analysis of the epilogues before doing the transformation,
> using the same approach as used by openmp simd.  I have not yet incorporated
> the cost tweaks for vectorizing the epilogue, I would like to do this in a
> subsequent patch, to make it easier to test the differences.
> 
> I currently disable the vectorization of epilogues when versioning for
> iterations.  This is simply because I do not completely understand how the
> assumptions are created and couldn't determine whether using skip_vectors with
> this would work.  If you don't think it is a problem or have a testcase to
> show it work I would gladly look at it.

I don't think there's any problem.  Basically the versioning condition
is if (can_we_compute_niter), most of the time it is an extra
condition from niter analysis under which niter is for example zero.
This should also be the same with all vector sizes.

-               delete loop_vinfo;
+               {
+                 /* Set versioning threshold of the original LOOP_VINFO 
based
+                    on the last vectorization of the epilog.  */
+                 LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo)
+                   = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
+                 delete loop_vinfo;
+               }

I'm not sure this works reliably since the order we process vector
sizes is under target control and not necessarily decreasing.  I think
you want to keep track of the minimum here?  Preferably separately
I guess.

From what I see vect_analyze_loop_2 doesn't need vect_epilogues_nomask
and thus it doesn't change throughout the iteration.

       else
-       delete loop_vinfo;
+       {
+         /* Disable epilog vectorization if we can't determine the 
epilogs can
+            be vectorized.  */
+         *vect_epilogues_nomask &= vectorized_loops > 1;
+         delete loop_vinfo;
+       }

and this is a bit premature and instead it should be done
just before returning success?  Maybe also storing the
epilogue vector sizes we can handle in the loop_vinfo,
thereby representing !vect_epilogues_nomask if there are no
such sizes which also means that

@@ -1013,8 +1015,13 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> 
*&simduid_to_vf_htab,

   /* Epilogue of vectorized loop must be vectorized too.  */
   if (new_loop)
-    ret |= try_vectorize_loop_1 (simduid_to_vf_htab, 
num_vectorized_loops,
-                                new_loop, loop_vinfo, NULL, NULL);
+    {
+      /* Don't include vectorized epilogues in the "vectorized loops" 
count.
+       */
+      unsigned dont_count = *num_vectorized_loops;
+      ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count,
+                                  new_loop, loop_vinfo, NULL, NULL);
+    }


can be optimized to not re-check all smaller sizes (but even assert
re-analysis succeeds to the original result for the actual transform).

Otherwise this looks reasonable to me.

Thanks,
Richard.

> 
> Bootstrapped this and the next patch on x86_64 and aarch64-unknown-linux-gnu,
> with no regressions (after test changes in next patch).
> 
> gcc/ChangeLog:
> 2019-08-23  Andre Vieira  <andre.simoesdiasvieira@arm.com>
> 
>          PR 88915
>          * gentype.c (main): Add poly_uint64 type to generator.
>          * tree-vect-loop.c (vect_analyze_loop_2): Make it determine
>          whether we vectorize epilogue loops.
>          (vect_analyze_loop): Idem.
>          (vect_transform_loop): Pass decision to vectorize epilogues
>          to vect_do_peeling.
>          * tree-vect-loop-manip.c (vect_do_peeling): Enable skip-vectors
>          when doing loop versioning if we decided to vectorize epilogues.
>          (vect-loop_versioning): Moved decision to check_profitability
>          based on cost model.
>          * tree-vectorizer.h (vect_loop_versioning, vect_do_peeling,
>          vect_analyze_loop, vect_transform_loop): Update declarations.
>          * tree-vectorizer.c: Include params.h
>          (try_vectorize_loop_1): Initialize vect_epilogues_nomask
>          to PARAM_VECT_EPILOGUES_NOMASK and pass it to vect_analyze_loop
>          and vect_transform_loop.  Also make sure vectorizing epilogues
>          does not count towards number of vectorized loops.
> 
>
Jeff Law Sept. 4, 2019, 3:34 p.m. UTC | #2
On 8/23/19 10:50 AM, Andre Vieira (lists) wrote:
> Hi,
> 
> This patch is an improvement on my last RFC.  As you pointed out, we can
> do the vectorization analysis of the epilogues before doing the
> transformation, using the same approach as used by openmp simd.  I have
> not yet incorporated the cost tweaks for vectorizing the epilogue, I
> would like to do this in a subsequent patch, to make it easier to test
> the differences.
> 
> I currently disable the vectorization of epilogues when versioning for
> iterations.  This is simply because I do not completely understand how
> the assumptions are created and couldn't determine whether using
> skip_vectors with this would work.  If you don't think it is a problem
> or have a testcase to show it work I would gladly look at it.
> 
> Bootstrapped this and the next patch on x86_64 and
> aarch64-unknown-linux-gnu, with no regressions (after test changes in
> next patch).
> 
> gcc/ChangeLog:
> 2019-08-23  Andre Vieira  <andre.simoesdiasvieira@arm.com>
> 
>          PR 88915
>          * gentype.c (main): Add poly_uint64 type to generator.
>          * tree-vect-loop.c (vect_analyze_loop_2): Make it determine
>          whether we vectorize epilogue loops.
>          (vect_analyze_loop): Idem.
>          (vect_transform_loop): Pass decision to vectorize epilogues
>          to vect_do_peeling.
>          * tree-vect-loop-manip.c (vect_do_peeling): Enable skip-vectors
>          when doing loop versioning if we decided to vectorize epilogues.
>          (vect-loop_versioning): Moved decision to check_profitability
>          based on cost model.
>          * tree-vectorizer.h (vect_loop_versioning, vect_do_peeling,
>          vect_analyze_loop, vect_transform_loop): Update declarations.
>          * tree-vectorizer.c: Include params.h
>          (try_vectorize_loop_1): Initialize vect_epilogues_nomask
>          to PARAM_VECT_EPILOGUES_NOMASK and pass it to vect_analyze_loop
>          and vect_transform_loop.  Also make sure vectorizing epilogues
>          does not count towards number of vectorized loops.
Nit.  In several places you use "epilog", proper spelling is "epilogue".



> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> index 173e6b51652fd023893b38da786ff28f827553b5..25c3fc8ff55e017ae0b971fa93ce8ce2a07cb94c 100644
> --- a/gcc/tree-vectorizer.c
> +++ b/gcc/tree-vectorizer.c
> @@ -1013,8 +1015,13 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
>  
>    /* Epilogue of vectorized loop must be vectorized too.  */
>    if (new_loop)
> -    ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops,
> -				 new_loop, loop_vinfo, NULL, NULL);
> +    {
> +      /* Don't include vectorized epilogues in the "vectorized loops" count.
> +       */
> +      unsigned dont_count = *num_vectorized_loops;
> +      ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count,
> +				   new_loop, loop_vinfo, NULL, NULL);
> +    }
Nit.  Don't wrap a comment with just the closing */ on its own line.
Instead wrap before "count" so that.

This is fine for the trunk after fixing those nits.

jeff
Andre Vieira (lists) Sept. 18, 2019, 11:11 a.m. UTC | #3
Hi Richard,

As I mentioned in the IRC channel, this is my current work in progress 
patch. It currently ICE's when vectorizing 
gcc/testsuite/gcc.c-torture/execute/nestfunc-2.c with '-O3' and '--param 
vect-epilogues-nomask=1'.

It ICE's because the epilogue loop (after if conversion) and main loop 
(before vectorization) are not the same, there are a bunch of extra BBs 
and the epilogue loop seems to need some cleaning up too.

Let me know if you see a way around this issue.

Cheers,
Andre
Andre Vieira (lists) Oct. 8, 2019, 1:16 p.m. UTC | #4
Hi Richard,

As I mentioned in the IRC channel, I managed to get "most" of the 
regression testsuite working for x86_64 (avx512) and aarch64.

On x86_64 I get a failure that I can't explain, was hoping you might be 
able to have a look with me:
"PASS->FAIL: gcc.target/i386/vect-perm-odd-1.c execution test"

vect-perm-odd-1.exe segfaults and when I gdb it seems to be the first 
iteration of the main loop.  The tree dumps look alright, but I do 
notice the stack usage seems to change between --param 
vect-epilogue-nomask={0,1}.

Am I missing to update some field that may later lead to the amount of 
stack being used? I am confused, it could very well be that I am missing 
something obvious, I am not too familiar with x86's ISA. I will try to 
investigate further.

This patch needs further clean-up and more comments (or comment 
updates), but I thought I'd share current state to see if you can help 
me unblock.

Cheers,
Andre
Richard Biener Oct. 9, 2019, 8:54 a.m. UTC | #5
On Tue, 8 Oct 2019, Andre Vieira (lists) wrote:

> Hi Richard,
> 
> As I mentioned in the IRC channel, I managed to get "most" of the regression
> testsuite working for x86_64 (avx512) and aarch64.
> 
> On x86_64 I get a failure that I can't explain, was hoping you might be able
> to have a look with me:
> "PASS->FAIL: gcc.target/i386/vect-perm-odd-1.c execution test"
> 
> vect-perm-odd-1.exe segfaults and when I gdb it seems to be the first
> iteration of the main loop.  The tree dumps look alright, but I do notice the
> stack usage seems to change between --param vect-epilogue-nomask={0,1}.

So the issue is that we have

=> 0x0000000000400778 <+72>:    vmovdqa64 %zmm1,-0x40(%rax)

but the memory accessed is not appropriately aligned.  The vectorizer
sets DECL_USER_ALIGN on the stack local but somehow later it downs
it to 256:

Old value = 640
New value = 576
ensure_base_align (dr_info=0x526f788) at 
/tmp/trunk/gcc/tree-vect-stmts.c:6294
6294              DECL_USER_ALIGN (base_decl) = 1;
(gdb) l
6289          if (decl_in_symtab_p (base_decl))
6290            symtab_node::get (base_decl)->increase_alignment 
(align_base_to);
6291          else
6292            {
6293              SET_DECL_ALIGN (base_decl, align_base_to);
6294              DECL_USER_ALIGN (base_decl) = 1;
6295            }

this means vectorizing the epilogue modifies the DRs, in particular
the base alignment?

> Am I missing to update some field that may later lead to the amount of stack
> being used? I am confused, it could very well be that I am missing something
> obvious, I am not too familiar with x86's ISA. I will try to investigate
> further.
> 
> This patch needs further clean-up and more comments (or comment updates), but
> I thought I'd share current state to see if you can help me unblock.
> 
> Cheers,
> Andre
>
Andre Vieira (lists) Oct. 10, 2019, 1:50 p.m. UTC | #6
Hi,

After all the discussions and respins I now believe this patch is close 
to what we envisioned.

This patch achieves two things when vect-epilogues-nomask=1:
1) It analyzes the original loop for each supported vector size and 
saves this analysis per loop, as well as the vector sizes we know we can 
vectorize the loop for.
2) When loop versioning it uses the 'skip_vector' code path to vectorize 
the epilogue, and uses the lowest versioning threshold between the main 
and epilogue's.

As side effects of this patch I also changed ensure_base_align to only 
update the alignment if the new alignment is lower than the current one. 
  This function already did that if the object was a symbol, now it 
behaves this way for any object.

I bootstrapped this patch with both vect-epilogues-nomask turned on and 
off on x86_64 (AVX512) and aarch64.  Regression tests looked good.

Is this OK for trunk?

gcc/ChangeLog:
2019-10-10  Andre Vieira  <andre.simoesdiasvieira@arm.com>

     PR 88915
     * cfgloop.h (loop): Add epilogue_vsizes member.
     * cfgloop.c (flow_loop_free): Release epilogue_vsizes.
     (alloc_loop): Initialize epilogue_vsizes.
     * gentype.c (main): Add poly_uint64 type and vector_sizes to
     generator.
     * tree-vect-loop.c (vect_get_loop_niters): Make externally visible.
     (_loop_vec_info): Initialize epilogue_vinfos.
     (~_loop_vec_info): Release epilogue_vinfos.
     (vect_analyze_loop_costing): Use knowledge of main VF to estimate
     number of iterations of epilogue.
     (determine_peel_for_niter): New. Outlined code to re-use in two
     places.
     (vect_analyze_loop_2): Adapt to analyse main loop for all supported
     vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
     versioning threshold needed for main loop.
     (vect_analyze_loop): Likewise.
     (replace_ops): New helper function.
     (vect_transform_loop): When vectorizing epilogues re-use analysis done
     on main loop and update necessary information.
     * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
     stmts on loop preheader edge.
     (vect_do_peeling): Enable skip-vectors when doing loop versioning if
     we decided to vectorize epilogues.  Update epilogues NITERS and
     construct ADVANCE to update epilogues data references where needed.
     (vect_loop_versioning): Moved decision to check_profitability
     based on cost model.
     * tree-vect-stmts.c (ensure_base_align): Only update alignment
     if new alignment is lower.
     * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos member.
     (vect_loop_versioning, vect_do_peeling, vect_get_loop_niters,
     vect_update_inits_of_drs, determine_peel_for_niter,
     vect_analyze_loop): Add or update declarations.
     * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
     create loop_vec_info's for epilogues when available.  Otherwise analyse
     epilogue separately.



Cheers,
Andre
Richard Biener Oct. 11, 2019, 12:57 p.m. UTC | #7
On Thu, 10 Oct 2019, Andre Vieira (lists) wrote:

> Hi,
> 
> After all the discussions and respins I now believe this patch is close to
> what we envisioned.
> 
> This patch achieves two things when vect-epilogues-nomask=1:
> 1) It analyzes the original loop for each supported vector size and saves this
> analysis per loop, as well as the vector sizes we know we can vectorize the
> loop for.
> 2) When loop versioning it uses the 'skip_vector' code path to vectorize the
> epilogue, and uses the lowest versioning threshold between the main and
> epilogue's.
> 
> As side effects of this patch I also changed ensure_base_align to only update
> the alignment if the new alignment is lower than the current one.  This
> function already did that if the object was a symbol, now it behaves this way
> for any object.
> 
> I bootstrapped this patch with both vect-epilogues-nomask turned on and off on
> x86_64 (AVX512) and aarch64.  Regression tests looked good.
> 
> Is this OK for trunk?

+
+  /* Keep track of vector sizes we know we can vectorize the epilogue 
with.  */
+  vector_sizes epilogue_vsizes;
 };

please don't enlarge struct loop, instead track this somewhere
in the vectorizer (in loop_vinfo?  I see you already have
epilogue_vinfos there - so the loop_vinfo simply lacks 
convenient access to the vector_size?)  I don't see any
use that could be trivially adjusted to look at a loop_vinfo
member instead.

For the vect_update_inits_of_drs this means that we'd possibly
do less CSE.  Not sure if really an issue.

You use LOOP_VINFO_EPILOGUE_P sometimes and sometimes
LOOP_VINFO_ORIG_LOOP_INFO, please change predicates to
LOOP_VINFO_EPILOGUE_P.

@@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
niters, tree nitersm1,
   else
     niters_prolog = build_int_cst (type, 0);
     
+  loop_vec_info epilogue_vinfo = NULL;
+  if (vect_epilogues)
+    { 
...
+       vect_epilogues = false;
+    }
+

I don't understand what all this does - it clearly needs a comment.
Maybe the overall comment of the function should be amended with
an overview of how we handle [multiple] epilogue loop vectorization?

+
+      if (epilogue_any_upper_bound && prolog_peeling >= 0)
+       {
+         epilog->any_upper_bound = true;
+         epilog->nb_iterations_upper_bound = eiters + 1;
+       }
+

comment missing.  How can prolog_peeling be < 0?  We likely
didn't set the upper bound because we don't know it in the
case we skipped the vector loop (skip_vector)?  So make sure
to not introduce wrong-code issues here - maybe do this
optimization as followup?

 class loop *
-vect_loop_versioning (loop_vec_info loop_vinfo,
-                     unsigned int th, bool check_profitability,
-                     poly_uint64 versioning_threshold)
+vect_loop_versioning (loop_vec_info loop_vinfo)
 { 
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop;
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
@@ -2988,10 +3151,15 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
   bool version_align = LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT 
(loop_vinfo);
   bool version_alias = LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo);
   bool version_niter = LOOP_REQUIRES_VERSIONING_FOR_NITERS (loop_vinfo);
+  poly_uint64 versioning_threshold
+    = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
   tree version_simd_if_cond
     = LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (loop_vinfo);
+  unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);

-  if (check_profitability)
+  if (th >= vect_vf_for_cost (loop_vinfo)
+      && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && !ordered_p (th, versioning_threshold))
     cond_expr = fold_build2 (GE_EXPR, boolean_type_node, 
scalar_loop_iters,
                             build_int_cst (TREE_TYPE (scalar_loop_iters),
                                            th - 1));

split out this refactoring - preapproved.

@@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info 
loop_vinfo)
       return 0;
     }

-  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
+  HOST_WIDE_INT estimated_niter = -1;
+
+  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+    estimated_niter
+      = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;
+  if (estimated_niter == -1)
+    estimated_niter = estimated_stmt_executions_int (loop);
   if (estimated_niter == -1)
     estimated_niter = likely_max_stmt_executions_int (loop);
   if (estimated_niter != -1

it's clearer if the old code is completely in a else {} path
even though vect_vf_for_cost - 1 should never be -1.

+/* Decides whether we need to create an epilogue loop to handle
+   remaining scalar iterations and sets PEELING_FOR_NITERS accordingly.  
*/
+      
+void                  
+determine_peel_for_niter (loop_vec_info loop_vinfo)
+{   
+  

extra vertical space

+  unsigned HOST_WIDE_INT const_vf;
+  HOST_WIDE_INT max_niter 

if it's a 1:1 copy outlined then split it out - preapproved
(so further reviews get smaller patches ;))  I'd add a
LOOP_VINFO_PEELING_FOR_NITER () = false as final else
since that's what we do by default?

-  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
+  if (LOOP_REQUIRES_VERSIONING (loop_vinfo)
+      || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+         && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo)))

not sure why we need to do this for epilouges?

+
+      /*  Use the same condition as vect_transform_loop to decide when to 
use
+         the cost to determine a versioning threshold.  */
+      if (th >= vect_vf_for_cost (loop_vinfo)
+         && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+         && ordered_p (th, niters_th))
+       niters_th = ordered_max (poly_uint64 (th), niters_th);

that's an independent change, right?  Please split out, it's
pre-approved if it tests OK separately.

+static tree
+replace_ops (tree op, hash_map<tree, tree> &mapping)
+{

I'm quite sure I've seen such beast elsewhere ;)  simplify_replace_tree
comes up first (not a 1:1 match but hints at a possible tree
sharing issue in your variant).

@@ -8497,11 +8588,11 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   if (th >= vect_vf_for_cost (loop_vinfo)
       && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
-      if (dump_enabled_p ())
-       dump_printf_loc (MSG_NOTE, vect_location,
-                        "Profitability threshold is %d loop 
iterations.\n",
-                         th);
-      check_profitability = true;
+       if (dump_enabled_p ())
+         dump_printf_loc (MSG_NOTE, vect_location,
+                          "Profitability threshold is %d loop 
iterations.\n",
+                          th);
+       check_profitability = true;
     }

   /* Make sure there exists a single-predecessor exit bb.  Do this before

obvious (separate)

+  tree advance;
   epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, 
&niters_vector,
                              &step_vector, &niters_vector_mult_vf, th,
-                             check_profitability, niters_no_overflow);
+                             check_profitability, niters_no_overflow,
+                             &advance);
+
+  if (epilogue)
+    {
+      basic_block *orig_bbs = get_loop_body (loop);
+      loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
...

please record this in vect_do_peeling itself and store the
orig_stmts/drs/etc. in the epilogue loop_vinfo and ...

+      /* We are done vectorizing the main loop, so now we update the 
epilogues
+        stmt_vec_info's.  At the same time we set the gimple UID of each
+        statement in the epilogue, as these are used to look them up in 
the
+        epilogues loop_vec_info later.  We also keep track of what
...

split this out to a new function.  I wonder why you need to record
the DRs, are they not available via ->datarefs and lookup_dr ()?

diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 
601a6f55fbff388c89f88d994e790aebf2bf960e..201549da6c0cbae0797a23ae1b8967b9895505e9 
100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -6288,7 +6288,7 @@ ensure_base_align (dr_vec_info *dr_info)

       if (decl_in_symtab_p (base_decl))
        symtab_node::get (base_decl)->increase_alignment (align_base_to);
-      else
+      else if (DECL_ALIGN (base_decl) < align_base_to)
        {
          SET_DECL_ALIGN (base_decl, align_base_to);
           DECL_USER_ALIGN (base_decl) = 1;

split out - preapproved.

Still have to go over the main loop doing the analysis/transform.

Thanks, it looks really promising (albeit exepectedly ugly due to
the data rewriting).

Richard.


> gcc/ChangeLog:
> 2019-10-10  Andre Vieira  <andre.simoesdiasvieira@arm.com>
> 
>     PR 88915
>     * cfgloop.h (loop): Add epilogue_vsizes member.
>     * cfgloop.c (flow_loop_free): Release epilogue_vsizes.
>     (alloc_loop): Initialize epilogue_vsizes.
>     * gentype.c (main): Add poly_uint64 type and vector_sizes to
>     generator.
>     * tree-vect-loop.c (vect_get_loop_niters): Make externally visible.
>     (_loop_vec_info): Initialize epilogue_vinfos.
>     (~_loop_vec_info): Release epilogue_vinfos.
>     (vect_analyze_loop_costing): Use knowledge of main VF to estimate
>     number of iterations of epilogue.
>     (determine_peel_for_niter): New. Outlined code to re-use in two
>     places.
>     (vect_analyze_loop_2): Adapt to analyse main loop for all supported
>     vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
>     versioning threshold needed for main loop.
>     (vect_analyze_loop): Likewise.
>     (replace_ops): New helper function.
>     (vect_transform_loop): When vectorizing epilogues re-use analysis done
>     on main loop and update necessary information.
>     * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
>     stmts on loop preheader edge.
>     (vect_do_peeling): Enable skip-vectors when doing loop versioning if
>     we decided to vectorize epilogues.  Update epilogues NITERS and
>     construct ADVANCE to update epilogues data references where needed.
>     (vect_loop_versioning): Moved decision to check_profitability
>     based on cost model.
>     * tree-vect-stmts.c (ensure_base_align): Only update alignment
>     if new alignment is lower.
>     * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos member.
>     (vect_loop_versioning, vect_do_peeling, vect_get_loop_niters,
>     vect_update_inits_of_drs, determine_peel_for_niter,
>     vect_analyze_loop): Add or update declarations.
>     * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
>     create loop_vec_info's for epilogues when available.  Otherwise analyse
>     epilogue separately.
> 
> 
> 
> Cheers,
> Andre
>
Andre Vieira (lists) Oct. 22, 2019, 12:48 p.m. UTC | #8
Hi Richi,

See inline responses to your comments.

On 11/10/2019 13:57, Richard Biener wrote:
> On Thu, 10 Oct 2019, Andre Vieira (lists) wrote:
> 
>> Hi,
>>

> 
> +
> +  /* Keep track of vector sizes we know we can vectorize the epilogue
> with.  */
> +  vector_sizes epilogue_vsizes;
>   };
> 
> please don't enlarge struct loop, instead track this somewhere
> in the vectorizer (in loop_vinfo?  I see you already have
> epilogue_vinfos there - so the loop_vinfo simply lacks
> convenient access to the vector_size?)  I don't see any
> use that could be trivially adjusted to look at a loop_vinfo
> member instead.

Done.
> 
> For the vect_update_inits_of_drs this means that we'd possibly
> do less CSE.  Not sure if really an issue.

CSE of what exactly? You are afraid we are repeating a calculation here 
we have done elsewhere before?

> 
> You use LOOP_VINFO_EPILOGUE_P sometimes and sometimes
> LOOP_VINFO_ORIG_LOOP_INFO, please change predicates to
> LOOP_VINFO_EPILOGUE_P.

I checked and the points where I use LOOP_VINFO_ORIG_LOOP_INFO is 
because I then use the resulting loop info.  If there are cases you feel 
strongly about let me know.
> 
> @@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
> niters, tree nitersm1,
>     else
>       niters_prolog = build_int_cst (type, 0);
>       
> +  loop_vec_info epilogue_vinfo = NULL;
> +  if (vect_epilogues)
> +    {
> ...
> +       vect_epilogues = false;
> +    }
> +
> 
> I don't understand what all this does - it clearly needs a comment.
> Maybe the overall comment of the function should be amended with
> an overview of how we handle [multiple] epilogue loop vectorization?

I added more comments both here and on top of the function.  Hopefully 
it is a bit clearer now, but it might need some tweaking.

> 
> +
> +      if (epilogue_any_upper_bound && prolog_peeling >= 0)
> +       {
> +         epilog->any_upper_bound = true;
> +         epilog->nb_iterations_upper_bound = eiters + 1;
> +       }
> +
> 
> comment missing.  How can prolog_peeling be < 0?  We likely
> didn't set the upper bound because we don't know it in the
> case we skipped the vector loop (skip_vector)?  So make sure
> to not introduce wrong-code issues here - maybe do this
> optimization as followup?n
> 

So the reason for this code wasn't so much an optimization as it was for 
correctness.  But I was mistaken, the failure I was seeing without this 
code was not because of this code, but rather being hidden by it. The 
problem I was seeing was that a prolog was being created using the 
original loop copy, rather than the scalar loop, leading to MASK_LOAD 
and MASK_STORE being left in the scalar prolog, leading to expand ICEs. 
I have fixed that issue by making sure the SCALAR_LOOP is used for 
prolog peeling and either the loop copy or SCALAR loop for epilogue 
peeling depending on whether we will be vectorizing said epilogue.


> @@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info
> loop_vinfo)
>         return 0;
>       }
> 
> -  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
> +  HOST_WIDE_INT estimated_niter = -1;
> +
> +  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
> +    estimated_niter
> +      = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;
> +  if (estimated_niter == -1)
> +    estimated_niter = estimated_stmt_executions_int (loop);
>     if (estimated_niter == -1)
>       estimated_niter = likely_max_stmt_executions_int (loop);
>     if (estimated_niter != -1
> 
> it's clearer if the old code is completely in a else {} path
> even though vect_vf_for_cost - 1 should never be -1.
> 
Done for the == -1 cases, need to keep the != -1 outside of course.
> -  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
> +  if (LOOP_REQUIRES_VERSIONING (loop_vinfo)
> +      || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
> +         && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo)))
> 
> not sure why we need to do this for epilouges?
> 

This is because we want to compute the versioning threshold for 
epilogues such that we can use the minimum versioning threshold when 
versioning the main loop.  The reason we need to ask we need to ask the 
original main loop is partially because of code in 
'vect_analyze_data_ref_dependences' that chooses to not do DR dependence 
analysis and thus never fills LOOP_VINFO_MAY_ALIAS_DDRS for the 
epilogues loop_vinfo and as a consequence LOOP_VINFO_COMP_ALIAS_DDRS is 
always 0.

The piece of code is preceded by this comment:
   /* For epilogues we either have no aliases or alias versioning
      was applied to original loop.  Therefore we may just get max_vf
      using VF of original loop.  */

I have added some comments to make it clearer.
> 
> +static tree
> +replace_ops (tree op, hash_map<tree, tree> &mapping)
> +{
> 
> I'm quite sure I've seen such beast elsewhere ;)  simplify_replace_tree
> comes up first (not a 1:1 match but hints at a possible tree
> sharing issue in your variant).
> 

The reason I couldn't use simplify_replace_tree is because I didn't know 
what the "OLD" value is at the time I want to call it.  Basically I want 
to check whether an SSA name is a key in MAPPING and if so replace it 
with the corresponding VALUE.

I have changed simplify_replace_tree such that valueize can take a 
context parameter. I replaced one use of replace_ops with it and the 
other I specialized as I found that it was always a MEM_REF and we 
needed to replace the address it was dereferencing.

> 
> +  tree advance;
>     epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1,
> &niters_vector,
>                                &step_vector, &niters_vector_mult_vf, th,
> -                             check_profitability, niters_no_overflow);
> +                             check_profitability, niters_no_overflow,
> +                             &advance);
> +
> +  if (epilogue)
> +    {
> +      basic_block *orig_bbs = get_loop_body (loop);
> +      loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
> ...
> 
> orig_stmts/drs/etc. in the epilogue loop_vinfo and ...
> 
> +      /* We are done vectorizing the main loop, so now we update the
> epilogues
> +        stmt_vec_info's.  At the same time we set the gimple UID of each
> +        statement in the epilogue, as these are used to look them up in
> the
> +        epilogues loop_vec_info later.  We also keep track of what
> ...
> 
> split this out to a new function.  I wonder why you need to record
> the DRs, are they not available via ->datarefs and lookup_dr ()?

lookup_dr may no longer work at this point. I found that for some memory 
accesses by the time I got to this point, the DR_STMT of the 
data_reference pointed to a scalar statement that no longer existed and 
the lookup_dr to that data reference ICE's.  I can't make this update 
before we transform the loop because the data references are shared, so 
I decided to capture the dr_vec_info's instead. Apparently we don't ever 
do a lookup_dr past this point, which I must admit is surprising.

> Still have to go over the main loop doing the analysis/transform.
> 
> Thanks, it looks really promising (albeit exepectedly ugly due to
> the data rewriting).
> 

Yeah, though I feel like now that I have put it away into functions it 
makes it look cleaner.  That vect_transform_loop function was getting 
too big!

Is this OK for trunk?

gcc/ChangeLog:
2019-10-22  Andre Vieira  <andre.simoesdiasvieira@arm.com>

     PR 88915
     * gentype.c (main): Add poly_uint64 type and vector_sizes to
     generator.
     * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
     * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
     and make the valueize function pointer also take a void pointer.
     * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
     around vn_valueize, to call it without a context.
     (process_bb): Use vn_valueize_wrapper instead of vn_valueize.
     * tree-vect-loop.c (vect_get_loop_niters): Make externally visible.
     (_loop_vec_info): Initialize epilogue_vinfos.
     (~_loop_vec_info): Release epilogue_vinfos.
     (vect_analyze_loop_costing): Use knowledge of main VF to estimate
     number of iterations of epilogue.
     (vect_analyze_loop_2): Adapt to analyse main loop for all supported
     vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
     versioning threshold needed for main loop.
     (vect_analyze_loop): Likewise.
     (find_in_mapping): New helper function.
     (update_epilogue_loop_vinfo): New function.
     (vect_transform_loop): When vectorizing epilogues re-use analysis done
     on main loop and call update_epilogue_loop_vinfo to update it.
     * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
     stmts on loop preheader edge.
     (vect_do_peeling): Enable skip-vectors when doing loop versioning if
     we decided to vectorize epilogues.  Update epilogues NITERS and
     construct ADVANCE to update epilogues data references where needed.
     * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos,
     epilogue_vsizes and update_epilogue_vinfo members.
     (LOOP_VINFO_UP_STMTS, LOOP_VINFO_UP_GT_DRS, LOOP_VINFO_UP_DRS,
      LOOP_VINFO_EPILOGUE_SIZES): Define MACROs.
     (vect_do_peeling, vect_get_loop_niters, vect_update_inits_of_drs,
      determine_peel_for_niter, vect_analyze_loop): Add or update 
declarations.
     * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
     created loop_vec_info's for epilogues when available.  Otherwise 
analyse
     epilogue separately.
Richard Biener Oct. 22, 2019, 1:56 p.m. UTC | #9
On Tue, 22 Oct 2019, Andre Vieira (lists) wrote:

> Hi Richi,
> 
> See inline responses to your comments.
> 
> On 11/10/2019 13:57, Richard Biener wrote:
> > On Thu, 10 Oct 2019, Andre Vieira (lists) wrote:
> > 
> >> Hi,
> >>
> 
> > 
> > +
> > +  /* Keep track of vector sizes we know we can vectorize the epilogue
> > with.  */
> > +  vector_sizes epilogue_vsizes;
> >   };
> > 
> > please don't enlarge struct loop, instead track this somewhere
> > in the vectorizer (in loop_vinfo?  I see you already have
> > epilogue_vinfos there - so the loop_vinfo simply lacks
> > convenient access to the vector_size?)  I don't see any
> > use that could be trivially adjusted to look at a loop_vinfo
> > member instead.
> 
> Done.
> > 
> > For the vect_update_inits_of_drs this means that we'd possibly
> > do less CSE.  Not sure if really an issue.
> 
> CSE of what exactly? You are afraid we are repeating a calculation here we
> have done elsewhere before?

All uses of those inits now possibly get the expression instead of
just the SSA name we inserted code for once.  But as said, we'll see.

> > 
> > You use LOOP_VINFO_EPILOGUE_P sometimes and sometimes
> > LOOP_VINFO_ORIG_LOOP_INFO, please change predicates to
> > LOOP_VINFO_EPILOGUE_P.
> 
> I checked and the points where I use LOOP_VINFO_ORIG_LOOP_INFO is because I
> then use the resulting loop info.  If there are cases you feel strongly about
> let me know.

Not too strongly, no.

> > 
> > @@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
> > niters, tree nitersm1,
> >     else
> >       niters_prolog = build_int_cst (type, 0);
> >       
> > +  loop_vec_info epilogue_vinfo = NULL;
> > +  if (vect_epilogues)
> > +    {
> > ...
> > +       vect_epilogues = false;
> > +    }
> > +
> > 
> > I don't understand what all this does - it clearly needs a comment.
> > Maybe the overall comment of the function should be amended with
> > an overview of how we handle [multiple] epilogue loop vectorization?
> 
> I added more comments both here and on top of the function.  Hopefully it is a
> bit clearer now, but it might need some tweaking.
> 
> > 
> > +
> > +      if (epilogue_any_upper_bound && prolog_peeling >= 0)
> > +       {
> > +         epilog->any_upper_bound = true;
> > +         epilog->nb_iterations_upper_bound = eiters + 1;
> > +       }
> > +
> > 
> > comment missing.  How can prolog_peeling be < 0?  We likely
> > didn't set the upper bound because we don't know it in the
> > case we skipped the vector loop (skip_vector)?  So make sure
> > to not introduce wrong-code issues here - maybe do this
> > optimization as followup?n
> > 
> 
> So the reason for this code wasn't so much an optimization as it was for
> correctness.  But I was mistaken, the failure I was seeing without this code
> was not because of this code, but rather being hidden by it. The problem I was
> seeing was that a prolog was being created using the original loop copy,
> rather than the scalar loop, leading to MASK_LOAD and MASK_STORE being left in
> the scalar prolog, leading to expand ICEs. I have fixed that issue by making
> sure the SCALAR_LOOP is used for prolog peeling and either the loop copy or
> SCALAR loop for epilogue peeling depending on whether we will be vectorizing
> said epilogue.

OK.

> 
> > @@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info
> > loop_vinfo)
> >         return 0;
> >       }
> > 
> > -  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
> > +  HOST_WIDE_INT estimated_niter = -1;
> > +
> > +  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
> > +    estimated_niter
> > +      = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;
> > +  if (estimated_niter == -1)
> > +    estimated_niter = estimated_stmt_executions_int (loop);
> >     if (estimated_niter == -1)
> >       estimated_niter = likely_max_stmt_executions_int (loop);
> >     if (estimated_niter != -1
> > 
> > it's clearer if the old code is completely in a else {} path
> > even though vect_vf_for_cost - 1 should never be -1.
> > 
> Done for the == -1 cases, need to keep the != -1 outside of course.
> > -  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
> > +  if (LOOP_REQUIRES_VERSIONING (loop_vinfo)
> > +      || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
> > +         && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo)))
> > 
> > not sure why we need to do this for epilouges?
> > 
> 
> This is because we want to compute the versioning threshold for epilogues such
> that we can use the minimum versioning threshold when versioning the main
> loop.  The reason we need to ask we need to ask the original main loop is
> partially because of code in 'vect_analyze_data_ref_dependences' that chooses
> to not do DR dependence analysis and thus never fills
> LOOP_VINFO_MAY_ALIAS_DDRS for the epilogues loop_vinfo and as a consequence
> LOOP_VINFO_COMP_ALIAS_DDRS is always 0.
> 
> The piece of code is preceded by this comment:
>   /* For epilogues we either have no aliases or alias versioning
>      was applied to original loop.  Therefore we may just get max_vf
>      using VF of original loop.  */
> 
> I have added some comments to make it clearer.
> > 
> > +static tree
> > +replace_ops (tree op, hash_map<tree, tree> &mapping)
> > +{
> > 
> > I'm quite sure I've seen such beast elsewhere ;)  simplify_replace_tree
> > comes up first (not a 1:1 match but hints at a possible tree
> > sharing issue in your variant).
> > 
> 
> The reason I couldn't use simplify_replace_tree is because I didn't know what
> the "OLD" value is at the time I want to call it.  Basically I want to check
> whether an SSA name is a key in MAPPING and if so replace it with the
> corresponding VALUE.
> 
> I have changed simplify_replace_tree such that valueize can take a context
> parameter. I replaced one use of replace_ops with it and the other I
> specialized as I found that it was always a MEM_REF and we needed to replace
> the address it was dereferencing.
> 
> > 
> > +  tree advance;
> >     epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1,
> > &niters_vector,
> >                                &step_vector, &niters_vector_mult_vf, th,
> > -                             check_profitability, niters_no_overflow);
> > +                             check_profitability, niters_no_overflow,
> > +                             &advance);
> > +
> > +  if (epilogue)
> > +    {
> > +      basic_block *orig_bbs = get_loop_body (loop);
> > +      loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
> > ...
> > 
> > orig_stmts/drs/etc. in the epilogue loop_vinfo and ...
> > 
> > +      /* We are done vectorizing the main loop, so now we update the
> > epilogues
> > +        stmt_vec_info's.  At the same time we set the gimple UID of each
> > +        statement in the epilogue, as these are used to look them up in
> > the
> > +        epilogues loop_vec_info later.  We also keep track of what
> > ...
> > 
> > split this out to a new function.  I wonder why you need to record
> > the DRs, are they not available via ->datarefs and lookup_dr ()?
> 
> lookup_dr may no longer work at this point. I found that for some memory
> accesses by the time I got to this point, the DR_STMT of the data_reference
> pointed to a scalar statement that no longer existed and the lookup_dr to that
> data reference ICE's.  I can't make this update before we transform the loop
> because the data references are shared, so I decided to capture the
> dr_vec_info's instead. Apparently we don't ever do a lookup_dr past this
> point, which I must admit is surprising.

OK, as long as this fixup code is well isolated we can see how to
make it prettier later ;)  But yes, we have some vectorizer transforms
that remove old stmts (bad).  At least that's true for stores, we
could probably delay actual (scalar) stmt removal until the whole
series of loop + epilogue vectorization is finished.

As said, let's try as followup.

> > Still have to go over the main loop doing the analysis/transform.
> > 
> > Thanks, it looks really promising (albeit exepectedly ugly due to
> > the data rewriting).
> > 
> 
> Yeah, though I feel like now that I have put it away into functions it makes
> it look cleaner.  That vect_transform_loop function was getting too big!
> 
> Is this OK for trunk?

You probably no longer need the gentype.c hunk.

+}
+
+static void
+update_epilogue_loop_vinfo (class loop *epilogue, tree advance)

function comment missing

+
+
+  /* We are done vectorizing the main loop, so now we update the 
epilogues

too much vertical space.

+  /* We are done vectorizing the main loop, so now we update the 
epilogues
+     stmt_vec_info's.  At the same time we set the gimple UID of each
+     statement in the epilogue, as these are used to look them up in the
+     epilogues loop_vec_info later.  We also keep track of what
+     stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might
+     need updating and we construct a mapping between variables defined 
in
+     the main loop and their corresponding names in epilogue.  */
+  for (unsigned i = 0; i < epilogue->num_nodes; ++i)

so for the following code I wonder if you can make use of the
fact that loop copying also copies UIDs, so you should be able
to match stmts via their UIDs and get at the other loop infos
stmt_info by the copy loop stmt UID.

I wonder why you need no modification for the SLP tree?

Otherwise the patch looks OK.

Thanks,
Richard.

> gcc/ChangeLog:
> 2019-10-22  Andre Vieira  <andre.simoesdiasvieira@arm.com>
> 
>     PR 88915
>     * gentype.c (main): Add poly_uint64 type and vector_sizes to
>     generator.
>     * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
>     * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
>     and make the valueize function pointer also take a void pointer.
>     * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
>     around vn_valueize, to call it without a context.
>     (process_bb): Use vn_valueize_wrapper instead of vn_valueize.
>     * tree-vect-loop.c (vect_get_loop_niters): Make externally visible.
>     (_loop_vec_info): Initialize epilogue_vinfos.
>     (~_loop_vec_info): Release epilogue_vinfos.
>     (vect_analyze_loop_costing): Use knowledge of main VF to estimate
>     number of iterations of epilogue.
>     (vect_analyze_loop_2): Adapt to analyse main loop for all supported
>     vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
>     versioning threshold needed for main loop.
>     (vect_analyze_loop): Likewise.
>     (find_in_mapping): New helper function.
>     (update_epilogue_loop_vinfo): New function.
>     (vect_transform_loop): When vectorizing epilogues re-use analysis done
>     on main loop and call update_epilogue_loop_vinfo to update it.
>     * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
>     stmts on loop preheader edge.
>     (vect_do_peeling): Enable skip-vectors when doing loop versioning if
>     we decided to vectorize epilogues.  Update epilogues NITERS and
>     construct ADVANCE to update epilogues data references where needed.
>     * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos,
>     epilogue_vsizes and update_epilogue_vinfo members.
>     (LOOP_VINFO_UP_STMTS, LOOP_VINFO_UP_GT_DRS, LOOP_VINFO_UP_DRS,
>     LOOP_VINFO_EPILOGUE_SIZES): Define MACROs.
>     (vect_do_peeling, vect_get_loop_niters, vect_update_inits_of_drs,
>      determine_peel_for_niter, vect_analyze_loop): Add or update declarations.
>     * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
>     created loop_vec_info's for epilogues when available.  Otherwise 
> analyse
>     epilogue separately.
>
Richard Sandiford Oct. 22, 2019, 5:52 p.m. UTC | #10
Thanks for doing this.  Hope this message doesn't cover too much old
ground or duplicate too much...

"Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> @@ -2466,15 +2476,65 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>    else
>      niters_prolog = build_int_cst (type, 0);
>  
> +  loop_vec_info epilogue_vinfo = NULL;
> +  if (vect_epilogues)
> +    {
> +      /* Take the next epilogue_vinfo to vectorize for.  */
> +      epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];
> +      loop_vinfo->epilogue_vinfos.ordered_remove (0);
> +
> +      /* Don't vectorize epilogues if this is not the most inner loop or if
> +	 the epilogue may need peeling for alignment as the vectorizer doesn't
> +	 know how to handle these situations properly yet.  */
> +      if (loop->inner != NULL
> +	  || LOOP_VINFO_PEELING_FOR_ALIGNMENT (epilogue_vinfo))
> +	vect_epilogues = false;
> +
> +    }

Nit: excess blank line before "}".  Sorry if this was discussed before,
but what's the reason for delaying the check for "loop->inner" to
this point, rather than doing it in vect_analyze_loop?

> +
> +  tree niters_vector_mult_vf;
> +  unsigned int lowest_vf = constant_lower_bound (vf);
> +  /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
> +     on niters already ajusted for the iterations of the prologue.  */

Pre-existing typo: adjusted.  But...

> +  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> +      && known_eq (vf, lowest_vf))
> +    {
> +      loop_vec_info orig_loop_vinfo;
> +      if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> +	orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
> +      else
> +	orig_loop_vinfo = loop_vinfo;
> +      vector_sizes vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo);
> +      unsigned next_size = 0;
> +      unsigned HOST_WIDE_INT eiters
> +	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
> +	   - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
> +
> +      if (prolog_peeling > 0)
> +	eiters -= prolog_peeling;

...is that comment still true?  We're now subtracting the peeling
amount here.

Might be worth asserting prolog_peeling >= 0, just to emphasise
that we can't get here for variable peeling amounts, and then subtract
prolog_peeling unconditionally (assuming that's the right thing to do).

> +      eiters
> +	= eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
> +
> +      unsigned int ratio;
> +      while (next_size < vector_sizes.length ()
> +	     && !(constant_multiple_p (current_vector_size,
> +				       vector_sizes[next_size], &ratio)
> +		  && eiters >= lowest_vf / ratio))
> +	next_size += 1;
> +
> +      if (next_size == vector_sizes.length ())
> +	vect_epilogues = false;
> +    }
> +
>    /* Prolog loop may be skipped.  */
>    bool skip_prolog = (prolog_peeling != 0);
>    /* Skip to epilog if scalar loop may be preferred.  It's only needed
> -     when we peel for epilog loop and when it hasn't been checked with
> -     loop versioning.  */
> +     when we peel for epilog loop or when we loop version.  */
>    bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>  		      ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
>  				  bound_prolog + bound_epilog)
> -		      : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
> +		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
> +			 || vect_epilogues));

The comment update looks wrong here: without epilogues, we don't need
the skip when loop versioning, because loop versioning ensures that we
have at least one vector iteration.

(I think "it" was supposed to mean "skipping to the epilogue" rather
than the epilogue loop itself, in case that's the confusion.)

It'd be good to mention the epilogue condition in the comment too.

> @@ -2504,6 +2564,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>  
>    dump_user_location_t loop_loc = find_loop_location (loop);
>    class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
> +  if (vect_epilogues)
> +    /* Make sure to set the epilogue's epilogue scalar loop, such that we can
> +       we can use the original scalar loop as remaining epilogue if
> +       necessary.  */

Double "we can".

> +    LOOP_VINFO_SCALAR_LOOP (epilogue_vinfo)
> +      = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
> +
>    if (prolog_peeling)
>      {
>        e = loop_preheader_edge (loop);
> @@ -2584,14 +2651,22 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>  			   "loop can't be duplicated to exit edge.\n");
>  	  gcc_unreachable ();
>  	}
> -      /* Peel epilog and put it on exit edge of loop.  */
> -      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e);
> +      /* Peel epilog and put it on exit edge of loop.  If we are vectorizing
> +	 said epilog then we should use a copy of the main loop as a starting
> +	 point.  This loop may have been already had some preliminary

s/been//

> +	 transformations to allow for more optimal vectorizationg, for example

typo: vectorizationg

> +	 if-conversion.  If we are not vectorizing the epilog then we should
> +	 use the scalar loop as the transformations mentioned above make less
> +	 or no sense when not vectorizing.  */
> +      epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
> +      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
>        if (!epilog)
>  	{
>  	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, loop_loc,
>  			   "slpeel_tree_duplicate_loop_to_edge_cfg failed.\n");
>  	  gcc_unreachable ();
>  	}
> +
>        epilog->force_vectorize = false;
>        slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);
>  
> [...]
> @@ -2699,10 +2774,163 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>        adjust_vec_debug_stmts ();
>        scev_reset ();
>      }
> +
> +  if (vect_epilogues)
> +    {
> +      epilog->aux = epilogue_vinfo;
> +      LOOP_VINFO_LOOP (epilogue_vinfo) = epilog;
> +
> +      loop_constraint_clear (epilog, LOOP_C_INFINITE);
> +
> +      /* We now must calculate the number of iterations for our epilogue.  */
> +      tree cond_niters, niters;
> +
> +      /* Depending on whether we peel for gaps we take niters or niters - 1,
> +	 we will refer to this as N - G, where N and G are the NITERS and
> +	 GAP for the original loop.  */
> +      niters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> +	? LOOP_VINFO_NITERSM1 (loop_vinfo)
> +	: LOOP_VINFO_NITERS (loop_vinfo);
> +
> +      /* Here we build a vector factorization mask:
> +	 vf_mask = ~(VF - 1), where VF is the Vectorization Factor.  */
> +      tree vf_mask = build_int_cst (TREE_TYPE (niters),
> +				    LOOP_VINFO_VECT_FACTOR (loop_vinfo));
> +      vf_mask = fold_build2 (MINUS_EXPR, TREE_TYPE (vf_mask),
> +			     vf_mask,
> +			     build_one_cst (TREE_TYPE (vf_mask)));
> +      vf_mask = fold_build1 (BIT_NOT_EXPR, TREE_TYPE (niters), vf_mask);
> +
> +      /* Here we calculate:
> +	 niters = N - ((N-G) & ~(VF -1)) */
> +      niters = fold_build2 (MINUS_EXPR, TREE_TYPE (niters),
> +			    LOOP_VINFO_NITERS (loop_vinfo),
> +			    fold_build2 (BIT_AND_EXPR, TREE_TYPE (niters),
> +					 niters,
> +					 vf_mask));

Might be a daft question, sorry, but why does this need to be so
complicated?  Couldn't we just use the final value of the main loop's
IV to calculate how many iterations are left?

The current code wouldn't for example work for non-power-of-2 SVE vectors.
vect_set_loop_condition_unmasked is structured to cope with that case
(in length-agnostic mode only), even when an epilogue is needed.

> [...]
> -  return epilog;
> +  if (vect_epilogues)
> +    {
> +      basic_block *bbs = get_loop_body (loop);
> +      loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilog);
> +
> +      LOOP_VINFO_UP_STMTS (epilogue_vinfo).create (0);
> +      LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).create (0);
> +      LOOP_VINFO_UP_DRS (epilogue_vinfo).create (0);
> +
> +      gimple_stmt_iterator gsi;
> +      gphi_iterator phi_gsi;
> +      gimple *stmt;
> +      stmt_vec_info stmt_vinfo;
> +      dr_vec_info *dr_vinfo;
> +
> +      /* The stmt_vec_info's of the epilogue were constructed for the main loop
> +	 and need to be updated to refer to the cloned variables used in the
> +	 epilogue loop.  We do this by assuming the original main loop and the
> +	 epilogue loop are identical (aside the different SSA names).  This
> +	 means we assume we can go through each BB in the loop and each STMT in
> +	 each BB and map them 1:1, replacing the STMT_VINFO_STMT of each
> +	 stmt_vec_info in the epilogue's loop_vec_info.  Here we only keep
> +	 track of the original state of the main loop, before vectorization.
> +	 After vectorization we proceed to update the epilogue's stmt_vec_infos
> +	 information.  We also update the references in PATTERN_DEF_SEQ's,
> +	 RELATED_STMT's and data_references.  Mainly the latter has to be
> +	 updated after we are done vectorizing the main loop, as the
> +	 data_references are shared between main and epilogue.  */
> +      for (unsigned i = 0; i < loop->num_nodes; ++i)
> +	{
> +	  for (phi_gsi = gsi_start_phis (bbs[i]);
> +	       !gsi_end_p (phi_gsi); gsi_next (&phi_gsi))
> +	    LOOP_VINFO_UP_STMTS (epilogue_vinfo).safe_push (phi_gsi.phi ());
> +	  for (gsi = gsi_start_bb (bbs[i]);
> +	       !gsi_end_p (gsi); gsi_next (&gsi))
> +	    {
> +	      stmt = gsi_stmt (gsi);
> +	      LOOP_VINFO_UP_STMTS (epilogue_vinfo).safe_push (stmt);
> +	      stmt_vinfo  = epilogue_vinfo->lookup_stmt (stmt);

Nit: double space before "=".

> +	      if (stmt_vinfo != NULL
> +		  && stmt_vinfo->dr_aux.stmt == stmt_vinfo)
> +		{
> +		  dr_vinfo = STMT_VINFO_DR_INFO (stmt_vinfo);
> +		  /* Data references pointing to gather loads and scatter stores
> +		     require special treatment because the address computation
> +		     happens in a different gimple node, pointed to by DR_REF.
> +		     In contrast to normal loads and stores where we only need
> +		     to update the offset of the data reference.  */
> +		  if (STMT_VINFO_GATHER_SCATTER_P (dr_vinfo->stmt))
> +		    LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).safe_push (dr_vinfo);
> +		  LOOP_VINFO_UP_DRS (epilogue_vinfo).safe_push (dr_vinfo);
> +		}
> +	    }
> +	}
> +    }
> +
> +  return vect_epilogues ? epilog : NULL;
>  }
>  
>  /* Function vect_create_cond_for_niters_checks.
> [...]
> @@ -2151,8 +2176,18 @@ start_over:
>    /* During peeling, we need to check if number of loop iterations is
>       enough for both peeled prolog loop and vector loop.  This check
>       can be merged along with threshold check of loop versioning, so
> -     increase threshold for this case if necessary.  */
> -  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
> +     increase threshold for this case if necessary.
> +
> +     If we are analyzing an epilogue we still want to check what it's

s/it's/its/

> +     versioning threshold would be.  If we decide to vectorize the epilogues we
> +     will want to use the lowest versioning threshold of all epilogues and main
> +     loop.  This will enable us to enter a vectorized epilogue even when
> +     versioning the loop.  We can't simply check whether the epilogue requires
> +     versioning though since we may have skipped some versioning checks when
> +     analyzing the epilogue. For instance, checks for alias versioning will be

Nit: should be two spaces after ".".

> +     skipped when dealing with epilogues as we assume we already checked them
> +     for the main loop.  So instead we always check the 'orig_loop_vinfo'.  */
> +  if (LOOP_REQUIRES_VERSIONING (orig_loop_vinfo))
>      {
>        poly_uint64 niters_th = 0;
>        unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
> @@ -2307,14 +2342,8 @@ again:
>     be vectorized.  */
>  opt_loop_vec_info
>  vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
> -		   vec_info_shared *shared)
> +		   vec_info_shared *shared, vector_sizes vector_sizes)
>  {
> -  auto_vector_sizes vector_sizes;
> -
> -  /* Autodetect first vector size we try.  */
> -  current_vector_size = 0;
> -  targetm.vectorize.autovectorize_vector_sizes (&vector_sizes,
> -						loop->simdlen != 0);
>    unsigned int next_size = 0;
>  
>    DUMP_VECT_SCOPE ("analyze_loop_nest");
> @@ -2335,6 +2364,9 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
>    poly_uint64 autodetected_vector_size = 0;
>    opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL);
>    poly_uint64 first_vector_size = 0;
> +  poly_uint64 lowest_th = 0;
> +  unsigned vectorized_loops = 0;
> +  bool vect_epilogues = !loop->simdlen && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK);
>    while (1)
>      {
>        /* Check the CFG characteristics of the loop (nesting, entry/exit).  */
> @@ -2353,24 +2385,52 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
>  
>        if (orig_loop_vinfo)
>  	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;
> +      else if (vect_epilogues && first_loop_vinfo)
> +	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
>  
>        opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts);
>        if (res)
>  	{
>  	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
> +	  vectorized_loops++;
>  
> -	  if (loop->simdlen
> -	      && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
> -			   (unsigned HOST_WIDE_INT) loop->simdlen))
> +	  if ((loop->simdlen
> +	       && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
> +			    (unsigned HOST_WIDE_INT) loop->simdlen))
> +	      || vect_epilogues)
>  	    {
>  	      if (first_loop_vinfo == NULL)
>  		{
>  		  first_loop_vinfo = loop_vinfo;
> +		  lowest_th
> +		    = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
>  		  first_vector_size = current_vector_size;
>  		  loop->aux = NULL;
>  		}
>  	      else
> -		delete loop_vinfo;
> +		{
> +		  /* Keep track of vector sizes that we know we can vectorize
> +		     the epilogue with.  */
> +		  if (vect_epilogues)
> +		    {
> +		      loop->aux = NULL;
> +		      first_loop_vinfo->epilogue_vsizes.reserve (1);
> +		      first_loop_vinfo->epilogue_vsizes.quick_push (current_vector_size);
> +		      first_loop_vinfo->epilogue_vinfos.reserve (1);
> +		      first_loop_vinfo->epilogue_vinfos.quick_push (loop_vinfo);

I've messed you around, sorry, but the patches I committed this weekend
mean we now store the vector size in the loop_vinfo.  It'd be good to
avoid a separate epilogue_vsizes array if possible.

> +		      LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
> +		      poly_uint64 th
> +			= LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
> +		      gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
> +				  || maybe_ne (lowest_th, 0U));
> +		      /* Keep track of the known smallest versioning
> +			 threshold.  */
> +		      if (ordered_p (lowest_th, th))
> +			lowest_th = ordered_min (lowest_th, th);
> +		    }
> +		  else
> +		    delete loop_vinfo;
> +		}
>  	    }
>  	  else
>  	    {
> @@ -2408,6 +2468,8 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
>  		  dump_dec (MSG_NOTE, current_vector_size);
>  		  dump_printf (MSG_NOTE, "\n");
>  		}
> +	      LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo) = lowest_th;
> +
>  	      return first_loop_vinfo;
>  	    }
>  	  else
> @@ -8128,6 +8190,188 @@ vect_transform_loop_stmt (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
>      *seen_store = stmt_info;
>  }
>  
> +/* Helper function to pass to simplify_replace_tree to enable replacing tree's
> +   in the hash_map with its corresponding values.  */
> +static tree
> +find_in_mapping (tree t, void *context)
> +{
> +  hash_map<tree,tree>* mapping = (hash_map<tree, tree>*) context;
> +
> +  tree *value = mapping->get (t);
> +  return value ? *value : t;
> +}
> +
> +static void
> +update_epilogue_loop_vinfo (class loop *epilogue, tree advance)
> +{
> +  loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
> +  auto_vec<stmt_vec_info> pattern_worklist, related_worklist;
> +  hash_map<tree,tree> mapping;
> +  gimple *orig_stmt, *new_stmt;
> +  gimple_stmt_iterator epilogue_gsi;
> +  gphi_iterator epilogue_phi_gsi;
> +  stmt_vec_info stmt_vinfo = NULL, related_vinfo;
> +  basic_block *epilogue_bbs = get_loop_body (epilogue);
> +
> +  LOOP_VINFO_BBS (epilogue_vinfo) = epilogue_bbs;
> +
> +  vect_update_inits_of_drs (epilogue_vinfo, advance, PLUS_EXPR);
> +
> +
> +  /* We are done vectorizing the main loop, so now we update the epilogues
> +     stmt_vec_info's.  At the same time we set the gimple UID of each

"epilogue's stmt_vec_infos"

> +     statement in the epilogue, as these are used to look them up in the
> +     epilogues loop_vec_info later.  We also keep track of what

epilogue's

> +     stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might

PATTERN_DEF_SEQs and RELATED_STMTs

> +     need updating and we construct a mapping between variables defined in
> +     the main loop and their corresponding names in epilogue.  */
> +  for (unsigned i = 0; i < epilogue->num_nodes; ++i)
> +    {
> +      for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]);
> +	   !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi))
> +	{
> +	  orig_stmt = LOOP_VINFO_UP_STMTS (epilogue_vinfo)[0];
> +	  LOOP_VINFO_UP_STMTS (epilogue_vinfo).ordered_remove (0);
> +	  new_stmt = epilogue_phi_gsi.phi ();
> +
> +	  stmt_vinfo
> +	    = epilogue_vinfo->lookup_stmt (orig_stmt);

Nit: fits one line.

> +
> +	  STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
> +	  gimple_set_uid (new_stmt, gimple_uid (orig_stmt));
> +
> +	  mapping.put (gimple_phi_result (orig_stmt),
> +			gimple_phi_result (new_stmt));

Nit: indented too far.

> +
> +	  if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
> +	    pattern_worklist.safe_push (stmt_vinfo);
> +
> +	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
> +	  while (related_vinfo && related_vinfo != stmt_vinfo)
> +	    {
> +	      related_worklist.safe_push (related_vinfo);
> +	      /* Set BB such that the assert in
> +		'get_initial_def_for_reduction' is able to determine that
> +		the BB of the related stmt is inside this loop.  */
> +	      gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
> +			     gimple_bb (new_stmt));
> +	      related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
> +	    }
> +	}
> +
> +      for (epilogue_gsi = gsi_start_bb (epilogue_bbs[i]);
> +	   !gsi_end_p (epilogue_gsi); gsi_next (&epilogue_gsi))
> +	{
> +	  orig_stmt = LOOP_VINFO_UP_STMTS (epilogue_vinfo)[0];
> +	  LOOP_VINFO_UP_STMTS (epilogue_vinfo).ordered_remove (0);
> +	  new_stmt = gsi_stmt (epilogue_gsi);
> +
> +	  stmt_vinfo
> +	    = epilogue_vinfo->lookup_stmt (orig_stmt);

Fits on one line.

> +
> +	  STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
> +	  gimple_set_uid (new_stmt, gimple_uid (orig_stmt));
> +
> +	  if (is_gimple_assign (orig_stmt))
> +	    {
> +	      gcc_assert (is_gimple_assign (new_stmt));
> +	      mapping.put (gimple_assign_lhs (orig_stmt),
> +			  gimple_assign_lhs (new_stmt));
> +	    }

Why just assigns?  Don't we need to handle calls too?

Maybe just use gimple_get_lhs here.

> +
> +	  if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
> +	    pattern_worklist.safe_push (stmt_vinfo);
> +
> +	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
> +	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
> +	  while (related_vinfo && related_vinfo != stmt_vinfo)
> +	    {
> +	      related_worklist.safe_push (related_vinfo);
> +	      /* Set BB such that the assert in
> +		'get_initial_def_for_reduction' is able to determine that
> +		the BB of the related stmt is inside this loop.  */
> +	      gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
> +			     gimple_bb (new_stmt));
> +	      related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
> +	    }
> +	}
> +      gcc_assert (LOOP_VINFO_UP_STMTS (epilogue_vinfo).length () == 0);
> +    }
> +
> +  /* The PATTERN_DEF_SEQ's in the epilogue were constructed using the

PATTERN_DEF_SEQs

> +     original main loop and thus need to be updated to refer to the cloned
> +     variables used in the epilogue.  */
> +  for (unsigned i = 0; i < pattern_worklist.length (); ++i)
> +    {
> +      gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (pattern_worklist[i]);
> +      tree *new_op;
> +
> +      while (seq)
> +	{
> +	  for (unsigned j = 1; j < gimple_num_ops (seq); ++j)
> +	    {
> +	      tree op = gimple_op (seq, j);
> +	      if ((new_op = mapping.get(op)))
> +		gimple_set_op (seq, j, *new_op);
> +	      else
> +		{
> +		  op = simplify_replace_tree (op, NULL_TREE, NULL_TREE,
> +					 &find_in_mapping, &mapping);
> +		  gimple_set_op (seq, j, op);
> +		}
> +	    }
> +	  seq = seq->next;
> +	}
> +    }
> +
> +  /* Just like the PATTERN_DEF_SEQ's the RELATED_STMT's also need to be

as above

> +     updated.  */
> +  for (unsigned i = 0; i < related_worklist.length (); ++i)
> +    {
> +      tree *new_t;
> +      gimple * stmt = STMT_VINFO_STMT (related_worklist[i]);
> +      for (unsigned j = 1; j < gimple_num_ops (stmt); ++j)
> +	if ((new_t = mapping.get(gimple_op (stmt, j))))

These days I think:

	if (tree *new_t = mapping.get(gimple_op (stmt, j)))

is preferred.

> +	  gimple_set_op (stmt, j, *new_t);
> +    }
> +
> +  tree *new_op;
> +  /* Data references for gather loads and scatter stores do not use the
> +     updated offset we set using ADVANCE.  Instead we have to make sure the
> +     reference in the data references point to the corresponding copy of
> +     the original in the epilogue.  */
> +  for (unsigned i = 0; i < LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).length (); ++i)
> +    {
> +      dr_vec_info *dr_vinfo = LOOP_VINFO_UP_GT_DRS (epilogue_vinfo)[i];
> +      data_reference *dr = dr_vinfo->dr;
> +      gcc_assert (dr);
> +      gcc_assert (TREE_CODE (DR_REF (dr)) == MEM_REF);
> +      new_op = mapping.get (TREE_OPERAND (DR_REF (dr), 0));
> +
> +      if (new_op)

Likewise:

      if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), 0)))

here.

> +	{
> +	  DR_REF (dr) = unshare_expr (DR_REF (dr));
> +	  TREE_OPERAND (DR_REF (dr), 0) = *new_op;
> +	  DR_STMT (dr_vinfo->dr) = SSA_NAME_DEF_STMT (*new_op);
> +	}
> +    }
> +
> +  /* The vector size of the epilogue is smaller than that of the main loop
> +     so the alignment is either the same or lower. This means the dr will
> +     thus by definition be aligned.  */
> +  for (unsigned i = 0; i < LOOP_VINFO_UP_DRS (epilogue_vinfo).length (); ++i)
> +    LOOP_VINFO_UP_DRS (epilogue_vinfo)[i]->base_misaligned = false;
> +
> +
> +  LOOP_VINFO_UP_STMTS (epilogue_vinfo).release ();
> +  LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).release ();
> +  LOOP_VINFO_UP_DRS (epilogue_vinfo).release ();
> +
> +  epilogue_vinfo->shared->datarefs_copy.release ();
> +  epilogue_vinfo->shared->save_datarefs ();
> +}
> +
> +
>  /* Function vect_transform_loop.
>  
>     The analysis phase has determined that the loop is vectorizable.
> [...]
> @@ -882,10 +886,35 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
>  		 LOCATION_FILE (vect_location.get_location_t ()),
>  		 LOCATION_LINE (vect_location.get_location_t ()));
>  
> +  /* If this is an epilogue, we already know what vector sizes we will use for
> +     vectorization as the analyzis was part of the main vectorized loop.  Use
> +     these instead of going through all vector sizes again.  */
> +  if (orig_loop_vinfo
> +      && !LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo).is_empty ())
> +    {
> +      vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo);
> +      assert_versioning = LOOP_REQUIRES_VERSIONING (orig_loop_vinfo);
> +      current_vector_size = vector_sizes[0];
> +    }
> +  else
> +    {
> +      /* Autodetect first vector size we try.  */
> +      current_vector_size = 0;
> +
> +      targetm.vectorize.autovectorize_vector_sizes (&auto_vector_sizes,
> +						    loop->simdlen != 0);
> +      vector_sizes = auto_vector_sizes;
> +    }
> +
>    /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p.  */
> -  opt_loop_vec_info loop_vinfo
> -    = vect_analyze_loop (loop, orig_loop_vinfo, &shared);
> -  loop->aux = loop_vinfo;
> +  opt_loop_vec_info loop_vinfo = opt_loop_vec_info::success (NULL);
> +  if (loop_vec_info_for_loop (loop))
> +    loop_vinfo = opt_loop_vec_info::success (loop_vec_info_for_loop (loop));
> +  else
> +    {
> +      loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo, &shared, vector_sizes);
> +      loop->aux = loop_vinfo;
> +    }

I don't really understand what this is doing for the epilogue case.
Do we call vect_analyze_loop again?  Are vector_sizes[1:] significant
for epilogues?

Thanks,
Richard
Andre Vieira (lists) Oct. 25, 2019, 4:18 p.m. UTC | #11
On 22/10/2019 14:56, Richard Biener wrote:
> On Tue, 22 Oct 2019, Andre Vieira (lists) wrote:
> 
>> Hi Richi,
>>
>> See inline responses to your comments.
>>
>> On 11/10/2019 13:57, Richard Biener wrote:
>>> On Thu, 10 Oct 2019, Andre Vieira (lists) wrote:
>>>
>>>> Hi,
>>>>
>>
>>>
>>> +
>>> +  /* Keep track of vector sizes we know we can vectorize the epilogue
>>> with.  */
>>> +  vector_sizes epilogue_vsizes;
>>>    };
>>>
>>> please don't enlarge struct loop, instead track this somewhere
>>> in the vectorizer (in loop_vinfo?  I see you already have
>>> epilogue_vinfos there - so the loop_vinfo simply lacks
>>> convenient access to the vector_size?)  I don't see any
>>> use that could be trivially adjusted to look at a loop_vinfo
>>> member instead.
>>
>> Done.
>>>
>>> For the vect_update_inits_of_drs this means that we'd possibly
>>> do less CSE.  Not sure if really an issue.
>>
>> CSE of what exactly? You are afraid we are repeating a calculation here we
>> have done elsewhere before?
> 
> All uses of those inits now possibly get the expression instead of
> just the SSA name we inserted code for once.  But as said, we'll see.
> 

This code changed after some comments from Richard Sandiford.

> +  /* We are done vectorizing the main loop, so now we update the
> epilogues
> +     stmt_vec_info's.  At the same time we set the gimple UID of each
> +     statement in the epilogue, as these are used to look them up in the
> +     epilogues loop_vec_info later.  We also keep track of what
> +     stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might
> +     need updating and we construct a mapping between variables defined
> in
> +     the main loop and their corresponding names in epilogue.  */
> +  for (unsigned i = 0; i < epilogue->num_nodes; ++i)
> 
> so for the following code I wonder if you can make use of the
> fact that loop copying also copies UIDs, so you should be able
> to match stmts via their UIDs and get at the other loop infos
> stmt_info by the copy loop stmt UID.
> 
> I wonder why you need no modification for the SLP tree?
> 
I checked with Tamar and the SLP tree works with the position of 
operands and not SSA_NAMES.  So we should be fine.
Andre Vieira (lists) Oct. 25, 2019, 4:18 p.m. UTC | #12
On 22/10/2019 18:52, Richard Sandiford wrote:
> Thanks for doing this.  Hope this message doesn't cover too much old
> ground or duplicate too much...
> 
> "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
>> @@ -2466,15 +2476,65 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>>     else
>>       niters_prolog = build_int_cst (type, 0);
>>   
>> +  loop_vec_info epilogue_vinfo = NULL;
>> +  if (vect_epilogues)
>> +    {
>> +      /* Take the next epilogue_vinfo to vectorize for.  */
>> +      epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];
>> +      loop_vinfo->epilogue_vinfos.ordered_remove (0);
>> +
>> +      /* Don't vectorize epilogues if this is not the most inner loop or if
>> +	 the epilogue may need peeling for alignment as the vectorizer doesn't
>> +	 know how to handle these situations properly yet.  */
>> +      if (loop->inner != NULL
>> +	  || LOOP_VINFO_PEELING_FOR_ALIGNMENT (epilogue_vinfo))
>> +	vect_epilogues = false;
>> +
>> +    }
> 
> Nit: excess blank line before "}".  Sorry if this was discussed before,
> but what's the reason for delaying the check for "loop->inner" to
> this point, rather than doing it in vect_analyze_loop?

Done.
> 
>> +
>> +  tree niters_vector_mult_vf;
>> +  unsigned int lowest_vf = constant_lower_bound (vf);
>> +  /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
>> +     on niters already ajusted for the iterations of the prologue.  */
> 
> Pre-existing typo: adjusted.  But...
> 
>> +  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>> +      && known_eq (vf, lowest_vf))
>> +    {
>> +      loop_vec_info orig_loop_vinfo;
>> +      if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>> +	orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
>> +      else
>> +	orig_loop_vinfo = loop_vinfo;
>> +      vector_sizes vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo);
>> +      unsigned next_size = 0;
>> +      unsigned HOST_WIDE_INT eiters
>> +	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
>> +	   - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
>> +
>> +      if (prolog_peeling > 0)
>> +	eiters -= prolog_peeling;
> 
> ...is that comment still true?  We're now subtracting the peeling
> amount here.

It is not, "adjusted" the comment ;)

> Might be worth asserting prolog_peeling >= 0, just to emphasise
> that we can't get here for variable peeling amounts, and then subtract
> prolog_peeling unconditionally (assuming that's the right thing to do).
> 
Can't assert as LOOP_VINFO_NITERS_KNOWN_P can be true even with 
prolog_peeling < 0, since we still know the constant number of scalar 
iterations, we just don't know how many vector iterations will be 
performed due to the runtime peeling. I will however, not reject 
vectorizing the epilogue, when we don't know how much we are peeling.
>> +      eiters
>> +	= eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
>> +
>> +      unsigned int ratio;
>> +      while (next_size < vector_sizes.length ()
>> +	     && !(constant_multiple_p (current_vector_size,
>> +				       vector_sizes[next_size], &ratio)
>> +		  && eiters >= lowest_vf / ratio))
>> +	next_size += 1;
>> +
>> +      if (next_size == vector_sizes.length ())
>> +	vect_epilogues = false;
>> +    }
>> +
>>     /* Prolog loop may be skipped.  */
>>     bool skip_prolog = (prolog_peeling != 0);
>>     /* Skip to epilog if scalar loop may be preferred.  It's only needed
>> -     when we peel for epilog loop and when it hasn't been checked with
>> -     loop versioning.  */
>> +     when we peel for epilog loop or when we loop version.  */
>>     bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>>   		      ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
>>   				  bound_prolog + bound_epilog)
>> -		      : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
>> +		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
>> +			 || vect_epilogues));
> 
> The comment update looks wrong here: without epilogues, we don't need
> the skip when loop versioning, because loop versioning ensures that we
> have at least one vector iteration.
> 
> (I think "it" was supposed to mean "skipping to the epilogue" rather
> than the epilogue loop itself, in case that's the confusion.)
> 
> It'd be good to mention the epilogue condition in the comment too.
> 

Rewrote comment, hopefully this now better reflects reality.

>> +
>> +  if (vect_epilogues)
>> +    {
>> +      epilog->aux = epilogue_vinfo;
>> +      LOOP_VINFO_LOOP (epilogue_vinfo) = epilog;
>> +
>> +      loop_constraint_clear (epilog, LOOP_C_INFINITE);
>> +
>> +      /* We now must calculate the number of iterations for our epilogue.  */
>> +      tree cond_niters, niters;
>> +
>> +      /* Depending on whether we peel for gaps we take niters or niters - 1,
>> +	 we will refer to this as N - G, where N and G are the NITERS and
>> +	 GAP for the original loop.  */
>> +      niters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>> +	? LOOP_VINFO_NITERSM1 (loop_vinfo)
>> +	: LOOP_VINFO_NITERS (loop_vinfo);
>> +
>> +      /* Here we build a vector factorization mask:
>> +	 vf_mask = ~(VF - 1), where VF is the Vectorization Factor.  */
>> +      tree vf_mask = build_int_cst (TREE_TYPE (niters),
>> +				    LOOP_VINFO_VECT_FACTOR (loop_vinfo));
>> +      vf_mask = fold_build2 (MINUS_EXPR, TREE_TYPE (vf_mask),
>> +			     vf_mask,
>> +			     build_one_cst (TREE_TYPE (vf_mask)));
>> +      vf_mask = fold_build1 (BIT_NOT_EXPR, TREE_TYPE (niters), vf_mask);
>> +
>> +      /* Here we calculate:
>> +	 niters = N - ((N-G) & ~(VF -1)) */
>> +      niters = fold_build2 (MINUS_EXPR, TREE_TYPE (niters),
>> +			    LOOP_VINFO_NITERS (loop_vinfo),
>> +			    fold_build2 (BIT_AND_EXPR, TREE_TYPE (niters),
>> +					 niters,
>> +					 vf_mask));
> 
> Might be a daft question, sorry, but why does this need to be so
> complicated?  Couldn't we just use the final value of the main loop's
> IV to calculate how many iterations are left?
> 
> The current code wouldn't for example work for non-power-of-2 SVE vectors.
> vect_set_loop_condition_unmasked is structured to cope with that case
> (in length-agnostic mode only), even when an epilogue is needed.

Good call, as we discussed I changed my approach here. Rather than using 
a conditional expression to guard against skipping the main loop, I now 
use a phi-node to carry the IV.  This actually already exists, so I am 
duplicating here, but I didn't know what the best way was to "grab" this 
existing IV.


>> +     skipped when dealing with epilogues as we assume we already checked them
>> +     for the main loop.  So instead we always check the 'orig_loop_vinfo'.  */
>> +  if (LOOP_REQUIRES_VERSIONING (orig_loop_vinfo))
>>       {
>>         poly_uint64 niters_th = 0;
>>         unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
>> @@ -2307,14 +2342,8 @@ again:
>>      be vectorized.  */
>>   opt_loop_vec_info
>>   vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
>> -		   vec_info_shared *shared)
>> +		   vec_info_shared *shared, vector_sizes vector_sizes)
>>   {
>> -  auto_vector_sizes vector_sizes;
>> -
>> -  /* Autodetect first vector size we try.  */
>> -  current_vector_size = 0;
>> -  targetm.vectorize.autovectorize_vector_sizes (&vector_sizes,
>> -						loop->simdlen != 0);
>>     unsigned int next_size = 0;
>>   
>>     DUMP_VECT_SCOPE ("analyze_loop_nest");
>> @@ -2335,6 +2364,9 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
>>     poly_uint64 autodetected_vector_size = 0;
>>     opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL);
>>     poly_uint64 first_vector_size = 0;
>> +  poly_uint64 lowest_th = 0;
>> +  unsigned vectorized_loops = 0;
>> +  bool vect_epilogues = !loop->simdlen && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK);
>>     while (1)
>>       {
>>         /* Check the CFG characteristics of the loop (nesting, entry/exit).  */
>> @@ -2353,24 +2385,52 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
>>   
>>         if (orig_loop_vinfo)
>>   	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;
>> +      else if (vect_epilogues && first_loop_vinfo)
>> +	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
>>   
>>         opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts);
>>         if (res)
>>   	{
>>   	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
>> +	  vectorized_loops++;
>>   
>> -	  if (loop->simdlen
>> -	      && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
>> -			   (unsigned HOST_WIDE_INT) loop->simdlen))
>> +	  if ((loop->simdlen
>> +	       && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
>> +			    (unsigned HOST_WIDE_INT) loop->simdlen))
>> +	      || vect_epilogues)
>>   	    {
>>   	      if (first_loop_vinfo == NULL)
>>   		{
>>   		  first_loop_vinfo = loop_vinfo;
>> +		  lowest_th
>> +		    = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
>>   		  first_vector_size = current_vector_size;
>>   		  loop->aux = NULL;
>>   		}
>>   	      else
>> -		delete loop_vinfo;
>> +		{
>> +		  /* Keep track of vector sizes that we know we can vectorize
>> +		     the epilogue with.  */
>> +		  if (vect_epilogues)
>> +		    {
>> +		      loop->aux = NULL;
>> +		      first_loop_vinfo->epilogue_vsizes.reserve (1);
>> +		      first_loop_vinfo->epilogue_vsizes.quick_push (current_vector_size);
>> +		      first_loop_vinfo->epilogue_vinfos.reserve (1);
>> +		      first_loop_vinfo->epilogue_vinfos.quick_push (loop_vinfo);
> 
> I've messed you around, sorry, but the patches I committed this weekend
> mean we now store the vector size in the loop_vinfo.  It'd be good to
> avoid a separate epilogue_vsizes array if possible.

Rebased. Actually quite happy with that, makes for a cleaner patch on my 
end :)
> 
>> +
>> +	  stmt_vinfo
>> +	    = epilogue_vinfo->lookup_stmt (orig_stmt);
> 
> Nit: fits one line.
> 
>> +
>> +	  STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
>> +	  gimple_set_uid (new_stmt, gimple_uid (orig_stmt));
>> +
>> +	  mapping.put (gimple_phi_result (orig_stmt),
>> +			gimple_phi_result (new_stmt));
> 
> Nit: indented too far.
> 
>> +
>> +	  if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
>> +	    pattern_worklist.safe_push (stmt_vinfo);
>> +
>> +	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
>> +	  while (related_vinfo && related_vinfo != stmt_vinfo)
>> +	    {
>> +	      related_worklist.safe_push (related_vinfo);
>> +	      /* Set BB such that the assert in
>> +		'get_initial_def_for_reduction' is able to determine that
>> +		the BB of the related stmt is inside this loop.  */
>> +	      gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
>> +			     gimple_bb (new_stmt));
>> +	      related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
>> +	    }
>> +	}
>> +
>> +      for (epilogue_gsi = gsi_start_bb (epilogue_bbs[i]);
>> +	   !gsi_end_p (epilogue_gsi); gsi_next (&epilogue_gsi))
>> +	{
>> +	  orig_stmt = LOOP_VINFO_UP_STMTS (epilogue_vinfo)[0];
>> +	  LOOP_VINFO_UP_STMTS (epilogue_vinfo).ordered_remove (0);
>> +	  new_stmt = gsi_stmt (epilogue_gsi);
>> +
>> +	  stmt_vinfo
>> +	    = epilogue_vinfo->lookup_stmt (orig_stmt);
> 
> Fits on one line.
> 
>> +
>> +	  STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
>> +	  gimple_set_uid (new_stmt, gimple_uid (orig_stmt));
>> +
>> +	  if (is_gimple_assign (orig_stmt))
>> +	    {
>> +	      gcc_assert (is_gimple_assign (new_stmt));
>> +	      mapping.put (gimple_assign_lhs (orig_stmt),
>> +			  gimple_assign_lhs (new_stmt));
>> +	    }
> 
> Why just assigns?  Don't we need to handle calls too?
> 
> Maybe just use gimple_get_lhs here.

Changed.
>> @@ -882,10 +886,35 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
>>   		 LOCATION_FILE (vect_location.get_location_t ()),
>>   		 LOCATION_LINE (vect_location.get_location_t ()));
>>   
>> +  /* If this is an epilogue, we already know what vector sizes we will use for
>> +     vectorization as the analyzis was part of the main vectorized loop.  Use
>> +     these instead of going through all vector sizes again.  */
>> +  if (orig_loop_vinfo
>> +      && !LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo).is_empty ())
>> +    {
>> +      vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo);
>> +      assert_versioning = LOOP_REQUIRES_VERSIONING (orig_loop_vinfo);
>> +      current_vector_size = vector_sizes[0];
>> +    }
>> +  else
>> +    {
>> +      /* Autodetect first vector size we try.  */
>> +      current_vector_size = 0;
>> +
>> +      targetm.vectorize.autovectorize_vector_sizes (&auto_vector_sizes,
>> +						    loop->simdlen != 0);
>> +      vector_sizes = auto_vector_sizes;
>> +    }
>> +
>>     /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p.  */
>> -  opt_loop_vec_info loop_vinfo
>> -    = vect_analyze_loop (loop, orig_loop_vinfo, &shared);
>> -  loop->aux = loop_vinfo;
>> +  opt_loop_vec_info loop_vinfo = opt_loop_vec_info::success (NULL);
>> +  if (loop_vec_info_for_loop (loop))
>> +    loop_vinfo = opt_loop_vec_info::success (loop_vec_info_for_loop (loop));
>> +  else
>> +    {
>> +      loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo, &shared, vector_sizes);
>> +      loop->aux = loop_vinfo;
>> +    }
> 
> I don't really understand what this is doing for the epilogue case.
> Do we call vect_analyze_loop again?  Are vector_sizes[1:] significant
> for epilogues?

The vector sizes code here is no longer needed after your patch. The 
loop_vec_info is just checking whether loop already has one set (which 
is the case for epilogues) and use that, or if not then analyse it 
(which is the case for the first vectorization).  I'll add some comments.
> 
> Thanks,
> Richard
>
Andre Vieira (lists) Oct. 25, 2019, 4:20 p.m. UTC | #13
Hi,

This is the reworked patch after your comments.

I have moved the epilogue check into the analysis form disguised under 
'!epilogue_vinfos.is_empty ()'.  This because I realized that I am doing 
the "lowest threshold" check there.

The only place where we may reject an epilogue_vinfo is when we know the 
number of scalar iterations and we realize the number of iterations left 
after the main loop are not enough to enter the vectorized epilogue so 
we optimize away that code-gen.  The only way we know this to be true is 
if the number of scalar iterations are known and the peeling for 
alignment is known. So we know we will enter the main loop regardless, 
so whether the threshold we use is for a lower VF or not it shouldn't 
matter as much, I would even like to think that check isn't done, but I 
am not sure... Might be worth checking as an optimization.


Is this OK for trunk?

gcc/ChangeLog:
2019-10-25  Andre Vieira  <andre.simoesdiasvieira@arm.com>

     PR 88915
     * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
     * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
     and make the valueize function pointer also take a void pointer.
     * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
     around vn_valueize, to call it without a context.
     (process_bb): Use vn_valueize_wrapper instead of vn_valueize.
     * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.
     (~_loop_vec_info): Release epilogue_vinfos.
     (vect_analyze_loop_costing): Use knowledge of main VF to estimate
     number of iterations of epilogue.
     (vect_analyze_loop_2): Adapt to analyse main loop for all supported
     vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
     versioning threshold needed for main loop.
     (vect_analyze_loop): Likewise.
     (find_in_mapping): New helper function.
     (update_epilogue_loop_vinfo): New function.
     (vect_transform_loop): When vectorizing epilogues re-use analysis done
     on main loop and call update_epilogue_loop_vinfo to update it.
     * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
     stmts on loop preheader edge.
     (vect_do_peeling): Enable skip-vectors when doing loop versioning if
     we decided to vectorize epilogues.  Update epilogues NITERS and
     construct ADVANCE to update epilogues data references where needed.
     * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.
     (vect_do_peeling, vect_update_inits_of_drs,
      determine_peel_for_niter, vect_analyze_loop): Add or update 
declarations.
     * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
     created loop_vec_info's for epilogues when available.  Otherwise 
analyse
     epilogue separately.



Cheers,
Andre
Richard Biener Oct. 28, 2019, 12:48 p.m. UTC | #14
On Fri, 25 Oct 2019, Andre Vieira (lists) wrote:

> 
> 
> On 22/10/2019 14:56, Richard Biener wrote:
> > On Tue, 22 Oct 2019, Andre Vieira (lists) wrote:
> > 
> >> Hi Richi,
> >>
> >> See inline responses to your comments.
> >>
> >> On 11/10/2019 13:57, Richard Biener wrote:
> >>> On Thu, 10 Oct 2019, Andre Vieira (lists) wrote:
> >>>
> >>>> Hi,
> >>>>
> >>
> >>>
> >>> +
> >>> +  /* Keep track of vector sizes we know we can vectorize the epilogue
> >>> with.  */
> >>> +  vector_sizes epilogue_vsizes;
> >>>    };
> >>>
> >>> please don't enlarge struct loop, instead track this somewhere
> >>> in the vectorizer (in loop_vinfo?  I see you already have
> >>> epilogue_vinfos there - so the loop_vinfo simply lacks
> >>> convenient access to the vector_size?)  I don't see any
> >>> use that could be trivially adjusted to look at a loop_vinfo
> >>> member instead.
> >>
> >> Done.
> >>>
> >>> For the vect_update_inits_of_drs this means that we'd possibly
> >>> do less CSE.  Not sure if really an issue.
> >>
> >> CSE of what exactly? You are afraid we are repeating a calculation here we
> >> have done elsewhere before?
> > 
> > All uses of those inits now possibly get the expression instead of
> > just the SSA name we inserted code for once.  But as said, we'll see.
> > 
> 
> This code changed after some comments from Richard Sandiford.
> 
> > +  /* We are done vectorizing the main loop, so now we update the
> > epilogues
> > +     stmt_vec_info's.  At the same time we set the gimple UID of each
> > +     statement in the epilogue, as these are used to look them up in the
> > +     epilogues loop_vec_info later.  We also keep track of what
> > +     stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might
> > +     need updating and we construct a mapping between variables defined
> > in
> > +     the main loop and their corresponding names in epilogue.  */
> > +  for (unsigned i = 0; i < epilogue->num_nodes; ++i)
> > 
> > so for the following code I wonder if you can make use of the
> > fact that loop copying also copies UIDs, so you should be able
> > to match stmts via their UIDs and get at the other loop infos
> > stmt_info by the copy loop stmt UID.
> > 
> > I wonder why you need no modification for the SLP tree?
> > 
> I checked with Tamar and the SLP tree works with the position of operands and
> not SSA_NAMES.  So we should be fine.

There's now SLP_TREE_SCALAR_OPS but only for invariants so I guess
we should indeed be fine here.  Everything else is already
stmt_infos which you patch with the new underlying stmts.

Richard.
Richard Biener Oct. 28, 2019, 2:16 p.m. UTC | #15
On Fri, 25 Oct 2019, Andre Vieira (lists) wrote:

> Hi,
> 
> This is the reworked patch after your comments.
> 
> I have moved the epilogue check into the analysis form disguised under
> '!epilogue_vinfos.is_empty ()'.  This because I realized that I am doing the
> "lowest threshold" check there.
> 
> The only place where we may reject an epilogue_vinfo is when we know the
> number of scalar iterations and we realize the number of iterations left after
> the main loop are not enough to enter the vectorized epilogue so we optimize
> away that code-gen.  The only way we know this to be true is if the number of
> scalar iterations are known and the peeling for alignment is known. So we know
> we will enter the main loop regardless, so whether the threshold we use is for
> a lower VF or not it shouldn't matter as much, I would even like to think that
> check isn't done, but I am not sure... Might be worth checking as an
> optimization.
> 
> 
> Is this OK for trunk?

+      for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]);
+          !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi))
+       {
..
+         if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+           pattern_worklist.safe_push (stmt_vinfo);
+
+         related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+         while (related_vinfo && related_vinfo != stmt_vinfo)
+           {

I think PHIs cannot have patterns.  You can assert
that STMT_VINFO_RELATED_STMT is NULL I think.

+         related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+         while (related_vinfo && related_vinfo != stmt_vinfo)
+           {
+             related_worklist.safe_push (related_vinfo);
+             /* Set BB such that the assert in
+               'get_initial_def_for_reduction' is able to determine that
+               the BB of the related stmt is inside this loop.  */
+             gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
+                            gimple_bb (new_stmt));
+             related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
+           }

do we really keep references to "nested" patterns?  Thus, do you
need this loop?

+  /* The PATTERN_DEF_SEQs in the epilogue were constructed using the
+     original main loop and thus need to be updated to refer to the 
cloned
+     variables used in the epilogue.  */
+  for (unsigned i = 0; i < pattern_worklist.length (); ++i)
+    {
...
+                 op = simplify_replace_tree (op, NULL_TREE, NULL_TREE,
+                                        &find_in_mapping, &mapping);
+                 gimple_set_op (seq, j, op);

you do this for the pattern-def seq but not for the related one.
I guess you ran into this for COND_EXPR conditions.  I wondered
to use a shared worklist for both the def-seq and the main pattern
stmt or at least to split out the replacement so you can share it.

+      /* Data references for gather loads and scatter stores do not use 
the
+        updated offset we set using ADVANCE.  Instead we have to make 
sure the
+        reference in the data references point to the corresponding copy 
of
+        the original in the epilogue.  */
+      if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))
+       {
+         int j;
+         if (TREE_CODE (DR_REF (dr)) == MEM_REF)
+           j = 0;
+         else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF)
+           j = 1;
+         else
+           gcc_unreachable ();
+
+         if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), j)))
+           {
+             DR_REF (dr) = unshare_expr (DR_REF (dr));
+             TREE_OPERAND (DR_REF (dr), j) = *new_op;
+           }

huh, do you really only ever see MEM_REF or ARRAY_REF here?
I would guess using simplify_replace_tree is safer.
There's also DR_BASE_ADDRESS - we seem to leave the DRs partially
updated, is that correct?

Otherwise looks OK to me.

Thanks,
Richard.


> gcc/ChangeLog:
> 2019-10-25  Andre Vieira  <andre.simoesdiasvieira@arm.com>
> 
>     PR 88915
>     * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
>     * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
>     and make the valueize function pointer also take a void pointer.
>     * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
>     around vn_valueize, to call it without a context.
>     (process_bb): Use vn_valueize_wrapper instead of vn_valueize.
>     * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.
>     (~_loop_vec_info): Release epilogue_vinfos.
>     (vect_analyze_loop_costing): Use knowledge of main VF to estimate
>     number of iterations of epilogue.
>     (vect_analyze_loop_2): Adapt to analyse main loop for all supported
>     vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
>     versioning threshold needed for main loop.
>     (vect_analyze_loop): Likewise.
>     (find_in_mapping): New helper function.
>     (update_epilogue_loop_vinfo): New function.
>     (vect_transform_loop): When vectorizing epilogues re-use analysis done
>     on main loop and call update_epilogue_loop_vinfo to update it.
>     * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
>     stmts on loop preheader edge.
>     (vect_do_peeling): Enable skip-vectors when doing loop versioning if
>     we decided to vectorize epilogues.  Update epilogues NITERS and
>     construct ADVANCE to update epilogues data references where needed.
>     * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.
>     (vect_do_peeling, vect_update_inits_of_drs,
>      determine_peel_for_niter, vect_analyze_loop): Add or update declarations.
>     * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
>     created loop_vec_info's for epilogues when available.  Otherwise 
> analyse
>     epilogue separately.
> 
> 
> 
> Cheers,
> Andre
>
Andre Vieira (lists) Oct. 28, 2019, 6:31 p.m. UTC | #16
Hi,

Reworked according to your comments, see inline for clarification.

Is this OK for trunk?

gcc/ChangeLog:
2019-10-28  Andre Vieira  <andre.simoesdiasvieira@arm.com>

     PR 88915
     * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
     * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
     and make the valueize function pointer also take a void pointer.
     * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
     around vn_valueize, to call it without a context.
     (process_bb): Use vn_valueize_wrapper instead of vn_valueize.
     * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.
     (~_loop_vec_info): Release epilogue_vinfos.
     (vect_analyze_loop_costing): Use knowledge of main VF to estimate
     number of iterations of epilogue.
     (vect_analyze_loop_2): Adapt to analyse main loop for all supported
     vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
     versioning threshold needed for main loop.
     (vect_analyze_loop): Likewise.
     (find_in_mapping): New helper function.
     (update_epilogue_loop_vinfo): New function.
     (vect_transform_loop): When vectorizing epilogues re-use analysis done
     on main loop and call update_epilogue_loop_vinfo to update it.
     * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
     stmts on loop preheader edge.
     (vect_do_peeling): Enable skip-vectors when doing loop versioning if
     we decided to vectorize epilogues.  Update epilogues NITERS and
     construct ADVANCE to update epilogues data references where needed.
     * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.
     (vect_do_peeling, vect_update_inits_of_drs,
      determine_peel_for_niter, vect_analyze_loop): Add or update 
declarations.
     * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
     created loop_vec_info's for epilogues when available.  Otherwise 
analyse
     epilogue separately.



Cheers,
Andre

On 28/10/2019 14:16, Richard Biener wrote:
> On Fri, 25 Oct 2019, Andre Vieira (lists) wrote:
> 
>> Hi,
>>
>> This is the reworked patch after your comments.
>>
>> I have moved the epilogue check into the analysis form disguised under
>> '!epilogue_vinfos.is_empty ()'.  This because I realized that I am doing the
>> "lowest threshold" check there.
>>
>> The only place where we may reject an epilogue_vinfo is when we know the
>> number of scalar iterations and we realize the number of iterations left after
>> the main loop are not enough to enter the vectorized epilogue so we optimize
>> away that code-gen.  The only way we know this to be true is if the number of
>> scalar iterations are known and the peeling for alignment is known. So we know
>> we will enter the main loop regardless, so whether the threshold we use is for
>> a lower VF or not it shouldn't matter as much, I would even like to think that
>> check isn't done, but I am not sure... Might be worth checking as an
>> optimization.
>>
>>
>> Is this OK for trunk?
> 
> +      for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]);
> +          !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi))
> +       {
> ..
> +         if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
> +           pattern_worklist.safe_push (stmt_vinfo);
> +
> +         related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
> +         while (related_vinfo && related_vinfo != stmt_vinfo)
> +           {
> 
> I think PHIs cannot have patterns.  You can assert
> that STMT_VINFO_RELATED_STMT is NULL I think.

Done.
> 
> +         related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
> +         while (related_vinfo && related_vinfo != stmt_vinfo)
> +           {
> +             related_worklist.safe_push (related_vinfo);
> +             /* Set BB such that the assert in
> +               'get_initial_def_for_reduction' is able to determine that
> +               the BB of the related stmt is inside this loop.  */
> +             gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
> +                            gimple_bb (new_stmt));
> +             related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
> +           }
> 
> do we really keep references to "nested" patterns?  Thus, do you
> need this loop?

Changed and added asserts.  They didn't trigger so I suppose you are 
right, I didn't know at the time whether it was possible, so I just 
operated on the side of caution.  Can remove the asserts and so on if 
you want.
> 
> +  /* The PATTERN_DEF_SEQs in the epilogue were constructed using the
> +     original main loop and thus need to be updated to refer to the
> cloned
> +     variables used in the epilogue.  */
> +  for (unsigned i = 0; i < pattern_worklist.length (); ++i)
> +    {
> ...
> +                 op = simplify_replace_tree (op, NULL_TREE, NULL_TREE,
> +                                        &find_in_mapping, &mapping);
> +                 gimple_set_op (seq, j, op);
> 
> you do this for the pattern-def seq but not for the related one.
> I guess you ran into this for COND_EXPR conditions.  I wondered
> to use a shared worklist for both the def-seq and the main pattern
> stmt or at least to split out the replacement so you can share it.

I think that was it yeah, reworked it now to use the same list. Less 
code, thanks!
> 
> +      /* Data references for gather loads and scatter stores do not use
> the
> +        updated offset we set using ADVANCE.  Instead we have to make
> sure the
> +        reference in the data references point to the corresponding copy
> of
> +        the original in the epilogue.  */
> +      if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))
> +       {
> +         int j;
> +         if (TREE_CODE (DR_REF (dr)) == MEM_REF)
> +           j = 0;
> +         else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF)
> +           j = 1;
> +         else
> +           gcc_unreachable ();
> +
> +         if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), j)))
> +           {
> +             DR_REF (dr) = unshare_expr (DR_REF (dr));
> +             TREE_OPERAND (DR_REF (dr), j) = *new_op;
> +           }
> 
> huh, do you really only ever see MEM_REF or ARRAY_REF here?
> I would guess using simplify_replace_tree is safer.
> There's also DR_BASE_ADDRESS - we seem to leave the DRs partially
> updated, is that correct?

Yeah can use simplify_replace_tree indeed.  And I have changed it so it 
updates DR_BASE_ADDRESS.  I think DR_BASE_ADDRESS never actually changed 
in the way we use data_references... Either way, replacing them if they 
do change is cleaner and more future proof.
> 
> Otherwise looks OK to me.
> 
> Thanks,
> Richard.
> 
> 
>> gcc/ChangeLog:
>> 2019-10-25  Andre Vieira  <andre.simoesdiasvieira@arm.com>
>>
>>      PR 88915
>>      * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
>>      * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
>>      and make the valueize function pointer also take a void pointer.
>>      * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
>>      around vn_valueize, to call it without a context.
>>      (process_bb): Use vn_valueize_wrapper instead of vn_valueize.
>>      * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.
>>      (~_loop_vec_info): Release epilogue_vinfos.
>>      (vect_analyze_loop_costing): Use knowledge of main VF to estimate
>>      number of iterations of epilogue.
>>      (vect_analyze_loop_2): Adapt to analyse main loop for all supported
>>      vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
>>      versioning threshold needed for main loop.
>>      (vect_analyze_loop): Likewise.
>>      (find_in_mapping): New helper function.
>>      (update_epilogue_loop_vinfo): New function.
>>      (vect_transform_loop): When vectorizing epilogues re-use analysis done
>>      on main loop and call update_epilogue_loop_vinfo to update it.
>>      * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
>>      stmts on loop preheader edge.
>>      (vect_do_peeling): Enable skip-vectors when doing loop versioning if
>>      we decided to vectorize epilogues.  Update epilogues NITERS and
>>      construct ADVANCE to update epilogues data references where needed.
>>      * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.
>>      (vect_do_peeling, vect_update_inits_of_drs,
>>       determine_peel_for_niter, vect_analyze_loop): Add or update declarations.
>>      * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
>>      created loop_vec_info's for epilogues when available.  Otherwise
>> analyse
>>      epilogue separately.
>>
>>
>>
>> Cheers,
>> Andre
>>
>
Richard Biener Oct. 29, 2019, 11:48 a.m. UTC | #17
On Mon, 28 Oct 2019, Andre Vieira (lists) wrote:

> Hi,
> 
> Reworked according to your comments, see inline for clarification.
> 
> Is this OK for trunk?

+             gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo);
+             while (seq)
+               {
+                 stmt_worklist.safe_push (seq);
+                 seq = seq->next;
+               }

you're supposed to do to the following, not access the ->next
implementation detail:

    for (gimple_stmt_iterator gsi = gsi_start (seq); !gsi_end_p (gsi); 
gsi_next (&gsi))
      stmt_worklist.safe_push (gsi_stmt (gsi));


+      /* Data references for gather loads and scatter stores do not use 
the
+        updated offset we set using ADVANCE.  Instead we have to make 
sure
the
+        reference in the data references point to the corresponding copy 
of
+        the original in the epilogue.  */
+      if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))
+       {
+         DR_REF (dr)
+           = simplify_replace_tree (DR_REF (dr), NULL_TREE, NULL_TREE,
+                                    &find_in_mapping, &mapping);
+         DR_BASE_ADDRESS (dr)
+           = simplify_replace_tree (DR_BASE_ADDRESS (dr), NULL_TREE,
NULL_TREE,
+                                    &find_in_mapping, &mapping);
+       }

Hmm.  So for other DRs we account for the previous vector loop
by adjusting DR_OFFSET?  But STMT_VINFO_GATHER_SCATTER_P ends up
using (unconditionally) DR_REF here?  In that case it seems
best to adjust DR_REF only but NULL out DR_BASE_ADDRESS and
DR_OFFSET?  I wonder how prologue peeling deals with
STMT_VINFO_GATHER_SCATTER_P ... I see the caller of
vect_update_init_of_dr there does nothing for STMT_VINFO_GATHER_SCATTER_P.

I wonder if (as followup to not delay this further) we can
"offload" all the DR adjustment by storing ADVANCE in dr_vec_info
and accounting for that when we create the dataref pointers in
vectorizable_load/store?  That way we could avoid saving/restoring
DR_OFFSET as well. 

So, the patch is OK with the sequence iteration fixed.  I think
sorting out the above can be done as followup.

Thanks,
Richard.

> gcc/ChangeLog:
> 2019-10-28  Andre Vieira  <andre.simoesdiasvieira@arm.com>
> 
>     PR 88915
>     * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
>     * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
>     and make the valueize function pointer also take a void pointer.
>     * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
>     around vn_valueize, to call it without a context.
>     (process_bb): Use vn_valueize_wrapper instead of vn_valueize.
>     * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.
>     (~_loop_vec_info): Release epilogue_vinfos.
>     (vect_analyze_loop_costing): Use knowledge of main VF to estimate
>     number of iterations of epilogue.
>     (vect_analyze_loop_2): Adapt to analyse main loop for all supported
>     vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
>     versioning threshold needed for main loop.
>     (vect_analyze_loop): Likewise.
>     (find_in_mapping): New helper function.
>     (update_epilogue_loop_vinfo): New function.
>     (vect_transform_loop): When vectorizing epilogues re-use analysis done
>     on main loop and call update_epilogue_loop_vinfo to update it.
>     * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
>     stmts on loop preheader edge.
>     (vect_do_peeling): Enable skip-vectors when doing loop versioning if
>     we decided to vectorize epilogues.  Update epilogues NITERS and
>     construct ADVANCE to update epilogues data references where needed.
>     * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.
>     (vect_do_peeling, vect_update_inits_of_drs,
>      determine_peel_for_niter, vect_analyze_loop): Add or update declarations.
>     * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
>     created loop_vec_info's for epilogues when available.  Otherwise 
> analyse
>     epilogue separately.
> 
> 
> 
> Cheers,
> Andre
> 
> On 28/10/2019 14:16, Richard Biener wrote:
> > On Fri, 25 Oct 2019, Andre Vieira (lists) wrote:
> > 
> >> Hi,
> >>
> >> This is the reworked patch after your comments.
> >>
> >> I have moved the epilogue check into the analysis form disguised under
> >> '!epilogue_vinfos.is_empty ()'.  This because I realized that I am doing
> >> the
> >> "lowest threshold" check there.
> >>
> >> The only place where we may reject an epilogue_vinfo is when we know the
> >> number of scalar iterations and we realize the number of iterations left
> >> after
> >> the main loop are not enough to enter the vectorized epilogue so we
> >> optimize
> >> away that code-gen.  The only way we know this to be true is if the number
> >> of
> >> scalar iterations are known and the peeling for alignment is known. So we
> >> know
> >> we will enter the main loop regardless, so whether the threshold we use is
> >> for
> >> a lower VF or not it shouldn't matter as much, I would even like to think
> >> that
> >> check isn't done, but I am not sure... Might be worth checking as an
> >> optimization.
> >>
> >>
> >> Is this OK for trunk?
> > 
> > +      for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]);
> > +          !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi))
> > +       {
> > ..
> > +         if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
> > +           pattern_worklist.safe_push (stmt_vinfo);
> > +
> > +         related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
> > +         while (related_vinfo && related_vinfo != stmt_vinfo)
> > +           {
> > 
> > I think PHIs cannot have patterns.  You can assert
> > that STMT_VINFO_RELATED_STMT is NULL I think.
> 
> Done.
> > 
> > +         related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
> > +         while (related_vinfo && related_vinfo != stmt_vinfo)
> > +           {
> > +             related_worklist.safe_push (related_vinfo);
> > +             /* Set BB such that the assert in
> > +               'get_initial_def_for_reduction' is able to determine that
> > +               the BB of the related stmt is inside this loop.  */
> > +             gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
> > +                            gimple_bb (new_stmt));
> > +             related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
> > +           }
> > 
> > do we really keep references to "nested" patterns?  Thus, do you
> > need this loop?
> 
> Changed and added asserts.  They didn't trigger so I suppose you are right, I
> didn't know at the time whether it was possible, so I just operated on the
> side of caution.  Can remove the asserts and so on if you want.
> > 
> > +  /* The PATTERN_DEF_SEQs in the epilogue were constructed using the
> > +     original main loop and thus need to be updated to refer to the
> > cloned
> > +     variables used in the epilogue.  */
> > +  for (unsigned i = 0; i < pattern_worklist.length (); ++i)
> > +    {
> > ...
> > +                 op = simplify_replace_tree (op, NULL_TREE, NULL_TREE,
> > +                                        &find_in_mapping, &mapping);
> > +                 gimple_set_op (seq, j, op);
> > 
> > you do this for the pattern-def seq but not for the related one.
> > I guess you ran into this for COND_EXPR conditions.  I wondered
> > to use a shared worklist for both the def-seq and the main pattern
> > stmt or at least to split out the replacement so you can share it.
> 
> I think that was it yeah, reworked it now to use the same list. Less code,
> thanks!
> > 
> > +      /* Data references for gather loads and scatter stores do not use
> > the
> > +        updated offset we set using ADVANCE.  Instead we have to make
> > sure the
> > +        reference in the data references point to the corresponding copy
> > of
> > +        the original in the epilogue.  */
> > +      if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))
> > +       {
> > +         int j;
> > +         if (TREE_CODE (DR_REF (dr)) == MEM_REF)
> > +           j = 0;
> > +         else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF)
> > +           j = 1;
> > +         else
> > +           gcc_unreachable ();
> > +
> > +         if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), j)))
> > +           {
> > +             DR_REF (dr) = unshare_expr (DR_REF (dr));
> > +             TREE_OPERAND (DR_REF (dr), j) = *new_op;
> > +           }
> > 
> > huh, do you really only ever see MEM_REF or ARRAY_REF here?
> > I would guess using simplify_replace_tree is safer.
> > There's also DR_BASE_ADDRESS - we seem to leave the DRs partially
> > updated, is that correct?
> 
> Yeah can use simplify_replace_tree indeed.  And I have changed it so it
> updates DR_BASE_ADDRESS.  I think DR_BASE_ADDRESS never actually changed in
> the way we use data_references... Either way, replacing them if they do change
> is cleaner and more future proof.
> > 
> > Otherwise looks OK to me.
> > 
> > Thanks,
> > Richard.
> > 
> > 
> >> gcc/ChangeLog:
> >> 2019-10-25  Andre Vieira  <andre.simoesdiasvieira@arm.com>
> >>
> >>      PR 88915
> >>      * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
> >>      * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
> >>      and make the valueize function pointer also take a void pointer.
> >>      * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
> >>      around vn_valueize, to call it without a context.
> >>      (process_bb): Use vn_valueize_wrapper instead of vn_valueize.
> >>      * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.
> >>      (~_loop_vec_info): Release epilogue_vinfos.
> >>      (vect_analyze_loop_costing): Use knowledge of main VF to estimate
> >>      number of iterations of epilogue.
> >>      (vect_analyze_loop_2): Adapt to analyse main loop for all supported
> >>      vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
> >>      versioning threshold needed for main loop.
> >>      (vect_analyze_loop): Likewise.
> >>      (find_in_mapping): New helper function.
> >>      (update_epilogue_loop_vinfo): New function.
> >>      (vect_transform_loop): When vectorizing epilogues re-use analysis done
> >>      on main loop and call update_epilogue_loop_vinfo to update it.
> >>      * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
> >>      stmts on loop preheader edge.
> >>      (vect_do_peeling): Enable skip-vectors when doing loop versioning if
> >>      we decided to vectorize epilogues.  Update epilogues NITERS and
> >>      construct ADVANCE to update epilogues data references where needed.
> >>      * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.
> >>      (vect_do_peeling, vect_update_inits_of_drs,
> >>      determine_peel_for_niter, vect_analyze_loop): Add or update
> >>      declarations.
> >>      * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
> >>      created loop_vec_info's for epilogues when available.  Otherwise
> >> analyse
> >>      epilogue separately.
> >>
> >>
> >>
> >> Cheers,
> >> Andre
> >>
> > 
> 
>
diff mbox series

Patch

diff --git a/gcc/gengtype.c b/gcc/gengtype.c
index 53317337cf8c8e8caefd6b819d28b3bba301e755..56ffa08a7dee54837441f0c743f8c0faa285c74b 100644
--- a/gcc/gengtype.c
+++ b/gcc/gengtype.c
@@ -5197,6 +5197,7 @@  main (int argc, char **argv)
       POS_HERE (do_scalar_typedef ("widest_int", &pos));
       POS_HERE (do_scalar_typedef ("int64_t", &pos));
       POS_HERE (do_scalar_typedef ("poly_int64", &pos));
+      POS_HERE (do_scalar_typedef ("poly_uint64", &pos));
       POS_HERE (do_scalar_typedef ("uint64_t", &pos));
       POS_HERE (do_scalar_typedef ("uint8", &pos));
       POS_HERE (do_scalar_typedef ("uintptr_t", &pos));
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 5c25441c70a271f04730486e513437fffa75b7e3..3b5f14c45b5b9b601120c6776734bbafefe1e178 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2401,7 +2401,8 @@  class loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, tree *step_vector,
 		 tree *niters_vector_mult_vf_var, int th,
-		 bool check_profitability, bool niters_no_overflow)
+		 bool check_profitability, bool niters_no_overflow,
+		 bool vect_epilogues_nomask)
 {
   edge e, guard_e;
   tree type = TREE_TYPE (niters), guard_cond;
@@ -2474,7 +2475,8 @@  vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 		      ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
 				  bound_prolog + bound_epilog)
-		      : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
+		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+			 || vect_epilogues_nomask));
   /* Epilog loop must be executed if the number of iterations for epilog
      loop is known at compile time, otherwise we need to add a check at
      the end of vector loop and skip to the end of epilog loop.  */
@@ -2966,9 +2968,7 @@  vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo, tree * cond_expr)
    *COND_EXPR_STMT_LIST.  */
 
 class loop *
-vect_loop_versioning (loop_vec_info loop_vinfo,
-		      unsigned int th, bool check_profitability,
-		      poly_uint64 versioning_threshold)
+vect_loop_versioning (loop_vec_info loop_vinfo)
 {
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop;
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
@@ -2988,10 +2988,15 @@  vect_loop_versioning (loop_vec_info loop_vinfo,
   bool version_align = LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT (loop_vinfo);
   bool version_alias = LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo);
   bool version_niter = LOOP_REQUIRES_VERSIONING_FOR_NITERS (loop_vinfo);
+  poly_uint64 versioning_threshold
+    = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
   tree version_simd_if_cond
     = LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (loop_vinfo);
+  unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
 
-  if (check_profitability)
+  if (th >= vect_vf_for_cost (loop_vinfo)
+      && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && !ordered_p (th, versioning_threshold))
     cond_expr = fold_build2 (GE_EXPR, boolean_type_node, scalar_loop_iters,
 			     build_int_cst (TREE_TYPE (scalar_loop_iters),
 					    th - 1));
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index b0cbbac0cb5ba1ffce706715d3dbb9139063803d..305ee2b06eabde9091049da829e6fc93161aa13f 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1858,7 +1858,8 @@  vect_dissolve_slp_only_groups (loop_vec_info loop_vinfo)
    for it.  The different analyses will record information in the
    loop_vec_info struct.  */
 static opt_result
-vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
+vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts,
+		     bool *vect_epilogues_nomask)
 {
   opt_result ok = opt_result::success ();
   int res;
@@ -2179,6 +2180,11 @@  start_over:
         }
     }
 
+  /* Disable epilogue vectorization if versioning is required because of the
+     iteration count.  TODO: Needs investigation as to whether it is possible
+     to vectorize epilogues in this case.  */
+  *vect_epilogues_nomask &= !LOOP_REQUIRES_VERSIONING_FOR_NITERS (loop_vinfo);
+
   /* During peeling, we need to check if number of loop iterations is
      enough for both peeled prolog loop and vector loop.  This check
      can be merged along with threshold check of loop versioning, so
@@ -2186,6 +2192,7 @@  start_over:
   if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
     {
       poly_uint64 niters_th = 0;
+      unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
 
       if (!vect_use_loop_mask_for_alignment_p (loop_vinfo))
 	{
@@ -2206,6 +2213,14 @@  start_over:
       /* One additional iteration because of peeling for gap.  */
       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
 	niters_th += 1;
+
+      /*  Use the same condition as vect_transform_loop to decide when to use
+	  the cost to determine a versioning threshold.  */
+      if (th >= vect_vf_for_cost (loop_vinfo)
+	  && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	  && ordered_p (th, niters_th))
+	niters_th = ordered_max (poly_uint64 (th), niters_th);
+
       LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo) = niters_th;
     }
 
@@ -2329,7 +2344,7 @@  again:
    be vectorized.  */
 opt_loop_vec_info
 vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
-		   vec_info_shared *shared)
+		   vec_info_shared *shared, bool *vect_epilogues_nomask)
 {
   auto_vector_sizes vector_sizes;
 
@@ -2357,6 +2372,7 @@  vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
   poly_uint64 autodetected_vector_size = 0;
   opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL);
   poly_uint64 first_vector_size = 0;
+  unsigned vectorized_loops = 0;
   while (1)
     {
       /* Check the CFG characteristics of the loop (nesting, entry/exit).  */
@@ -2376,14 +2392,17 @@  vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
       if (orig_loop_vinfo)
 	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;
 
-      opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts);
+      opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts,
+					    vect_epilogues_nomask);
       if (res)
 	{
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
+	  vectorized_loops++;
 
-	  if (loop->simdlen
-	      && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
-			   (unsigned HOST_WIDE_INT) loop->simdlen))
+	  if ((loop->simdlen
+	       && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			    (unsigned HOST_WIDE_INT) loop->simdlen))
+	      || *vect_epilogues_nomask)
 	    {
 	      if (first_loop_vinfo == NULL)
 		{
@@ -2392,7 +2411,13 @@  vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 		  loop->aux = NULL;
 		}
 	      else
-		delete loop_vinfo;
+		{
+		  /* Set versioning threshold of the original LOOP_VINFO based
+		     on the last vectorization of the epilog.  */
+		  LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo)
+		    = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
+		  delete loop_vinfo;
+		}
 	    }
 	  else
 	    {
@@ -2401,7 +2426,12 @@  vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 	    }
 	}
       else
-	delete loop_vinfo;
+	{
+	  /* Disable epilog vectorization if we can't determine the epilogs can
+	     be vectorized.  */
+	  *vect_epilogues_nomask &= vectorized_loops > 1;
+	  delete loop_vinfo;
+	}
 
       if (next_size == 0)
 	autodetected_vector_size = current_vector_size;
@@ -8468,7 +8498,7 @@  vect_transform_loop_stmt (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
    Returns scalar epilogue loop if any.  */
 
 class loop *
-vect_transform_loop (loop_vec_info loop_vinfo)
+vect_transform_loop (loop_vec_info loop_vinfo, bool vect_epilogues_nomask)
 {
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   class loop *epilogue = NULL;
@@ -8497,11 +8527,11 @@  vect_transform_loop (loop_vec_info loop_vinfo)
   if (th >= vect_vf_for_cost (loop_vinfo)
       && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_NOTE, vect_location,
-			 "Profitability threshold is %d loop iterations.\n",
-                         th);
-      check_profitability = true;
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "Profitability threshold is %d loop iterations.\n",
+			   th);
+	check_profitability = true;
     }
 
   /* Make sure there exists a single-predecessor exit bb.  Do this before 
@@ -8519,18 +8549,8 @@  vect_transform_loop (loop_vec_info loop_vinfo)
 
   if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
     {
-      poly_uint64 versioning_threshold
-	= LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
-      if (check_profitability
-	  && ordered_p (poly_uint64 (th), versioning_threshold))
-	{
-	  versioning_threshold = ordered_max (poly_uint64 (th),
-					      versioning_threshold);
-	  check_profitability = false;
-	}
       class loop *sloop
-	= vect_loop_versioning (loop_vinfo, th, check_profitability,
-				versioning_threshold);
+	= vect_loop_versioning (loop_vinfo);
       sloop->force_vectorize = false;
       check_profitability = false;
     }
@@ -8557,7 +8577,8 @@  vect_transform_loop (loop_vec_info loop_vinfo)
   bool niters_no_overflow = loop_niters_no_overflow (loop_vinfo);
   epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector,
 			      &step_vector, &niters_vector_mult_vf, th,
-			      check_profitability, niters_no_overflow);
+			      check_profitability, niters_no_overflow,
+			      vect_epilogues_nomask);
   if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo)
       && LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo).initialized_p ())
     scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo),
@@ -8818,7 +8839,7 @@  vect_transform_loop (loop_vec_info loop_vinfo)
   if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
     epilogue = NULL;
 
-  if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
+  if (!vect_epilogues_nomask)
     epilogue = NULL;
 
   if (epilogue)
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 1456cde4c2c2dec7244c504d2c496248894a4f1e..e87170c592036a6f3f5330e1ebf5d125441861a6 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1480,10 +1480,10 @@  extern void vect_set_loop_condition (class loop *, loop_vec_info,
 extern bool slpeel_can_duplicate_loop_p (const class loop *, const_edge);
 class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
 						     class loop *, edge);
-class loop *vect_loop_versioning (loop_vec_info, unsigned int, bool,
-				   poly_uint64);
+class loop *vect_loop_versioning (loop_vec_info);
 extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
-				     tree *, tree *, tree *, int, bool, bool);
+				    tree *, tree *, tree *, int, bool, bool,
+				    bool);
 extern void vect_prepare_for_masked_peels (loop_vec_info);
 extern dump_user_location_t find_loop_location (class loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
@@ -1610,7 +1610,8 @@  extern bool check_reduction_path (dump_user_location_t, loop_p, gphi *, tree,
 /* Drive for loop analysis stage.  */
 extern opt_loop_vec_info vect_analyze_loop (class loop *,
 					    loop_vec_info,
-					    vec_info_shared *);
+					    vec_info_shared *,
+					    bool *);
 extern tree vect_build_loop_niters (loop_vec_info, bool * = NULL);
 extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *,
 					 tree *, bool);
@@ -1622,7 +1623,7 @@  extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
 
 /* Drive for loop transformation stage.  */
-extern class loop *vect_transform_loop (loop_vec_info);
+extern class loop *vect_transform_loop (loop_vec_info, bool);
 extern opt_loop_vec_info vect_analyze_loop_form (class loop *,
 						 vec_info_shared *);
 extern bool vectorizable_live_operation (stmt_vec_info, gimple_stmt_iterator *,
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 173e6b51652fd023893b38da786ff28f827553b5..25c3fc8ff55e017ae0b971fa93ce8ce2a07cb94c 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -61,6 +61,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree.h"
 #include "gimple.h"
 #include "predict.h"
+#include "params.h"
 #include "tree-pass.h"
 #include "ssa.h"
 #include "cgraph.h"
@@ -875,6 +876,7 @@  try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
   vec_info_shared shared;
   auto_purge_vect_location sentinel;
   vect_location = find_loop_location (loop);
+  bool vect_epilogues_nomask = PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK);
   if (LOCATION_LOCUS (vect_location.get_location_t ()) != UNKNOWN_LOCATION
       && dump_enabled_p ())
     dump_printf (MSG_NOTE | MSG_PRIORITY_INTERNALS,
@@ -884,7 +886,7 @@  try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
   /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p.  */
   opt_loop_vec_info loop_vinfo
-    = vect_analyze_loop (loop, orig_loop_vinfo, &shared);
+    = vect_analyze_loop (loop, orig_loop_vinfo, &shared, &vect_epilogues_nomask);
   loop->aux = loop_vinfo;
 
   if (!loop_vinfo)
@@ -980,7 +982,7 @@  try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 			 "loop vectorized using variable length vectors\n");
     }
 
-  loop_p new_loop = vect_transform_loop (loop_vinfo);
+  loop_p new_loop = vect_transform_loop (loop_vinfo, vect_epilogues_nomask);
   (*num_vectorized_loops)++;
   /* Now that the loop has been vectorized, allow it to be unrolled
      etc.  */
@@ -1013,8 +1015,13 @@  try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
   /* Epilogue of vectorized loop must be vectorized too.  */
   if (new_loop)
-    ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops,
-				 new_loop, loop_vinfo, NULL, NULL);
+    {
+      /* Don't include vectorized epilogues in the "vectorized loops" count.
+       */
+      unsigned dont_count = *num_vectorized_loops;
+      ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count,
+				   new_loop, loop_vinfo, NULL, NULL);
+    }
 
   return ret;
 }