diff mbox series

[RFC] Don't move cold code out of loop by checking bb count

Message ID 20210802050501.159058-1-luoxhu@linux.ibm.com
State New
Headers show
Series [RFC] Don't move cold code out of loop by checking bb count | expand

Commit Message

Xionghu Luo Aug. 2, 2021, 5:05 a.m. UTC
There was a patch trying to avoid move cold block out of loop:

https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Richard suggested to "never hoist anything from a bb with lower execution
frequency to a bb with higher one in LIM invariantness_dom_walker
before_dom_children".

This patch does this profile count check in both gimple LIM
move_computations_worker and RTL loop-invariant.c find_invariants_bb,
if the loop bb is colder than loop preheader, don't hoist it out of
loop.

Also, the profile count in loop split pass should be corrected to avoid
lim2 and lim4 mismatch behavior, currently, the new loop preheader generated
by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will
move statement out of loop unexpectely when lim2 didn't move it.  This
change could fix regression on 544.nab_r from -1.55% to +0.46%.

SPEC2017 performance evaluation shows 1% performance improvement for
intrate GEOMEAN and no obvious regression for others.  Especially,
500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
on P8LE.

Regression and bootstrap tested pass on P8LE, any comments?  Thanks.

gcc/ChangeLog:

	* loop-invariant.c (find_invariants_bb): Check profile count
	before motion.
	(find_invariants_body): Add argument.
	* tree-ssa-loop-im.c (move_computations_worker): Check profile
	count before motion.
	(execute_sm): Likewise.
	(execute_sm_exit): Check pointer validness.
	* tree-ssa-loop-split.c (split_loop): Correct probability.
	(do_split_loop_on_cond): Likewise.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/recip-3.c: Adjust.
---
 gcc/loop-invariant.c                    |  10 +-
 gcc/testsuite/gcc.dg/tree-ssa/recip-3.c |   2 +-
 gcc/tree-ssa-loop-im.c                  | 164 +++++++++++++++++++++++-
 gcc/tree-ssa-loop-split.c               |  14 +-
 4 files changed, 177 insertions(+), 13 deletions(-)

Comments

Richard Biener Aug. 6, 2021, 12:15 p.m. UTC | #1
On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com> wrote:
>
> There was a patch trying to avoid move cold block out of loop:
>
> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>
> Richard suggested to "never hoist anything from a bb with lower execution
> frequency to a bb with higher one in LIM invariantness_dom_walker
> before_dom_children".
>
> This patch does this profile count check in both gimple LIM
> move_computations_worker and RTL loop-invariant.c find_invariants_bb,
> if the loop bb is colder than loop preheader, don't hoist it out of
> loop.
>
> Also, the profile count in loop split pass should be corrected to avoid
> lim2 and lim4 mismatch behavior, currently, the new loop preheader generated
> by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will
> move statement out of loop unexpectely when lim2 didn't move it.  This
> change could fix regression on 544.nab_r from -1.55% to +0.46%.
>
> SPEC2017 performance evaluation shows 1% performance improvement for
> intrate GEOMEAN and no obvious regression for others.  Especially,
> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> on P8LE.
>
> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.

While I'm not familiar with the RTL invariant motion pass the patch there
looks reasonable.  Note that we should assess the profile quality
somehow - I'm not sure how to do that, CCed Honza for that.

For the GIMPLE part the patch looks quite complicated - but note it
probably has to be since LIM performs kind of a "CSE" on loads
(and stores for store-motion), so when there are multiple stmts
affected by a hoisting decision the biggest block count has to be
accounted.  Likewise when there are dependent stmts involved
that might include conditional stmts (a "PHI"), but the overall
cost should be looked at.

Now - GIMPLE LIM "costing" is somewhat backward right now
and it isn't set up to consider those multiple involved stmts.  Plus
the store-motion part does not have any cost part (but it depends
on previously decided invariant motions).

I think the way you implemented the check will cause no hoisting
to be performed instead of, say, hoisting to a different loop level
only.  Possibly shown when you consider a loop nest like

  for (;;)
    if (unlikely_cond)
      for (;;)
         invariant;

we want to hoist 'invariant' but only from the inner loop even if it
is invariant also in the outer loop.  But for example if there is
a store motion opportunity like

  for (;;)
     {
        if (unlikely_cond)
          for (;;)
            a = ...;
        a = ...;
     }

we'd still want to perform the store motion on the outer loop.

Note that store-motion already performs part of the transform
before dependent code is moved in move_computations (that
you patched).

IIRC your main concern were the COND_EXPRs we insert
for hoisted conditional stmts?

Thanks,
Richard.

> gcc/ChangeLog:
>
>         * loop-invariant.c (find_invariants_bb): Check profile count
>         before motion.
>         (find_invariants_body): Add argument.
>         * tree-ssa-loop-im.c (move_computations_worker): Check profile
>         count before motion.
>         (execute_sm): Likewise.
>         (execute_sm_exit): Check pointer validness.
>         * tree-ssa-loop-split.c (split_loop): Correct probability.
>         (do_split_loop_on_cond): Likewise.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
> ---
>  gcc/loop-invariant.c                    |  10 +-
>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c |   2 +-
>  gcc/tree-ssa-loop-im.c                  | 164 +++++++++++++++++++++++-
>  gcc/tree-ssa-loop-split.c               |  14 +-
>  4 files changed, 177 insertions(+), 13 deletions(-)
>
> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
> index bdc7b59dd5f..7b5d64d11f9 100644
> --- a/gcc/loop-invariant.c
> +++ b/gcc/loop-invariant.c
> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
>     call.  */
>
>  static void
> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
> +                   bool always_executed)
>  {
>    rtx_insn *insn;
> +  basic_block preheader = loop_preheader_edge (loop)->src;
> +
> +  if (preheader->count > bb->count)
> +    return;
>
>    FOR_BB_INSNS (bb, insn)
>      {
> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
>    unsigned i;
>
>    for (i = 0; i < loop->num_nodes; i++)
> -    find_invariants_bb (body[i],
> -                       bitmap_bit_p (always_reached, i),
> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
>                         bitmap_bit_p (always_executed, i));
>  }
>
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> index 638bf38db8c..641c91e719e 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> @@ -23,4 +23,4 @@ float h ()
>         F[0] += E / d;
>  }
>
> -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
> +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index 7de47edbcb3..2bfb5e8ec15 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -1147,6 +1147,61 @@ move_computations_worker (basic_block bb)
>           continue;
>         }
>
> +      edge e = loop_preheader_edge (level);
> +      if (e->src->count > bb->count)
> +       {
> +         if (dump_file && (dump_flags & TDF_DETAILS))
> +           {
> +             fprintf (dump_file, "PHI node NOT moved to %d from %d:\n",
> +                      e->src->index, bb->index);
> +             print_gimple_stmt (dump_file, stmt, 0);
> +             fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
> +                      level->num);
> +           }
> +         gsi_next (&bsi);
> +         continue;
> +       }
> +      else
> +       {
> +         unsigned i;
> +         bool skip_phi_move = false;
> +         for (i = 0; i < gimple_phi_num_args (stmt); i++)
> +           {
> +             tree def = PHI_ARG_DEF (stmt, i);
> +
> +             if (TREE_CODE (def) != SSA_NAME)
> +               continue;
> +
> +             gimple *def_stmt = SSA_NAME_DEF_STMT (def);
> +
> +             if (!gimple_bb (def_stmt))
> +               continue;
> +
> +             if (!dominated_by_p (CDI_DOMINATORS, e->src,
> +                                  gimple_bb (def_stmt)))
> +               {
> +                 if (dump_file && (dump_flags & TDF_DETAILS))
> +                   {
> +                     fprintf (dump_file,
> +                              "PHI node NOT moved to %d [local count:%d] from "
> +                              "%d [local count:%d]:\n",
> +                              e->src->index, e->src->count.value (), bb->index,
> +                              bb->count.value ());
> +                     print_gimple_stmt (dump_file, stmt, 0);
> +                     fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
> +                              level->num);
> +                   }
> +                 skip_phi_move = true;
> +                 break;
> +               }
> +           }
> +         if (skip_phi_move)
> +           {
> +             gsi_next (&bsi);
> +             continue;
> +           }
> +       }
> +
>        if (dump_file && (dump_flags & TDF_DETAILS))
>         {
>           fprintf (dump_file, "Moving PHI node\n");
> @@ -1184,14 +1239,13 @@ move_computations_worker (basic_block bb)
>           tree lhs = gimple_assign_lhs (new_stmt);
>           SSA_NAME_RANGE_INFO (lhs) = NULL;
>         }
> -      gsi_insert_on_edge (loop_preheader_edge (level), new_stmt);
> +      gsi_insert_on_edge (e, new_stmt);
>        remove_phi_node (&bsi, false);
>      }
>
>    for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); )
>      {
>        edge e;
> -
>        gimple *stmt = gsi_stmt (bsi);
>
>        lim_data = get_lim_data (stmt);
> @@ -1214,7 +1268,90 @@ move_computations_worker (basic_block bb)
>        /* We do not really want to move conditionals out of the loop; we just
>          placed it here to force its operands to be moved if necessary.  */
>        if (gimple_code (stmt) == GIMPLE_COND)
> -       continue;
> +       {
> +         gsi_next (&bsi);
> +         continue;
> +       }
> +
> +      e = loop_preheader_edge (level);
> +      if (e->src->count > bb->count)
> +       {
> +         if (dump_file && (dump_flags & TDF_DETAILS))
> +           {
> +             fprintf (dump_file,
> +                      "stmt: Statement NOT moved to %d [local count:%d] from "
> +                      "%d [local count:%d]:\n",
> +                      e->src->index, e->src->count.value (), bb->index,
> +                      bb->count.value ());
> +             print_gimple_stmt (dump_file, stmt, 0);
> +             fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
> +                      level->num);
> +           }
> +         gsi_next (&bsi);
> +         continue;
> +       }
> +      else
> +       {
> +         if (is_gimple_assign (stmt))
> +           {
> +             tree rhs1 = gimple_assign_rhs1 (stmt);
> +             tree rhs2 = gimple_assign_rhs2 (stmt);
> +             if (TREE_CODE (rhs1) == MEM_REF)
> +               {
> +                 rhs2 = TREE_OPERAND (rhs1, 1);
> +                 rhs1 = TREE_OPERAND (rhs1, 0);
> +               }
> +             gimple *stmt1 = NULL, *stmt2 = NULL;
> +             basic_block def_bb;
> +             if (rhs1 && TREE_CODE (rhs1) == SSA_NAME)
> +               {
> +                 stmt1 = SSA_NAME_DEF_STMT (rhs1);
> +                 def_bb = gimple_bb (stmt1);
> +                 if (stmt1
> +                     && def_bb
> +                     && (def_bb == bb
> +                         || !dominated_by_p (CDI_DOMINATORS, e->src, def_bb)))
> +                   {
> +                     if (dump_file && (dump_flags & TDF_DETAILS))
> +                       {
> +                         fprintf (dump_file,
> +                                  "stmt1: Statement NOT moved to %d [local "
> +                                  "count:%d] from %d [local count:%d]:\n",
> +                                  e->src->index, e->src->count.value (),
> +                                  bb->index, bb->count.value ());
> +                         print_gimple_stmt (dump_file, stmt, 0);
> +                         fprintf (dump_file, "(cost %u) out of loop %d.\n\n",
> +                                  cost, level->num);
> +                       }
> +                     gsi_next (&bsi);
> +                     continue;
> +                   }
> +               }
> +             if (rhs2 && TREE_CODE (rhs2) == SSA_NAME)
> +               {
> +                 stmt2 = SSA_NAME_DEF_STMT (rhs2);
> +                 def_bb = gimple_bb (stmt2);
> +                 if (stmt2 && def_bb
> +                     && (def_bb == bb
> +                         || !dominated_by_p (CDI_DOMINATORS, e->src, def_bb)))
> +                   {
> +                     if (dump_file && (dump_flags & TDF_DETAILS))
> +                       {
> +                         fprintf (dump_file,
> +                                  "stmt2: Statement NOT moved to %d [local "
> +                                  "count:%d] from %d [local count:%d]:\n",
> +                                  e->src->index, e->src->count.value (),
> +                                  bb->index, bb->count.value ());
> +                         print_gimple_stmt (dump_file, stmt, 0);
> +                         fprintf (dump_file, "(cost %u) out of loop %d.\n\n",
> +                                  cost, level->num);
> +                       }
> +                     gsi_next (&bsi);
> +                     continue;
> +                   }
> +               }
> +           }
> +       }
>
>        if (dump_file && (dump_flags & TDF_DETAILS))
>         {
> @@ -1224,7 +1361,6 @@ move_computations_worker (basic_block bb)
>                    cost, level->num);
>         }
>
> -      e = loop_preheader_edge (level);
>        gcc_assert (!gimple_vdef (stmt));
>        if (gimple_vuse (stmt))
>         {
> @@ -2094,6 +2230,19 @@ execute_sm (class loop *loop, im_mem_ref *ref,
>    bool multi_threaded_model_p = false;
>    gimple_stmt_iterator gsi;
>    sm_aux *aux = new sm_aux;
> +  basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt);
> +
> +  edge e = loop_preheader_edge (loop);
> +  if (e->src->count > bb->count)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       {
> +         fprintf (dump_file, "Don't execute store motion of ");
> +         print_generic_expr (dump_file, ref->mem.ref);
> +         fprintf (dump_file, " from loop %d\n", loop->num);
> +       }
> +      return;
> +    }
>
>    if (dump_file && (dump_flags & TDF_DETAILS))
>      {
> @@ -2202,7 +2351,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq,
>         }
>        else
>         {
> -         sm_aux *aux = *aux_map.get (ref);
> +         sm_aux **paux = aux_map.get (ref);
> +         sm_aux *aux;
> +         if (paux)
> +           aux = *paux;
> +         else
> +           continue;
>           if (!aux->store_flag || kind == sm_ord)
>             {
>               gassign *store;
> diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
> index 3a09bbc39e5..4cae82936b9 100644
> --- a/gcc/tree-ssa-loop-split.c
> +++ b/gcc/tree-ssa-loop-split.c
> @@ -577,14 +577,17 @@ split_loop (class loop *loop1)
>         if (!initial_true)
>           cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond);
>
> +       edge true_edge = EDGE_SUCC (bbs[i], 0)->flags & EDGE_TRUE_VALUE
> +                          ? EDGE_SUCC (bbs[i], 0)
> +                          : EDGE_SUCC (bbs[i], 1);
>         /* Now version the loop, placing loop2 after loop1 connecting
>            them, and fix up SSA form for that.  */
>         initialize_original_copy_tables ();
>         basic_block cond_bb;
>
>         class loop *loop2 = loop_version (loop1, cond, &cond_bb,
> -                                          profile_probability::always (),
> -                                          profile_probability::always (),
> +                                          true_edge->probability,
> +                                          true_edge->probability.invert (),
>                                            profile_probability::always (),
>                                            profile_probability::always (),
>                                            true);
> @@ -1486,8 +1489,8 @@ do_split_loop_on_cond (struct loop *loop1, edge invar_branch)
>    initialize_original_copy_tables ();
>
>    struct loop *loop2 = loop_version (loop1, boolean_true_node, NULL,
> -                                    profile_probability::always (),
> -                                    profile_probability::never (),
> +                                    invar_branch->probability,
> +                                    invar_branch->probability.invert (),
>                                      profile_probability::always (),
>                                      profile_probability::always (),
>                                      true);
> @@ -1530,6 +1533,9 @@ do_split_loop_on_cond (struct loop *loop1, edge invar_branch)
>    to_loop1->flags |= true_invar ? EDGE_FALSE_VALUE : EDGE_TRUE_VALUE;
>    to_loop2->flags |= true_invar ? EDGE_TRUE_VALUE : EDGE_FALSE_VALUE;
>
> +  to_loop1->probability = invar_branch->probability.invert ();
> +  to_loop2->probability = invar_branch->probability;
> +
>    /* Due to introduction of a control flow edge from loop1 latch to loop2
>       pre-header, we should update PHIs in loop2 to reflect this connection
>       between loop1 and loop2.  */
> --
> 2.27.0.90.geebb51ba8c
>
Xionghu Luo Aug. 10, 2021, 2:03 a.m. UTC | #2
Hi,

On 2021/8/6 20:15, Richard Biener wrote:
> On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com> wrote:
>>
>> There was a patch trying to avoid move cold block out of loop:
>>
>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>
>> Richard suggested to "never hoist anything from a bb with lower execution
>> frequency to a bb with higher one in LIM invariantness_dom_walker
>> before_dom_children".
>>
>> This patch does this profile count check in both gimple LIM
>> move_computations_worker and RTL loop-invariant.c find_invariants_bb,
>> if the loop bb is colder than loop preheader, don't hoist it out of
>> loop.
>>
>> Also, the profile count in loop split pass should be corrected to avoid
>> lim2 and lim4 mismatch behavior, currently, the new loop preheader generated
>> by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will
>> move statement out of loop unexpectely when lim2 didn't move it.  This
>> change could fix regression on 544.nab_r from -1.55% to +0.46%.
>>
>> SPEC2017 performance evaluation shows 1% performance improvement for
>> intrate GEOMEAN and no obvious regression for others.  Especially,
>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>> on P8LE.
>>
>> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.
> 
> While I'm not familiar with the RTL invariant motion pass the patch there
> looks reasonable.  Note that we should assess the profile quality
> somehow - I'm not sure how to do that, CCed Honza for that.

Thanks.

> 
> For the GIMPLE part the patch looks quite complicated - but note it
> probably has to be since LIM performs kind of a "CSE" on loads
> (and stores for store-motion), so when there are multiple stmts
> affected by a hoisting decision the biggest block count has to be
> accounted.  Likewise when there are dependent stmts involved
> that might include conditional stmts (a "PHI"), but the overall
> cost should be looked at.

Currently, The gimple code check two situations with the patch:
1) The statement or PHI‘s BB is *colder* then preheader, don't move it out
of loop;
2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs
couldn't be moved out of loop, also don't move it out of loop to avoid definition
not dominates use error.

May be I could collect the number of instructions not hoisted with the patch
on regression tests and SPEC2017 to do a estimation for "multiple stmts affected"
and "overall cost" need to be considered?  But it seems move_computations_worker
couldn't rollback if we still want to hoist multiple stmts out during the iterations?

> 
> Now - GIMPLE LIM "costing" is somewhat backward right now
> and it isn't set up to consider those multiple involved stmts.  Plus
> the store-motion part does not have any cost part (but it depends
> on previously decided invariant motions).
> 
> I think the way you implemented the check will cause no hoisting
> to be performed instead of, say, hoisting to a different loop level
> only.  Possibly shown when you consider a loop nest like
> 
>    for (;;)
>      if (unlikely_cond)
>        for (;;)
>           invariant;
> 
> we want to hoist 'invariant' but only from the inner loop even if it
> is invariant also in the outer loop.


For this case, theorotically I think the master GCC will optimize it to:

  invariant;
  for (;;)
    if (unlikely_cond)
      for (;;)
         ;

'invariant' is moved out of outer loop, but with the patch, it will get:

  for (;;)
    if (unlikely_cond)
      {
        invariant;
        for (;;)
           ;
      }

'invariant' is *cold* for outer loop, but it is still *hot* for inner loop,
so hoist it out of inner loop, this is exactly what we want, right?


>  But for example if there is
> a store motion opportunity like
> 
>    for (;;)
>       {
>          if (unlikely_cond)
>            for (;;)
>              a = ...;
>          a = ...;
>       }
> 
> we'd still want to perform the store motion on the outer loop.
> 
> Note that store-motion already performs part of the transform
> before dependent code is moved in move_computations (that
> you patched).

Yes.  do_store_motion is running before move_computations_worker, store
motion happens earlier in execute_sm, I also added the check in execute_sm
to stop cold store moved out of loop.  So for your case, I think my patch
will similarly optimize it to:

  for (;;)
     {
        if (unlikely_cond)
          {
            for (;;)
              ;
            a = ...;
          }
     }
    a = ...;

Whether this is better?  Will construct cases to verify it.

> 
> IIRC your main concern were the COND_EXPRs we insert
> for hoisted conditional stmts?

Not sure what you mean here of COND_EXPRs?


Thanks,
Xionghu

> 
> Thanks,
> Richard.
> 
>> gcc/ChangeLog:
>>
>>          * loop-invariant.c (find_invariants_bb): Check profile count
>>          before motion.
>>          (find_invariants_body): Add argument.
>>          * tree-ssa-loop-im.c (move_computations_worker): Check profile
>>          count before motion.
>>          (execute_sm): Likewise.
>>          (execute_sm_exit): Check pointer validness.
>>          * tree-ssa-loop-split.c (split_loop): Correct probability.
>>          (do_split_loop_on_cond): Likewise.
>>
>> gcc/testsuite/ChangeLog:
>>
>>          * gcc.dg/tree-ssa/recip-3.c: Adjust.
>> ---
>>   gcc/loop-invariant.c                    |  10 +-
>>   gcc/testsuite/gcc.dg/tree-ssa/recip-3.c |   2 +-
>>   gcc/tree-ssa-loop-im.c                  | 164 +++++++++++++++++++++++-
>>   gcc/tree-ssa-loop-split.c               |  14 +-
>>   4 files changed, 177 insertions(+), 13 deletions(-)
>>
>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
>> index bdc7b59dd5f..7b5d64d11f9 100644
>> --- a/gcc/loop-invariant.c
>> +++ b/gcc/loop-invariant.c
>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
>>      call.  */
>>
>>   static void
>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
>> +                   bool always_executed)
>>   {
>>     rtx_insn *insn;
>> +  basic_block preheader = loop_preheader_edge (loop)->src;
>> +
>> +  if (preheader->count > bb->count)
>> +    return;
>>
>>     FOR_BB_INSNS (bb, insn)
>>       {
>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
>>     unsigned i;
>>
>>     for (i = 0; i < loop->num_nodes; i++)
>> -    find_invariants_bb (body[i],
>> -                       bitmap_bit_p (always_reached, i),
>> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
>>                          bitmap_bit_p (always_executed, i));
>>   }
>>
>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
>> index 638bf38db8c..641c91e719e 100644
>> --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
>> @@ -23,4 +23,4 @@ float h ()
>>          F[0] += E / d;
>>   }
>>
>> -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
>> +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>> index 7de47edbcb3..2bfb5e8ec15 100644
>> --- a/gcc/tree-ssa-loop-im.c
>> +++ b/gcc/tree-ssa-loop-im.c
>> @@ -1147,6 +1147,61 @@ move_computations_worker (basic_block bb)
>>            continue;
>>          }
>>
>> +      edge e = loop_preheader_edge (level);
>> +      if (e->src->count > bb->count)
>> +       {
>> +         if (dump_file && (dump_flags & TDF_DETAILS))
>> +           {
>> +             fprintf (dump_file, "PHI node NOT moved to %d from %d:\n",
>> +                      e->src->index, bb->index);
>> +             print_gimple_stmt (dump_file, stmt, 0);
>> +             fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
>> +                      level->num);
>> +           }
>> +         gsi_next (&bsi);
>> +         continue;
>> +       }
>> +      else
>> +       {
>> +         unsigned i;
>> +         bool skip_phi_move = false;
>> +         for (i = 0; i < gimple_phi_num_args (stmt); i++)
>> +           {
>> +             tree def = PHI_ARG_DEF (stmt, i);
>> +
>> +             if (TREE_CODE (def) != SSA_NAME)
>> +               continue;
>> +
>> +             gimple *def_stmt = SSA_NAME_DEF_STMT (def);
>> +
>> +             if (!gimple_bb (def_stmt))
>> +               continue;
>> +
>> +             if (!dominated_by_p (CDI_DOMINATORS, e->src,
>> +                                  gimple_bb (def_stmt)))
>> +               {
>> +                 if (dump_file && (dump_flags & TDF_DETAILS))
>> +                   {
>> +                     fprintf (dump_file,
>> +                              "PHI node NOT moved to %d [local count:%d] from "
>> +                              "%d [local count:%d]:\n",
>> +                              e->src->index, e->src->count.value (), bb->index,
>> +                              bb->count.value ());
>> +                     print_gimple_stmt (dump_file, stmt, 0);
>> +                     fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
>> +                              level->num);
>> +                   }
>> +                 skip_phi_move = true;
>> +                 break;
>> +               }
>> +           }
>> +         if (skip_phi_move)
>> +           {
>> +             gsi_next (&bsi);
>> +             continue;
>> +           }
>> +       }
>> +
>>         if (dump_file && (dump_flags & TDF_DETAILS))
>>          {
>>            fprintf (dump_file, "Moving PHI node\n");
>> @@ -1184,14 +1239,13 @@ move_computations_worker (basic_block bb)
>>            tree lhs = gimple_assign_lhs (new_stmt);
>>            SSA_NAME_RANGE_INFO (lhs) = NULL;
>>          }
>> -      gsi_insert_on_edge (loop_preheader_edge (level), new_stmt);
>> +      gsi_insert_on_edge (e, new_stmt);
>>         remove_phi_node (&bsi, false);
>>       }
>>
>>     for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); )
>>       {
>>         edge e;
>> -
>>         gimple *stmt = gsi_stmt (bsi);
>>
>>         lim_data = get_lim_data (stmt);
>> @@ -1214,7 +1268,90 @@ move_computations_worker (basic_block bb)
>>         /* We do not really want to move conditionals out of the loop; we just
>>           placed it here to force its operands to be moved if necessary.  */
>>         if (gimple_code (stmt) == GIMPLE_COND)
>> -       continue;
>> +       {
>> +         gsi_next (&bsi);
>> +         continue;
>> +       }
>> +
>> +      e = loop_preheader_edge (level);
>> +      if (e->src->count > bb->count)
>> +       {
>> +         if (dump_file && (dump_flags & TDF_DETAILS))
>> +           {
>> +             fprintf (dump_file,
>> +                      "stmt: Statement NOT moved to %d [local count:%d] from "
>> +                      "%d [local count:%d]:\n",
>> +                      e->src->index, e->src->count.value (), bb->index,
>> +                      bb->count.value ());
>> +             print_gimple_stmt (dump_file, stmt, 0);
>> +             fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
>> +                      level->num);
>> +           }
>> +         gsi_next (&bsi);
>> +         continue;
>> +       }
>> +      else
>> +       {
>> +         if (is_gimple_assign (stmt))
>> +           {
>> +             tree rhs1 = gimple_assign_rhs1 (stmt);
>> +             tree rhs2 = gimple_assign_rhs2 (stmt);
>> +             if (TREE_CODE (rhs1) == MEM_REF)
>> +               {
>> +                 rhs2 = TREE_OPERAND (rhs1, 1);
>> +                 rhs1 = TREE_OPERAND (rhs1, 0);
>> +               }
>> +             gimple *stmt1 = NULL, *stmt2 = NULL;
>> +             basic_block def_bb;
>> +             if (rhs1 && TREE_CODE (rhs1) == SSA_NAME)
>> +               {
>> +                 stmt1 = SSA_NAME_DEF_STMT (rhs1);
>> +                 def_bb = gimple_bb (stmt1);
>> +                 if (stmt1
>> +                     && def_bb
>> +                     && (def_bb == bb
>> +                         || !dominated_by_p (CDI_DOMINATORS, e->src, def_bb)))
>> +                   {
>> +                     if (dump_file && (dump_flags & TDF_DETAILS))
>> +                       {
>> +                         fprintf (dump_file,
>> +                                  "stmt1: Statement NOT moved to %d [local "
>> +                                  "count:%d] from %d [local count:%d]:\n",
>> +                                  e->src->index, e->src->count.value (),
>> +                                  bb->index, bb->count.value ());
>> +                         print_gimple_stmt (dump_file, stmt, 0);
>> +                         fprintf (dump_file, "(cost %u) out of loop %d.\n\n",
>> +                                  cost, level->num);
>> +                       }
>> +                     gsi_next (&bsi);
>> +                     continue;
>> +                   }
>> +               }
>> +             if (rhs2 && TREE_CODE (rhs2) == SSA_NAME)
>> +               {
>> +                 stmt2 = SSA_NAME_DEF_STMT (rhs2);
>> +                 def_bb = gimple_bb (stmt2);
>> +                 if (stmt2 && def_bb
>> +                     && (def_bb == bb
>> +                         || !dominated_by_p (CDI_DOMINATORS, e->src, def_bb)))
>> +                   {
>> +                     if (dump_file && (dump_flags & TDF_DETAILS))
>> +                       {
>> +                         fprintf (dump_file,
>> +                                  "stmt2: Statement NOT moved to %d [local "
>> +                                  "count:%d] from %d [local count:%d]:\n",
>> +                                  e->src->index, e->src->count.value (),
>> +                                  bb->index, bb->count.value ());
>> +                         print_gimple_stmt (dump_file, stmt, 0);
>> +                         fprintf (dump_file, "(cost %u) out of loop %d.\n\n",
>> +                                  cost, level->num);
>> +                       }
>> +                     gsi_next (&bsi);
>> +                     continue;
>> +                   }
>> +               }
>> +           }
>> +       }
>>
>>         if (dump_file && (dump_flags & TDF_DETAILS))
>>          {
>> @@ -1224,7 +1361,6 @@ move_computations_worker (basic_block bb)
>>                     cost, level->num);
>>          }
>>
>> -      e = loop_preheader_edge (level);
>>         gcc_assert (!gimple_vdef (stmt));
>>         if (gimple_vuse (stmt))
>>          {
>> @@ -2094,6 +2230,19 @@ execute_sm (class loop *loop, im_mem_ref *ref,
>>     bool multi_threaded_model_p = false;
>>     gimple_stmt_iterator gsi;
>>     sm_aux *aux = new sm_aux;
>> +  basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt);
>> +
>> +  edge e = loop_preheader_edge (loop);
>> +  if (e->src->count > bb->count)
>> +    {
>> +      if (dump_file && (dump_flags & TDF_DETAILS))
>> +       {
>> +         fprintf (dump_file, "Don't execute store motion of ");
>> +         print_generic_expr (dump_file, ref->mem.ref);
>> +         fprintf (dump_file, " from loop %d\n", loop->num);
>> +       }
>> +      return;
>> +    }
>>
>>     if (dump_file && (dump_flags & TDF_DETAILS))
>>       {
>> @@ -2202,7 +2351,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq,
>>          }
>>         else
>>          {
>> -         sm_aux *aux = *aux_map.get (ref);
>> +         sm_aux **paux = aux_map.get (ref);
>> +         sm_aux *aux;
>> +         if (paux)
>> +           aux = *paux;
>> +         else
>> +           continue;
>>            if (!aux->store_flag || kind == sm_ord)
>>              {
>>                gassign *store;
>> diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
>> index 3a09bbc39e5..4cae82936b9 100644
>> --- a/gcc/tree-ssa-loop-split.c
>> +++ b/gcc/tree-ssa-loop-split.c
>> @@ -577,14 +577,17 @@ split_loop (class loop *loop1)
>>          if (!initial_true)
>>            cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond);
>>
>> +       edge true_edge = EDGE_SUCC (bbs[i], 0)->flags & EDGE_TRUE_VALUE
>> +                          ? EDGE_SUCC (bbs[i], 0)
>> +                          : EDGE_SUCC (bbs[i], 1);
>>          /* Now version the loop, placing loop2 after loop1 connecting
>>             them, and fix up SSA form for that.  */
>>          initialize_original_copy_tables ();
>>          basic_block cond_bb;
>>
>>          class loop *loop2 = loop_version (loop1, cond, &cond_bb,
>> -                                          profile_probability::always (),
>> -                                          profile_probability::always (),
>> +                                          true_edge->probability,
>> +                                          true_edge->probability.invert (),
>>                                             profile_probability::always (),
>>                                             profile_probability::always (),
>>                                             true);
>> @@ -1486,8 +1489,8 @@ do_split_loop_on_cond (struct loop *loop1, edge invar_branch)
>>     initialize_original_copy_tables ();
>>
>>     struct loop *loop2 = loop_version (loop1, boolean_true_node, NULL,
>> -                                    profile_probability::always (),
>> -                                    profile_probability::never (),
>> +                                    invar_branch->probability,
>> +                                    invar_branch->probability.invert (),
>>                                       profile_probability::always (),
>>                                       profile_probability::always (),
>>                                       true);
>> @@ -1530,6 +1533,9 @@ do_split_loop_on_cond (struct loop *loop1, edge invar_branch)
>>     to_loop1->flags |= true_invar ? EDGE_FALSE_VALUE : EDGE_TRUE_VALUE;
>>     to_loop2->flags |= true_invar ? EDGE_TRUE_VALUE : EDGE_FALSE_VALUE;
>>
>> +  to_loop1->probability = invar_branch->probability.invert ();
>> +  to_loop2->probability = invar_branch->probability;
>> +
>>     /* Due to introduction of a control flow edge from loop1 latch to loop2
>>        pre-header, we should update PHIs in loop2 to reflect this connection
>>        between loop1 and loop2.  */
>> --
>> 2.27.0.90.geebb51ba8c
>>
Ulrich Drepper Aug. 10, 2021, 4:25 a.m. UTC | #3
On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
> For this case, theorotically I think the master GCC will optimize it to:
>
>   invariant;
>   for (;;)
>     if (unlikely_cond)
>       for (;;)
>          ;
>
> 'invariant' is moved out of outer loop, but with the patch, it will get:
>
>   for (;;)
>     if (unlikely_cond)
>       {
>         invariant;
>         for (;;)
>            ;
>       }
>
> 'invariant' is *cold* for outer loop, but it is still *hot* for inner loop,
> so hoist it out of inner loop, this is exactly what we want, right?

Is relying on absolute numbers really what you want?  If the
'unlikely_cond' condition depends on the iteration count of the outer
loop the probability of it being true in each individual iteration can
be low (at least that's how I use unlikely) but the overall
probability of needing the code is higher 1 - (1 - p)^n  if 'p' is the
probability of 'unlikely_cond' and 'n' is the number of iterations.
Assuming complete independence of the loop iterations, otherwise it's
rather an upper limit.

At the very least I'd generate code like this:

  first = true;
  for (;;)
    if (unlikely_cond)
      {
        if (first)
          {
            invariant;
            first = false;
          }
        for (;;)
           ;
      }

If it's worth hoisting the code the the extra test and flag should be
small in cost in comparison.

If 'unlikely_cond' does not in any way depend on the loop iteration
then I think your code generation is fine.
Richard Biener Aug. 26, 2021, 11:33 a.m. UTC | #4
On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>
> Hi,
>
> On 2021/8/6 20:15, Richard Biener wrote:
> > On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com> wrote:
> >>
> >> There was a patch trying to avoid move cold block out of loop:
> >>
> >> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> >>
> >> Richard suggested to "never hoist anything from a bb with lower execution
> >> frequency to a bb with higher one in LIM invariantness_dom_walker
> >> before_dom_children".
> >>
> >> This patch does this profile count check in both gimple LIM
> >> move_computations_worker and RTL loop-invariant.c find_invariants_bb,
> >> if the loop bb is colder than loop preheader, don't hoist it out of
> >> loop.
> >>
> >> Also, the profile count in loop split pass should be corrected to avoid
> >> lim2 and lim4 mismatch behavior, currently, the new loop preheader generated
> >> by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will
> >> move statement out of loop unexpectely when lim2 didn't move it.  This
> >> change could fix regression on 544.nab_r from -1.55% to +0.46%.
> >>
> >> SPEC2017 performance evaluation shows 1% performance improvement for
> >> intrate GEOMEAN and no obvious regression for others.  Especially,
> >> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> >> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> >> on P8LE.
> >>
> >> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.
> >
> > While I'm not familiar with the RTL invariant motion pass the patch there
> > looks reasonable.  Note that we should assess the profile quality
> > somehow - I'm not sure how to do that, CCed Honza for that.
>
> Thanks.
>
> >
> > For the GIMPLE part the patch looks quite complicated - but note it
> > probably has to be since LIM performs kind of a "CSE" on loads
> > (and stores for store-motion), so when there are multiple stmts
> > affected by a hoisting decision the biggest block count has to be
> > accounted.  Likewise when there are dependent stmts involved
> > that might include conditional stmts (a "PHI"), but the overall
> > cost should be looked at.
>
> Currently, The gimple code check two situations with the patch:
> 1) The statement or PHI‘s BB is *colder* then preheader, don't move it out
> of loop;
> 2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs
> couldn't be moved out of loop, also don't move it out of loop to avoid definition
> not dominates use error.

But part 2) is obviously already done.  What I tried to say is your heuristic
doesn't integrate nicely with the pass but I admitted that it might be a bit
difficult to find a place to add this heuristic.

There is lim_data->cost which we could bias negatively but then this is
a cost that is independent on the hoisting distance.  But doing this would
work at least for the case where the immediately enclosing loop preheader
is hotter than the stmt and with this it would be a patch that's similarly
simple as the RTL one.

Another possibility is to simply only adjust PHI processing in
compute_invariantness, capping movement according to the hotness
heuristic.  The same could be done for regular stmts there but I'm
not sure that will do good in the end since this function is supposed
to compute "correctness" (well, it also has the cost stuff), and it's
not the place to do overall cost considerations.

> May be I could collect the number of instructions not hoisted with the patch
> on regression tests and SPEC2017 to do a estimation for "multiple stmts affected"
> and "overall cost" need to be considered?  But it seems move_computations_worker
> couldn't rollback if we still want to hoist multiple stmts out during the iterations?
>
> >
> > Now - GIMPLE LIM "costing" is somewhat backward right now
> > and it isn't set up to consider those multiple involved stmts.  Plus
> > the store-motion part does not have any cost part (but it depends
> > on previously decided invariant motions).
> >
> > I think the way you implemented the check will cause no hoisting
> > to be performed instead of, say, hoisting to a different loop level
> > only.  Possibly shown when you consider a loop nest like
> >
> >    for (;;)
> >      if (unlikely_cond)
> >        for (;;)
> >           invariant;
> >
> > we want to hoist 'invariant' but only from the inner loop even if it
> > is invariant also in the outer loop.
>
>
> For this case, theorotically I think the master GCC will optimize it to:
>
>   invariant;
>   for (;;)
>     if (unlikely_cond)
>       for (;;)
>          ;
>
> 'invariant' is moved out of outer loop, but with the patch, it will get:
>
>   for (;;)
>     if (unlikely_cond)
>       {
>         invariant;
>         for (;;)
>            ;
>       }
>
> 'invariant' is *cold* for outer loop, but it is still *hot* for inner loop,
> so hoist it out of inner loop, this is exactly what we want, right?

Yes.  I had doubts your patch would achieve that.

>
> >  But for example if there is
> > a store motion opportunity like
> >
> >    for (;;)
> >       {
> >          if (unlikely_cond)
> >            for (;;)
> >              a = ...;
> >          a = ...;
> >       }
> >
> > we'd still want to perform the store motion on the outer loop.
> >
> > Note that store-motion already performs part of the transform
> > before dependent code is moved in move_computations (that
> > you patched).
>
> Yes.  do_store_motion is running before move_computations_worker, store
> motion happens earlier in execute_sm, I also added the check in execute_sm
> to stop cold store moved out of loop.  So for your case, I think my patch
> will similarly optimize it to:
>
>   for (;;)
>      {
>         if (unlikely_cond)
>           {
>             for (;;)
>               ;
>             a = ...;
>           }
>      }
>     a = ...;
>
> Whether this is better?  Will construct cases to verify it.
>
> >
> > IIRC your main concern were the COND_EXPRs we insert
> > for hoisted conditional stmts?
>
> Not sure what you mean here of COND_EXPRs?

The PHIs we hoist and for which we insert COND_EXPRs.  IIRC that was
your original complaint.  So my question is whether we can fix the really
bad cases moving PHIs with sth local to compute_invariantness.

Richard.

>
> Thanks,
> Xionghu
>
> >
> > Thanks,
> > Richard.
> >
> >> gcc/ChangeLog:
> >>
> >>          * loop-invariant.c (find_invariants_bb): Check profile count
> >>          before motion.
> >>          (find_invariants_body): Add argument.
> >>          * tree-ssa-loop-im.c (move_computations_worker): Check profile
> >>          count before motion.
> >>          (execute_sm): Likewise.
> >>          (execute_sm_exit): Check pointer validness.
> >>          * tree-ssa-loop-split.c (split_loop): Correct probability.
> >>          (do_split_loop_on_cond): Likewise.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >>          * gcc.dg/tree-ssa/recip-3.c: Adjust.
> >> ---
> >>   gcc/loop-invariant.c                    |  10 +-
> >>   gcc/testsuite/gcc.dg/tree-ssa/recip-3.c |   2 +-
> >>   gcc/tree-ssa-loop-im.c                  | 164 +++++++++++++++++++++++-
> >>   gcc/tree-ssa-loop-split.c               |  14 +-
> >>   4 files changed, 177 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
> >> index bdc7b59dd5f..7b5d64d11f9 100644
> >> --- a/gcc/loop-invariant.c
> >> +++ b/gcc/loop-invariant.c
> >> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
> >>      call.  */
> >>
> >>   static void
> >> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
> >> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
> >> +                   bool always_executed)
> >>   {
> >>     rtx_insn *insn;
> >> +  basic_block preheader = loop_preheader_edge (loop)->src;
> >> +
> >> +  if (preheader->count > bb->count)
> >> +    return;
> >>
> >>     FOR_BB_INSNS (bb, insn)
> >>       {
> >> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
> >>     unsigned i;
> >>
> >>     for (i = 0; i < loop->num_nodes; i++)
> >> -    find_invariants_bb (body[i],
> >> -                       bitmap_bit_p (always_reached, i),
> >> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
> >>                          bitmap_bit_p (always_executed, i));
> >>   }
> >>
> >> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> >> index 638bf38db8c..641c91e719e 100644
> >> --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> >> +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> >> @@ -23,4 +23,4 @@ float h ()
> >>          F[0] += E / d;
> >>   }
> >>
> >> -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
> >> +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
> >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> >> index 7de47edbcb3..2bfb5e8ec15 100644
> >> --- a/gcc/tree-ssa-loop-im.c
> >> +++ b/gcc/tree-ssa-loop-im.c
> >> @@ -1147,6 +1147,61 @@ move_computations_worker (basic_block bb)
> >>            continue;
> >>          }
> >>
> >> +      edge e = loop_preheader_edge (level);
> >> +      if (e->src->count > bb->count)
> >> +       {
> >> +         if (dump_file && (dump_flags & TDF_DETAILS))
> >> +           {
> >> +             fprintf (dump_file, "PHI node NOT moved to %d from %d:\n",
> >> +                      e->src->index, bb->index);
> >> +             print_gimple_stmt (dump_file, stmt, 0);
> >> +             fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
> >> +                      level->num);
> >> +           }
> >> +         gsi_next (&bsi);
> >> +         continue;
> >> +       }
> >> +      else
> >> +       {
> >> +         unsigned i;
> >> +         bool skip_phi_move = false;
> >> +         for (i = 0; i < gimple_phi_num_args (stmt); i++)
> >> +           {
> >> +             tree def = PHI_ARG_DEF (stmt, i);
> >> +
> >> +             if (TREE_CODE (def) != SSA_NAME)
> >> +               continue;
> >> +
> >> +             gimple *def_stmt = SSA_NAME_DEF_STMT (def);
> >> +
> >> +             if (!gimple_bb (def_stmt))
> >> +               continue;
> >> +
> >> +             if (!dominated_by_p (CDI_DOMINATORS, e->src,
> >> +                                  gimple_bb (def_stmt)))
> >> +               {
> >> +                 if (dump_file && (dump_flags & TDF_DETAILS))
> >> +                   {
> >> +                     fprintf (dump_file,
> >> +                              "PHI node NOT moved to %d [local count:%d] from "
> >> +                              "%d [local count:%d]:\n",
> >> +                              e->src->index, e->src->count.value (), bb->index,
> >> +                              bb->count.value ());
> >> +                     print_gimple_stmt (dump_file, stmt, 0);
> >> +                     fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
> >> +                              level->num);
> >> +                   }
> >> +                 skip_phi_move = true;
> >> +                 break;
> >> +               }
> >> +           }
> >> +         if (skip_phi_move)
> >> +           {
> >> +             gsi_next (&bsi);
> >> +             continue;
> >> +           }
> >> +       }
> >> +
> >>         if (dump_file && (dump_flags & TDF_DETAILS))
> >>          {
> >>            fprintf (dump_file, "Moving PHI node\n");
> >> @@ -1184,14 +1239,13 @@ move_computations_worker (basic_block bb)
> >>            tree lhs = gimple_assign_lhs (new_stmt);
> >>            SSA_NAME_RANGE_INFO (lhs) = NULL;
> >>          }
> >> -      gsi_insert_on_edge (loop_preheader_edge (level), new_stmt);
> >> +      gsi_insert_on_edge (e, new_stmt);
> >>         remove_phi_node (&bsi, false);
> >>       }
> >>
> >>     for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); )
> >>       {
> >>         edge e;
> >> -
> >>         gimple *stmt = gsi_stmt (bsi);
> >>
> >>         lim_data = get_lim_data (stmt);
> >> @@ -1214,7 +1268,90 @@ move_computations_worker (basic_block bb)
> >>         /* We do not really want to move conditionals out of the loop; we just
> >>           placed it here to force its operands to be moved if necessary.  */
> >>         if (gimple_code (stmt) == GIMPLE_COND)
> >> -       continue;
> >> +       {
> >> +         gsi_next (&bsi);
> >> +         continue;
> >> +       }
> >> +
> >> +      e = loop_preheader_edge (level);
> >> +      if (e->src->count > bb->count)
> >> +       {
> >> +         if (dump_file && (dump_flags & TDF_DETAILS))
> >> +           {
> >> +             fprintf (dump_file,
> >> +                      "stmt: Statement NOT moved to %d [local count:%d] from "
> >> +                      "%d [local count:%d]:\n",
> >> +                      e->src->index, e->src->count.value (), bb->index,
> >> +                      bb->count.value ());
> >> +             print_gimple_stmt (dump_file, stmt, 0);
> >> +             fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
> >> +                      level->num);
> >> +           }
> >> +         gsi_next (&bsi);
> >> +         continue;
> >> +       }
> >> +      else
> >> +       {
> >> +         if (is_gimple_assign (stmt))
> >> +           {
> >> +             tree rhs1 = gimple_assign_rhs1 (stmt);
> >> +             tree rhs2 = gimple_assign_rhs2 (stmt);
> >> +             if (TREE_CODE (rhs1) == MEM_REF)
> >> +               {
> >> +                 rhs2 = TREE_OPERAND (rhs1, 1);
> >> +                 rhs1 = TREE_OPERAND (rhs1, 0);
> >> +               }
> >> +             gimple *stmt1 = NULL, *stmt2 = NULL;
> >> +             basic_block def_bb;
> >> +             if (rhs1 && TREE_CODE (rhs1) == SSA_NAME)
> >> +               {
> >> +                 stmt1 = SSA_NAME_DEF_STMT (rhs1);
> >> +                 def_bb = gimple_bb (stmt1);
> >> +                 if (stmt1
> >> +                     && def_bb
> >> +                     && (def_bb == bb
> >> +                         || !dominated_by_p (CDI_DOMINATORS, e->src, def_bb)))
> >> +                   {
> >> +                     if (dump_file && (dump_flags & TDF_DETAILS))
> >> +                       {
> >> +                         fprintf (dump_file,
> >> +                                  "stmt1: Statement NOT moved to %d [local "
> >> +                                  "count:%d] from %d [local count:%d]:\n",
> >> +                                  e->src->index, e->src->count.value (),
> >> +                                  bb->index, bb->count.value ());
> >> +                         print_gimple_stmt (dump_file, stmt, 0);
> >> +                         fprintf (dump_file, "(cost %u) out of loop %d.\n\n",
> >> +                                  cost, level->num);
> >> +                       }
> >> +                     gsi_next (&bsi);
> >> +                     continue;
> >> +                   }
> >> +               }
> >> +             if (rhs2 && TREE_CODE (rhs2) == SSA_NAME)
> >> +               {
> >> +                 stmt2 = SSA_NAME_DEF_STMT (rhs2);
> >> +                 def_bb = gimple_bb (stmt2);
> >> +                 if (stmt2 && def_bb
> >> +                     && (def_bb == bb
> >> +                         || !dominated_by_p (CDI_DOMINATORS, e->src, def_bb)))
> >> +                   {
> >> +                     if (dump_file && (dump_flags & TDF_DETAILS))
> >> +                       {
> >> +                         fprintf (dump_file,
> >> +                                  "stmt2: Statement NOT moved to %d [local "
> >> +                                  "count:%d] from %d [local count:%d]:\n",
> >> +                                  e->src->index, e->src->count.value (),
> >> +                                  bb->index, bb->count.value ());
> >> +                         print_gimple_stmt (dump_file, stmt, 0);
> >> +                         fprintf (dump_file, "(cost %u) out of loop %d.\n\n",
> >> +                                  cost, level->num);
> >> +                       }
> >> +                     gsi_next (&bsi);
> >> +                     continue;
> >> +                   }
> >> +               }
> >> +           }
> >> +       }
> >>
> >>         if (dump_file && (dump_flags & TDF_DETAILS))
> >>          {
> >> @@ -1224,7 +1361,6 @@ move_computations_worker (basic_block bb)
> >>                     cost, level->num);
> >>          }
> >>
> >> -      e = loop_preheader_edge (level);
> >>         gcc_assert (!gimple_vdef (stmt));
> >>         if (gimple_vuse (stmt))
> >>          {
> >> @@ -2094,6 +2230,19 @@ execute_sm (class loop *loop, im_mem_ref *ref,
> >>     bool multi_threaded_model_p = false;
> >>     gimple_stmt_iterator gsi;
> >>     sm_aux *aux = new sm_aux;
> >> +  basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt);
> >> +
> >> +  edge e = loop_preheader_edge (loop);
> >> +  if (e->src->count > bb->count)
> >> +    {
> >> +      if (dump_file && (dump_flags & TDF_DETAILS))
> >> +       {
> >> +         fprintf (dump_file, "Don't execute store motion of ");
> >> +         print_generic_expr (dump_file, ref->mem.ref);
> >> +         fprintf (dump_file, " from loop %d\n", loop->num);
> >> +       }
> >> +      return;
> >> +    }
> >>
> >>     if (dump_file && (dump_flags & TDF_DETAILS))
> >>       {
> >> @@ -2202,7 +2351,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq,
> >>          }
> >>         else
> >>          {
> >> -         sm_aux *aux = *aux_map.get (ref);
> >> +         sm_aux **paux = aux_map.get (ref);
> >> +         sm_aux *aux;
> >> +         if (paux)
> >> +           aux = *paux;
> >> +         else
> >> +           continue;
> >>            if (!aux->store_flag || kind == sm_ord)
> >>              {
> >>                gassign *store;
> >> diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
> >> index 3a09bbc39e5..4cae82936b9 100644
> >> --- a/gcc/tree-ssa-loop-split.c
> >> +++ b/gcc/tree-ssa-loop-split.c
> >> @@ -577,14 +577,17 @@ split_loop (class loop *loop1)
> >>          if (!initial_true)
> >>            cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond);
> >>
> >> +       edge true_edge = EDGE_SUCC (bbs[i], 0)->flags & EDGE_TRUE_VALUE
> >> +                          ? EDGE_SUCC (bbs[i], 0)
> >> +                          : EDGE_SUCC (bbs[i], 1);
> >>          /* Now version the loop, placing loop2 after loop1 connecting
> >>             them, and fix up SSA form for that.  */
> >>          initialize_original_copy_tables ();
> >>          basic_block cond_bb;
> >>
> >>          class loop *loop2 = loop_version (loop1, cond, &cond_bb,
> >> -                                          profile_probability::always (),
> >> -                                          profile_probability::always (),
> >> +                                          true_edge->probability,
> >> +                                          true_edge->probability.invert (),
> >>                                             profile_probability::always (),
> >>                                             profile_probability::always (),
> >>                                             true);
> >> @@ -1486,8 +1489,8 @@ do_split_loop_on_cond (struct loop *loop1, edge invar_branch)
> >>     initialize_original_copy_tables ();
> >>
> >>     struct loop *loop2 = loop_version (loop1, boolean_true_node, NULL,
> >> -                                    profile_probability::always (),
> >> -                                    profile_probability::never (),
> >> +                                    invar_branch->probability,
> >> +                                    invar_branch->probability.invert (),
> >>                                       profile_probability::always (),
> >>                                       profile_probability::always (),
> >>                                       true);
> >> @@ -1530,6 +1533,9 @@ do_split_loop_on_cond (struct loop *loop1, edge invar_branch)
> >>     to_loop1->flags |= true_invar ? EDGE_FALSE_VALUE : EDGE_TRUE_VALUE;
> >>     to_loop2->flags |= true_invar ? EDGE_TRUE_VALUE : EDGE_FALSE_VALUE;
> >>
> >> +  to_loop1->probability = invar_branch->probability.invert ();
> >> +  to_loop2->probability = invar_branch->probability;
> >> +
> >>     /* Due to introduction of a control flow edge from loop1 latch to loop2
> >>        pre-header, we should update PHIs in loop2 to reflect this connection
> >>        between loop1 and loop2.  */
> >> --
> >> 2.27.0.90.geebb51ba8c
> >>
Xionghu Luo Sept. 9, 2021, 1:55 a.m. UTC | #5
On 2021/8/26 19:33, Richard Biener wrote:
> On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>
>> Hi,
>>
>> On 2021/8/6 20:15, Richard Biener wrote:
>>> On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com> wrote:
>>>>
>>>> There was a patch trying to avoid move cold block out of loop:
>>>>
>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>>>
>>>> Richard suggested to "never hoist anything from a bb with lower execution
>>>> frequency to a bb with higher one in LIM invariantness_dom_walker
>>>> before_dom_children".
>>>>
>>>> This patch does this profile count check in both gimple LIM
>>>> move_computations_worker and RTL loop-invariant.c find_invariants_bb,
>>>> if the loop bb is colder than loop preheader, don't hoist it out of
>>>> loop.
>>>>
>>>> Also, the profile count in loop split pass should be corrected to avoid
>>>> lim2 and lim4 mismatch behavior, currently, the new loop preheader generated
>>>> by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will
>>>> move statement out of loop unexpectely when lim2 didn't move it.  This
>>>> change could fix regression on 544.nab_r from -1.55% to +0.46%.
>>>>
>>>> SPEC2017 performance evaluation shows 1% performance improvement for
>>>> intrate GEOMEAN and no obvious regression for others.  Especially,
>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>>>> on P8LE.
>>>>
>>>> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.
>>>
>>> While I'm not familiar with the RTL invariant motion pass the patch there
>>> looks reasonable.  Note that we should assess the profile quality
>>> somehow - I'm not sure how to do that, CCed Honza for that.
>>
>> Thanks.
>>
>>>
>>> For the GIMPLE part the patch looks quite complicated - but note it
>>> probably has to be since LIM performs kind of a "CSE" on loads
>>> (and stores for store-motion), so when there are multiple stmts
>>> affected by a hoisting decision the biggest block count has to be
>>> accounted.  Likewise when there are dependent stmts involved
>>> that might include conditional stmts (a "PHI"), but the overall
>>> cost should be looked at.
>>
>> Currently, The gimple code check two situations with the patch:
>> 1) The statement or PHI‘s BB is *colder* then preheader, don't move it out
>> of loop;
>> 2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs
>> couldn't be moved out of loop, also don't move it out of loop to avoid definition
>> not dominates use error.
> 
> But part 2) is obviously already done.  What I tried to say is your heuristic
> doesn't integrate nicely with the pass but I admitted that it might be a bit
> difficult to find a place to add this heuristic.
> 
> There is lim_data->cost which we could bias negatively but then this is
> a cost that is independent on the hoisting distance.  But doing this would
> work at least for the case where the immediately enclosing loop preheader
> is hotter than the stmt and with this it would be a patch that's similarly
> simple as the RTL one.
> 
> Another possibility is to simply only adjust PHI processing in
> compute_invariantness, capping movement according to the hotness
> heuristic.  The same could be done for regular stmts there but I'm
> not sure that will do good in the end since this function is supposed
> to compute "correctness" (well, it also has the cost stuff), and it's
> not the place to do overall cost considerations.

Thanks.  I found that adding a function find_coldest_out_loop and check it in
outermost_invariant_loop to find the coldest invariant loop between outermost
loop and itself could also reach the purpose.  Then the gimple code check is
redundant and could be removed.

> 
>> May be I could collect the number of instructions not hoisted with the patch
>> on regression tests and SPEC2017 to do a estimation for "multiple stmts affected"
>> and "overall cost" need to be considered?  But it seems move_computations_worker
>> couldn't rollback if we still want to hoist multiple stmts out during the iterations?
>>
>>>
>>> Now - GIMPLE LIM "costing" is somewhat backward right now
>>> and it isn't set up to consider those multiple involved stmts.  Plus
>>> the store-motion part does not have any cost part (but it depends
>>> on previously decided invariant motions).
>>>
>>> I think the way you implemented the check will cause no hoisting
>>> to be performed instead of, say, hoisting to a different loop level
>>> only.  Possibly shown when you consider a loop nest like
>>>
>>>     for (;;)
>>>       if (unlikely_cond)
>>>         for (;;)
>>>            invariant;
>>>
>>> we want to hoist 'invariant' but only from the inner loop even if it
>>> is invariant also in the outer loop.
>>
>>
>> For this case, theorotically I think the master GCC will optimize it to:
>>
>>    invariant;
>>    for (;;)
>>      if (unlikely_cond)
>>        for (;;)
>>           ;
>>
>> 'invariant' is moved out of outer loop, but with the patch, it will get:
>>
>>    for (;;)
>>      if (unlikely_cond)
>>        {
>>          invariant;
>>          for (;;)
>>             ;
>>        }
>>
>> 'invariant' is *cold* for outer loop, but it is still *hot* for inner loop,
>> so hoist it out of inner loop, this is exactly what we want, right?
> 
> Yes.  I had doubts your patch would achieve that.
> 


The below updated patch could achieve it:


There was a patch trying to avoid move cold block out of loop:

https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Richard suggested to "never hoist anything from a bb with lower execution
frequency to a bb with higher one in LIM invariantness_dom_walker
before_dom_children".

In gimple LIM analysis,  add find_coldest_out_loop to move invariants to
expected target loop, then  if profile count of the loop bb is colder
than target loop preheader, it won't be hoisted out of loop.

SPEC2017 performance evaluation shows 1% performance improvement for
intrate GEOMEAN and no obvious regression for others.  Especially,
500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
on P8LE.

Regression and bootstrap tested pass on P8LE, any comments?  Thanks.

gcc/ChangeLog:

	* loop-invariant.c (find_invariants_bb): Check profile count
	before motion.
	(find_invariants_body): Add argument.
	* tree-ssa-loop-im.c (find_coldest_out_loop): New function.
	(outermost_invariant_loop): Use find_coldest_out_loop.
	(determine_max_movement): Likewise.
	(move_computations_worker): Adjust and fix iteration udpate.
	(execute_sm): Likewise.
	(execute_sm_exit): Check pointer validness.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/recip-3.c: Adjust.
	* gcc.dg/tree-ssa/ssa-lim-16.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-17.c: New test.
---
  gcc/loop-invariant.c                       | 10 ++-
  gcc/tree-ssa-loop-im.c                     | 79 ++++++++++++++++++----
  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c | 20 ++++++
  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c | 26 +++++++
  5 files changed, 121 insertions(+), 16 deletions(-)
  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c
  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c

diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
index fca0c2b24be..5c3be7bf0eb 100644
--- a/gcc/loop-invariant.c
+++ b/gcc/loop-invariant.c
@@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
     call.  */
  
  static void
-find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
+find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
+		    bool always_executed)
  {
    rtx_insn *insn;
+  basic_block preheader = loop_preheader_edge (loop)->src;
+
+  if (preheader->count > bb->count)
+    return;
  
    FOR_BB_INSNS (bb, insn)
      {
@@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
    unsigned i;
  
    for (i = 0; i < loop->num_nodes; i++)
-    find_invariants_bb (body[i],
-			bitmap_bit_p (always_reached, i),
+    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
  			bitmap_bit_p (always_executed, i));
  }
  
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index d9f75d5025e..f5ab6a734e7 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
    return ret;
  }
  
+/* Find coldest loop between outmost_loop and loop by comapring profile count.  */
+
+static class loop *
+find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
+		       basic_block def_bb = NULL)
+{
+  class loop *cold_loop, *min_loop;
+  cold_loop = min_loop = outmost_loop;
+  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
+
+  if (def_bb && def_bb->count < loop_preheader_edge (loop)->src->count)
+    return NULL;
+
+  while (min_loop != loop)
+    {
+      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
+      if (loop_preheader_edge (min_loop)->src->count < min_count)
+	cold_loop = min_loop;
+    }
+  return cold_loop;
+}
+
  /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
     loop to that we could move the expression using DEF if it did not have
     other operands, i.e. the outermost loop enclosing LOOP in that the value
@@ -431,18 +453,18 @@ outermost_invariant_loop (tree def, class loop *loop)
    struct lim_aux_data *lim_data;
  
    if (!def)
-    return superloop_at_depth (loop, 1);
+    return find_coldest_out_loop (superloop_at_depth (loop, 1), loop);
  
    if (TREE_CODE (def) != SSA_NAME)
      {
        gcc_assert (is_gimple_min_invariant (def));
-      return superloop_at_depth (loop, 1);
+      return find_coldest_out_loop (superloop_at_depth (loop, 1), loop);
      }
  
    def_stmt = SSA_NAME_DEF_STMT (def);
    def_bb = gimple_bb (def_stmt);
    if (!def_bb)
-    return superloop_at_depth (loop, 1);
+    return find_coldest_out_loop (superloop_at_depth (loop, 1), loop, def_bb);
  
    max_loop = find_common_loop (loop, def_bb->loop_father);
  
@@ -452,7 +474,13 @@ outermost_invariant_loop (tree def, class loop *loop)
  				 loop_outer (lim_data->max_loop));
    if (max_loop == loop)
      return NULL;
-  max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1);
+  max_loop = find_coldest_out_loop (max_loop, loop, def_bb);
+  if (!max_loop)
+    return NULL;
+  if (max_loop == loop)
+    return max_loop;
+  else
+    max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1);
  
    return max_loop;
  }
@@ -684,7 +712,11 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
    if (must_preserve_exec)
      level = ALWAYS_EXECUTED_IN (bb);
    else
-    level = superloop_at_depth (loop, 1);
+    level = find_coldest_out_loop (superloop_at_depth (loop, 1), loop, bb);
+
+  if (!level)
+    return false;
+
    lim_data->max_loop = level;
  
    if (gphi *phi = dyn_cast <gphi *> (stmt))
@@ -783,8 +815,10 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
        if (ref
  	  && MEM_ANALYZABLE (ref))
  	{
-	  lim_data->max_loop = outermost_indep_loop (lim_data->max_loop,
-						     loop, ref);
+	  level = outermost_indep_loop (lim_data->max_loop, loop, ref);
+	  if (!level)
+	    return false;
+	  lim_data->max_loop = find_coldest_out_loop (level, loop, bb);
  	  if (!lim_data->max_loop)
  	    return false;
  	}
@@ -1154,6 +1188,7 @@ move_computations_worker (basic_block bb)
  	  continue;
  	}
  
+      edge e = loop_preheader_edge (level);
        if (dump_file && (dump_flags & TDF_DETAILS))
  	{
  	  fprintf (dump_file, "Moving PHI node\n");
@@ -1191,14 +1226,13 @@ move_computations_worker (basic_block bb)
  	  tree lhs = gimple_assign_lhs (new_stmt);
  	  SSA_NAME_RANGE_INFO (lhs) = NULL;
  	}
-      gsi_insert_on_edge (loop_preheader_edge (level), new_stmt);
+      gsi_insert_on_edge (e, new_stmt);
        remove_phi_node (&bsi, false);
      }
  
    for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); )
      {
        edge e;
-
        gimple *stmt = gsi_stmt (bsi);
  
        lim_data = get_lim_data (stmt);
@@ -1221,8 +1255,12 @@ move_computations_worker (basic_block bb)
        /* We do not really want to move conditionals out of the loop; we just
  	 placed it here to force its operands to be moved if necessary.  */
        if (gimple_code (stmt) == GIMPLE_COND)
-	continue;
+	{
+	  gsi_next (&bsi);
+	  continue;
+	}
  
+      e = loop_preheader_edge (level);
        if (dump_file && (dump_flags & TDF_DETAILS))
  	{
  	  fprintf (dump_file, "Moving statement\n");
@@ -1231,7 +1269,6 @@ move_computations_worker (basic_block bb)
  		   cost, level->num);
  	}
  
-      e = loop_preheader_edge (level);
        gcc_assert (!gimple_vdef (stmt));
        if (gimple_vuse (stmt))
  	{
@@ -2133,6 +2170,19 @@ execute_sm (class loop *loop, im_mem_ref *ref,
    bool multi_threaded_model_p = false;
    gimple_stmt_iterator gsi;
    sm_aux *aux = new sm_aux;
+  basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt);
+
+  edge e = loop_preheader_edge (loop);
+  if (e->src->count > bb->count)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Don't execute store motion of ");
+	  print_generic_expr (dump_file, ref->mem.ref);
+	  fprintf (dump_file, " from loop %d\n", loop->num);
+	}
+      return;
+    }
  
    if (dump_file && (dump_flags & TDF_DETAILS))
      {
@@ -2241,7 +2291,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq,
  	}
        else
  	{
-	  sm_aux *aux = *aux_map.get (ref);
+	  sm_aux **paux = aux_map.get (ref);
+	  sm_aux *aux;
+	  if (paux)
+	    aux = *paux;
+	  else
+	    continue;
  	  if (!aux->store_flag || kind == sm_ord)
  	    {
  	      gassign *store;
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
index 638bf38db8c..641c91e719e 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
@@ -23,4 +23,4 @@ float h ()
  	F[0] += E / d;
  }
  
-/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
+/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c
new file mode 100644
index 00000000000..2303f3d5d86
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+void
+assert_fail (int, char *, char *);
+void
+foo (int *a, int n, int k)
+{
+  int i;
+
+  for (i = 0; i < n; i++)
+    {
+      if (__builtin_expect (x, 0))
+	assert_fail (k / 5, "one", "two");
+      a[i] = k;
+    }
+}
+
+/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c
new file mode 100644
index 00000000000..3b1c7c0cb3e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+void
+assert_fail (int, char *, char *);
+void
+foo (int *a, int n, int m, int k, int s)
+{
+  int i;
+  int j;
+
+  for (i = 0; i < m; i++)
+    {
+      if (__builtin_expect (x, 0))
+	for (j = 0; j < n; j++)
+	  {
+	    assert_fail (k / 5, "one", "two");
+	  a[s] = k;
+	}
+      a[s] = s;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */
+/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
Richard Biener Sept. 22, 2021, 9:14 a.m. UTC | #6
On Thu, Sep 9, 2021 at 3:56 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2021/8/26 19:33, Richard Biener wrote:
> > On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>
> >> Hi,
> >>
> >> On 2021/8/6 20:15, Richard Biener wrote:
> >>> On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com> wrote:
> >>>>
> >>>> There was a patch trying to avoid move cold block out of loop:
> >>>>
> >>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> >>>>
> >>>> Richard suggested to "never hoist anything from a bb with lower execution
> >>>> frequency to a bb with higher one in LIM invariantness_dom_walker
> >>>> before_dom_children".
> >>>>
> >>>> This patch does this profile count check in both gimple LIM
> >>>> move_computations_worker and RTL loop-invariant.c find_invariants_bb,
> >>>> if the loop bb is colder than loop preheader, don't hoist it out of
> >>>> loop.
> >>>>
> >>>> Also, the profile count in loop split pass should be corrected to avoid
> >>>> lim2 and lim4 mismatch behavior, currently, the new loop preheader generated
> >>>> by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will
> >>>> move statement out of loop unexpectely when lim2 didn't move it.  This
> >>>> change could fix regression on 544.nab_r from -1.55% to +0.46%.
> >>>>
> >>>> SPEC2017 performance evaluation shows 1% performance improvement for
> >>>> intrate GEOMEAN and no obvious regression for others.  Especially,
> >>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> >>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> >>>> on P8LE.
> >>>>
> >>>> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.
> >>>
> >>> While I'm not familiar with the RTL invariant motion pass the patch there
> >>> looks reasonable.  Note that we should assess the profile quality
> >>> somehow - I'm not sure how to do that, CCed Honza for that.
> >>
> >> Thanks.
> >>
> >>>
> >>> For the GIMPLE part the patch looks quite complicated - but note it
> >>> probably has to be since LIM performs kind of a "CSE" on loads
> >>> (and stores for store-motion), so when there are multiple stmts
> >>> affected by a hoisting decision the biggest block count has to be
> >>> accounted.  Likewise when there are dependent stmts involved
> >>> that might include conditional stmts (a "PHI"), but the overall
> >>> cost should be looked at.
> >>
> >> Currently, The gimple code check two situations with the patch:
> >> 1) The statement or PHI‘s BB is *colder* then preheader, don't move it out
> >> of loop;
> >> 2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs
> >> couldn't be moved out of loop, also don't move it out of loop to avoid definition
> >> not dominates use error.
> >
> > But part 2) is obviously already done.  What I tried to say is your heuristic
> > doesn't integrate nicely with the pass but I admitted that it might be a bit
> > difficult to find a place to add this heuristic.
> >
> > There is lim_data->cost which we could bias negatively but then this is
> > a cost that is independent on the hoisting distance.  But doing this would
> > work at least for the case where the immediately enclosing loop preheader
> > is hotter than the stmt and with this it would be a patch that's similarly
> > simple as the RTL one.
> >
> > Another possibility is to simply only adjust PHI processing in
> > compute_invariantness, capping movement according to the hotness
> > heuristic.  The same could be done for regular stmts there but I'm
> > not sure that will do good in the end since this function is supposed
> > to compute "correctness" (well, it also has the cost stuff), and it's
> > not the place to do overall cost considerations.
>
> Thanks.  I found that adding a function find_coldest_out_loop and check it in
> outermost_invariant_loop to find the coldest invariant loop between outermost
> loop and itself could also reach the purpose.  Then the gimple code check is
> redundant and could be removed.
>
> >
> >> May be I could collect the number of instructions not hoisted with the patch
> >> on regression tests and SPEC2017 to do a estimation for "multiple stmts affected"
> >> and "overall cost" need to be considered?  But it seems move_computations_worker
> >> couldn't rollback if we still want to hoist multiple stmts out during the iterations?
> >>
> >>>
> >>> Now - GIMPLE LIM "costing" is somewhat backward right now
> >>> and it isn't set up to consider those multiple involved stmts.  Plus
> >>> the store-motion part does not have any cost part (but it depends
> >>> on previously decided invariant motions).
> >>>
> >>> I think the way you implemented the check will cause no hoisting
> >>> to be performed instead of, say, hoisting to a different loop level
> >>> only.  Possibly shown when you consider a loop nest like
> >>>
> >>>     for (;;)
> >>>       if (unlikely_cond)
> >>>         for (;;)
> >>>            invariant;
> >>>
> >>> we want to hoist 'invariant' but only from the inner loop even if it
> >>> is invariant also in the outer loop.
> >>
> >>
> >> For this case, theorotically I think the master GCC will optimize it to:
> >>
> >>    invariant;
> >>    for (;;)
> >>      if (unlikely_cond)
> >>        for (;;)
> >>           ;
> >>
> >> 'invariant' is moved out of outer loop, but with the patch, it will get:
> >>
> >>    for (;;)
> >>      if (unlikely_cond)
> >>        {
> >>          invariant;
> >>          for (;;)
> >>             ;
> >>        }
> >>
> >> 'invariant' is *cold* for outer loop, but it is still *hot* for inner loop,
> >> so hoist it out of inner loop, this is exactly what we want, right?
> >
> > Yes.  I had doubts your patch would achieve that.
> >
>
>
> The below updated patch could achieve it:
>
>
> There was a patch trying to avoid move cold block out of loop:
>
> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>
> Richard suggested to "never hoist anything from a bb with lower execution
> frequency to a bb with higher one in LIM invariantness_dom_walker
> before_dom_children".
>
> In gimple LIM analysis,  add find_coldest_out_loop to move invariants to
> expected target loop, then  if profile count of the loop bb is colder
> than target loop preheader, it won't be hoisted out of loop.
>
> SPEC2017 performance evaluation shows 1% performance improvement for
> intrate GEOMEAN and no obvious regression for others.  Especially,
> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> on P8LE.
>
> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.

Can you split the RTL and GIMPLE changes and measure them separately
please?

> gcc/ChangeLog:
>
>         * loop-invariant.c (find_invariants_bb): Check profile count
>         before motion.
>         (find_invariants_body): Add argument.
>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
>         (outermost_invariant_loop): Use find_coldest_out_loop.
>         (determine_max_movement): Likewise.
>         (move_computations_worker): Adjust and fix iteration udpate.
>         (execute_sm): Likewise.
>         (execute_sm_exit): Check pointer validness.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
>         * gcc.dg/tree-ssa/ssa-lim-16.c: New test.
>         * gcc.dg/tree-ssa/ssa-lim-17.c: New test.
> ---
>   gcc/loop-invariant.c                       | 10 ++-
>   gcc/tree-ssa-loop-im.c                     | 79 ++++++++++++++++++----
>   gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
>   gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c | 20 ++++++
>   gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c | 26 +++++++
>   5 files changed, 121 insertions(+), 16 deletions(-)
>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c
>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c
>
> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
> index fca0c2b24be..5c3be7bf0eb 100644
> --- a/gcc/loop-invariant.c
> +++ b/gcc/loop-invariant.c
> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
>      call.  */
>
>   static void
> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
> +                   bool always_executed)
>   {
>     rtx_insn *insn;
> +  basic_block preheader = loop_preheader_edge (loop)->src;
> +
> +  if (preheader->count > bb->count)
> +    return;
>
>     FOR_BB_INSNS (bb, insn)
>       {
> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
>     unsigned i;
>
>     for (i = 0; i < loop->num_nodes; i++)
> -    find_invariants_bb (body[i],
> -                       bitmap_bit_p (always_reached, i),
> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
>                         bitmap_bit_p (always_executed, i));
>   }
>
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index d9f75d5025e..f5ab6a734e7 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
>     return ret;
>   }
>
> +/* Find coldest loop between outmost_loop and loop by comapring profile count.  */
> +
> +static class loop *
> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> +                      basic_block def_bb = NULL)
> +{
> +  class loop *cold_loop, *min_loop;
> +  cold_loop = min_loop = outmost_loop;
> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
> +
> +  if (def_bb && def_bb->count < loop_preheader_edge (loop)->src->count)
> +    return NULL;
> +
> +  while (min_loop != loop)
> +    {
> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> +      if (loop_preheader_edge (min_loop)->src->count < min_count)
> +       cold_loop = min_loop;
> +    }
> +  return cold_loop;
> +}
> +
>   /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
>      loop to that we could move the expression using DEF if it did not have
>      other operands, i.e. the outermost loop enclosing LOOP in that the value
> @@ -431,18 +453,18 @@ outermost_invariant_loop (tree def, class loop *loop)
>     struct lim_aux_data *lim_data;
>
>     if (!def)
> -    return superloop_at_depth (loop, 1);
> +    return find_coldest_out_loop (superloop_at_depth (loop, 1), loop);
>
>     if (TREE_CODE (def) != SSA_NAME)
>       {
>         gcc_assert (is_gimple_min_invariant (def));
> -      return superloop_at_depth (loop, 1);
> +      return find_coldest_out_loop (superloop_at_depth (loop, 1), loop);
>       }
>
>     def_stmt = SSA_NAME_DEF_STMT (def);
>     def_bb = gimple_bb (def_stmt);
>     if (!def_bb)
> -    return superloop_at_depth (loop, 1);
> +    return find_coldest_out_loop (superloop_at_depth (loop, 1), loop, def_bb);
>
>     max_loop = find_common_loop (loop, def_bb->loop_father);
>
> @@ -452,7 +474,13 @@ outermost_invariant_loop (tree def, class loop *loop)
>                                  loop_outer (lim_data->max_loop));
>     if (max_loop == loop)
>       return NULL;
> -  max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1);
> +  max_loop = find_coldest_out_loop (max_loop, loop, def_bb);
> +  if (!max_loop)
> +    return NULL;
> +  if (max_loop == loop)
> +    return max_loop;
> +  else
> +    max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1);
>
>     return max_loop;
>   }

As said 'outermost_invariant_loop' is the "correctness" part and I
don't like changing
it this way.  Instead determine_max_movement is what should be adjusted ...

> @@ -684,7 +712,11 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
>     if (must_preserve_exec)
>       level = ALWAYS_EXECUTED_IN (bb);
>     else
> -    level = superloop_at_depth (loop, 1);
> +    level = find_coldest_out_loop (superloop_at_depth (loop, 1), loop, bb);

... which you do here (but you should apply that also to the must_preserve_exec
result).

> +
> +  if (!level)
> +    return false;
> +
>     lim_data->max_loop = level;
>
>     if (gphi *phi = dyn_cast <gphi *> (stmt))
> @@ -783,8 +815,10 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
>         if (ref
>           && MEM_ANALYZABLE (ref))
>         {
> -         lim_data->max_loop = outermost_indep_loop (lim_data->max_loop,
> -                                                    loop, ref);
> +         level = outermost_indep_loop (lim_data->max_loop, loop, ref);
> +         if (!level)
> +           return false;
> +         lim_data->max_loop = find_coldest_out_loop (level, loop, bb);

... why again here?  outermost_indep_loop honors the passed max_loop.

>           if (!lim_data->max_loop)
>             return false;
>         }
> @@ -1154,6 +1188,7 @@ move_computations_worker (basic_block bb)
>           continue;
>         }
>
> +      edge e = loop_preheader_edge (level);

unncecessary change

>         if (dump_file && (dump_flags & TDF_DETAILS))
>         {
>           fprintf (dump_file, "Moving PHI node\n");
> @@ -1191,14 +1226,13 @@ move_computations_worker (basic_block bb)
>           tree lhs = gimple_assign_lhs (new_stmt);
>           SSA_NAME_RANGE_INFO (lhs) = NULL;
>         }
> -      gsi_insert_on_edge (loop_preheader_edge (level), new_stmt);
> +      gsi_insert_on_edge (e, new_stmt);
>         remove_phi_node (&bsi, false);
>       }
>
>     for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); )
>       {
>         edge e;
> -
>         gimple *stmt = gsi_stmt (bsi);
>
>         lim_data = get_lim_data (stmt);
> @@ -1221,8 +1255,12 @@ move_computations_worker (basic_block bb)
>         /* We do not really want to move conditionals out of the loop; we just
>          placed it here to force its operands to be moved if necessary.  */
>         if (gimple_code (stmt) == GIMPLE_COND)
> -       continue;
> +       {
> +         gsi_next (&bsi);
> +         continue;
> +       }

looks like an omission - do you now run into this?

>
> +      e = loop_preheader_edge (level);

unnecessary change

>         if (dump_file && (dump_flags & TDF_DETAILS))
>         {
>           fprintf (dump_file, "Moving statement\n");
> @@ -1231,7 +1269,6 @@ move_computations_worker (basic_block bb)
>                    cost, level->num);
>         }
>
> -      e = loop_preheader_edge (level);
>         gcc_assert (!gimple_vdef (stmt));
>         if (gimple_vuse (stmt))
>         {
> @@ -2133,6 +2170,19 @@ execute_sm (class loop *loop, im_mem_ref *ref,
>     bool multi_threaded_model_p = false;
>     gimple_stmt_iterator gsi;
>     sm_aux *aux = new sm_aux;
> +  basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt);
> +
> +  edge e = loop_preheader_edge (loop);
> +  if (e->src->count > bb->count)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       {
> +         fprintf (dump_file, "Don't execute store motion of ");
> +         print_generic_expr (dump_file, ref->mem.ref);
> +         fprintf (dump_file, " from loop %d\n", loop->num);
> +       }
> +      return;
> +    }

why do you need this?  I think you instead want to adjust 'can_sm_ref_p'
where you want to use sth like

  for_all_locs_in_loop (loop, ref, ...)

and ... being a lambda or function that checks loc->stmt and if at least one
reference is executed in a hot part of 'loop' then we should apply store-motion.
Do this last because it looks somewhat expensive.

>
>     if (dump_file && (dump_flags & TDF_DETAILS))
>       {
> @@ -2241,7 +2291,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq,
>         }
>         else
>         {
> -         sm_aux *aux = *aux_map.get (ref);
> +         sm_aux **paux = aux_map.get (ref);
> +         sm_aux *aux;
> +         if (paux)
> +           aux = *paux;
> +         else
> +           continue;
>           if (!aux->store_flag || kind == sm_ord)
>             {
>               gassign *store;
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> index 638bf38db8c..641c91e719e 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> @@ -23,4 +23,4 @@ float h ()
>         F[0] += E / d;
>   }
>
> -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
> +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c
> new file mode 100644
> index 00000000000..2303f3d5d86
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +void
> +assert_fail (int, char *, char *);
> +void
> +foo (int *a, int n, int k)
> +{
> +  int i;
> +
> +  for (i = 0; i < n; i++)
> +    {
> +      if (__builtin_expect (x, 0))
> +       assert_fail (k / 5, "one", "two");

I don't think these are very good testcases since 'assert' is usually
noreturn which would place the whole block outside of the loop (it's
a loop exit then).  But naming the function 'foo' would make it less
obviously pointless.

> +      a[i] = k;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c
> new file mode 100644
> index 00000000000..3b1c7c0cb3e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c
> @@ -0,0 +1,26 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +void
> +assert_fail (int, char *, char *);
> +void
> +foo (int *a, int n, int m, int k, int s)
> +{
> +  int i;
> +  int j;
> +
> +  for (i = 0; i < m; i++)
> +    {
> +      if (__builtin_expect (x, 0))
> +       for (j = 0; j < n; j++)
> +         {
> +           assert_fail (k / 5, "one", "two");
> +         a[s] = k;
> +       }
> +      a[s] = s;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */
> +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
> --
> 2.27.0.90.geebb51ba8c
>
>
Xionghu Luo Sept. 23, 2021, 2:13 a.m. UTC | #7
On 2021/9/22 17:14, Richard Biener wrote:
> On Thu, Sep 9, 2021 at 3:56 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>
>>
>>
>> On 2021/8/26 19:33, Richard Biener wrote:
>>> On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 2021/8/6 20:15, Richard Biener wrote:
>>>>> On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com> wrote:
>>>>>>
>>>>>> There was a patch trying to avoid move cold block out of loop:
>>>>>>
>>>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>>>>>
>>>>>> Richard suggested to "never hoist anything from a bb with lower execution
>>>>>> frequency to a bb with higher one in LIM invariantness_dom_walker
>>>>>> before_dom_children".
>>>>>>
>>>>>> This patch does this profile count check in both gimple LIM
>>>>>> move_computations_worker and RTL loop-invariant.c find_invariants_bb,
>>>>>> if the loop bb is colder than loop preheader, don't hoist it out of
>>>>>> loop.
>>>>>>
>>>>>> Also, the profile count in loop split pass should be corrected to avoid
>>>>>> lim2 and lim4 mismatch behavior, currently, the new loop preheader generated
>>>>>> by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will
>>>>>> move statement out of loop unexpectely when lim2 didn't move it.  This
>>>>>> change could fix regression on 544.nab_r from -1.55% to +0.46%.
>>>>>>
>>>>>> SPEC2017 performance evaluation shows 1% performance improvement for
>>>>>> intrate GEOMEAN and no obvious regression for others.  Especially,
>>>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>>>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>>>>>> on P8LE.
>>>>>>
>>>>>> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.
>>>>>
>>>>> While I'm not familiar with the RTL invariant motion pass the patch there
>>>>> looks reasonable.  Note that we should assess the profile quality
>>>>> somehow - I'm not sure how to do that, CCed Honza for that.
>>>>
>>>> Thanks.
>>>>
>>>>>
>>>>> For the GIMPLE part the patch looks quite complicated - but note it
>>>>> probably has to be since LIM performs kind of a "CSE" on loads
>>>>> (and stores for store-motion), so when there are multiple stmts
>>>>> affected by a hoisting decision the biggest block count has to be
>>>>> accounted.  Likewise when there are dependent stmts involved
>>>>> that might include conditional stmts (a "PHI"), but the overall
>>>>> cost should be looked at.
>>>>
>>>> Currently, The gimple code check two situations with the patch:
>>>> 1) The statement or PHI‘s BB is *colder* then preheader, don't move it out
>>>> of loop;
>>>> 2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs
>>>> couldn't be moved out of loop, also don't move it out of loop to avoid definition
>>>> not dominates use error.
>>>
>>> But part 2) is obviously already done.  What I tried to say is your heuristic
>>> doesn't integrate nicely with the pass but I admitted that it might be a bit
>>> difficult to find a place to add this heuristic.
>>>
>>> There is lim_data->cost which we could bias negatively but then this is
>>> a cost that is independent on the hoisting distance.  But doing this would
>>> work at least for the case where the immediately enclosing loop preheader
>>> is hotter than the stmt and with this it would be a patch that's similarly
>>> simple as the RTL one.
>>>
>>> Another possibility is to simply only adjust PHI processing in
>>> compute_invariantness, capping movement according to the hotness
>>> heuristic.  The same could be done for regular stmts there but I'm
>>> not sure that will do good in the end since this function is supposed
>>> to compute "correctness" (well, it also has the cost stuff), and it's
>>> not the place to do overall cost considerations.
>>
>> Thanks.  I found that adding a function find_coldest_out_loop and check it in
>> outermost_invariant_loop to find the coldest invariant loop between outermost
>> loop and itself could also reach the purpose.  Then the gimple code check is
>> redundant and could be removed.
>>
>>>
>>>> May be I could collect the number of instructions not hoisted with the patch
>>>> on regression tests and SPEC2017 to do a estimation for "multiple stmts affected"
>>>> and "overall cost" need to be considered?  But it seems move_computations_worker
>>>> couldn't rollback if we still want to hoist multiple stmts out during the iterations?
>>>>
>>>>>
>>>>> Now - GIMPLE LIM "costing" is somewhat backward right now
>>>>> and it isn't set up to consider those multiple involved stmts.  Plus
>>>>> the store-motion part does not have any cost part (but it depends
>>>>> on previously decided invariant motions).
>>>>>
>>>>> I think the way you implemented the check will cause no hoisting
>>>>> to be performed instead of, say, hoisting to a different loop level
>>>>> only.  Possibly shown when you consider a loop nest like
>>>>>
>>>>>      for (;;)
>>>>>        if (unlikely_cond)
>>>>>          for (;;)
>>>>>             invariant;
>>>>>
>>>>> we want to hoist 'invariant' but only from the inner loop even if it
>>>>> is invariant also in the outer loop.
>>>>
>>>>
>>>> For this case, theorotically I think the master GCC will optimize it to:
>>>>
>>>>     invariant;
>>>>     for (;;)
>>>>       if (unlikely_cond)
>>>>         for (;;)
>>>>            ;
>>>>
>>>> 'invariant' is moved out of outer loop, but with the patch, it will get:
>>>>
>>>>     for (;;)
>>>>       if (unlikely_cond)
>>>>         {
>>>>           invariant;
>>>>           for (;;)
>>>>              ;
>>>>         }
>>>>
>>>> 'invariant' is *cold* for outer loop, but it is still *hot* for inner loop,
>>>> so hoist it out of inner loop, this is exactly what we want, right?
>>>
>>> Yes.  I had doubts your patch would achieve that.
>>>
>>
>>
>> The below updated patch could achieve it:
>>
>>
>> There was a patch trying to avoid move cold block out of loop:
>>
>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>
>> Richard suggested to "never hoist anything from a bb with lower execution
>> frequency to a bb with higher one in LIM invariantness_dom_walker
>> before_dom_children".
>>
>> In gimple LIM analysis,  add find_coldest_out_loop to move invariants to
>> expected target loop, then  if profile count of the loop bb is colder
>> than target loop preheader, it won't be hoisted out of loop.
>>
>> SPEC2017 performance evaluation shows 1% performance improvement for
>> intrate GEOMEAN and no obvious regression for others.  Especially,
>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>> on P8LE.
>>
>> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.
> 
> Can you split the RTL and GIMPLE changes and measure them separately
> please?

I did that before and got below data, it is slightly different due to
using ratio instead of seconds, 500.perlbench_r obviously benefits
from the RTL part change, while gimple part only improves exchange2
and blender, with a regression on nab which requires the fix of loop
split,

https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576566.html

Reason is lim2 doesn't hoist code out of loop, but loop split generates
duplicated loop with incorrect profile count on preheader and header bb,
then later lim4 hoists code out of loop to unexpected place.  Same loop
shows mismatch behavior in lim2 and lim4.  With that patch, the regression
is gone.


               Gimple+RTL |  Gimple lim | RTL loop-invariant
500.perlbench_r     8.03%    0.67%    7.69%
502.gcc_r           0.56%    0.37%    0.19%
505.mcf_r           0.19%   -0.19%    0.39%
520.omnetpp_r       0.83%    0.83%    0.83%
523.xalancbmk_r    -0.78%    0.00%   -1.04%
525.x264_r          0.17%    0.00%    0.00%
531.deepsjeng_r     0.00%    0.31%    0.00%
541.leela_r         0.00%   -0.31%    0.31%
548.exchange2_r     2.08%    1.85%    0.23%
557.xz_r            0.97%    0.00%    0.65%
503.bwaves_r       -0.12%    0.00%   -0.23%
507.cactuBSSN_r     0.00%    0.14%    0.00%
508.namd_r          0.00%    0.00%    0.00%
510.parest_r       -0.16%   -0.65%    0.00%
511.povray_r        0.30%    0.91%    0.91%
519.lbm_r           0.15%    0.00%    0.00%
521.wrf_r           0.00%    0.00%   -0.80%
526.blender_r       1.84%    0.26%    0.52%
527.cam4_r          0.28%    0.00%    0.00%
538.imagick_r       0.20%    0.00%    0.00%
544.nab_r          -1.55%   -0.78%    0.00%
549.fotonik3d_r    -0.25%    0.00%    0.00%
554.roms_r         -0.84%    0.00%   -0.63%
INT GEAMEAN         1.16%    0.35%    0.90%
FLOAT GEOMEAN      -0.01%   -0.01%   -0.02%
GEOMEAN             0.50%    0.15%    0.38%


Will address other comments in later reply.  Thanks.
Xionghu Luo Sept. 23, 2021, 2:16 a.m. UTC | #8
On 2021/9/23 10:13, Xionghu Luo via Gcc-patches wrote:
> 
> 
> On 2021/9/22 17:14, Richard Biener wrote:
>> On Thu, Sep 9, 2021 at 3:56 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>
>>>
>>>
>>> On 2021/8/26 19:33, Richard Biener wrote:
>>>> On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luoxhu@linux.ibm.com> 
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> On 2021/8/6 20:15, Richard Biener wrote:
>>>>>> On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com> 
>>>>>> wrote:
>>>>>>>
>>>>>>> There was a patch trying to avoid move cold block out of loop:
>>>>>>>
>>>>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>>>>>>
>>>>>>> Richard suggested to "never hoist anything from a bb with lower 
>>>>>>> execution
>>>>>>> frequency to a bb with higher one in LIM invariantness_dom_walker
>>>>>>> before_dom_children".
>>>>>>>
>>>>>>> This patch does this profile count check in both gimple LIM
>>>>>>> move_computations_worker and RTL loop-invariant.c 
>>>>>>> find_invariants_bb,
>>>>>>> if the loop bb is colder than loop preheader, don't hoist it out of
>>>>>>> loop.
>>>>>>>
>>>>>>> Also, the profile count in loop split pass should be corrected to 
>>>>>>> avoid
>>>>>>> lim2 and lim4 mismatch behavior, currently, the new loop 
>>>>>>> preheader generated
>>>>>>> by loop_version is set to "[count: 0]:", then lim4 after lsplt 
>>>>>>> pass will
>>>>>>> move statement out of loop unexpectely when lim2 didn't move it.  
>>>>>>> This
>>>>>>> change could fix regression on 544.nab_r from -1.55% to +0.46%.
>>>>>>>
>>>>>>> SPEC2017 performance evaluation shows 1% performance improvement for
>>>>>>> intrate GEOMEAN and no obvious regression for others.  Especially,
>>>>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>>>>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>>>>>>> on P8LE.
>>>>>>>
>>>>>>> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.
>>>>>>
>>>>>> While I'm not familiar with the RTL invariant motion pass the 
>>>>>> patch there
>>>>>> looks reasonable.  Note that we should assess the profile quality
>>>>>> somehow - I'm not sure how to do that, CCed Honza for that.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>>
>>>>>> For the GIMPLE part the patch looks quite complicated - but note it
>>>>>> probably has to be since LIM performs kind of a "CSE" on loads
>>>>>> (and stores for store-motion), so when there are multiple stmts
>>>>>> affected by a hoisting decision the biggest block count has to be
>>>>>> accounted.  Likewise when there are dependent stmts involved
>>>>>> that might include conditional stmts (a "PHI"), but the overall
>>>>>> cost should be looked at.
>>>>>
>>>>> Currently, The gimple code check two situations with the patch:
>>>>> 1) The statement or PHI‘s BB is *colder* then preheader, don't move 
>>>>> it out
>>>>> of loop;
>>>>> 2) The statement or PHI's BB is *hotter* then preheader, but any of 
>>>>> it's rhs
>>>>> couldn't be moved out of loop, also don't move it out of loop to 
>>>>> avoid definition
>>>>> not dominates use error.
>>>>
>>>> But part 2) is obviously already done.  What I tried to say is your 
>>>> heuristic
>>>> doesn't integrate nicely with the pass but I admitted that it might 
>>>> be a bit
>>>> difficult to find a place to add this heuristic.
>>>>
>>>> There is lim_data->cost which we could bias negatively but then this is
>>>> a cost that is independent on the hoisting distance.  But doing this 
>>>> would
>>>> work at least for the case where the immediately enclosing loop 
>>>> preheader
>>>> is hotter than the stmt and with this it would be a patch that's 
>>>> similarly
>>>> simple as the RTL one.
>>>>
>>>> Another possibility is to simply only adjust PHI processing in
>>>> compute_invariantness, capping movement according to the hotness
>>>> heuristic.  The same could be done for regular stmts there but I'm
>>>> not sure that will do good in the end since this function is supposed
>>>> to compute "correctness" (well, it also has the cost stuff), and it's
>>>> not the place to do overall cost considerations.
>>>
>>> Thanks.  I found that adding a function find_coldest_out_loop and 
>>> check it in
>>> outermost_invariant_loop to find the coldest invariant loop between 
>>> outermost
>>> loop and itself could also reach the purpose.  Then the gimple code 
>>> check is
>>> redundant and could be removed.
>>>
>>>>
>>>>> May be I could collect the number of instructions not hoisted with 
>>>>> the patch
>>>>> on regression tests and SPEC2017 to do a estimation for "multiple 
>>>>> stmts affected"
>>>>> and "overall cost" need to be considered?  But it seems 
>>>>> move_computations_worker
>>>>> couldn't rollback if we still want to hoist multiple stmts out 
>>>>> during the iterations?
>>>>>
>>>>>>
>>>>>> Now - GIMPLE LIM "costing" is somewhat backward right now
>>>>>> and it isn't set up to consider those multiple involved stmts.  Plus
>>>>>> the store-motion part does not have any cost part (but it depends
>>>>>> on previously decided invariant motions).
>>>>>>
>>>>>> I think the way you implemented the check will cause no hoisting
>>>>>> to be performed instead of, say, hoisting to a different loop level
>>>>>> only.  Possibly shown when you consider a loop nest like
>>>>>>
>>>>>>      for (;;)
>>>>>>        if (unlikely_cond)
>>>>>>          for (;;)
>>>>>>             invariant;
>>>>>>
>>>>>> we want to hoist 'invariant' but only from the inner loop even if it
>>>>>> is invariant also in the outer loop.
>>>>>
>>>>>
>>>>> For this case, theorotically I think the master GCC will optimize 
>>>>> it to:
>>>>>
>>>>>     invariant;
>>>>>     for (;;)
>>>>>       if (unlikely_cond)
>>>>>         for (;;)
>>>>>            ;
>>>>>
>>>>> 'invariant' is moved out of outer loop, but with the patch, it will 
>>>>> get:
>>>>>
>>>>>     for (;;)
>>>>>       if (unlikely_cond)
>>>>>         {
>>>>>           invariant;
>>>>>           for (;;)
>>>>>              ;
>>>>>         }
>>>>>
>>>>> 'invariant' is *cold* for outer loop, but it is still *hot* for 
>>>>> inner loop,
>>>>> so hoist it out of inner loop, this is exactly what we want, right?
>>>>
>>>> Yes.  I had doubts your patch would achieve that.
>>>>
>>>
>>>
>>> The below updated patch could achieve it:
>>>
>>>
>>> There was a patch trying to avoid move cold block out of loop:
>>>
>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>>
>>> Richard suggested to "never hoist anything from a bb with lower 
>>> execution
>>> frequency to a bb with higher one in LIM invariantness_dom_walker
>>> before_dom_children".
>>>
>>> In gimple LIM analysis,  add find_coldest_out_loop to move invariants to
>>> expected target loop, then  if profile count of the loop bb is colder
>>> than target loop preheader, it won't be hoisted out of loop.
>>>
>>> SPEC2017 performance evaluation shows 1% performance improvement for
>>> intrate GEOMEAN and no obvious regression for others.  Especially,
>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>>> on P8LE.
>>>
>>> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.
>>
>> Can you split the RTL and GIMPLE changes and measure them separately
>> please?
> 
> I did that before and got below data, it is slightly different due to
> using ratio instead of seconds, 500.perlbench_r obviously benefits
> from the RTL part change, while gimple part only improves exchange2
> and blender, with a regression on nab which requires the fix of loop
> split,
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576566.html
> 
> Reason is lim2 doesn't hoist code out of loop, but loop split generates
> duplicated loop with incorrect profile count on preheader and header bb,
> then later lim4 hoists code out of loop to unexpected place.  Same loop
> shows mismatch behavior in lim2 and lim4.  With that patch, the regression
> is gone.
> 
> 
>                Gimple+RTL |  Gimple lim | RTL loop-invariant
> 500.perlbench_r     8.03%    0.67%    7.69%
> 502.gcc_r           0.56%    0.37%    0.19%
> 505.mcf_r           0.19%   -0.19%    0.39%
> 520.omnetpp_r       0.83%    0.83%    0.83%
> 523.xalancbmk_r    -0.78%    0.00%   -1.04%
> 525.x264_r          0.17%    0.00%    0.00%
> 531.deepsjeng_r     0.00%    0.31%    0.00%
> 541.leela_r         0.00%   -0.31%    0.31%
> 548.exchange2_r     2.08%    1.85%    0.23%
> 557.xz_r            0.97%    0.00%    0.65%
> 503.bwaves_r       -0.12%    0.00%   -0.23%
> 507.cactuBSSN_r     0.00%    0.14%    0.00%
> 508.namd_r          0.00%    0.00%    0.00%
> 510.parest_r       -0.16%   -0.65%    0.00%
> 511.povray_r        0.30%    0.91%    0.91%
> 519.lbm_r           0.15%    0.00%    0.00%
> 521.wrf_r           0.00%    0.00%   -0.80%
> 526.blender_r       1.84%    0.26%    0.52%
> 527.cam4_r          0.28%    0.00%    0.00%
> 538.imagick_r       0.20%    0.00%    0.00%
> 544.nab_r          -1.55%   -0.78%    0.00%
> 549.fotonik3d_r    -0.25%    0.00%    0.00%
> 554.roms_r         -0.84%    0.00%   -0.63%
> INT GEAMEAN         1.16%    0.35%    0.90%
> FLOAT GEOMEAN      -0.01%   -0.01%   -0.02%
> GEOMEAN             0.50%    0.15%    0.38%


BTW, feedback from other platform:


      I do see ~8% performance improvement for 500.perlbench on aarch64.


> 
> 
> Will address other comments in later reply.  Thanks.
> 
>
Xionghu Luo Sept. 24, 2021, 6:29 a.m. UTC | #9
Update the patch to v3, not sure whether you prefer the paste style
and continue to link the previous thread as Segher dislikes this...


[PATCH v3] Don't move cold code out of loop by checking bb count


Changes:
1. Handle max_loop in determine_max_movement instead of
outermost_invariant_loop.
2. Remove unnecessary changes.
3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
infinite loop when implementing v1 and the iteration is missed to be
updated actually.

v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html

There was a patch trying to avoid move cold block out of loop:

https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Richard suggested to "never hoist anything from a bb with lower execution
frequency to a bb with higher one in LIM invariantness_dom_walker
before_dom_children".

In gimple LIM analysis, add find_coldest_out_loop to move invariants to
expected target loop, if profile count of the loop bb is colder
than target loop preheader, it won't be hoisted out of loop.
Likely for store motion, if all locations of the REF in loop is cold,
don't do store motion of it.

SPEC2017 performance evaluation shows 1% performance improvement for
intrate GEOMEAN and no obvious regression for others.  Especially,
500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
on P8LE.

gcc/ChangeLog:

	* loop-invariant.c (find_invariants_bb): Check profile count
	before motion.
	(find_invariants_body): Add argument.
	* tree-ssa-loop-im.c (find_coldest_out_loop): New function.
	(determine_max_movement): Use find_coldest_out_loop.
	(move_computations_worker): Adjust and fix iteration udpate.
	(execute_sm_exit): Check pointer validness.
	(class ref_in_loop_hot_body): New functor.
	(ref_in_loop_hot_body::operator): New.
	(can_sm_ref_p): Use for_all_locs_in_loop.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/recip-3.c: Adjust.
	* gcc.dg/tree-ssa/ssa-lim-18.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
---
 gcc/loop-invariant.c                       | 10 ++--
 gcc/tree-ssa-loop-im.c                     | 61 ++++++++++++++++++++--
 gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++
 7 files changed, 165 insertions(+), 8 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c

diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
index fca0c2b24be..5c3be7bf0eb 100644
--- a/gcc/loop-invariant.c
+++ b/gcc/loop-invariant.c
@@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
    call.  */
 
 static void
-find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
+find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
+		    bool always_executed)
 {
   rtx_insn *insn;
+  basic_block preheader = loop_preheader_edge (loop)->src;
+
+  if (preheader->count > bb->count)
+    return;
 
   FOR_BB_INSNS (bb, insn)
     {
@@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
   unsigned i;
 
   for (i = 0; i < loop->num_nodes; i++)
-    find_invariants_bb (body[i],
-			bitmap_bit_p (always_reached, i),
+    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
 			bitmap_bit_p (always_executed, i));
 }
 
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 4b187c2cdaf..655fab03442 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
   return ret;
 }
 
+/* Find coldest loop between outmost_loop and loop by comapring profile count.  */
+
+static class loop *
+find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
+		       basic_block curr_bb)
+{
+  class loop *cold_loop, *min_loop;
+  cold_loop = min_loop = outmost_loop;
+  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
+
+  if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count)
+    return NULL;
+
+  while (min_loop != loop)
+    {
+      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
+      if (loop_preheader_edge (min_loop)->src->count < min_count)
+	cold_loop = min_loop;
+    }
+  return cold_loop;
+}
+
 /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
    loop to that we could move the expression using DEF if it did not have
    other operands, i.e. the outermost loop enclosing LOOP in that the value
@@ -685,7 +707,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
     level = ALWAYS_EXECUTED_IN (bb);
   else
     level = superloop_at_depth (loop, 1);
-  lim_data->max_loop = level;
+  lim_data->max_loop = find_coldest_out_loop (level, loop, bb);
+  if (!lim_data->max_loop)
+    return false;
 
   if (gphi *phi = dyn_cast <gphi *> (stmt))
     {
@@ -1198,7 +1222,6 @@ move_computations_worker (basic_block bb)
   for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); )
     {
       edge e;
-
       gimple *stmt = gsi_stmt (bsi);
 
       lim_data = get_lim_data (stmt);
@@ -1221,7 +1244,10 @@ move_computations_worker (basic_block bb)
       /* We do not really want to move conditionals out of the loop; we just
 	 placed it here to force its operands to be moved if necessary.  */
       if (gimple_code (stmt) == GIMPLE_COND)
-	continue;
+	{
+	  gsi_next (&bsi);
+	  continue;
+	}
 
       if (dump_file && (dump_flags & TDF_DETAILS))
 	{
@@ -2241,7 +2267,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq,
 	}
       else
 	{
-	  sm_aux *aux = *aux_map.get (ref);
+	  sm_aux **paux = aux_map.get (ref);
+	  sm_aux *aux;
+	  if (paux)
+	    aux = *paux;
+	  else
+	    continue;
 	  if (!aux->store_flag || kind == sm_ord)
 	    {
 	      gassign *store;
@@ -2887,6 +2918,25 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind)
   return indep_p;
 }
 
+class ref_in_loop_hot_body
+{
+public:
+  ref_in_loop_hot_body (loop *loop_) : l (loop_) {}
+  bool operator () (mem_ref_loc *loc);
+  class loop *l;
+};
+
+bool
+ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
+{
+  basic_block curr_bb = gimple_bb (loc->stmt);
+  edge e = loop_preheader_edge (l);
+  if (e->src->count > curr_bb->count)
+    return false;
+  else
+    return true;
+}
+
 
 /* Returns true if we can perform store motion of REF from LOOP.  */
 
@@ -2941,6 +2991,9 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
   if (!ref_indep_loop_p (loop, ref, sm_war))
     return false;
 
+  if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
+    return false;
+
   return true;
 }
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
index 638bf38db8c..641c91e719e 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
@@ -23,4 +23,4 @@ float h ()
 	F[0] += E / d;
 }
 
-/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
+/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
new file mode 100644
index 00000000000..7326a230b3f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+void
+bar (int, char *, char *);
+void
+foo (int *a, int n, int k)
+{
+  int i;
+
+  for (i = 0; i < n; i++)
+    {
+      if (__builtin_expect (x, 0))
+	bar (k / 5, "one", "two");
+      a[i] = k;
+    }
+}
+
+/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
new file mode 100644
index 00000000000..f0a99fa42b4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+void
+bar (int, char *, char *);
+void
+foo (int *a, int n, int m, int k, int s)
+{
+  int i;
+  int j;
+
+  for (i = 0; i < m; i++)
+    {
+      if (__builtin_expect (x, 0))
+	for (j = 0; j < n; j++)
+	  {
+	    bar (k / 5, "one", "two");
+	  a[s] = k;
+	}
+      a[s] = s;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */
+/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
new file mode 100644
index 00000000000..bc60a040a70
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
@@ -0,0 +1,25 @@
+/* { dg-do compile  } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+/* Test that `count' is not hoisted out of loop when bb is cold.  */
+
+int count;
+volatile int x;
+
+struct obj {
+  int data;
+  struct obj *next;
+
+} *q;
+
+void
+func (int m)
+{
+  struct obj *p;
+  for (int i = 0; i < m; i++)
+    if (__builtin_expect (x, 0))
+      count++;
+
+}
+
+/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
new file mode 100644
index 00000000000..fedaa3b7119
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
@@ -0,0 +1,28 @@
+/* { dg-do compile  } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+/* Test that `count' is hoisted out of loop when one of it's used bb is hot. */
+
+int count;
+volatile int x;
+
+struct obj {
+  int data;
+  struct obj *next;
+
+} *q;
+
+void
+func (int m, int n)
+{
+  struct obj *p;
+  for (int i = 0; i < m; i++)
+  {
+    if (__builtin_expect (x, 0))
+      count++;
+    count += n;
+  }
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion of" 1 "lim2"  }  } */
+
Richard Biener Sept. 28, 2021, 12:09 p.m. UTC | #10
On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>
> Update the patch to v3, not sure whether you prefer the paste style
> and continue to link the previous thread as Segher dislikes this...
>
>
> [PATCH v3] Don't move cold code out of loop by checking bb count
>
>
> Changes:
> 1. Handle max_loop in determine_max_movement instead of
> outermost_invariant_loop.
> 2. Remove unnecessary changes.
> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> infinite loop when implementing v1 and the iteration is missed to be
> updated actually.
>
> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
>
> There was a patch trying to avoid move cold block out of loop:
>
> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>
> Richard suggested to "never hoist anything from a bb with lower execution
> frequency to a bb with higher one in LIM invariantness_dom_walker
> before_dom_children".
>
> In gimple LIM analysis, add find_coldest_out_loop to move invariants to
> expected target loop, if profile count of the loop bb is colder
> than target loop preheader, it won't be hoisted out of loop.
> Likely for store motion, if all locations of the REF in loop is cold,
> don't do store motion of it.
>
> SPEC2017 performance evaluation shows 1% performance improvement for
> intrate GEOMEAN and no obvious regression for others.  Especially,
> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> on P8LE.
>
> gcc/ChangeLog:
>
>         * loop-invariant.c (find_invariants_bb): Check profile count
>         before motion.
>         (find_invariants_body): Add argument.
>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
>         (determine_max_movement): Use find_coldest_out_loop.
>         (move_computations_worker): Adjust and fix iteration udpate.
>         (execute_sm_exit): Check pointer validness.
>         (class ref_in_loop_hot_body): New functor.
>         (ref_in_loop_hot_body::operator): New.
>         (can_sm_ref_p): Use for_all_locs_in_loop.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> ---
>  gcc/loop-invariant.c                       | 10 ++--
>  gcc/tree-ssa-loop-im.c                     | 61 ++++++++++++++++++++--
>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++
>  7 files changed, 165 insertions(+), 8 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
>
> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
> index fca0c2b24be..5c3be7bf0eb 100644
> --- a/gcc/loop-invariant.c
> +++ b/gcc/loop-invariant.c
> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
>     call.  */
>
>  static void
> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
> +                   bool always_executed)
>  {
>    rtx_insn *insn;
> +  basic_block preheader = loop_preheader_edge (loop)->src;
> +
> +  if (preheader->count > bb->count)
> +    return;
>
>    FOR_BB_INSNS (bb, insn)
>      {
> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
>    unsigned i;
>
>    for (i = 0; i < loop->num_nodes; i++)
> -    find_invariants_bb (body[i],
> -                       bitmap_bit_p (always_reached, i),
> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
>                         bitmap_bit_p (always_executed, i));
>  }
>
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index 4b187c2cdaf..655fab03442 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
>    return ret;
>  }
>
> +/* Find coldest loop between outmost_loop and loop by comapring profile count.  */
> +
> +static class loop *
> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> +                      basic_block curr_bb)
> +{
> +  class loop *cold_loop, *min_loop;
> +  cold_loop = min_loop = outmost_loop;
> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
> +
> +  if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count)

Honza - can you comment on whether we should compare BB counts this way?

I would suspect that for, say,

  for (...)
     if (a)
       X;
     else
       Y;

that the counts for X and Y will be less than that of the preheader of the loop
only when the loop is estimated to run once.  That is, should we really compare
the to the preheader count or maybe better to the _header_ count which
would keep the number of iterations out of the equation?

If we look at maybe_hot_count_p that's a quite sophisticated thing to
compare a count to the "IPA hot", here we're comparing two counts
within a function where it actually matters whether we use a<b or
!(a>=b) since 'unordered' is mapped to false (but there's no ordered_p
function).

Xionghu, you error on the side of not hoisting for unordered counts here

> +    return NULL;
> +
> +  while (min_loop != loop)
> +    {
> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> +      if (loop_preheader_edge (min_loop)->src->count < min_count)

but in the other direction here and on the side of not hoisting
in ref_in_loop_hot_body.

The three-state relational operator overloads are probably not the
very best idea...
(see profile-count.h for them)

> +       cold_loop = min_loop;
> +    }
> +  return cold_loop;
> +}
> +
>  /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
>     loop to that we could move the expression using DEF if it did not have
>     other operands, i.e. the outermost loop enclosing LOOP in that the value
> @@ -685,7 +707,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
>      level = ALWAYS_EXECUTED_IN (bb);
>    else
>      level = superloop_at_depth (loop, 1);
> -  lim_data->max_loop = level;
> +  lim_data->max_loop = find_coldest_out_loop (level, loop, bb);
> +  if (!lim_data->max_loop)
> +    return false;
>
>    if (gphi *phi = dyn_cast <gphi *> (stmt))
>      {
> @@ -1198,7 +1222,6 @@ move_computations_worker (basic_block bb)
>    for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); )
>      {
>        edge e;
> -
>        gimple *stmt = gsi_stmt (bsi);
>
>        lim_data = get_lim_data (stmt);
> @@ -1221,7 +1244,10 @@ move_computations_worker (basic_block bb)
>        /* We do not really want to move conditionals out of the loop; we just
>          placed it here to force its operands to be moved if necessary.  */
>        if (gimple_code (stmt) == GIMPLE_COND)
> -       continue;
> +       {
> +         gsi_next (&bsi);
> +         continue;
> +       }
>
>        if (dump_file && (dump_flags & TDF_DETAILS))
>         {
> @@ -2241,7 +2267,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq,
>         }
>        else
>         {
> -         sm_aux *aux = *aux_map.get (ref);
> +         sm_aux **paux = aux_map.get (ref);
> +         sm_aux *aux;
> +         if (paux)
> +           aux = *paux;
> +         else
> +           continue;

do you really need this?  I doubt so.

>           if (!aux->store_flag || kind == sm_ord)
>             {
>               gassign *store;
> @@ -2887,6 +2918,25 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind)
>    return indep_p;
>  }
>
> +class ref_in_loop_hot_body
> +{
> +public:
> +  ref_in_loop_hot_body (loop *loop_) : l (loop_) {}
> +  bool operator () (mem_ref_loc *loc);
> +  class loop *l;
> +};
> +
> +bool
> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> +{
> +  basic_block curr_bb = gimple_bb (loc->stmt);
> +  edge e = loop_preheader_edge (l);
> +  if (e->src->count > curr_bb->count)
> +    return false;
> +  else
> +    return true;
> +}
> +
>
>  /* Returns true if we can perform store motion of REF from LOOP.  */
>
> @@ -2941,6 +2991,9 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
>    if (!ref_indep_loop_p (loop, ref, sm_war))
>      return false;
>

Add a comment here what this is about.

Otherwise the GIMPLE invariant motion parts look sensible, but I'd
really like to have
the issue on the profile_count API sorted out.

Can you split out the RTL invariant motion part to a separate patch please?

Thanks,
Richard.

> +  if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
> +    return false;
> +
>    return true;
>  }
>
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> index 638bf38db8c..641c91e719e 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> @@ -23,4 +23,4 @@ float h ()
>         F[0] += E / d;
>  }
>
> -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
> +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> new file mode 100644
> index 00000000000..7326a230b3f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +void
> +bar (int, char *, char *);
> +void
> +foo (int *a, int n, int k)
> +{
> +  int i;
> +
> +  for (i = 0; i < n; i++)
> +    {
> +      if (__builtin_expect (x, 0))
> +       bar (k / 5, "one", "two");
> +      a[i] = k;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> new file mode 100644
> index 00000000000..f0a99fa42b4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> @@ -0,0 +1,27 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +void
> +bar (int, char *, char *);
> +void
> +foo (int *a, int n, int m, int k, int s)
> +{
> +  int i;
> +  int j;
> +
> +  for (i = 0; i < m; i++)
> +    {
> +      if (__builtin_expect (x, 0))
> +       for (j = 0; j < n; j++)
> +         {
> +           bar (k / 5, "one", "two");
> +         a[s] = k;
> +       }
> +      a[s] = s;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */
> +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> new file mode 100644
> index 00000000000..bc60a040a70
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> @@ -0,0 +1,25 @@
> +/* { dg-do compile  } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +/* Test that `count' is not hoisted out of loop when bb is cold.  */
> +
> +int count;
> +volatile int x;
> +
> +struct obj {
> +  int data;
> +  struct obj *next;
> +
> +} *q;
> +
> +void
> +func (int m)
> +{
> +  struct obj *p;
> +  for (int i = 0; i < m; i++)
> +    if (__builtin_expect (x, 0))
> +      count++;
> +
> +}
> +
> +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> new file mode 100644
> index 00000000000..fedaa3b7119
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> @@ -0,0 +1,28 @@
> +/* { dg-do compile  } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +/* Test that `count' is hoisted out of loop when one of it's used bb is hot. */
> +
> +int count;
> +volatile int x;
> +
> +struct obj {
> +  int data;
> +  struct obj *next;
> +
> +} *q;
> +
> +void
> +func (int m, int n)
> +{
> +  struct obj *p;
> +  for (int i = 0; i < m; i++)
> +  {
> +    if (__builtin_expect (x, 0))
> +      count++;
> +    count += n;
> +  }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Executing store motion of" 1 "lim2"  }  } */
> +
> --
> 2.27.0.90.geebb51ba8c
>
>
Xionghu Luo Oct. 9, 2021, 3:44 a.m. UTC | #11
Hi,

On 2021/9/28 20:09, Richard Biener wrote:
> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>
>> Update the patch to v3, not sure whether you prefer the paste style
>> and continue to link the previous thread as Segher dislikes this...
>>
>>
>> [PATCH v3] Don't move cold code out of loop by checking bb count
>>
>>
>> Changes:
>> 1. Handle max_loop in determine_max_movement instead of
>> outermost_invariant_loop.
>> 2. Remove unnecessary changes.
>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
>> infinite loop when implementing v1 and the iteration is missed to be
>> updated actually.
>>
>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
>>
>> There was a patch trying to avoid move cold block out of loop:
>>
>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>
>> Richard suggested to "never hoist anything from a bb with lower execution
>> frequency to a bb with higher one in LIM invariantness_dom_walker
>> before_dom_children".
>>
>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to
>> expected target loop, if profile count of the loop bb is colder
>> than target loop preheader, it won't be hoisted out of loop.
>> Likely for store motion, if all locations of the REF in loop is cold,
>> don't do store motion of it.
>>
>> SPEC2017 performance evaluation shows 1% performance improvement for
>> intrate GEOMEAN and no obvious regression for others.  Especially,
>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>> on P8LE.
>>
>> gcc/ChangeLog:
>>
>>         * loop-invariant.c (find_invariants_bb): Check profile count
>>         before motion.
>>         (find_invariants_body): Add argument.
>>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
>>         (determine_max_movement): Use find_coldest_out_loop.
>>         (move_computations_worker): Adjust and fix iteration udpate.
>>         (execute_sm_exit): Check pointer validness.
>>         (class ref_in_loop_hot_body): New functor.
>>         (ref_in_loop_hot_body::operator): New.
>>         (can_sm_ref_p): Use for_all_locs_in_loop.
>>
>> gcc/testsuite/ChangeLog:
>>
>>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
>>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
>>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
>> ---
>>  gcc/loop-invariant.c                       | 10 ++--
>>  gcc/tree-ssa-loop-im.c                     | 61 ++++++++++++++++++++--
>>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++
>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++
>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++
>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++
>>  7 files changed, 165 insertions(+), 8 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
>>
>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
>> index fca0c2b24be..5c3be7bf0eb 100644
>> --- a/gcc/loop-invariant.c
>> +++ b/gcc/loop-invariant.c
>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
>>     call.  */
>>
>>  static void
>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
>> +                   bool always_executed)
>>  {
>>    rtx_insn *insn;
>> +  basic_block preheader = loop_preheader_edge (loop)->src;
>> +
>> +  if (preheader->count > bb->count)
>> +    return;
>>
>>    FOR_BB_INSNS (bb, insn)
>>      {
>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
>>    unsigned i;
>>
>>    for (i = 0; i < loop->num_nodes; i++)
>> -    find_invariants_bb (body[i],
>> -                       bitmap_bit_p (always_reached, i),
>> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
>>                         bitmap_bit_p (always_executed, i));
>>  }
>>
>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>> index 4b187c2cdaf..655fab03442 100644
>> --- a/gcc/tree-ssa-loop-im.c
>> +++ b/gcc/tree-ssa-loop-im.c
>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
>>    return ret;
>>  }
>>
>> +/* Find coldest loop between outmost_loop and loop by comapring profile count.  */
>> +
>> +static class loop *
>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
>> +                      basic_block curr_bb)
>> +{
>> +  class loop *cold_loop, *min_loop;
>> +  cold_loop = min_loop = outmost_loop;
>> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
>> +
>> +  if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count)
> 
> Honza - can you comment on whether we should compare BB counts this way?
> 
> I would suspect that for, say,
> 
>   for (...)
>      if (a)
>        X;
>      else
>        Y;
> 
> that the counts for X and Y will be less than that of the preheader of the loop
> only when the loop is estimated to run once.  That is, should we really compare
> the to the preheader count or maybe better to the _header_ count which
> would keep the number of iterations out of the equation?

I quickly tried to replace all the loop_preheader_edge (loop)->src with
loop_preheader_edge (loop)->dest, it will cause many failures in
gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems
reasonable to compare the bb count with preheader count as both gimple lim
and RTL loop-invariant move instructions to *preheader* instead of *header*
after analysis?

> 
> If we look at maybe_hot_count_p that's a quite sophisticated thing to
> compare a count to the "IPA hot", here we're comparing two counts
> within a function where it actually matters whether we use a<b or
> !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p
> function).
> 
> Xionghu, you error on the side of not hoisting for unordered counts here
> 
>> +    return NULL;
>> +
>> +  while (min_loop != loop)
>> +    {
>> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
>> +      if (loop_preheader_edge (min_loop)->src->count < min_count)
> 
> but in the other direction here and on the side of not hoisting
> in ref_in_loop_hot_body.
> 
> The three-state relational operator overloads are probably not the
> very best idea...
> (see profile-count.h for them)
> 
Added new function bb_colder_than_loop_preheader to encapsulate the comparision,
if FALSE is returned due to three-state inequality,  find_coldest_out_loop
will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator ()
will return true to continue perform store motion, both preserve the previous
behavior.


>> +       cold_loop = min_loop;
>> +    }
>> +  return cold_loop;
>> +}
>> +
>>  /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
>>     loop to that we could move the expression using DEF if it did not have
>>     other operands, i.e. the outermost loop enclosing LOOP in that the value
>> @@ -685,7 +707,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
>>      level = ALWAYS_EXECUTED_IN (bb);
>>    else
>>      level = superloop_at_depth (loop, 1);
>> -  lim_data->max_loop = level;
>> +  lim_data->max_loop = find_coldest_out_loop (level, loop, bb);
>> +  if (!lim_data->max_loop)
>> +    return false;
>>
>>    if (gphi *phi = dyn_cast <gphi *> (stmt))
>>      {
>> @@ -1198,7 +1222,6 @@ move_computations_worker (basic_block bb)
>>    for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); )
>>      {
>>        edge e;
>> -
>>        gimple *stmt = gsi_stmt (bsi);
>>
>>        lim_data = get_lim_data (stmt);
>> @@ -1221,7 +1244,10 @@ move_computations_worker (basic_block bb)
>>        /* We do not really want to move conditionals out of the loop; we just
>>          placed it here to force its operands to be moved if necessary.  */
>>        if (gimple_code (stmt) == GIMPLE_COND)
>> -       continue;
>> +       {
>> +         gsi_next (&bsi);
>> +         continue;
>> +       }
>>
>>        if (dump_file && (dump_flags & TDF_DETAILS))
>>         {
>> @@ -2241,7 +2267,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq,
>>         }
>>        else
>>         {
>> -         sm_aux *aux = *aux_map.get (ref);
>> +         sm_aux **paux = aux_map.get (ref);
>> +         sm_aux *aux;
>> +         if (paux)
>> +           aux = *paux;
>> +         else
>> +           continue;
> 
> do you really need this?  I doubt so.

Removed.

> 
>>           if (!aux->store_flag || kind == sm_ord)
>>             {
>>               gassign *store;
>> @@ -2887,6 +2918,25 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind)
>>    return indep_p;
>>  }
>>
>> +class ref_in_loop_hot_body
>> +{
>> +public:
>> +  ref_in_loop_hot_body (loop *loop_) : l (loop_) {}
>> +  bool operator () (mem_ref_loc *loc);
>> +  class loop *l;
>> +};
>> +
>> +bool
>> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
>> +{
>> +  basic_block curr_bb = gimple_bb (loc->stmt);
>> +  edge e = loop_preheader_edge (l);
>> +  if (e->src->count > curr_bb->count)
>> +    return false;
>> +  else
>> +    return true;
>> +}
>> +
>>
>>  /* Returns true if we can perform store motion of REF from LOOP.  */
>>
>> @@ -2941,6 +2991,9 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
>>    if (!ref_indep_loop_p (loop, ref, sm_war))
>>      return false;
>>
> 
> Add a comment here what this is about.

Done.

> 
> Otherwise the GIMPLE invariant motion parts look sensible, but I'd
> really like to have
> the issue on the profile_count API sorted out.
> 
> Can you split out the RTL invariant motion part to a separate patch please?

Done.  Attached the two patches, thanks.


BR,
Xionghu
From 092d6df49c0027001c3ed9343f0d1e8c02232d95 Mon Sep 17 00:00:00 2001
From: Xiong Hu Luo <luoxhu@linux.ibm.com>
Date: Mon, 5 Jul 2021 03:57:11 -0500
Subject: [PATCH v4 1/2] Don't move cold code out of loop by checking bb count

v4 changes:
1. Sort out profile_count comparision to function bb_cold_than_loop_preheader.
2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare.
3. Split RTL invariant motion part out.
4. Remove aux changes.

v3 changes:
1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop.
2. Remove unnecessary changes.
3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
infinite loop when implementing v1 and the iteration is missed to be
updated actually.

v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html

There was a patch trying to avoid move cold block out of loop:

https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Richard suggested to "never hoist anything from a bb with lower execution
frequency to a bb with higher one in LIM invariantness_dom_walker
before_dom_children".

In gimple LIM analysis, add find_coldest_out_loop to move invariants to
expected target loop, if profile count of the loop bb is colder
than target loop preheader, it won't be hoisted out of loop.
Likely for store motion, if all locations of the REF in loop is cold,
don't do store motion of it.

SPEC2017 performance evaluation shows 1% performance improvement for
intrate GEOMEAN and no obvious regression for others.  Especially,
500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
on P8LE.

gcc/ChangeLog:

	(find_invariants_body): Add argument.
	* tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New
	function.
	(find_coldest_out_loop): New function.
	(determine_max_movement): Use find_coldest_out_loop.
	(move_computations_worker): Adjust and fix iteration udpate.
	(execute_sm_exit): Check pointer validness.
	(class ref_in_loop_hot_body): New functor.
	(ref_in_loop_hot_body::operator): New.
	(can_sm_ref_p): Use for_all_locs_in_loop.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/recip-3.c: Adjust.
	* gcc.dg/tree-ssa/ssa-lim-18.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-21.c: New test.
---
 gcc/tree-ssa-loop-im.c                     | 85 +++++++++++++++++++++-
 gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 +++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 35 +++++++++
 6 files changed, 191 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 4b187c2cdaf..870e0a00512 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -417,6 +417,46 @@ movement_possibility (gimple *stmt)
   return ret;
 }
 
+/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state
+   as stated in profile-count.h, FALSE is returned if inequality cannot be
+   decided.  */
+bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2)
+{
+  if (count1 < count2)
+    return true;
+  else
+    return false;
+}
+
+/* Find coldest loop between OUTMOST_LOOP and LOOP by comapring profile count.
+ */
+
+static class loop *
+find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
+		       basic_block curr_bb)
+{
+  class loop *cold_loop, *min_loop;
+  cold_loop = min_loop = outmost_loop;
+  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
+
+  /* If bb_colder_than_loop_preheader returns false due to three-state
+    comparision, OUTMOST_LOOP is returned finally to preserve the behavior.
+    Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP.  */
+  if (curr_bb
+      && bb_colder_than_loop_preheader (curr_bb->count,
+					loop_preheader_edge (loop)->src->count))
+    return NULL;
+
+  while (min_loop != loop)
+    {
+      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
+      if (bb_colder_than_loop_preheader (
+	    loop_preheader_edge (min_loop)->src->count, min_count))
+	cold_loop = min_loop;
+    }
+  return cold_loop;
+}
+
 /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
    loop to that we could move the expression using DEF if it did not have
    other operands, i.e. the outermost loop enclosing LOOP in that the value
@@ -685,7 +725,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
     level = ALWAYS_EXECUTED_IN (bb);
   else
     level = superloop_at_depth (loop, 1);
-  lim_data->max_loop = level;
+  lim_data->max_loop = find_coldest_out_loop (level, loop, bb);
+  if (!lim_data->max_loop)
+    return false;
 
   if (gphi *phi = dyn_cast <gphi *> (stmt))
     {
@@ -1221,7 +1263,10 @@ move_computations_worker (basic_block bb)
       /* We do not really want to move conditionals out of the loop; we just
 	 placed it here to force its operands to be moved if necessary.  */
       if (gimple_code (stmt) == GIMPLE_COND)
-	continue;
+	{
+	  gsi_next (&bsi);
+	  continue;
+	}
 
       if (dump_file && (dump_flags & TDF_DETAILS))
 	{
@@ -2887,6 +2932,35 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind)
   return indep_p;
 }
 
+class ref_in_loop_hot_body
+{
+public:
+  ref_in_loop_hot_body (loop *loop_) : l (loop_) {}
+  bool operator () (mem_ref_loc *loc);
+  class loop *l;
+};
+
+/* Find out the coldest loop between loop L and innermost loop, compare the
+   hotness between current BB and coldest loop preheader by profile count.  */
+bool
+ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
+{
+  basic_block curr_bb = gimple_bb (loc->stmt);
+  class loop *inner_loop = curr_bb->loop_father;
+  class loop *cold_loop = l;
+  if (l != inner_loop)
+    cold_loop = find_coldest_out_loop (l, inner_loop, curr_bb);
+  if (!cold_loop)
+    return false;
+  edge e = loop_preheader_edge (cold_loop);
+  /*  If bb_colder_than_loop_preheader is false due to three-state inequality
+     comparision, TRUE is returned to continue perform store motion.  */
+  if (bb_colder_than_loop_preheader (curr_bb->count, e->src->count))
+    return false;
+  else
+    return true;
+}
+
 
 /* Returns true if we can perform store motion of REF from LOOP.  */
 
@@ -2941,6 +3015,13 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
   if (!ref_indep_loop_p (loop, ref, sm_war))
     return false;
 
+  /* Verify whether the candidate is hot for LOOP.  Only do store motion if the
+    candidate's profile count is hot.  Statement in cold BB shouldn't be moved
+    out of it's loop_father, also it shouldn't be moved out of LOOP if it is
+    colder than LOOP's preheader.  See ssa-lim-21.c.  */
+  if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
+    return false;
+
   return true;
 }
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
index 638bf38db8c..641c91e719e 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
@@ -23,4 +23,4 @@ float h ()
 	F[0] += E / d;
 }
 
-/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
+/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
new file mode 100644
index 00000000000..7326a230b3f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+void
+bar (int, char *, char *);
+void
+foo (int *a, int n, int k)
+{
+  int i;
+
+  for (i = 0; i < n; i++)
+    {
+      if (__builtin_expect (x, 0))
+	bar (k / 5, "one", "two");
+      a[i] = k;
+    }
+}
+
+/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
new file mode 100644
index 00000000000..f0a99fa42b4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+void
+bar (int, char *, char *);
+void
+foo (int *a, int n, int m, int k, int s)
+{
+  int i;
+  int j;
+
+  for (i = 0; i < m; i++)
+    {
+      if (__builtin_expect (x, 0))
+	for (j = 0; j < n; j++)
+	  {
+	    bar (k / 5, "one", "two");
+	  a[s] = k;
+	}
+      a[s] = s;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */
+/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
new file mode 100644
index 00000000000..bc60a040a70
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
@@ -0,0 +1,25 @@
+/* { dg-do compile  } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+/* Test that `count' is not hoisted out of loop when bb is cold.  */
+
+int count;
+volatile int x;
+
+struct obj {
+  int data;
+  struct obj *next;
+
+} *q;
+
+void
+func (int m)
+{
+  struct obj *p;
+  for (int i = 0; i < m; i++)
+    if (__builtin_expect (x, 0))
+      count++;
+
+}
+
+/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
new file mode 100644
index 00000000000..c38a858283f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
@@ -0,0 +1,35 @@
+/* { dg-do compile  } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+/* Test that `count' is not hoisted out of inner loop and outer loop  when it is
+   in cold loop.  */
+
+int count;
+volatile int x;
+
+struct obj {
+  int data;
+  int data1;
+  struct obj *next;
+};
+
+void
+func (int m, int n, int k, struct obj *a)
+{
+  struct obj *q = a;
+  for (int j = 0; j < m; j++)
+    if (__builtin_expect (m, 0))
+      for (int i = 0; i < m; i++)
+	{
+	  if (__builtin_expect (x, 0))
+	    {
+	      count++;
+	      q->data += 3; /* Not hoisted out to inner loop. */
+	    }
+	  count += n;
+	  q->data1 += k; /* Not hoisted out to outer loop. */
+	}
+}
+
+/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
+
Richard Biener Oct. 15, 2021, 8:11 a.m. UTC | #12
On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>
> Hi,
>
> On 2021/9/28 20:09, Richard Biener wrote:
> > On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>
> >> Update the patch to v3, not sure whether you prefer the paste style
> >> and continue to link the previous thread as Segher dislikes this...
> >>
> >>
> >> [PATCH v3] Don't move cold code out of loop by checking bb count
> >>
> >>
> >> Changes:
> >> 1. Handle max_loop in determine_max_movement instead of
> >> outermost_invariant_loop.
> >> 2. Remove unnecessary changes.
> >> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
> >> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> >> infinite loop when implementing v1 and the iteration is missed to be
> >> updated actually.
> >>
> >> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> >> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> >>
> >> There was a patch trying to avoid move cold block out of loop:
> >>
> >> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> >>
> >> Richard suggested to "never hoist anything from a bb with lower execution
> >> frequency to a bb with higher one in LIM invariantness_dom_walker
> >> before_dom_children".
> >>
> >> In gimple LIM analysis, add find_coldest_out_loop to move invariants to
> >> expected target loop, if profile count of the loop bb is colder
> >> than target loop preheader, it won't be hoisted out of loop.
> >> Likely for store motion, if all locations of the REF in loop is cold,
> >> don't do store motion of it.
> >>
> >> SPEC2017 performance evaluation shows 1% performance improvement for
> >> intrate GEOMEAN and no obvious regression for others.  Especially,
> >> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> >> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> >> on P8LE.
> >>
> >> gcc/ChangeLog:
> >>
> >>         * loop-invariant.c (find_invariants_bb): Check profile count
> >>         before motion.
> >>         (find_invariants_body): Add argument.
> >>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
> >>         (determine_max_movement): Use find_coldest_out_loop.
> >>         (move_computations_worker): Adjust and fix iteration udpate.
> >>         (execute_sm_exit): Check pointer validness.
> >>         (class ref_in_loop_hot_body): New functor.
> >>         (ref_in_loop_hot_body::operator): New.
> >>         (can_sm_ref_p): Use for_all_locs_in_loop.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
> >>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
> >>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> >>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> >> ---
> >>  gcc/loop-invariant.c                       | 10 ++--
> >>  gcc/tree-ssa-loop-im.c                     | 61 ++++++++++++++++++++--
> >>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++
> >>  7 files changed, 165 insertions(+), 8 deletions(-)
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> >>
> >> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
> >> index fca0c2b24be..5c3be7bf0eb 100644
> >> --- a/gcc/loop-invariant.c
> >> +++ b/gcc/loop-invariant.c
> >> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
> >>     call.  */
> >>
> >>  static void
> >> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
> >> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
> >> +                   bool always_executed)
> >>  {
> >>    rtx_insn *insn;
> >> +  basic_block preheader = loop_preheader_edge (loop)->src;
> >> +
> >> +  if (preheader->count > bb->count)
> >> +    return;
> >>
> >>    FOR_BB_INSNS (bb, insn)
> >>      {
> >> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
> >>    unsigned i;
> >>
> >>    for (i = 0; i < loop->num_nodes; i++)
> >> -    find_invariants_bb (body[i],
> >> -                       bitmap_bit_p (always_reached, i),
> >> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
> >>                         bitmap_bit_p (always_executed, i));
> >>  }
> >>
> >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> >> index 4b187c2cdaf..655fab03442 100644
> >> --- a/gcc/tree-ssa-loop-im.c
> >> +++ b/gcc/tree-ssa-loop-im.c
> >> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
> >>    return ret;
> >>  }
> >>
> >> +/* Find coldest loop between outmost_loop and loop by comapring profile count.  */
> >> +
> >> +static class loop *
> >> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> >> +                      basic_block curr_bb)
> >> +{
> >> +  class loop *cold_loop, *min_loop;
> >> +  cold_loop = min_loop = outmost_loop;
> >> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
> >> +
> >> +  if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count)
> >
> > Honza - can you comment on whether we should compare BB counts this way?
> >
> > I would suspect that for, say,
> >
> >   for (...)
> >      if (a)
> >        X;
> >      else
> >        Y;
> >
> > that the counts for X and Y will be less than that of the preheader of the loop
> > only when the loop is estimated to run once.  That is, should we really compare
> > the to the preheader count or maybe better to the _header_ count which
> > would keep the number of iterations out of the equation?
>
> I quickly tried to replace all the loop_preheader_edge (loop)->src with
> loop_preheader_edge (loop)->dest, it will cause many failures in
> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems
> reasonable to compare the bb count with preheader count as both gimple lim
> and RTL loop-invariant move instructions to *preheader* instead of *header*
> after analysis?

Hmm, yeah - guess I was confused here.

> >
> > If we look at maybe_hot_count_p that's a quite sophisticated thing to
> > compare a count to the "IPA hot", here we're comparing two counts
> > within a function where it actually matters whether we use a<b or
> > !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p
> > function).
> >
> > Xionghu, you error on the side of not hoisting for unordered counts here
> >
> >> +    return NULL;
> >> +
> >> +  while (min_loop != loop)
> >> +    {
> >> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> >> +      if (loop_preheader_edge (min_loop)->src->count < min_count)
> >
> > but in the other direction here and on the side of not hoisting
> > in ref_in_loop_hot_body.
> >
> > The three-state relational operator overloads are probably not the
> > very best idea...
> > (see profile-count.h for them)
> >
> Added new function bb_colder_than_loop_preheader to encapsulate the comparision,
> if FALSE is returned due to three-state inequality,  find_coldest_out_loop
> will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator ()
> will return true to continue perform store motion, both preserve the previous
> behavior.

Thanks.  But I don't think the abstraction as written is useful:

+/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state
+   as stated in profile-count.h, FALSE is returned if inequality cannot be
+   decided.  */
+bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2)
+{
+  if (count1 < count2)
+    return true;
+  else
+    return false;
+}

given the following seems to pass the preheader count in place of the BB count.

+      if (bb_colder_than_loop_preheader (
+           loop_preheader_edge (min_loop)->src->count, min_count))
+       cold_loop = min_loop;

find_coldest_out_loop is also a bit weird, I think we want to find
the outermost loop between outmost_loop and loop that has a
lower count than the curr_bb count but

+  while (min_loop != loop)
+    {
+      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
+      if (bb_colder_than_loop_preheader (
+           loop_preheader_edge (min_loop)->src->count, min_count))
+       cold_loop = min_loop;

compares the outermost loops count (min_count) against the preheader
count?  So we're searching for a cold loop with respect to its enclosing loop
here?

Why is this function not simply

+static class loop *
+find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
+                      basic_block curr_bb)
+{
     while (bb_colder_than_loop_preheader (curr_bb->count,
               loop_preheader_edge (outermost_loop)->src->count))
        {
            if (outermost_loop == loop)
              return NULL;
            outermost_loop = superloop_at_depth (loop, loop_depth
(outermost_loop) + 1);
        }
     return outermost_loop;
}

?

Likewise I wonder why ref_in_loop_hot_body::operator () needs to call
find_coldest_out_loop and why it not simply does

+bool
+ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
+{
+  basic_block curr_bb = gimple_bb (loc->stmt);
    if (bb_colder_than_loop_preheader (curr_bb->count,
loop_preheader_edge (l)->src->count))
      return false;
   return true;
   }

?
>
> >> +       cold_loop = min_loop;
> >> +    }
> >> +  return cold_loop;
> >> +}
> >> +
> >>  /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
> >>     loop to that we could move the expression using DEF if it did not have
> >>     other operands, i.e. the outermost loop enclosing LOOP in that the value
> >> @@ -685,7 +707,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
> >>      level = ALWAYS_EXECUTED_IN (bb);
> >>    else
> >>      level = superloop_at_depth (loop, 1);
> >> -  lim_data->max_loop = level;
> >> +  lim_data->max_loop = find_coldest_out_loop (level, loop, bb);
> >> +  if (!lim_data->max_loop)
> >> +    return false;
> >>
> >>    if (gphi *phi = dyn_cast <gphi *> (stmt))
> >>      {
> >> @@ -1198,7 +1222,6 @@ move_computations_worker (basic_block bb)
> >>    for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); )
> >>      {
> >>        edge e;
> >> -
> >>        gimple *stmt = gsi_stmt (bsi);
> >>
> >>        lim_data = get_lim_data (stmt);
> >> @@ -1221,7 +1244,10 @@ move_computations_worker (basic_block bb)
> >>        /* We do not really want to move conditionals out of the loop; we just
> >>          placed it here to force its operands to be moved if necessary.  */
> >>        if (gimple_code (stmt) == GIMPLE_COND)
> >> -       continue;
> >> +       {
> >> +         gsi_next (&bsi);
> >> +         continue;
> >> +       }
> >>
> >>        if (dump_file && (dump_flags & TDF_DETAILS))
> >>         {
> >> @@ -2241,7 +2267,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq,
> >>         }
> >>        else
> >>         {
> >> -         sm_aux *aux = *aux_map.get (ref);
> >> +         sm_aux **paux = aux_map.get (ref);
> >> +         sm_aux *aux;
> >> +         if (paux)
> >> +           aux = *paux;
> >> +         else
> >> +           continue;
> >
> > do you really need this?  I doubt so.
>
> Removed.
>
> >
> >>           if (!aux->store_flag || kind == sm_ord)
> >>             {
> >>               gassign *store;
> >> @@ -2887,6 +2918,25 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind)
> >>    return indep_p;
> >>  }
> >>
> >> +class ref_in_loop_hot_body
> >> +{
> >> +public:
> >> +  ref_in_loop_hot_body (loop *loop_) : l (loop_) {}
> >> +  bool operator () (mem_ref_loc *loc);
> >> +  class loop *l;
> >> +};
> >> +
> >> +bool
> >> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> >> +{
> >> +  basic_block curr_bb = gimple_bb (loc->stmt);
> >> +  edge e = loop_preheader_edge (l);
> >> +  if (e->src->count > curr_bb->count)
> >> +    return false;
> >> +  else
> >> +    return true;
> >> +}
> >> +
> >>
> >>  /* Returns true if we can perform store motion of REF from LOOP.  */
> >>
> >> @@ -2941,6 +2991,9 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
> >>    if (!ref_indep_loop_p (loop, ref, sm_war))
> >>      return false;
> >>
> >
> > Add a comment here what this is about.
>
> Done.
>
> >
> > Otherwise the GIMPLE invariant motion parts look sensible, but I'd
> > really like to have
> > the issue on the profile_count API sorted out.
> >
> > Can you split out the RTL invariant motion part to a separate patch please?
>
> Done.  Attached the two patches, thanks.
>
>
> BR,
> Xionghu
Xionghu Luo Oct. 18, 2021, 4:29 a.m. UTC | #13
On 2021/10/15 16:11, Richard Biener wrote:
> On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>
>> Hi,
>>
>> On 2021/9/28 20:09, Richard Biener wrote:
>>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>>
>>>> Update the patch to v3, not sure whether you prefer the paste style
>>>> and continue to link the previous thread as Segher dislikes this...
>>>>
>>>>
>>>> [PATCH v3] Don't move cold code out of loop by checking bb count
>>>>
>>>>
>>>> Changes:
>>>> 1. Handle max_loop in determine_max_movement instead of
>>>> outermost_invariant_loop.
>>>> 2. Remove unnecessary changes.
>>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
>>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
>>>> infinite loop when implementing v1 and the iteration is missed to be
>>>> updated actually.
>>>>
>>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
>>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
>>>>
>>>> There was a patch trying to avoid move cold block out of loop:
>>>>
>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>>>
>>>> Richard suggested to "never hoist anything from a bb with lower execution
>>>> frequency to a bb with higher one in LIM invariantness_dom_walker
>>>> before_dom_children".
>>>>
>>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to
>>>> expected target loop, if profile count of the loop bb is colder
>>>> than target loop preheader, it won't be hoisted out of loop.
>>>> Likely for store motion, if all locations of the REF in loop is cold,
>>>> don't do store motion of it.
>>>>
>>>> SPEC2017 performance evaluation shows 1% performance improvement for
>>>> intrate GEOMEAN and no obvious regression for others.  Especially,
>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>>>> on P8LE.
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>>         * loop-invariant.c (find_invariants_bb): Check profile count
>>>>         before motion.
>>>>         (find_invariants_body): Add argument.
>>>>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
>>>>         (determine_max_movement): Use find_coldest_out_loop.
>>>>         (move_computations_worker): Adjust and fix iteration udpate.
>>>>         (execute_sm_exit): Check pointer validness.
>>>>         (class ref_in_loop_hot_body): New functor.
>>>>         (ref_in_loop_hot_body::operator): New.
>>>>         (can_sm_ref_p): Use for_all_locs_in_loop.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
>>>>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
>>>>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>>>>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
>>>> ---
>>>>  gcc/loop-invariant.c                       | 10 ++--
>>>>  gcc/tree-ssa-loop-im.c                     | 61 ++++++++++++++++++++--
>>>>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++
>>>>  7 files changed, 165 insertions(+), 8 deletions(-)
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
>>>>
>>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
>>>> index fca0c2b24be..5c3be7bf0eb 100644
>>>> --- a/gcc/loop-invariant.c
>>>> +++ b/gcc/loop-invariant.c
>>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
>>>>     call.  */
>>>>
>>>>  static void
>>>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
>>>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
>>>> +                   bool always_executed)
>>>>  {
>>>>    rtx_insn *insn;
>>>> +  basic_block preheader = loop_preheader_edge (loop)->src;
>>>> +
>>>> +  if (preheader->count > bb->count)
>>>> +    return;
>>>>
>>>>    FOR_BB_INSNS (bb, insn)
>>>>      {
>>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
>>>>    unsigned i;
>>>>
>>>>    for (i = 0; i < loop->num_nodes; i++)
>>>> -    find_invariants_bb (body[i],
>>>> -                       bitmap_bit_p (always_reached, i),
>>>> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
>>>>                         bitmap_bit_p (always_executed, i));
>>>>  }
>>>>
>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>>>> index 4b187c2cdaf..655fab03442 100644
>>>> --- a/gcc/tree-ssa-loop-im.c
>>>> +++ b/gcc/tree-ssa-loop-im.c
>>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
>>>>    return ret;
>>>>  }
>>>>
>>>> +/* Find coldest loop between outmost_loop and loop by comapring profile count.  */
>>>> +
>>>> +static class loop *
>>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
>>>> +                      basic_block curr_bb)
>>>> +{
>>>> +  class loop *cold_loop, *min_loop;
>>>> +  cold_loop = min_loop = outmost_loop;
>>>> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
>>>> +
>>>> +  if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count)
>>>
>>> Honza - can you comment on whether we should compare BB counts this way?
>>>
>>> I would suspect that for, say,
>>>
>>>   for (...)
>>>      if (a)
>>>        X;
>>>      else
>>>        Y;
>>>
>>> that the counts for X and Y will be less than that of the preheader of the loop
>>> only when the loop is estimated to run once.  That is, should we really compare
>>> the to the preheader count or maybe better to the _header_ count which
>>> would keep the number of iterations out of the equation?
>>
>> I quickly tried to replace all the loop_preheader_edge (loop)->src with
>> loop_preheader_edge (loop)->dest, it will cause many failures in
>> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems
>> reasonable to compare the bb count with preheader count as both gimple lim
>> and RTL loop-invariant move instructions to *preheader* instead of *header*
>> after analysis?
> 
> Hmm, yeah - guess I was confused here.
> 
>>>
>>> If we look at maybe_hot_count_p that's a quite sophisticated thing to
>>> compare a count to the "IPA hot", here we're comparing two counts
>>> within a function where it actually matters whether we use a<b or
>>> !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p
>>> function).
>>>
>>> Xionghu, you error on the side of not hoisting for unordered counts here
>>>
>>>> +    return NULL;
>>>> +
>>>> +  while (min_loop != loop)
>>>> +    {
>>>> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
>>>> +      if (loop_preheader_edge (min_loop)->src->count < min_count)
>>>
>>> but in the other direction here and on the side of not hoisting
>>> in ref_in_loop_hot_body.
>>>
>>> The three-state relational operator overloads are probably not the
>>> very best idea...
>>> (see profile-count.h for them)
>>>
>> Added new function bb_colder_than_loop_preheader to encapsulate the comparision,
>> if FALSE is returned due to three-state inequality,  find_coldest_out_loop
>> will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator ()
>> will return true to continue perform store motion, both preserve the previous
>> behavior.
> 
> Thanks.  But I don't think the abstraction as written is useful:
> 
> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state
> +   as stated in profile-count.h, FALSE is returned if inequality cannot be
> +   decided.  */
> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2)
> +{
> +  if (count1 < count2)
> +    return true;
> +  else
> +    return false;
> +}
> 
> given the following seems to pass the preheader count in place of the BB count.
> 
> +      if (bb_colder_than_loop_preheader (
> +           loop_preheader_edge (min_loop)->src->count, min_count))
> +       cold_loop = min_loop;
> 
> find_coldest_out_loop is also a bit weird, I think we want to find
> the outermost loop between outmost_loop and loop that has a
> lower count than the curr_bb count but
> 
> +  while (min_loop != loop)
> +    {
> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> +      if (bb_colder_than_loop_preheader (
> +           loop_preheader_edge (min_loop)->src->count, min_count))
> +       cold_loop = min_loop;
> 
> compares the outermost loops count (min_count) against the preheader
> count?  So we're searching for a cold loop with respect to its enclosing loop
> here?

Let me try to explain how it works :)

find_coldest_out_loop does two steps check:
1) Check whether curr_bb is cold in it's own loop_father, if it is cold,
just return NULL which means it should not be moved out at all;
2)  curr_bb is NOT cold, assuming the current loop L[m] is the coldest first, 
than try to find a cold loop to be hoisted to from {L[1], L[2], ... L[m]}, 
if L[i]->count < L[m]->count, set the cold_loop to L[i] until find the loop
that has smallest profile_count.


Take the updated ssa-lim-19.c as example, check whether curr_bb(bb 5) is cold in
loop 3, if it is cold, just return NULL, otherwise select the coldest loop in
{loop1, loop2, loop3} and find that loop2 is colder than loop3, return loop2 to
be the target hoist loop.  The first check could AVOID hoist if curr_bb is colder
than loop3, but it is still hot than loop1 and loop2.  Not sure whether it is possible
to construct such cases?


gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c

volatile int x;
void
bar (int, char *, char *);
void
foo (int *a, int n, int m, int s, int t)
{
  int i;
  int j;
  int k;

  for (i = 0; i < m; i++)  // loop 1
    {
      if (__builtin_expect (x, 0))
        for (j = 0; j < n; j++)   // loop 2
          for (k = 0; k < n; k++)   // loop 3
           {
             bar (s / 5, "one", "two");  // curr_bb 
             a[t] = s;
           }
      a[t] = t;  // curr_bb2
    }
}

The 4 invariant statements are moved to bb 11(loop2) instead of bb 10(loop1)
with this patch.
There are totally 6 combinations when curr_bb is hotter than loop 3.  We need
to compare the "Loop preheader hotness" instead of "every Loop[i] and curr_bb hotness",
returning the coldest loop for this function find_coldest_out_loop, otherwise
unexpected behavior happens.

L1 > L2 > L3   =>  return L3
L1 > L3 > L2   =>  return L2
L2 > L1 > L3   =>  return L3
L2 > L3 > L1   =>  return L1
L3 > L1 > L2   =>  return L2
L3 > L2 > L1   =>  return L1

So bb_colder_than_loop_preheader does two kind of checks, one is checking
L3 preheader count with curr_bb count, another is checking L3 preheader count
with L1 preheader count, L2 preheader count, etc...


ssa-lim-19.c.138t.lim2:
...
   <bb 10> [local count: 16057869]:  // L1 preheader
-  _4 = s_22(D) / 5;
-  _5 = (long unsigned int) t_24(D);
-  _6 = _5 * 4;
-  _7 = a_25(D) + _6;
   _8 = (long unsigned int) t_24(D);
   _9 = _8 * 4;
   _10 = a_25(D) + _9;

   <bb 3> [local count: 145980626]:
   # i_34 = PHI <i_29(12), 0(10)>
   x.0_1 ={v} x;
   if (x.0_1 != 0)
     goto <bb 4>; [10.00%]
   else
     goto <bb 8>; [90.00%]

   <bb 4> [local count: 14598063]:
   if (n_20(D) > 0)
     goto <bb 11>; [89.00%]
   else
     goto <bb 8>; [11.00%]

   <bb 11> [local count: 12992276]:  // L2 preheader
+  _4 = s_22(D) / 5;
+  _5 = (long unsigned int) t_24(D);
+  _6 = _5 * 4;
+  _7 = a_25(D) + _6;
   goto <bb 7>; [100.00%]

   <bb 14> [local count: 850510901]:

   <bb 5> [local count: 955630225]:  // curr_bb
   # k_36 = PHI <k_27(14), 0(7)>
   bar (_4, "one", "two");
   *_7 = s_22(D);
   k_27 = k_36 + 1;
   if (n_20(D) > k_27)
     goto <bb 14>; [89.00%]
   else
     goto <bb 6>; [11.00%]

   <bb 6> [local count: 118111600]:
   j_21 = j_35 + 1;
   if (n_20(D) > j_21)
     goto <bb 13>; [89.00%]
   else
     goto <bb 8>; [11.00%]

   <bb 13> [local count: 105119324]:

   <bb 7> [local count: 118111600]:   // L3 preheader
   # j_35 = PHI <j_21(13), 0(11)>
   goto <bb 5>; [100.00%]

   <bb 8> [local count: 145980626]:
   *_10 = t_24(D);
   i_29 = i_34 + 1;

Re-paste the bb_colder_than_loop_preheader and find_coldest_out_loop:

+/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state
+   as stated in profile-count.h, FALSE is returned if inequality cannot be
+   decided.  */
+bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2)
+{
+  if (count1 < count2)
+    return true;
+  else
+    return false;
+}
+
+/* Find coldest loop between OUTMOST_LOOP and LOOP by comapring profile count.
+ */
+
+static class loop *
+find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
+		       basic_block curr_bb)
+{
+  class loop *cold_loop, *min_loop;
+  cold_loop = min_loop = outmost_loop;
+  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
+
+  /* If bb_colder_than_loop_preheader returns false due to three-state
+    comparision, OUTMOST_LOOP is returned finally to preserve the behavior.
+    Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP.  */
+  if (curr_bb
+      && bb_colder_than_loop_preheader (curr_bb->count,
+					loop_preheader_edge (loop)->src->count))
+    return NULL;
+
+  while (min_loop != loop)
+    {
+      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
+      if (bb_colder_than_loop_preheader (
+	    loop_preheader_edge (min_loop)->src->count, min_count))
+	cold_loop = min_loop;
+    }
+  return cold_loop;
+}
+


> 
> Why is this function not simply
> 
> +static class loop *
> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> +                      basic_block curr_bb)
> +{
>      while (bb_colder_than_loop_preheader (curr_bb->count,
>                loop_preheader_edge (outermost_loop)->src->count))
>         {
>             if (outermost_loop == loop)
>               return NULL;
>             outermost_loop = superloop_at_depth (loop, loop_depth
> (outermost_loop) + 1);
>         }
>      return outermost_loop;
> }

If change like this, when processing curr_bb(5), outermost_loop will
return loop 1 since curr_bb->count > Loop1_prehead->count, the while
loop stopped.  This doesn't meet what we want.  

> 
> ?
> 
> Likewise I wonder why ref_in_loop_hot_body::operator () needs to call
> find_coldest_out_loop and why it not simply does
> 
> +bool
> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> +{
> +  basic_block curr_bb = gimple_bb (loc->stmt);
>     if (bb_colder_than_loop_preheader (curr_bb->count,
> loop_preheader_edge (l)->src->count))
>       return false;
>    return true;
>    }

Likely for this part,

+/* Find out the coldest loop between loop L and innermost loop, compare the
+   hotness between current BB and coldest loop preheader by profile count.  */
+bool
+ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
+{
+  basic_block curr_bb = gimple_bb (loc->stmt);
+  class loop *inner_loop = curr_bb->loop_father;
+  class loop *cold_loop = l;
+  if (l != inner_loop)
+    cold_loop = find_coldest_out_loop (l, inner_loop, curr_bb);
+  if (!cold_loop)
+    return false;
+  edge e = loop_preheader_edge (cold_loop);
+  /*  If bb_colder_than_loop_preheader is false due to three-state inequality
+     comparision, TRUE is returned to continue perform store motion.  */
+  if (bb_colder_than_loop_preheader (curr_bb->count, e->src->count))
+    return false;
+  else
+    return true;
+}

l is the input of ref_in_loop_hot_body, it is an out loop, we need to find a
cold_loop between l and inner_loop.  Reason is there may be cold loop between
l and inner_loop, which means we shouldn't do store-motion from curr_bb to l
directly.
After reconsideration, I think the bb_colder_than_loop_preheader could be
 removed since curr_bb is checked in find_coldest_out_loop already.  And remove
the "l != inner_loop" check:

+/* Find out the coldest loop between loop L and innermost loop.  */
+bool
+ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
+{
+  basic_block curr_bb = gimple_bb (loc->stmt);
+  class loop *inner_loop = curr_bb->loop_father;
+  class loop *cold_loop = l;
+  cold_loop = find_coldest_out_loop (l, inner_loop, curr_bb);
+  if (!cold_loop)
+    return false;
+  return true;
+}
Richard Biener Oct. 26, 2021, 1:20 p.m. UTC | #14
On Mon, Oct 18, 2021 at 6:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2021/10/15 16:11, Richard Biener wrote:
> > On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>
> >> Hi,
> >>
> >> On 2021/9/28 20:09, Richard Biener wrote:
> >>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>>>
> >>>> Update the patch to v3, not sure whether you prefer the paste style
> >>>> and continue to link the previous thread as Segher dislikes this...
> >>>>
> >>>>
> >>>> [PATCH v3] Don't move cold code out of loop by checking bb count
> >>>>
> >>>>
> >>>> Changes:
> >>>> 1. Handle max_loop in determine_max_movement instead of
> >>>> outermost_invariant_loop.
> >>>> 2. Remove unnecessary changes.
> >>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
> >>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> >>>> infinite loop when implementing v1 and the iteration is missed to be
> >>>> updated actually.
> >>>>
> >>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> >>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> >>>>
> >>>> There was a patch trying to avoid move cold block out of loop:
> >>>>
> >>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> >>>>
> >>>> Richard suggested to "never hoist anything from a bb with lower execution
> >>>> frequency to a bb with higher one in LIM invariantness_dom_walker
> >>>> before_dom_children".
> >>>>
> >>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to
> >>>> expected target loop, if profile count of the loop bb is colder
> >>>> than target loop preheader, it won't be hoisted out of loop.
> >>>> Likely for store motion, if all locations of the REF in loop is cold,
> >>>> don't do store motion of it.
> >>>>
> >>>> SPEC2017 performance evaluation shows 1% performance improvement for
> >>>> intrate GEOMEAN and no obvious regression for others.  Especially,
> >>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> >>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> >>>> on P8LE.
> >>>>
> >>>> gcc/ChangeLog:
> >>>>
> >>>>         * loop-invariant.c (find_invariants_bb): Check profile count
> >>>>         before motion.
> >>>>         (find_invariants_body): Add argument.
> >>>>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
> >>>>         (determine_max_movement): Use find_coldest_out_loop.
> >>>>         (move_computations_worker): Adjust and fix iteration udpate.
> >>>>         (execute_sm_exit): Check pointer validness.
> >>>>         (class ref_in_loop_hot_body): New functor.
> >>>>         (ref_in_loop_hot_body::operator): New.
> >>>>         (can_sm_ref_p): Use for_all_locs_in_loop.
> >>>>
> >>>> gcc/testsuite/ChangeLog:
> >>>>
> >>>>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
> >>>>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
> >>>>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> >>>>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> >>>> ---
> >>>>  gcc/loop-invariant.c                       | 10 ++--
> >>>>  gcc/tree-ssa-loop-im.c                     | 61 ++++++++++++++++++++--
> >>>>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
> >>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++
> >>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++
> >>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++
> >>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++
> >>>>  7 files changed, 165 insertions(+), 8 deletions(-)
> >>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> >>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> >>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> >>>>
> >>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
> >>>> index fca0c2b24be..5c3be7bf0eb 100644
> >>>> --- a/gcc/loop-invariant.c
> >>>> +++ b/gcc/loop-invariant.c
> >>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
> >>>>     call.  */
> >>>>
> >>>>  static void
> >>>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
> >>>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
> >>>> +                   bool always_executed)
> >>>>  {
> >>>>    rtx_insn *insn;
> >>>> +  basic_block preheader = loop_preheader_edge (loop)->src;
> >>>> +
> >>>> +  if (preheader->count > bb->count)
> >>>> +    return;
> >>>>
> >>>>    FOR_BB_INSNS (bb, insn)
> >>>>      {
> >>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
> >>>>    unsigned i;
> >>>>
> >>>>    for (i = 0; i < loop->num_nodes; i++)
> >>>> -    find_invariants_bb (body[i],
> >>>> -                       bitmap_bit_p (always_reached, i),
> >>>> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
> >>>>                         bitmap_bit_p (always_executed, i));
> >>>>  }
> >>>>
> >>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> >>>> index 4b187c2cdaf..655fab03442 100644
> >>>> --- a/gcc/tree-ssa-loop-im.c
> >>>> +++ b/gcc/tree-ssa-loop-im.c
> >>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
> >>>>    return ret;
> >>>>  }
> >>>>
> >>>> +/* Find coldest loop between outmost_loop and loop by comapring profile count.  */
> >>>> +
> >>>> +static class loop *
> >>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> >>>> +                      basic_block curr_bb)
> >>>> +{
> >>>> +  class loop *cold_loop, *min_loop;
> >>>> +  cold_loop = min_loop = outmost_loop;
> >>>> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
> >>>> +
> >>>> +  if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count)
> >>>
> >>> Honza - can you comment on whether we should compare BB counts this way?
> >>>
> >>> I would suspect that for, say,
> >>>
> >>>   for (...)
> >>>      if (a)
> >>>        X;
> >>>      else
> >>>        Y;
> >>>
> >>> that the counts for X and Y will be less than that of the preheader of the loop
> >>> only when the loop is estimated to run once.  That is, should we really compare
> >>> the to the preheader count or maybe better to the _header_ count which
> >>> would keep the number of iterations out of the equation?
> >>
> >> I quickly tried to replace all the loop_preheader_edge (loop)->src with
> >> loop_preheader_edge (loop)->dest, it will cause many failures in
> >> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems
> >> reasonable to compare the bb count with preheader count as both gimple lim
> >> and RTL loop-invariant move instructions to *preheader* instead of *header*
> >> after analysis?
> >
> > Hmm, yeah - guess I was confused here.
> >
> >>>
> >>> If we look at maybe_hot_count_p that's a quite sophisticated thing to
> >>> compare a count to the "IPA hot", here we're comparing two counts
> >>> within a function where it actually matters whether we use a<b or
> >>> !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p
> >>> function).
> >>>
> >>> Xionghu, you error on the side of not hoisting for unordered counts here
> >>>
> >>>> +    return NULL;
> >>>> +
> >>>> +  while (min_loop != loop)
> >>>> +    {
> >>>> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> >>>> +      if (loop_preheader_edge (min_loop)->src->count < min_count)
> >>>
> >>> but in the other direction here and on the side of not hoisting
> >>> in ref_in_loop_hot_body.
> >>>
> >>> The three-state relational operator overloads are probably not the
> >>> very best idea...
> >>> (see profile-count.h for them)
> >>>
> >> Added new function bb_colder_than_loop_preheader to encapsulate the comparision,
> >> if FALSE is returned due to three-state inequality,  find_coldest_out_loop
> >> will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator ()
> >> will return true to continue perform store motion, both preserve the previous
> >> behavior.
> >
> > Thanks.  But I don't think the abstraction as written is useful:
> >
> > +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state
> > +   as stated in profile-count.h, FALSE is returned if inequality cannot be
> > +   decided.  */
> > +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2)
> > +{
> > +  if (count1 < count2)
> > +    return true;
> > +  else
> > +    return false;
> > +}
> >
> > given the following seems to pass the preheader count in place of the BB count.
> >
> > +      if (bb_colder_than_loop_preheader (
> > +           loop_preheader_edge (min_loop)->src->count, min_count))
> > +       cold_loop = min_loop;
> >
> > find_coldest_out_loop is also a bit weird, I think we want to find
> > the outermost loop between outmost_loop and loop that has a
> > lower count than the curr_bb count but
> >
> > +  while (min_loop != loop)
> > +    {
> > +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> > +      if (bb_colder_than_loop_preheader (
> > +           loop_preheader_edge (min_loop)->src->count, min_count))
> > +       cold_loop = min_loop;
> >
> > compares the outermost loops count (min_count) against the preheader
> > count?  So we're searching for a cold loop with respect to its enclosing loop
> > here?
>
> Let me try to explain how it works :)
>
> find_coldest_out_loop does two steps check:
> 1) Check whether curr_bb is cold in it's own loop_father, if it is cold,
> just return NULL which means it should not be moved out at all;
> 2)  curr_bb is NOT cold, assuming the current loop L[m] is the coldest first,
> than try to find a cold loop to be hoisted to from {L[1], L[2], ... L[m]},
> if L[i]->count < L[m]->count, set the cold_loop to L[i] until find the loop
> that has smallest profile_count.
>
>
> Take the updated ssa-lim-19.c as example, check whether curr_bb(bb 5) is cold in
> loop 3, if it is cold, just return NULL, otherwise select the coldest loop in
> {loop1, loop2, loop3} and find that loop2 is colder than loop3, return loop2 to
> be the target hoist loop.  The first check could AVOID hoist if curr_bb is colder
> than loop3, but it is still hot than loop1 and loop2.  Not sure whether it is possible
> to construct such cases?
>
>
> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>
> volatile int x;
> void
> bar (int, char *, char *);
> void
> foo (int *a, int n, int m, int s, int t)
> {
>   int i;
>   int j;
>   int k;
>
>   for (i = 0; i < m; i++)  // loop 1
>     {
>       if (__builtin_expect (x, 0))
>         for (j = 0; j < n; j++)   // loop 2
>           for (k = 0; k < n; k++)   // loop 3
>            {
>              bar (s / 5, "one", "two");  // curr_bb
>              a[t] = s;
>            }
>       a[t] = t;  // curr_bb2
>     }
> }
>
> The 4 invariant statements are moved to bb 11(loop2) instead of bb 10(loop1)
> with this patch.
> There are totally 6 combinations when curr_bb is hotter than loop 3.  We need
> to compare the "Loop preheader hotness" instead of "every Loop[i] and curr_bb hotness",
> returning the coldest loop for this function find_coldest_out_loop, otherwise
> unexpected behavior happens.
>
> L1 > L2 > L3   =>  return L3
> L1 > L3 > L2   =>  return L2
> L2 > L1 > L3   =>  return L3
> L2 > L3 > L1   =>  return L1
> L3 > L1 > L2   =>  return L2
> L3 > L2 > L1   =>  return L1
>
> So bb_colder_than_loop_preheader does two kind of checks, one is checking
> L3 preheader count with curr_bb count, another is checking L3 preheader count
> with L1 preheader count, L2 preheader count, etc...
>
>
> ssa-lim-19.c.138t.lim2:
> ...
>    <bb 10> [local count: 16057869]:  // L1 preheader
> -  _4 = s_22(D) / 5;
> -  _5 = (long unsigned int) t_24(D);
> -  _6 = _5 * 4;
> -  _7 = a_25(D) + _6;
>    _8 = (long unsigned int) t_24(D);
>    _9 = _8 * 4;
>    _10 = a_25(D) + _9;
>
>    <bb 3> [local count: 145980626]:
>    # i_34 = PHI <i_29(12), 0(10)>
>    x.0_1 ={v} x;
>    if (x.0_1 != 0)
>      goto <bb 4>; [10.00%]
>    else
>      goto <bb 8>; [90.00%]
>
>    <bb 4> [local count: 14598063]:
>    if (n_20(D) > 0)
>      goto <bb 11>; [89.00%]
>    else
>      goto <bb 8>; [11.00%]
>
>    <bb 11> [local count: 12992276]:  // L2 preheader
> +  _4 = s_22(D) / 5;
> +  _5 = (long unsigned int) t_24(D);
> +  _6 = _5 * 4;
> +  _7 = a_25(D) + _6;
>    goto <bb 7>; [100.00%]
>
>    <bb 14> [local count: 850510901]:
>
>    <bb 5> [local count: 955630225]:  // curr_bb
>    # k_36 = PHI <k_27(14), 0(7)>
>    bar (_4, "one", "two");
>    *_7 = s_22(D);
>    k_27 = k_36 + 1;
>    if (n_20(D) > k_27)
>      goto <bb 14>; [89.00%]
>    else
>      goto <bb 6>; [11.00%]
>
>    <bb 6> [local count: 118111600]:
>    j_21 = j_35 + 1;
>    if (n_20(D) > j_21)
>      goto <bb 13>; [89.00%]
>    else
>      goto <bb 8>; [11.00%]
>
>    <bb 13> [local count: 105119324]:
>
>    <bb 7> [local count: 118111600]:   // L3 preheader
>    # j_35 = PHI <j_21(13), 0(11)>
>    goto <bb 5>; [100.00%]
>
>    <bb 8> [local count: 145980626]:
>    *_10 = t_24(D);
>    i_29 = i_34 + 1;
>
> Re-paste the bb_colder_than_loop_preheader and find_coldest_out_loop:
>
> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state
> +   as stated in profile-count.h, FALSE is returned if inequality cannot be
> +   decided.  */
> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2)
> +{
> +  if (count1 < count2)
> +    return true;
> +  else
> +    return false;
> +}
> +
> +/* Find coldest loop between OUTMOST_LOOP and LOOP by comapring profile count.
> + */
> +
> +static class loop *
> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> +                      basic_block curr_bb)
> +{
> +  class loop *cold_loop, *min_loop;
> +  cold_loop = min_loop = outmost_loop;
> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
> +
> +  /* If bb_colder_than_loop_preheader returns false due to three-state
> +    comparision, OUTMOST_LOOP is returned finally to preserve the behavior.
> +    Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP.  */
> +  if (curr_bb
> +      && bb_colder_than_loop_preheader (curr_bb->count,
> +                                       loop_preheader_edge (loop)->src->count))
> +    return NULL;
> +
> +  while (min_loop != loop)
> +    {
> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> +      if (bb_colder_than_loop_preheader (
> +           loop_preheader_edge (min_loop)->src->count, min_count))
> +       cold_loop = min_loop;
> +    }
> +  return cold_loop;
> +}
> +
>
>
> >
> > Why is this function not simply
> >
> > +static class loop *
> > +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> > +                      basic_block curr_bb)
> > +{
> >      while (bb_colder_than_loop_preheader (curr_bb->count,
> >                loop_preheader_edge (outermost_loop)->src->count))
> >         {
> >             if (outermost_loop == loop)
> >               return NULL;
> >             outermost_loop = superloop_at_depth (loop, loop_depth
> > (outermost_loop) + 1);
> >         }
> >      return outermost_loop;
> > }
>
> If change like this, when processing curr_bb(5), outermost_loop will
> return loop 1 since curr_bb->count > Loop1_prehead->count, the while
> loop stopped.  This doesn't meet what we want.

Why?  curr_bb is executed at least as often as loop1 preheader if
we look at the counts?  So either the counts do not really tell us
anything of help or I am missing something.  Are you merely
looking for a block with a lower count on the path from the outermost
loop entry to the block in question and deciding you do not want to
hoist further than that?  So it's not about not hoisting to a hot place
but instead hoist to the coldest place within a loop nest?

So we have

  for (i = 0; i < m; i++)  // loop 1
    {
      if (__builtin_expect (x, 0))
        for (j = 0; j < n; j++)   // loop 2


   <bb 10> [local count: 16057869]:  // L1 preheader
       ...
 <bb 3> [local count: 145980626]:
   # i_34 = PHI <i_29(12), 0(10)>
 ...
   <bb 11> [local count: 12992276]:  // L2 preheader
   ...
    <bb 7> [local count: 118111600]:   // L3 preheader
   # j_35 = PHI <j_21(13), 0(11)>
   goto <bb 5>; [100.00%]

and we want to hoist to the L2 preheader because that's less frequently
executed than the L1 preheader (which is less frequently executed
than the L3 preheader or the block we are hoisting from).

I'm concerned with compile-time complexity re-evaluating counts on the
loop nest many times.  So it looks to me that we can pre-compute
this lowest-preheader-count loop for a loop nest at least for the
store-motion case where we know the outermost loop?


> >
> > ?
> >
> > Likewise I wonder why ref_in_loop_hot_body::operator () needs to call
> > find_coldest_out_loop and why it not simply does
> >
> > +bool
> > +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> > +{
> > +  basic_block curr_bb = gimple_bb (loc->stmt);
> >     if (bb_colder_than_loop_preheader (curr_bb->count,
> > loop_preheader_edge (l)->src->count))
> >       return false;
> >    return true;
> >    }
>
> Likely for this part,
>
> +/* Find out the coldest loop between loop L and innermost loop, compare the
> +   hotness between current BB and coldest loop preheader by profile count.  */
> +bool
> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> +{
> +  basic_block curr_bb = gimple_bb (loc->stmt);
> +  class loop *inner_loop = curr_bb->loop_father;
> +  class loop *cold_loop = l;
> +  if (l != inner_loop)
> +    cold_loop = find_coldest_out_loop (l, inner_loop, curr_bb);
> +  if (!cold_loop)
> +    return false;
> +  edge e = loop_preheader_edge (cold_loop);
> +  /*  If bb_colder_than_loop_preheader is false due to three-state inequality
> +     comparision, TRUE is returned to continue perform store motion.  */
> +  if (bb_colder_than_loop_preheader (curr_bb->count, e->src->count))
> +    return false;
> +  else
> +    return true;
> +}
>
> l is the input of ref_in_loop_hot_body, it is an out loop, we need to find a
> cold_loop between l and inner_loop.  Reason is there may be cold loop between
> l and inner_loop, which means we shouldn't do store-motion from curr_bb to l
> directly.
> After reconsideration, I think the bb_colder_than_loop_preheader could be
>  removed since curr_bb is checked in find_coldest_out_loop already.  And remove
> the "l != inner_loop" check:
>
> +/* Find out the coldest loop between loop L and innermost loop.  */
> +bool
> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> +{
> +  basic_block curr_bb = gimple_bb (loc->stmt);
> +  class loop *inner_loop = curr_bb->loop_father;
> +  class loop *cold_loop = l;
> +  cold_loop = find_coldest_out_loop (l, inner_loop, curr_bb);
> +  if (!cold_loop)
> +    return false;
> +  return true;
> +}
>
>
> --
> Thanks,
> Xionghu
Xionghu Luo Oct. 27, 2021, 2:40 a.m. UTC | #15
On 2021/10/26 21:20, Richard Biener wrote:
> On Mon, Oct 18, 2021 at 6:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>
>>
>>
>> On 2021/10/15 16:11, Richard Biener wrote:
>>> On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 2021/9/28 20:09, Richard Biener wrote:
>>>>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>>>>
>>>>>> Update the patch to v3, not sure whether you prefer the paste style
>>>>>> and continue to link the previous thread as Segher dislikes this...
>>>>>>
>>>>>>
>>>>>> [PATCH v3] Don't move cold code out of loop by checking bb count
>>>>>>
>>>>>>
>>>>>> Changes:
>>>>>> 1. Handle max_loop in determine_max_movement instead of
>>>>>> outermost_invariant_loop.
>>>>>> 2. Remove unnecessary changes.
>>>>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
>>>>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
>>>>>> infinite loop when implementing v1 and the iteration is missed to be
>>>>>> updated actually.
>>>>>>
>>>>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
>>>>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
>>>>>>
>>>>>> There was a patch trying to avoid move cold block out of loop:
>>>>>>
>>>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>>>>>
>>>>>> Richard suggested to "never hoist anything from a bb with lower execution
>>>>>> frequency to a bb with higher one in LIM invariantness_dom_walker
>>>>>> before_dom_children".
>>>>>>
>>>>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to
>>>>>> expected target loop, if profile count of the loop bb is colder
>>>>>> than target loop preheader, it won't be hoisted out of loop.
>>>>>> Likely for store motion, if all locations of the REF in loop is cold,
>>>>>> don't do store motion of it.
>>>>>>
>>>>>> SPEC2017 performance evaluation shows 1% performance improvement for
>>>>>> intrate GEOMEAN and no obvious regression for others.  Especially,
>>>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>>>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>>>>>> on P8LE.
>>>>>>
>>>>>> gcc/ChangeLog:
>>>>>>
>>>>>>         * loop-invariant.c (find_invariants_bb): Check profile count
>>>>>>         before motion.
>>>>>>         (find_invariants_body): Add argument.
>>>>>>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
>>>>>>         (determine_max_movement): Use find_coldest_out_loop.
>>>>>>         (move_computations_worker): Adjust and fix iteration udpate.
>>>>>>         (execute_sm_exit): Check pointer validness.
>>>>>>         (class ref_in_loop_hot_body): New functor.
>>>>>>         (ref_in_loop_hot_body::operator): New.
>>>>>>         (can_sm_ref_p): Use for_all_locs_in_loop.
>>>>>>
>>>>>> gcc/testsuite/ChangeLog:
>>>>>>
>>>>>>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
>>>>>>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
>>>>>>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>>>>>>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
>>>>>> ---
>>>>>>  gcc/loop-invariant.c                       | 10 ++--
>>>>>>  gcc/tree-ssa-loop-im.c                     | 61 ++++++++++++++++++++--
>>>>>>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
>>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++
>>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++
>>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++
>>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++
>>>>>>  7 files changed, 165 insertions(+), 8 deletions(-)
>>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
>>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
>>>>>>
>>>>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
>>>>>> index fca0c2b24be..5c3be7bf0eb 100644
>>>>>> --- a/gcc/loop-invariant.c
>>>>>> +++ b/gcc/loop-invariant.c
>>>>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
>>>>>>     call.  */
>>>>>>
>>>>>>  static void
>>>>>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
>>>>>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
>>>>>> +                   bool always_executed)
>>>>>>  {
>>>>>>    rtx_insn *insn;
>>>>>> +  basic_block preheader = loop_preheader_edge (loop)->src;
>>>>>> +
>>>>>> +  if (preheader->count > bb->count)
>>>>>> +    return;
>>>>>>
>>>>>>    FOR_BB_INSNS (bb, insn)
>>>>>>      {
>>>>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
>>>>>>    unsigned i;
>>>>>>
>>>>>>    for (i = 0; i < loop->num_nodes; i++)
>>>>>> -    find_invariants_bb (body[i],
>>>>>> -                       bitmap_bit_p (always_reached, i),
>>>>>> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
>>>>>>                         bitmap_bit_p (always_executed, i));
>>>>>>  }
>>>>>>
>>>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>>>>>> index 4b187c2cdaf..655fab03442 100644
>>>>>> --- a/gcc/tree-ssa-loop-im.c
>>>>>> +++ b/gcc/tree-ssa-loop-im.c
>>>>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
>>>>>>    return ret;
>>>>>>  }
>>>>>>
>>>>>> +/* Find coldest loop between outmost_loop and loop by comapring profile count.  */
>>>>>> +
>>>>>> +static class loop *
>>>>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
>>>>>> +                      basic_block curr_bb)
>>>>>> +{
>>>>>> +  class loop *cold_loop, *min_loop;
>>>>>> +  cold_loop = min_loop = outmost_loop;
>>>>>> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
>>>>>> +
>>>>>> +  if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count)
>>>>>
>>>>> Honza - can you comment on whether we should compare BB counts this way?
>>>>>
>>>>> I would suspect that for, say,
>>>>>
>>>>>   for (...)
>>>>>      if (a)
>>>>>        X;
>>>>>      else
>>>>>        Y;
>>>>>
>>>>> that the counts for X and Y will be less than that of the preheader of the loop
>>>>> only when the loop is estimated to run once.  That is, should we really compare
>>>>> the to the preheader count or maybe better to the _header_ count which
>>>>> would keep the number of iterations out of the equation?
>>>>
>>>> I quickly tried to replace all the loop_preheader_edge (loop)->src with
>>>> loop_preheader_edge (loop)->dest, it will cause many failures in
>>>> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems
>>>> reasonable to compare the bb count with preheader count as both gimple lim
>>>> and RTL loop-invariant move instructions to *preheader* instead of *header*
>>>> after analysis?
>>>
>>> Hmm, yeah - guess I was confused here.
>>>
>>>>>
>>>>> If we look at maybe_hot_count_p that's a quite sophisticated thing to
>>>>> compare a count to the "IPA hot", here we're comparing two counts
>>>>> within a function where it actually matters whether we use a<b or
>>>>> !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p
>>>>> function).
>>>>>
>>>>> Xionghu, you error on the side of not hoisting for unordered counts here
>>>>>
>>>>>> +    return NULL;
>>>>>> +
>>>>>> +  while (min_loop != loop)
>>>>>> +    {
>>>>>> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
>>>>>> +      if (loop_preheader_edge (min_loop)->src->count < min_count)
>>>>>
>>>>> but in the other direction here and on the side of not hoisting
>>>>> in ref_in_loop_hot_body.
>>>>>
>>>>> The three-state relational operator overloads are probably not the
>>>>> very best idea...
>>>>> (see profile-count.h for them)
>>>>>
>>>> Added new function bb_colder_than_loop_preheader to encapsulate the comparision,
>>>> if FALSE is returned due to three-state inequality,  find_coldest_out_loop
>>>> will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator ()
>>>> will return true to continue perform store motion, both preserve the previous
>>>> behavior.
>>>
>>> Thanks.  But I don't think the abstraction as written is useful:
>>>
>>> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state
>>> +   as stated in profile-count.h, FALSE is returned if inequality cannot be
>>> +   decided.  */
>>> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2)
>>> +{
>>> +  if (count1 < count2)
>>> +    return true;
>>> +  else
>>> +    return false;
>>> +}
>>>
>>> given the following seems to pass the preheader count in place of the BB count.
>>>
>>> +      if (bb_colder_than_loop_preheader (
>>> +           loop_preheader_edge (min_loop)->src->count, min_count))
>>> +       cold_loop = min_loop;
>>>
>>> find_coldest_out_loop is also a bit weird, I think we want to find
>>> the outermost loop between outmost_loop and loop that has a
>>> lower count than the curr_bb count but
>>>
>>> +  while (min_loop != loop)
>>> +    {
>>> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
>>> +      if (bb_colder_than_loop_preheader (
>>> +           loop_preheader_edge (min_loop)->src->count, min_count))
>>> +       cold_loop = min_loop;
>>>
>>> compares the outermost loops count (min_count) against the preheader
>>> count?  So we're searching for a cold loop with respect to its enclosing loop
>>> here?
>>
>> Let me try to explain how it works :)
>>
>> find_coldest_out_loop does two steps check:
>> 1) Check whether curr_bb is cold in it's own loop_father, if it is cold,
>> just return NULL which means it should not be moved out at all;
>> 2)  curr_bb is NOT cold, assuming the current loop L[m] is the coldest first,
>> than try to find a cold loop to be hoisted to from {L[1], L[2], ... L[m]},
>> if L[i]->count < L[m]->count, set the cold_loop to L[i] until find the loop
>> that has smallest profile_count.
>>
>>
>> Take the updated ssa-lim-19.c as example, check whether curr_bb(bb 5) is cold in
>> loop 3, if it is cold, just return NULL, otherwise select the coldest loop in
>> {loop1, loop2, loop3} and find that loop2 is colder than loop3, return loop2 to
>> be the target hoist loop.  The first check could AVOID hoist if curr_bb is colder
>> than loop3, but it is still hot than loop1 and loop2.  Not sure whether it is possible
>> to construct such cases?
>>
>>
>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>
>> volatile int x;
>> void
>> bar (int, char *, char *);
>> void
>> foo (int *a, int n, int m, int s, int t)
>> {
>>   int i;
>>   int j;
>>   int k;
>>
>>   for (i = 0; i < m; i++)  // loop 1
>>     {
>>       if (__builtin_expect (x, 0))
>>         for (j = 0; j < n; j++)   // loop 2
>>           for (k = 0; k < n; k++)   // loop 3
>>            {
>>              bar (s / 5, "one", "two");  // curr_bb
>>              a[t] = s;
>>            }
>>       a[t] = t;  // curr_bb2
>>     }
>> }
>>
>> The 4 invariant statements are moved to bb 11(loop2) instead of bb 10(loop1)
>> with this patch.
>> There are totally 6 combinations when curr_bb is hotter than loop 3.  We need
>> to compare the "Loop preheader hotness" instead of "every Loop[i] and curr_bb hotness",
>> returning the coldest loop for this function find_coldest_out_loop, otherwise
>> unexpected behavior happens.
>>
>> L1 > L2 > L3   =>  return L3
>> L1 > L3 > L2   =>  return L2
>> L2 > L1 > L3   =>  return L3
>> L2 > L3 > L1   =>  return L1
>> L3 > L1 > L2   =>  return L2
>> L3 > L2 > L1   =>  return L1
>>
>> So bb_colder_than_loop_preheader does two kind of checks, one is checking
>> L3 preheader count with curr_bb count, another is checking L3 preheader count
>> with L1 preheader count, L2 preheader count, etc...
>>
>>
>> ssa-lim-19.c.138t.lim2:
>> ...
>>    <bb 10> [local count: 16057869]:  // L1 preheader
>> -  _4 = s_22(D) / 5;
>> -  _5 = (long unsigned int) t_24(D);
>> -  _6 = _5 * 4;
>> -  _7 = a_25(D) + _6;
>>    _8 = (long unsigned int) t_24(D);
>>    _9 = _8 * 4;
>>    _10 = a_25(D) + _9;
>>
>>    <bb 3> [local count: 145980626]:
>>    # i_34 = PHI <i_29(12), 0(10)>
>>    x.0_1 ={v} x;
>>    if (x.0_1 != 0)
>>      goto <bb 4>; [10.00%]
>>    else
>>      goto <bb 8>; [90.00%]
>>
>>    <bb 4> [local count: 14598063]:
>>    if (n_20(D) > 0)
>>      goto <bb 11>; [89.00%]
>>    else
>>      goto <bb 8>; [11.00%]
>>
>>    <bb 11> [local count: 12992276]:  // L2 preheader
>> +  _4 = s_22(D) / 5;
>> +  _5 = (long unsigned int) t_24(D);
>> +  _6 = _5 * 4;
>> +  _7 = a_25(D) + _6;
>>    goto <bb 7>; [100.00%]
>>
>>    <bb 14> [local count: 850510901]:
>>
>>    <bb 5> [local count: 955630225]:  // curr_bb
>>    # k_36 = PHI <k_27(14), 0(7)>
>>    bar (_4, "one", "two");
>>    *_7 = s_22(D);
>>    k_27 = k_36 + 1;
>>    if (n_20(D) > k_27)
>>      goto <bb 14>; [89.00%]
>>    else
>>      goto <bb 6>; [11.00%]
>>
>>    <bb 6> [local count: 118111600]:
>>    j_21 = j_35 + 1;
>>    if (n_20(D) > j_21)
>>      goto <bb 13>; [89.00%]
>>    else
>>      goto <bb 8>; [11.00%]
>>
>>    <bb 13> [local count: 105119324]:
>>
>>    <bb 7> [local count: 118111600]:   // L3 preheader
>>    # j_35 = PHI <j_21(13), 0(11)>
>>    goto <bb 5>; [100.00%]
>>
>>    <bb 8> [local count: 145980626]:
>>    *_10 = t_24(D);
>>    i_29 = i_34 + 1;
>>
>> Re-paste the bb_colder_than_loop_preheader and find_coldest_out_loop:
>>
>> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state
>> +   as stated in profile-count.h, FALSE is returned if inequality cannot be
>> +   decided.  */
>> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2)
>> +{
>> +  if (count1 < count2)
>> +    return true;
>> +  else
>> +    return false;
>> +}
>> +
>> +/* Find coldest loop between OUTMOST_LOOP and LOOP by comapring profile count.
>> + */
>> +
>> +static class loop *
>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
>> +                      basic_block curr_bb)
>> +{
>> +  class loop *cold_loop, *min_loop;
>> +  cold_loop = min_loop = outmost_loop;
>> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
>> +
>> +  /* If bb_colder_than_loop_preheader returns false due to three-state
>> +    comparision, OUTMOST_LOOP is returned finally to preserve the behavior.
>> +    Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP.  */
>> +  if (curr_bb
>> +      && bb_colder_than_loop_preheader (curr_bb->count,
>> +                                       loop_preheader_edge (loop)->src->count))
>> +    return NULL;
>> +
>> +  while (min_loop != loop)
>> +    {
>> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
>> +      if (bb_colder_than_loop_preheader (
>> +           loop_preheader_edge (min_loop)->src->count, min_count))
>> +       cold_loop = min_loop;
>> +    }
>> +  return cold_loop;
>> +}
>> +
>>
>>
>>>
>>> Why is this function not simply
>>>
>>> +static class loop *
>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
>>> +                      basic_block curr_bb)
>>> +{
>>>      while (bb_colder_than_loop_preheader (curr_bb->count,
>>>                loop_preheader_edge (outermost_loop)->src->count))
>>>         {
>>>             if (outermost_loop == loop)
>>>               return NULL;
>>>             outermost_loop = superloop_at_depth (loop, loop_depth
>>> (outermost_loop) + 1);
>>>         }
>>>      return outermost_loop;
>>> }
>>
>> If change like this, when processing curr_bb(5), outermost_loop will
>> return loop 1 since curr_bb->count > Loop1_prehead->count, the while
>> loop stopped.  This doesn't meet what we want.
> 
> Why?  curr_bb is executed at least as often as loop1 preheader if
> we look at the counts?  So either the counts do not really tell us
> anything of help or I am missing something.  Are you merely
> looking for a block with a lower count on the path from the outermost
> loop entry to the block in question and deciding you do not want to
> hoist further than that?  So it's not about not hoisting to a hot place
> but instead hoist to the coldest place within a loop nest?
> 
> So we have
> 
>   for (i = 0; i < m; i++)  // loop 1
>     {
>       if (__builtin_expect (x, 0))
>         for (j = 0; j < n; j++)   // loop 2
> 
> 
>    <bb 10> [local count: 16057869]:  // L1 preheader
>        ...
>  <bb 3> [local count: 145980626]:
>    # i_34 = PHI <i_29(12), 0(10)>
>  ...
>    <bb 11> [local count: 12992276]:  // L2 preheader
>    ...
>     <bb 7> [local count: 118111600]:   // L3 preheader
>    # j_35 = PHI <j_21(13), 0(11)>
>    goto <bb 5>; [100.00%]
> 
> and we want to hoist to the L2 preheader because that's less frequently
> executed than the L1 preheader (which is less frequently executed
> than the L3 preheader or the block we are hoisting from).

Yes, this is exactly what I want, sorry for not describe it clear before ;(

The updated patch[1] may reflect find_coldest_out_loop better:
It first check whether curr_bb is hotter than it's preheader, if false, return NULL
which means no need hoist at all; Then find a *coldest* preheader to hoist
within a loop nest from outmost_loop.


[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html


+/* Find coldest loop between OUTMOST_LOOP and LOOP by comparing profile count.
+   It does two steps check:
+   1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just
+   return NULL which means it should not be moved out at all;
+   2)  CURR_BB is NOT cold, set LOOP to cold_loop, then iteratively search loops
+   from {L[outmost_loop], L[outmost_loop+1], ... L[loop]}, if L[i] is colder
+   than L[cold_loop], reset cold_loop to L[i] until get the loop that has
+   smallest profile_count.  */
+
+static class loop *
+find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
+		       basic_block curr_bb)
+{
+  class loop *cold_loop;
+
+  /* If bb_colder_than_loop_preheader returns false due to three-state
+    comparision, OUTMOST_LOOP is returned finally to preserve the behavior.
+    Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP.  */
+  if (curr_bb
+      && bb_colder_than_loop_preheader (curr_bb,
+					loop_preheader_edge (loop)->src))
+    return NULL;
+
+  cold_loop = loop;
+  while (outmost_loop != loop)
+    {
+      if (bb_colder_than_loop_preheader (loop_preheader_edge (outmost_loop)->src,
+					 loop_preheader_edge (cold_loop)->src))
+	cold_loop = outmost_loop;
+      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
+    }
+  return cold_loop;
+}


> 
> I'm concerned with compile-time complexity re-evaluating counts on the
> loop nest many times.  So it looks to me that we can pre-compute
> this lowest-preheader-count loop for a loop nest at least for the
> store-motion case where we know the outermost loop?
> 
> 

But the lowest-preheader-count loop may change for a loop/bb with different
outermost loop.  For example if,

L1_preheader_count < L2_preheader_count < L3_preheader_count < L4_preheader_count < curr_bb_count

then,

find_coldest_out_loop (L1, loop, curr_bb)  => coldest preheader loop is L1
find_coldest_out_loop (L2, loop, curr_bb)  => coldest preheader loop is L2

So it will be a 1:N map?  Pre-compute it in find_coldest_out_loop
and save it also in lim_data with a new variable
coldest_preheader_loop[outmost_loop][coldest_preheader_loop]?
each call of find_coldest_out_loop will check whether that variable is set
already, only continue the search if
coldest_preheader_loop[outmost_loop][coldest_preheader_loop] is not set?
Seems a bit complicated and not sure whether it helps to reduce
compile-time complexity or I am misunderstanding...
Jan Hubicka Oct. 27, 2021, 12:54 p.m. UTC | #16
> Hi,
> 
> On 2021/9/28 20:09, Richard Biener wrote:
> > On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>
> >> Update the patch to v3, not sure whether you prefer the paste style
> >> and continue to link the previous thread as Segher dislikes this...
> >>
> >>
> >> [PATCH v3] Don't move cold code out of loop by checking bb count
> >>
> >>
> >> Changes:
> >> 1. Handle max_loop in determine_max_movement instead of
> >> outermost_invariant_loop.
> >> 2. Remove unnecessary changes.
> >> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
> >> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> >> infinite loop when implementing v1 and the iteration is missed to be
> >> updated actually.
> >>
> >> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> >> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> >>
> >> There was a patch trying to avoid move cold block out of loop:
> >>
> >> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> >>
> >> Richard suggested to "never hoist anything from a bb with lower execution
> >> frequency to a bb with higher one in LIM invariantness_dom_walker
> >> before_dom_children".
> >>
> >> In gimple LIM analysis, add find_coldest_out_loop to move invariants to
> >> expected target loop, if profile count of the loop bb is colder
> >> than target loop preheader, it won't be hoisted out of loop.
> >> Likely for store motion, if all locations of the REF in loop is cold,
> >> don't do store motion of it.
> >>
> >> SPEC2017 performance evaluation shows 1% performance improvement for
> >> intrate GEOMEAN and no obvious regression for others.  Especially,
> >> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> >> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> >> on P8LE.
> >>
> >> gcc/ChangeLog:
> >>
> >>         * loop-invariant.c (find_invariants_bb): Check profile count
> >>         before motion.
> >>         (find_invariants_body): Add argument.
> >>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
> >>         (determine_max_movement): Use find_coldest_out_loop.
> >>         (move_computations_worker): Adjust and fix iteration udpate.
> >>         (execute_sm_exit): Check pointer validness.
> >>         (class ref_in_loop_hot_body): New functor.
> >>         (ref_in_loop_hot_body::operator): New.
> >>         (can_sm_ref_p): Use for_all_locs_in_loop.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
> >>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
> >>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> >>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> >> ---
> >>  gcc/loop-invariant.c                       | 10 ++--
> >>  gcc/tree-ssa-loop-im.c                     | 61 ++++++++++++++++++++--
> >>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++
> >>  7 files changed, 165 insertions(+), 8 deletions(-)
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> >>
> >> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
> >> index fca0c2b24be..5c3be7bf0eb 100644
> >> --- a/gcc/loop-invariant.c
> >> +++ b/gcc/loop-invariant.c
> >> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
> >>     call.  */
> >>
> >>  static void
> >> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
> >> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
> >> +                   bool always_executed)
> >>  {
> >>    rtx_insn *insn;
> >> +  basic_block preheader = loop_preheader_edge (loop)->src;
> >> +
> >> +  if (preheader->count > bb->count)
> >> +    return;
> >>
> >>    FOR_BB_INSNS (bb, insn)
> >>      {
> >> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
> >>    unsigned i;
> >>
> >>    for (i = 0; i < loop->num_nodes; i++)
> >> -    find_invariants_bb (body[i],
> >> -                       bitmap_bit_p (always_reached, i),
> >> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
> >>                         bitmap_bit_p (always_executed, i));
> >>  }
> >>
> >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> >> index 4b187c2cdaf..655fab03442 100644
> >> --- a/gcc/tree-ssa-loop-im.c
> >> +++ b/gcc/tree-ssa-loop-im.c
> >> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
> >>    return ret;
> >>  }
> >>
> >> +/* Find coldest loop between outmost_loop and loop by comapring profile count.  */
> >> +
> >> +static class loop *
> >> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> >> +                      basic_block curr_bb)
> >> +{
> >> +  class loop *cold_loop, *min_loop;
> >> +  cold_loop = min_loop = outmost_loop;
> >> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
> >> +
> >> +  if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count)
> > 
> > Honza - can you comment on whether we should compare BB counts this way?
> > 
> > I would suspect that for, say,
> > 
> >   for (...)
> >      if (a)
> >        X;
> >      else
> >        Y;
> > 
> > that the counts for X and Y will be less than that of the preheader of the loop
> > only when the loop is estimated to run once.  That is, should we really compare
> > the to the preheader count or maybe better to the _header_ count which
> > would keep the number of iterations out of the equation?
> 
> I quickly tried to replace all the loop_preheader_edge (loop)->src with
> loop_preheader_edge (loop)->dest, it will cause many failures in
> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems
> reasonable to compare the bb count with preheader count as both gimple lim
> and RTL loop-invariant move instructions to *preheader* instead of *header*
> after analysis?

I am not quite sure I understand what you shoot for.  But if you have
loop invariant inside a loop nest and you get range of loops in the nest
where you want to move it, you want to pick chepaer preheader count,
since the statement is going to be executed there.

For
> >   for (...)
> >      if (a)
> >        X;
> >      else
> >        Y;

You may have frequency of X less then preheader i.e. when probability
that a is true is lower than the expected iteration count.

If I understand correctly, you want to compare sum of counts of all
BBs where invariant evaulates currently to the minimal count of
preheader where you can move it.

If you have

for A
  for B
    for C
      invariant_computation

Usually you want to move it:

invariant_computation
for A
  for B
    for C

However if for B usually iterates 0 times, it may happen that preheader
of for C is executed less often then preheaders of for A/B and you want:

for A
  for B
    invariant_computation
    for C
> 
> > 
> > If we look at maybe_hot_count_p that's a quite sophisticated thing to
> > compare a count to the "IPA hot", here we're comparing two counts
> > within a function where it actually matters whether we use a<b or
> > !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p
> > function).
> > 
> > Xionghu, you error on the side of not hoisting for unordered counts here
> > 
> >> +    return NULL;
> >> +
> >> +  while (min_loop != loop)
> >> +    {
> >> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> >> +      if (loop_preheader_edge (min_loop)->src->count < min_count)
> > 
> > but in the other direction here and on the side of not hoisting
> > in ref_in_loop_hot_body.
> > 
> > The three-state relational operator overloads are probably not the
> > very best idea...
> > (see profile-count.h for them)

In first version of the patch I had
 count1.known_le (count2)
which however made the code to look quite ugly and eventually I
convinced myself that three-state comparators are less pain than hard to
read conditionals...

But i guess we can ensapsulate them when it makes code easier to read. I
would be OK with having known_XY comparator variants in profile-count.h

Honza
Xionghu Luo Oct. 28, 2021, 1:49 a.m. UTC | #17
On 2021/10/27 20:54, Jan Hubicka wrote:
>> Hi,
>>
>> On 2021/9/28 20:09, Richard Biener wrote:
>>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>>
>>>> Update the patch to v3, not sure whether you prefer the paste style
>>>> and continue to link the previous thread as Segher dislikes this...
>>>>
>>>>
>>>> [PATCH v3] Don't move cold code out of loop by checking bb count
>>>>
>>>>
>>>> Changes:
>>>> 1. Handle max_loop in determine_max_movement instead of
>>>> outermost_invariant_loop.
>>>> 2. Remove unnecessary changes.
>>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
>>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
>>>> infinite loop when implementing v1 and the iteration is missed to be
>>>> updated actually.
>>>>
>>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
>>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
>>>>
>>>> There was a patch trying to avoid move cold block out of loop:
>>>>
>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>>>
>>>> Richard suggested to "never hoist anything from a bb with lower execution
>>>> frequency to a bb with higher one in LIM invariantness_dom_walker
>>>> before_dom_children".
>>>>
>>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to
>>>> expected target loop, if profile count of the loop bb is colder
>>>> than target loop preheader, it won't be hoisted out of loop.
>>>> Likely for store motion, if all locations of the REF in loop is cold,
>>>> don't do store motion of it.
>>>>
>>>> SPEC2017 performance evaluation shows 1% performance improvement for
>>>> intrate GEOMEAN and no obvious regression for others.  Especially,
>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>>>> on P8LE.
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>>         * loop-invariant.c (find_invariants_bb): Check profile count
>>>>         before motion.
>>>>         (find_invariants_body): Add argument.
>>>>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
>>>>         (determine_max_movement): Use find_coldest_out_loop.
>>>>         (move_computations_worker): Adjust and fix iteration udpate.
>>>>         (execute_sm_exit): Check pointer validness.
>>>>         (class ref_in_loop_hot_body): New functor.
>>>>         (ref_in_loop_hot_body::operator): New.
>>>>         (can_sm_ref_p): Use for_all_locs_in_loop.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
>>>>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
>>>>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>>>>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
>>>> ---
>>>>  gcc/loop-invariant.c                       | 10 ++--
>>>>  gcc/tree-ssa-loop-im.c                     | 61 ++++++++++++++++++++--
>>>>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++
>>>>  7 files changed, 165 insertions(+), 8 deletions(-)
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
>>>>
>>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
>>>> index fca0c2b24be..5c3be7bf0eb 100644
>>>> --- a/gcc/loop-invariant.c
>>>> +++ b/gcc/loop-invariant.c
>>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
>>>>     call.  */
>>>>
>>>>  static void
>>>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
>>>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
>>>> +                   bool always_executed)
>>>>  {
>>>>    rtx_insn *insn;
>>>> +  basic_block preheader = loop_preheader_edge (loop)->src;
>>>> +
>>>> +  if (preheader->count > bb->count)
>>>> +    return;
>>>>
>>>>    FOR_BB_INSNS (bb, insn)
>>>>      {
>>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
>>>>    unsigned i;
>>>>
>>>>    for (i = 0; i < loop->num_nodes; i++)
>>>> -    find_invariants_bb (body[i],
>>>> -                       bitmap_bit_p (always_reached, i),
>>>> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
>>>>                         bitmap_bit_p (always_executed, i));
>>>>  }
>>>>
>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>>>> index 4b187c2cdaf..655fab03442 100644
>>>> --- a/gcc/tree-ssa-loop-im.c
>>>> +++ b/gcc/tree-ssa-loop-im.c
>>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
>>>>    return ret;
>>>>  }
>>>>
>>>> +/* Find coldest loop between outmost_loop and loop by comapring profile count.  */
>>>> +
>>>> +static class loop *
>>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
>>>> +                      basic_block curr_bb)
>>>> +{
>>>> +  class loop *cold_loop, *min_loop;
>>>> +  cold_loop = min_loop = outmost_loop;
>>>> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
>>>> +
>>>> +  if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count)
>>>
>>> Honza - can you comment on whether we should compare BB counts this way?
>>>
>>> I would suspect that for, say,
>>>
>>>   for (...)
>>>      if (a)
>>>        X;
>>>      else
>>>        Y;
>>>
>>> that the counts for X and Y will be less than that of the preheader of the loop
>>> only when the loop is estimated to run once.  That is, should we really compare
>>> the to the preheader count or maybe better to the _header_ count which
>>> would keep the number of iterations out of the equation?
>>
>> I quickly tried to replace all the loop_preheader_edge (loop)->src with
>> loop_preheader_edge (loop)->dest, it will cause many failures in
>> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems
>> reasonable to compare the bb count with preheader count as both gimple lim
>> and RTL loop-invariant move instructions to *preheader* instead of *header*
>> after analysis?
> 
> I am not quite sure I understand what you shoot for.  But if you have
> loop invariant inside a loop nest and you get range of loops in the nest
> where you want to move it, you want to pick chepaer preheader count,
> since the statement is going to be executed there.
> 
> For
>>>   for (...)
>>>      if (a)
>>>        X;
>>>      else
>>>        Y;
> 
> You may have frequency of X less then preheader i.e. when probability
> that a is true is lower than the expected iteration count.
> 
> If I understand correctly, you want to compare sum of counts of all
> BBs where invariant evaulates currently to the minimal count of
> preheader where you can move it.
> 
> If you have
> 
> for A
>   for B
>     for C
>       invariant_computation
> 
> Usually you want to move it:
> 
> invariant_computation
> for A
>   for B
>     for C
> 
> However if for B usually iterates 0 times, it may happen that preheader
> of for C is executed less often then preheaders of for A/B and you want:
> 
> for A
>   for B
>     invariant_computation
>     for C

Thanks, this is what I am trying to do in both gimple lim and RTL loop-invariant motion. 

In gimple lim, the new added function find_coldest_out_loop[1] will check whether 
invariant_computation is hotter than C_preheader, if true, find a coldest
preheader from outermost nested loop, if B is the coldest, reset the outmost_loop
to B, which could avoid hoist cold statement to hot loops to reduce execution counts.
Gimple only change could improve 500.perlbench_r and 548.exchange2_r a bit[2].

RTL patch need only small check like below, it could improve performance
500.perlbench_r ~8% [2]for at least Power and aarch64.


[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html
[2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580110.html


This is the other patch Richard and I expecting your review :)


From 468e0b252a6b4a8b648c4a49850ed337ab5e03e1 Mon Sep 17 00:00:00 2001
From: Xiong Hu Luo <luoxhu@linux.ibm.com>
Date: Fri, 8 Oct 2021 22:05:39 -0500
Subject: [PATCH v4 2/2] loop-invariant: Don't move cold bb instructions to preheader in RTL

gcc/ChangeLog:

	* loop-invariant.c (find_invariants_bb): Check profile count
	before motion.
	(find_invariants_body): Add argument.
---
 gcc/loop-invariant.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
index fca0c2b24be..5c3be7bf0eb 100644
--- a/gcc/loop-invariant.c
+++ b/gcc/loop-invariant.c
@@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
    call.  */
 
 static void
-find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
+find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
+		    bool always_executed)
 {
   rtx_insn *insn;
+  basic_block preheader = loop_preheader_edge (loop)->src;
+
+  if (preheader->count > bb->count)
+    return;
 
   FOR_BB_INSNS (bb, insn)
     {
@@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
   unsigned i;
 
   for (i = 0; i < loop->num_nodes; i++)
-    find_invariants_bb (body[i],
-			bitmap_bit_p (always_reached, i),
+    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
 			bitmap_bit_p (always_executed, i));
 }
Richard Biener Oct. 29, 2021, 11:48 a.m. UTC | #18
On Wed, Oct 27, 2021 at 4:40 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2021/10/26 21:20, Richard Biener wrote:
> > On Mon, Oct 18, 2021 at 6:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>
> >>
> >>
> >> On 2021/10/15 16:11, Richard Biener wrote:
> >>> On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> On 2021/9/28 20:09, Richard Biener wrote:
> >>>>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>>>>>
> >>>>>> Update the patch to v3, not sure whether you prefer the paste style
> >>>>>> and continue to link the previous thread as Segher dislikes this...
> >>>>>>
> >>>>>>
> >>>>>> [PATCH v3] Don't move cold code out of loop by checking bb count
> >>>>>>
> >>>>>>
> >>>>>> Changes:
> >>>>>> 1. Handle max_loop in determine_max_movement instead of
> >>>>>> outermost_invariant_loop.
> >>>>>> 2. Remove unnecessary changes.
> >>>>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
> >>>>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> >>>>>> infinite loop when implementing v1 and the iteration is missed to be
> >>>>>> updated actually.
> >>>>>>
> >>>>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> >>>>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> >>>>>>
> >>>>>> There was a patch trying to avoid move cold block out of loop:
> >>>>>>
> >>>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> >>>>>>
> >>>>>> Richard suggested to "never hoist anything from a bb with lower execution
> >>>>>> frequency to a bb with higher one in LIM invariantness_dom_walker
> >>>>>> before_dom_children".
> >>>>>>
> >>>>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to
> >>>>>> expected target loop, if profile count of the loop bb is colder
> >>>>>> than target loop preheader, it won't be hoisted out of loop.
> >>>>>> Likely for store motion, if all locations of the REF in loop is cold,
> >>>>>> don't do store motion of it.
> >>>>>>
> >>>>>> SPEC2017 performance evaluation shows 1% performance improvement for
> >>>>>> intrate GEOMEAN and no obvious regression for others.  Especially,
> >>>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> >>>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> >>>>>> on P8LE.
> >>>>>>
> >>>>>> gcc/ChangeLog:
> >>>>>>
> >>>>>>         * loop-invariant.c (find_invariants_bb): Check profile count
> >>>>>>         before motion.
> >>>>>>         (find_invariants_body): Add argument.
> >>>>>>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
> >>>>>>         (determine_max_movement): Use find_coldest_out_loop.
> >>>>>>         (move_computations_worker): Adjust and fix iteration udpate.
> >>>>>>         (execute_sm_exit): Check pointer validness.
> >>>>>>         (class ref_in_loop_hot_body): New functor.
> >>>>>>         (ref_in_loop_hot_body::operator): New.
> >>>>>>         (can_sm_ref_p): Use for_all_locs_in_loop.
> >>>>>>
> >>>>>> gcc/testsuite/ChangeLog:
> >>>>>>
> >>>>>>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
> >>>>>>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
> >>>>>>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> >>>>>>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> >>>>>> ---
> >>>>>>  gcc/loop-invariant.c                       | 10 ++--
> >>>>>>  gcc/tree-ssa-loop-im.c                     | 61 ++++++++++++++++++++--
> >>>>>>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
> >>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++
> >>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++
> >>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++
> >>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++
> >>>>>>  7 files changed, 165 insertions(+), 8 deletions(-)
> >>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> >>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> >>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> >>>>>>
> >>>>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
> >>>>>> index fca0c2b24be..5c3be7bf0eb 100644
> >>>>>> --- a/gcc/loop-invariant.c
> >>>>>> +++ b/gcc/loop-invariant.c
> >>>>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
> >>>>>>     call.  */
> >>>>>>
> >>>>>>  static void
> >>>>>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
> >>>>>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
> >>>>>> +                   bool always_executed)
> >>>>>>  {
> >>>>>>    rtx_insn *insn;
> >>>>>> +  basic_block preheader = loop_preheader_edge (loop)->src;
> >>>>>> +
> >>>>>> +  if (preheader->count > bb->count)
> >>>>>> +    return;
> >>>>>>
> >>>>>>    FOR_BB_INSNS (bb, insn)
> >>>>>>      {
> >>>>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
> >>>>>>    unsigned i;
> >>>>>>
> >>>>>>    for (i = 0; i < loop->num_nodes; i++)
> >>>>>> -    find_invariants_bb (body[i],
> >>>>>> -                       bitmap_bit_p (always_reached, i),
> >>>>>> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
> >>>>>>                         bitmap_bit_p (always_executed, i));
> >>>>>>  }
> >>>>>>
> >>>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> >>>>>> index 4b187c2cdaf..655fab03442 100644
> >>>>>> --- a/gcc/tree-ssa-loop-im.c
> >>>>>> +++ b/gcc/tree-ssa-loop-im.c
> >>>>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
> >>>>>>    return ret;
> >>>>>>  }
> >>>>>>
> >>>>>> +/* Find coldest loop between outmost_loop and loop by comapring profile count.  */
> >>>>>> +
> >>>>>> +static class loop *
> >>>>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> >>>>>> +                      basic_block curr_bb)
> >>>>>> +{
> >>>>>> +  class loop *cold_loop, *min_loop;
> >>>>>> +  cold_loop = min_loop = outmost_loop;
> >>>>>> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
> >>>>>> +
> >>>>>> +  if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count)
> >>>>>
> >>>>> Honza - can you comment on whether we should compare BB counts this way?
> >>>>>
> >>>>> I would suspect that for, say,
> >>>>>
> >>>>>   for (...)
> >>>>>      if (a)
> >>>>>        X;
> >>>>>      else
> >>>>>        Y;
> >>>>>
> >>>>> that the counts for X and Y will be less than that of the preheader of the loop
> >>>>> only when the loop is estimated to run once.  That is, should we really compare
> >>>>> the to the preheader count or maybe better to the _header_ count which
> >>>>> would keep the number of iterations out of the equation?
> >>>>
> >>>> I quickly tried to replace all the loop_preheader_edge (loop)->src with
> >>>> loop_preheader_edge (loop)->dest, it will cause many failures in
> >>>> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems
> >>>> reasonable to compare the bb count with preheader count as both gimple lim
> >>>> and RTL loop-invariant move instructions to *preheader* instead of *header*
> >>>> after analysis?
> >>>
> >>> Hmm, yeah - guess I was confused here.
> >>>
> >>>>>
> >>>>> If we look at maybe_hot_count_p that's a quite sophisticated thing to
> >>>>> compare a count to the "IPA hot", here we're comparing two counts
> >>>>> within a function where it actually matters whether we use a<b or
> >>>>> !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p
> >>>>> function).
> >>>>>
> >>>>> Xionghu, you error on the side of not hoisting for unordered counts here
> >>>>>
> >>>>>> +    return NULL;
> >>>>>> +
> >>>>>> +  while (min_loop != loop)
> >>>>>> +    {
> >>>>>> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> >>>>>> +      if (loop_preheader_edge (min_loop)->src->count < min_count)
> >>>>>
> >>>>> but in the other direction here and on the side of not hoisting
> >>>>> in ref_in_loop_hot_body.
> >>>>>
> >>>>> The three-state relational operator overloads are probably not the
> >>>>> very best idea...
> >>>>> (see profile-count.h for them)
> >>>>>
> >>>> Added new function bb_colder_than_loop_preheader to encapsulate the comparision,
> >>>> if FALSE is returned due to three-state inequality,  find_coldest_out_loop
> >>>> will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator ()
> >>>> will return true to continue perform store motion, both preserve the previous
> >>>> behavior.
> >>>
> >>> Thanks.  But I don't think the abstraction as written is useful:
> >>>
> >>> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state
> >>> +   as stated in profile-count.h, FALSE is returned if inequality cannot be
> >>> +   decided.  */
> >>> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2)
> >>> +{
> >>> +  if (count1 < count2)
> >>> +    return true;
> >>> +  else
> >>> +    return false;
> >>> +}
> >>>
> >>> given the following seems to pass the preheader count in place of the BB count.
> >>>
> >>> +      if (bb_colder_than_loop_preheader (
> >>> +           loop_preheader_edge (min_loop)->src->count, min_count))
> >>> +       cold_loop = min_loop;
> >>>
> >>> find_coldest_out_loop is also a bit weird, I think we want to find
> >>> the outermost loop between outmost_loop and loop that has a
> >>> lower count than the curr_bb count but
> >>>
> >>> +  while (min_loop != loop)
> >>> +    {
> >>> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> >>> +      if (bb_colder_than_loop_preheader (
> >>> +           loop_preheader_edge (min_loop)->src->count, min_count))
> >>> +       cold_loop = min_loop;
> >>>
> >>> compares the outermost loops count (min_count) against the preheader
> >>> count?  So we're searching for a cold loop with respect to its enclosing loop
> >>> here?
> >>
> >> Let me try to explain how it works :)
> >>
> >> find_coldest_out_loop does two steps check:
> >> 1) Check whether curr_bb is cold in it's own loop_father, if it is cold,
> >> just return NULL which means it should not be moved out at all;
> >> 2)  curr_bb is NOT cold, assuming the current loop L[m] is the coldest first,
> >> than try to find a cold loop to be hoisted to from {L[1], L[2], ... L[m]},
> >> if L[i]->count < L[m]->count, set the cold_loop to L[i] until find the loop
> >> that has smallest profile_count.
> >>
> >>
> >> Take the updated ssa-lim-19.c as example, check whether curr_bb(bb 5) is cold in
> >> loop 3, if it is cold, just return NULL, otherwise select the coldest loop in
> >> {loop1, loop2, loop3} and find that loop2 is colder than loop3, return loop2 to
> >> be the target hoist loop.  The first check could AVOID hoist if curr_bb is colder
> >> than loop3, but it is still hot than loop1 and loop2.  Not sure whether it is possible
> >> to construct such cases?
> >>
> >>
> >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>
> >> volatile int x;
> >> void
> >> bar (int, char *, char *);
> >> void
> >> foo (int *a, int n, int m, int s, int t)
> >> {
> >>   int i;
> >>   int j;
> >>   int k;
> >>
> >>   for (i = 0; i < m; i++)  // loop 1
> >>     {
> >>       if (__builtin_expect (x, 0))
> >>         for (j = 0; j < n; j++)   // loop 2
> >>           for (k = 0; k < n; k++)   // loop 3
> >>            {
> >>              bar (s / 5, "one", "two");  // curr_bb
> >>              a[t] = s;
> >>            }
> >>       a[t] = t;  // curr_bb2
> >>     }
> >> }
> >>
> >> The 4 invariant statements are moved to bb 11(loop2) instead of bb 10(loop1)
> >> with this patch.
> >> There are totally 6 combinations when curr_bb is hotter than loop 3.  We need
> >> to compare the "Loop preheader hotness" instead of "every Loop[i] and curr_bb hotness",
> >> returning the coldest loop for this function find_coldest_out_loop, otherwise
> >> unexpected behavior happens.
> >>
> >> L1 > L2 > L3   =>  return L3
> >> L1 > L3 > L2   =>  return L2
> >> L2 > L1 > L3   =>  return L3
> >> L2 > L3 > L1   =>  return L1
> >> L3 > L1 > L2   =>  return L2
> >> L3 > L2 > L1   =>  return L1
> >>
> >> So bb_colder_than_loop_preheader does two kind of checks, one is checking
> >> L3 preheader count with curr_bb count, another is checking L3 preheader count
> >> with L1 preheader count, L2 preheader count, etc...
> >>
> >>
> >> ssa-lim-19.c.138t.lim2:
> >> ...
> >>    <bb 10> [local count: 16057869]:  // L1 preheader
> >> -  _4 = s_22(D) / 5;
> >> -  _5 = (long unsigned int) t_24(D);
> >> -  _6 = _5 * 4;
> >> -  _7 = a_25(D) + _6;
> >>    _8 = (long unsigned int) t_24(D);
> >>    _9 = _8 * 4;
> >>    _10 = a_25(D) + _9;
> >>
> >>    <bb 3> [local count: 145980626]:
> >>    # i_34 = PHI <i_29(12), 0(10)>
> >>    x.0_1 ={v} x;
> >>    if (x.0_1 != 0)
> >>      goto <bb 4>; [10.00%]
> >>    else
> >>      goto <bb 8>; [90.00%]
> >>
> >>    <bb 4> [local count: 14598063]:
> >>    if (n_20(D) > 0)
> >>      goto <bb 11>; [89.00%]
> >>    else
> >>      goto <bb 8>; [11.00%]
> >>
> >>    <bb 11> [local count: 12992276]:  // L2 preheader
> >> +  _4 = s_22(D) / 5;
> >> +  _5 = (long unsigned int) t_24(D);
> >> +  _6 = _5 * 4;
> >> +  _7 = a_25(D) + _6;
> >>    goto <bb 7>; [100.00%]
> >>
> >>    <bb 14> [local count: 850510901]:
> >>
> >>    <bb 5> [local count: 955630225]:  // curr_bb
> >>    # k_36 = PHI <k_27(14), 0(7)>
> >>    bar (_4, "one", "two");
> >>    *_7 = s_22(D);
> >>    k_27 = k_36 + 1;
> >>    if (n_20(D) > k_27)
> >>      goto <bb 14>; [89.00%]
> >>    else
> >>      goto <bb 6>; [11.00%]
> >>
> >>    <bb 6> [local count: 118111600]:
> >>    j_21 = j_35 + 1;
> >>    if (n_20(D) > j_21)
> >>      goto <bb 13>; [89.00%]
> >>    else
> >>      goto <bb 8>; [11.00%]
> >>
> >>    <bb 13> [local count: 105119324]:
> >>
> >>    <bb 7> [local count: 118111600]:   // L3 preheader
> >>    # j_35 = PHI <j_21(13), 0(11)>
> >>    goto <bb 5>; [100.00%]
> >>
> >>    <bb 8> [local count: 145980626]:
> >>    *_10 = t_24(D);
> >>    i_29 = i_34 + 1;
> >>
> >> Re-paste the bb_colder_than_loop_preheader and find_coldest_out_loop:
> >>
> >> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state
> >> +   as stated in profile-count.h, FALSE is returned if inequality cannot be
> >> +   decided.  */
> >> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2)
> >> +{
> >> +  if (count1 < count2)
> >> +    return true;
> >> +  else
> >> +    return false;
> >> +}
> >> +
> >> +/* Find coldest loop between OUTMOST_LOOP and LOOP by comapring profile count.
> >> + */
> >> +
> >> +static class loop *
> >> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> >> +                      basic_block curr_bb)
> >> +{
> >> +  class loop *cold_loop, *min_loop;
> >> +  cold_loop = min_loop = outmost_loop;
> >> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
> >> +
> >> +  /* If bb_colder_than_loop_preheader returns false due to three-state
> >> +    comparision, OUTMOST_LOOP is returned finally to preserve the behavior.
> >> +    Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP.  */
> >> +  if (curr_bb
> >> +      && bb_colder_than_loop_preheader (curr_bb->count,
> >> +                                       loop_preheader_edge (loop)->src->count))
> >> +    return NULL;
> >> +
> >> +  while (min_loop != loop)
> >> +    {
> >> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> >> +      if (bb_colder_than_loop_preheader (
> >> +           loop_preheader_edge (min_loop)->src->count, min_count))
> >> +       cold_loop = min_loop;
> >> +    }
> >> +  return cold_loop;
> >> +}
> >> +
> >>
> >>
> >>>
> >>> Why is this function not simply
> >>>
> >>> +static class loop *
> >>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> >>> +                      basic_block curr_bb)
> >>> +{
> >>>      while (bb_colder_than_loop_preheader (curr_bb->count,
> >>>                loop_preheader_edge (outermost_loop)->src->count))
> >>>         {
> >>>             if (outermost_loop == loop)
> >>>               return NULL;
> >>>             outermost_loop = superloop_at_depth (loop, loop_depth
> >>> (outermost_loop) + 1);
> >>>         }
> >>>      return outermost_loop;
> >>> }
> >>
> >> If change like this, when processing curr_bb(5), outermost_loop will
> >> return loop 1 since curr_bb->count > Loop1_prehead->count, the while
> >> loop stopped.  This doesn't meet what we want.
> >
> > Why?  curr_bb is executed at least as often as loop1 preheader if
> > we look at the counts?  So either the counts do not really tell us
> > anything of help or I am missing something.  Are you merely
> > looking for a block with a lower count on the path from the outermost
> > loop entry to the block in question and deciding you do not want to
> > hoist further than that?  So it's not about not hoisting to a hot place
> > but instead hoist to the coldest place within a loop nest?
> >
> > So we have
> >
> >   for (i = 0; i < m; i++)  // loop 1
> >     {
> >       if (__builtin_expect (x, 0))
> >         for (j = 0; j < n; j++)   // loop 2
> >
> >
> >    <bb 10> [local count: 16057869]:  // L1 preheader
> >        ...
> >  <bb 3> [local count: 145980626]:
> >    # i_34 = PHI <i_29(12), 0(10)>
> >  ...
> >    <bb 11> [local count: 12992276]:  // L2 preheader
> >    ...
> >     <bb 7> [local count: 118111600]:   // L3 preheader
> >    # j_35 = PHI <j_21(13), 0(11)>
> >    goto <bb 5>; [100.00%]
> >
> > and we want to hoist to the L2 preheader because that's less frequently
> > executed than the L1 preheader (which is less frequently executed
> > than the L3 preheader or the block we are hoisting from).
>
> Yes, this is exactly what I want, sorry for not describe it clear before ;(

OK, thanks for confirming ;)

> The updated patch[1] may reflect find_coldest_out_loop better:
> It first check whether curr_bb is hotter than it's preheader, if false, return NULL
> which means no need hoist at all; Then find a *coldest* preheader to hoist
> within a loop nest from outmost_loop.
>
>
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html
>
>
> +/* Find coldest loop between OUTMOST_LOOP and LOOP by comparing profile count.
> +   It does two steps check:
> +   1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just
> +   return NULL which means it should not be moved out at all;
> +   2)  CURR_BB is NOT cold, set LOOP to cold_loop, then iteratively search loops
> +   from {L[outmost_loop], L[outmost_loop+1], ... L[loop]}, if L[i] is colder
> +   than L[cold_loop], reset cold_loop to L[i] until get the loop that has
> +   smallest profile_count.  */
> +
> +static class loop *
> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> +                      basic_block curr_bb)
> +{
> +  class loop *cold_loop;
> +
> +  /* If bb_colder_than_loop_preheader returns false due to three-state
> +    comparision, OUTMOST_LOOP is returned finally to preserve the behavior.
> +    Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP.  */
> +  if (curr_bb
> +      && bb_colder_than_loop_preheader (curr_bb,
> +                                       loop_preheader_edge (loop)->src))
> +    return NULL;
> +
> +  cold_loop = loop;
> +  while (outmost_loop != loop)
> +    {
> +      if (bb_colder_than_loop_preheader (loop_preheader_edge (outmost_loop)->src,
> +                                        loop_preheader_edge (cold_loop)->src))
> +       cold_loop = outmost_loop;
> +      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
> +    }
> +  return cold_loop;
> +}
>
>
> >
> > I'm concerned with compile-time complexity re-evaluating counts on the
> > loop nest many times.  So it looks to me that we can pre-compute
> > this lowest-preheader-count loop for a loop nest at least for the
> > store-motion case where we know the outermost loop?
> >
> >
>
> But the lowest-preheader-count loop may change for a loop/bb with different
> outermost loop.  For example if,
>
> L1_preheader_count < L2_preheader_count < L3_preheader_count < L4_preheader_count < curr_bb_count
>
> then,
>
> find_coldest_out_loop (L1, loop, curr_bb)  => coldest preheader loop is L1
> find_coldest_out_loop (L2, loop, curr_bb)  => coldest preheader loop is L2
>
> So it will be a 1:N map?

I'm talking about the can_sm_ref_p call, in that context 'loop' will
be the outermost loop of
interest, and we are calling this for all stores in a loop.  We're doing

+bool
+ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
+{
+  basic_block curr_bb = gimple_bb (loc->stmt);
+  class loop *inner_loop = curr_bb->loop_father;
+  return find_coldest_out_loop (l, inner_loop, curr_bb);

for each location the ref is accessed and the intent was to see
whether there's at least one
that we would like to move to 'loop'.  Indeed since we only know the
common outer loop
but not the inner we are hosting from there's not a single "coldest"
loop to cache and so
any caching we might want to perform could be applied to the other case as well.

I suppose the most natural thing to cache is for each loop the outer loop where
its outer loop preheader would be hotter than the outer loops preheader so that

+  while (outmost_loop != loop)
+    {
+      if (bb_colder_than_loop_preheader (loop_preheader_edge
(outmost_loop)->src,
+                                        loop_preheader_edge (cold_loop)->src))
+       cold_loop = outmost_loop;
+      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
+    }

could be instead written as

  coldest_loop = coldest_outermost_loop[loop->num];
  if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
    return outermost_loop;
  return coldest_loop;

?  And in the usual case coldest_outermost_loop[L] would be the loop tree root.
It should be possible to compute such cache in a DFS walk of the loop tree
(the loop iterator by default visits in such order).

>  Pre-compute it in find_coldest_out_loop
> and save it also in lim_data with a new variable
> coldest_preheader_loop[outmost_loop][coldest_preheader_loop]?
> each call of find_coldest_out_loop will check whether that variable is set
> already, only continue the search if
> coldest_preheader_loop[outmost_loop][coldest_preheader_loop] is not set?
> Seems a bit complicated and not sure whether it helps to reduce
> compile-time complexity or I am misunderstanding...
>
>
> --
> Thanks,
> Xionghu
Xionghu Luo Nov. 3, 2021, 6:49 a.m. UTC | #19
On 2021/10/29 19:48, Richard Biener wrote:
> I'm talking about the can_sm_ref_p call, in that context 'loop' will
> be the outermost loop of
> interest, and we are calling this for all stores in a loop.  We're doing
> 
> +bool
> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> +{
> +  basic_block curr_bb = gimple_bb (loc->stmt);
> +  class loop *inner_loop = curr_bb->loop_father;
> +  return find_coldest_out_loop (l, inner_loop, curr_bb);
> 
> for each location the ref is accessed and the intent was to see
> whether there's at least one
> that we would like to move to 'loop'.  Indeed since we only know the
> common outer loop
> but not the inner we are hosting from there's not a single "coldest"
> loop to cache and so
> any caching we might want to perform could be applied to the other case as well.
> 
> I suppose the most natural thing to cache is for each loop the outer loop where
> its outer loop preheader would be hotter than the outer loops preheader so that
> 
> +  while (outmost_loop != loop)
> +    {
> +      if (bb_colder_than_loop_preheader (loop_preheader_edge
> (outmost_loop)->src,
> +                                        loop_preheader_edge (cold_loop)->src))
> +       cold_loop = outmost_loop;
> +      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
> +    }
> 
> could be instead written as
> 
>   coldest_loop = coldest_outermost_loop[loop->num];
>   if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
>     return outermost_loop;
>   return coldest_loop;
> 
> ?  And in the usual case coldest_outermost_loop[L] would be the loop tree root.
> It should be possible to compute such cache in a DFS walk of the loop tree
> (the loop iterator by default visits in such order).


Thanks.  Updated the patch with your suggestion.  Not sure whether it strictly
conforms to your comments.  Though the patch passed all my added tests(coverage not enough),
I am still a bit worried if pre-computed coldest_loop is outside of outermost_loop, but
outermost_loop is not the COLDEST LOOP, i.e. (outer->inner)

 [loop tree root, coldest_loop, outermost_loop,..., second_coldest_loop, ..., loop],

then function find_coldest_out_loop will return a loop NOT accord with our
expectation, that should return second_coldest_loop instead of outermost_loop?


Changes:
1. Add function fill_coldest_out_loop to pre compute the coldest
outermost loop for each loop.
2. Rename find_coldest_out_loop to get_coldest_out_loop.
3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c.

v5 changes:
1. Refine comments for new functions.
2. Use basic_block instead of count in bb_colder_than_loop_preheader
to align with function name.
3. Refine with simpler implementation for get_coldest_out_loop and
ref_in_loop_hot_body::operator for better understanding.

v4 changes:
1. Sort out profile_count comparision to function bb_cold_than_loop_preheader.
2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare.
3. Split RTL invariant motion part out.
4. Remove aux changes.

v3 changes:
1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop.
2. Remove unnecessary changes.
3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
infinite loop when implementing v1 and the iteration is missed to be
updated actually.

v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html
v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html
v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html

There was a patch trying to avoid move cold block out of loop:

https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Richard suggested to "never hoist anything from a bb with lower execution
frequency to a bb with higher one in LIM invariantness_dom_walker
before_dom_children".

In gimple LIM analysis, add get_coldest_out_loop to move invariants to
expected target loop, if profile count of the loop bb is colder
than target loop preheader, it won't be hoisted out of loop.
Likely for store motion, if all locations of the REF in loop is cold,
don't do store motion of it.

SPEC2017 performance evaluation shows 1% performance improvement for
intrate GEOMEAN and no obvious regression for others.  Especially,
500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
on P8LE.

gcc/ChangeLog:

	* tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New
	function.
	(get_coldest_out_loop): New function.
	(determine_max_movement): Use get_coldest_out_loop.
	(move_computations_worker): Adjust and fix iteration udpate.
	(class ref_in_loop_hot_body): New functor.
	(ref_in_loop_hot_body::operator): New.
	(can_sm_ref_p): Use for_all_locs_in_loop.
	(fill_coldest_out_loop): New.
	(loop_invariant_motion_in_fun): Call fill_coldest_out_loop.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/recip-3.c: Adjust.
	* gcc.dg/tree-ssa/ssa-lim-18.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-21.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-22.c: New test.
---
 gcc/tree-ssa-loop-im.c                     | 111 ++++++++++++++++++++-
 gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c |  20 ++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  29 ++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  25 +++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c |  35 +++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c |  32 ++++++
 7 files changed, 251 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 4b187c2cdaf..d3390385fd9 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -146,6 +146,9 @@ public:
 enum dep_kind { lim_raw, sm_war, sm_waw };
 enum dep_state { dep_unknown, dep_independent, dep_dependent };
 
+/* coldest outermost loop for given loop.  */
+class loop **coldest_outermost_loop;
+
 /* Populate the loop dependence cache of REF for LOOP, KIND with STATE.  */
 
 static void
@@ -417,6 +420,43 @@ movement_possibility (gimple *stmt)
   return ret;
 }
 
+/* Compare the profile count inequality of bb and preheader, it is three-state
+   as stated in profile-count.h, FALSE is returned if inequality cannot be
+   decided.  */
+bool bb_colder_than_loop_preheader (basic_block bb, basic_block preheader)
+{
+  gcc_assert (bb && preheader);
+  return bb->count < preheader->count;
+}
+
+/* Check coldest loop between OUTMOST_LOOP and LOOP by comparing profile count.
+   It does two steps check:
+   1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just
+   return NULL which means it should not be moved out at all;
+   2)  CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of
+   OUTMOST_LOOP.  */
+
+static class loop *
+get_coldest_out_loop (class loop *outmost_loop, class loop *loop,
+		      basic_block curr_bb)
+{
+  gcc_assert (outmost_loop == loop || flow_loop_nested_p (outmost_loop, loop));
+  class loop *cold_loop;
+
+  /* If bb_colder_than_loop_preheader returns false due to three-state
+    comparision, OUTMOST_LOOP is returned finally to preserve the behavior.
+    Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP.  */
+  if (curr_bb
+      && bb_colder_than_loop_preheader (curr_bb,
+					loop_preheader_edge (loop)->src))
+    return NULL;
+
+  class loop *coldest_loop = coldest_outermost_loop[loop->num];
+  if (loop_depth (coldest_loop) < loop_depth (outmost_loop))
+    return outmost_loop;
+  return coldest_loop;
+}
+
 /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
    loop to that we could move the expression using DEF if it did not have
    other operands, i.e. the outermost loop enclosing LOOP in that the value
@@ -685,7 +725,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
     level = ALWAYS_EXECUTED_IN (bb);
   else
     level = superloop_at_depth (loop, 1);
-  lim_data->max_loop = level;
+  lim_data->max_loop = get_coldest_out_loop (level, loop, bb);
+  if (!lim_data->max_loop)
+    return false;
 
   if (gphi *phi = dyn_cast <gphi *> (stmt))
     {
@@ -1221,7 +1263,10 @@ move_computations_worker (basic_block bb)
       /* We do not really want to move conditionals out of the loop; we just
 	 placed it here to force its operands to be moved if necessary.  */
       if (gimple_code (stmt) == GIMPLE_COND)
-	continue;
+	{
+	  gsi_next (&bsi);
+	  continue;
+	}
 
       if (dump_file && (dump_flags & TDF_DETAILS))
 	{
@@ -2887,6 +2932,26 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind)
   return indep_p;
 }
 
+class ref_in_loop_hot_body
+{
+public:
+  ref_in_loop_hot_body (loop *loop_) : l (loop_) {}
+  bool operator () (mem_ref_loc *loc);
+  class loop *l;
+};
+
+/* Check the coldest loop between loop L and innermost loop.  If there is one
+   cold loop between L and INNER_LOOP, store motion can be performed, otherwise
+   no cold loop means no store motion.  get_coldest_out_loop also handles cases
+   when l is inner_loop.  */
+bool
+ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
+{
+  basic_block curr_bb = gimple_bb (loc->stmt);
+  class loop *inner_loop = curr_bb->loop_father;
+  return get_coldest_out_loop (l, inner_loop, curr_bb);
+}
+
 
 /* Returns true if we can perform store motion of REF from LOOP.  */
 
@@ -2941,6 +3006,12 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
   if (!ref_indep_loop_p (loop, ref, sm_war))
     return false;
 
+  /* Verify whether the candidate is hot for LOOP.  Only do store motion if the
+    candidate's profile count is hot.  Statement in cold BB shouldn't be moved
+    out of it's loop_father.  */
+  if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
+    return false;
+
   return true;
 }
 
@@ -3153,6 +3224,34 @@ fill_always_executed_in (void)
     fill_always_executed_in_1 (loop, contains_call);
 }
 
+/* Find the coldest loop preheader from loop tree root to LOOP.  Set LOOP to
+   cold_loop, then iteratively search loops from {L[outmost_loop],
+   L[outmost_loop+1], ... L[loop]}, if L[i] is colder than L[cold_loop], reset
+   cold_loop to L[i] until get the loop that has smallest profile_count.
+   Then recursively set each inner loop.  */
+
+void
+fill_coldest_out_loop (class loop *loop)
+{
+  class loop *outmost_loop = current_loops->tree_root->inner;
+  class loop *cold_loop = loop;
+  while (outmost_loop != loop)
+    {
+      if (bb_colder_than_loop_preheader (
+	    loop_preheader_edge (outmost_loop)->src,
+	    loop_preheader_edge (cold_loop)->src))
+	cold_loop = outmost_loop;
+      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
+    }
+
+  coldest_outermost_loop[loop->num] = cold_loop;
+  if (dump_enabled_p ())
+    dump_printf (MSG_NOTE, "loop %d's coldest outermost loop is %d\n",
+		 loop->num, cold_loop->num);
+
+  for (loop = loop->inner; loop; loop = loop->next)
+    fill_coldest_out_loop (loop);
+}
 
 /* Compute the global information needed by the loop invariant motion pass.  */
 
@@ -3237,6 +3336,8 @@ tree_ssa_lim_finalize (void)
     free_affine_expand_cache (&memory_accesses.ttae_cache);
 
   free (bb_loop_postorder);
+
+  free (coldest_outermost_loop);
 }
 
 /* Moves invariants from loops.  Only "expensive" invariants are moved out --
@@ -3256,6 +3357,12 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion)
   /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
   fill_always_executed_in ();
 
+  /* Pre-compute coldest outermost loop of each loop.  */
+  class loop *loop;
+  coldest_outermost_loop = XNEWVEC (class loop *, number_of_loops (cfun));
+  for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next)
+    fill_coldest_out_loop (loop);
+
   int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
   int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
index 638bf38db8c..641c91e719e 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
@@ -23,4 +23,4 @@ float h ()
 	F[0] += E / d;
 }
 
-/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
+/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
new file mode 100644
index 00000000000..7326a230b3f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+void
+bar (int, char *, char *);
+void
+foo (int *a, int n, int k)
+{
+  int i;
+
+  for (i = 0; i < n; i++)
+    {
+      if (__builtin_expect (x, 0))
+	bar (k / 5, "one", "two");
+      a[i] = k;
+    }
+}
+
+/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
new file mode 100644
index 00000000000..51c1913d003
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+void
+bar (int, char *, char *);
+void
+foo (int *a, int n, int m, int s, int t)
+{
+  int i;
+  int j;
+  int k;
+
+  for (i = 0; i < m; i++) // Loop 1
+    {
+      if (__builtin_expect (x, 0))
+	for (j = 0; j < n; j++) // Loop 2
+	  for (k = 0; k < n; k++) // Loop 3
+	    {
+	      bar (s / 5, "one", "two");
+	      a[t] = s;
+	    }
+      a[t] = t;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */
+/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
new file mode 100644
index 00000000000..bc60a040a70
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
@@ -0,0 +1,25 @@
+/* { dg-do compile  } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+/* Test that `count' is not hoisted out of loop when bb is cold.  */
+
+int count;
+volatile int x;
+
+struct obj {
+  int data;
+  struct obj *next;
+
+} *q;
+
+void
+func (int m)
+{
+  struct obj *p;
+  for (int i = 0; i < m; i++)
+    if (__builtin_expect (x, 0))
+      count++;
+
+}
+
+/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
new file mode 100644
index 00000000000..ffe6f8f699d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
@@ -0,0 +1,35 @@
+/* { dg-do compile  } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+/* Test that `data' and 'data1' is not hoisted out of inner loop and outer loop
+   when it is in cold loop.  */
+
+int count;
+volatile int x;
+
+struct obj {
+  int data;
+  int data1;
+  struct obj *next;
+};
+
+void
+func (int m, int n, int k, struct obj *a)
+{
+  struct obj *q = a;
+  for (int j = 0; j < m; j++)
+    if (__builtin_expect (m, 0))
+      for (int i = 0; i < m; i++)
+	{
+	  if (__builtin_expect (x, 0))
+	    {
+	      count++;
+	      q->data += 3; /* Not hoisted out to inner loop. */
+	    }
+	  count += n;
+	  q->data1 += k; /* Not hoisted out to outer loop. */
+	}
+}
+
+/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
new file mode 100644
index 00000000000..16ba4ceb8ab
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+volatile int y;
+void
+bar (int, char *, char *);
+void
+foo (int *a, int n, int m, int s, int t)
+{
+  int i;
+  int j;
+  int k;
+
+  for (i = 0; i < m; i++) // Loop 1
+    {
+      if (__builtin_expect (x, 0))
+	for (j = 0; j < n; j++) // Loop 2
+	  if (__builtin_expect (y, 0))
+	    for (k = 0; k < n; k++) // Loop 3
+	      {
+		bar (s / 5, "one", "two");
+		a[t] = s;
+	      }
+      a[t] = t;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "out of loop 3" 4 "lim2" } } */
+/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
+
+
Xionghu Luo Nov. 3, 2021, 1:29 p.m. UTC | #20
On 2021/10/29 19:48, Richard Biener wrote:
> I'm talking about the can_sm_ref_p call, in that context 'loop' will
> be the outermost loop of
> interest, and we are calling this for all stores in a loop.  We're doing
> 
> +bool
> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> +{
> +  basic_block curr_bb = gimple_bb (loc->stmt);
> +  class loop *inner_loop = curr_bb->loop_father;
> +  return find_coldest_out_loop (l, inner_loop, curr_bb);
> 
> for each location the ref is accessed and the intent was to see
> whether there's at least one
> that we would like to move to 'loop'.  Indeed since we only know the
> common outer loop
> but not the inner we are hosting from there's not a single "coldest"
> loop to cache and so
> any caching we might want to perform could be applied to the other case as well.
> 
> I suppose the most natural thing to cache is for each loop the outer loop where
> its outer loop preheader would be hotter than the outer loops preheader so that
> 
> +  while (outmost_loop != loop)
> +    {
> +      if (bb_colder_than_loop_preheader (loop_preheader_edge
> (outmost_loop)->src,
> +                                        loop_preheader_edge (cold_loop)->src))
> +       cold_loop = outmost_loop;
> +      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
> +    }
> 
> could be instead written as
> 
>   coldest_loop = coldest_outermost_loop[loop->num];
>   if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
>     return outermost_loop;
>   return coldest_loop;
> 
> ?  And in the usual case coldest_outermost_loop[L] would be the loop tree root.
> It should be possible to compute such cache in a DFS walk of the loop tree
> (the loop iterator by default visits in such order).



Thanks.  Updated the patch with your suggestion.  Not sure whether it strictly
conforms to your comments.  Though the patch passed all my added tests(coverage not enough),
I am still a bit worried if pre-computed coldest_loop is outside of outermost_loop, but
outermost_loop is not the COLDEST LOOP, i.e. (outer->inner)

 [loop tree root, coldest_loop, outermost_loop,..., second_coldest_loop, ..., loop],

then function find_coldest_out_loop will return a loop NOT accord with our
expectation, that should return second_coldest_loop instead of outermost_loop?


Changes:
1. Add function fill_coldest_out_loop to pre compute the coldest
outermost loop for each loop.
2. Rename find_coldest_out_loop to get_coldest_out_loop.
3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c.

v5 changes:
1. Refine comments for new functions.
2. Use basic_block instead of count in bb_colder_than_loop_preheader
to align with function name.
3. Refine with simpler implementation for get_coldest_out_loop and
ref_in_loop_hot_body::operator for better understanding.

v4 changes:
1. Sort out profile_count comparision to function bb_cold_than_loop_preheader.
2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare.
3. Split RTL invariant motion part out.
4. Remove aux changes.

v3 changes:
1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop.
2. Remove unnecessary changes.
3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
infinite loop when implementing v1 and the iteration is missed to be
updated actually.

v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html
v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html
v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html

There was a patch trying to avoid move cold block out of loop:

https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Richard suggested to "never hoist anything from a bb with lower execution
frequency to a bb with higher one in LIM invariantness_dom_walker
before_dom_children".

In gimple LIM analysis, add get_coldest_out_loop to move invariants to
expected target loop, if profile count of the loop bb is colder
than target loop preheader, it won't be hoisted out of loop.
Likely for store motion, if all locations of the REF in loop is cold,
don't do store motion of it.

SPEC2017 performance evaluation shows 1% performance improvement for
intrate GEOMEAN and no obvious regression for others.  Especially,
500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
on P8LE.

gcc/ChangeLog:

	* tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New
	function.
	(get_coldest_out_loop): New function.
	(determine_max_movement): Use get_coldest_out_loop.
	(move_computations_worker): Adjust and fix iteration udpate.
	(class ref_in_loop_hot_body): New functor.
	(ref_in_loop_hot_body::operator): New.
	(can_sm_ref_p): Use for_all_locs_in_loop.
	(fill_coldest_out_loop): New.
	(loop_invariant_motion_in_fun): Call fill_coldest_out_loop.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/recip-3.c: Adjust.
	* gcc.dg/tree-ssa/ssa-lim-18.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-21.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-22.c: New test.
---
 gcc/tree-ssa-loop-im.c                     | 111 ++++++++++++++++++++-
 gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c |  20 ++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  29 ++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  25 +++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c |  35 +++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c |  32 ++++++
 7 files changed, 251 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 4b187c2cdaf..d3390385fd9 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -146,6 +146,9 @@ public:
 enum dep_kind { lim_raw, sm_war, sm_waw };
 enum dep_state { dep_unknown, dep_independent, dep_dependent };
 
+/* coldest outermost loop for given loop.  */
+class loop **coldest_outermost_loop;
+
 /* Populate the loop dependence cache of REF for LOOP, KIND with STATE.  */
 
 static void
@@ -417,6 +420,43 @@ movement_possibility (gimple *stmt)
   return ret;
 }
 
+/* Compare the profile count inequality of bb and preheader, it is three-state
+   as stated in profile-count.h, FALSE is returned if inequality cannot be
+   decided.  */
+bool bb_colder_than_loop_preheader (basic_block bb, basic_block preheader)
+{
+  gcc_assert (bb && preheader);
+  return bb->count < preheader->count;
+}
+
+/* Check coldest loop between OUTMOST_LOOP and LOOP by comparing profile count.
+   It does two steps check:
+   1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just
+   return NULL which means it should not be moved out at all;
+   2)  CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of
+   OUTMOST_LOOP.  */
+
+static class loop *
+get_coldest_out_loop (class loop *outmost_loop, class loop *loop,
+		      basic_block curr_bb)
+{
+  gcc_assert (outmost_loop == loop || flow_loop_nested_p (outmost_loop, loop));
+  class loop *cold_loop;
+
+  /* If bb_colder_than_loop_preheader returns false due to three-state
+    comparision, OUTMOST_LOOP is returned finally to preserve the behavior.
+    Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP.  */
+  if (curr_bb
+      && bb_colder_than_loop_preheader (curr_bb,
+					loop_preheader_edge (loop)->src))
+    return NULL;
+
+  class loop *coldest_loop = coldest_outermost_loop[loop->num];
+  if (loop_depth (coldest_loop) < loop_depth (outmost_loop))
+    return outmost_loop;
+  return coldest_loop;
+}
+
 /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
    loop to that we could move the expression using DEF if it did not have
    other operands, i.e. the outermost loop enclosing LOOP in that the value
@@ -685,7 +725,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
     level = ALWAYS_EXECUTED_IN (bb);
   else
     level = superloop_at_depth (loop, 1);
-  lim_data->max_loop = level;
+  lim_data->max_loop = get_coldest_out_loop (level, loop, bb);
+  if (!lim_data->max_loop)
+    return false;
 
   if (gphi *phi = dyn_cast <gphi *> (stmt))
     {
@@ -1221,7 +1263,10 @@ move_computations_worker (basic_block bb)
       /* We do not really want to move conditionals out of the loop; we just
 	 placed it here to force its operands to be moved if necessary.  */
       if (gimple_code (stmt) == GIMPLE_COND)
-	continue;
+	{
+	  gsi_next (&bsi);
+	  continue;
+	}
 
       if (dump_file && (dump_flags & TDF_DETAILS))
 	{
@@ -2887,6 +2932,26 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind)
   return indep_p;
 }
 
+class ref_in_loop_hot_body
+{
+public:
+  ref_in_loop_hot_body (loop *loop_) : l (loop_) {}
+  bool operator () (mem_ref_loc *loc);
+  class loop *l;
+};
+
+/* Check the coldest loop between loop L and innermost loop.  If there is one
+   cold loop between L and INNER_LOOP, store motion can be performed, otherwise
+   no cold loop means no store motion.  get_coldest_out_loop also handles cases
+   when l is inner_loop.  */
+bool
+ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
+{
+  basic_block curr_bb = gimple_bb (loc->stmt);
+  class loop *inner_loop = curr_bb->loop_father;
+  return get_coldest_out_loop (l, inner_loop, curr_bb);
+}
+
 
 /* Returns true if we can perform store motion of REF from LOOP.  */
 
@@ -2941,6 +3006,12 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
   if (!ref_indep_loop_p (loop, ref, sm_war))
     return false;
 
+  /* Verify whether the candidate is hot for LOOP.  Only do store motion if the
+    candidate's profile count is hot.  Statement in cold BB shouldn't be moved
+    out of it's loop_father.  */
+  if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
+    return false;
+
   return true;
 }
 
@@ -3153,6 +3224,34 @@ fill_always_executed_in (void)
     fill_always_executed_in_1 (loop, contains_call);
 }
 
+/* Find the coldest loop preheader from loop tree root to LOOP.  Set LOOP to
+   cold_loop, then iteratively search loops from {L[outmost_loop],
+   L[outmost_loop+1], ... L[loop]}, if L[i] is colder than L[cold_loop], reset
+   cold_loop to L[i] until get the loop that has smallest profile_count.
+   Then recursively set each inner loop.  */
+
+void
+fill_coldest_out_loop (class loop *loop)
+{
+  class loop *outmost_loop = current_loops->tree_root->inner;
+  class loop *cold_loop = loop;
+  while (outmost_loop != loop)
+    {
+      if (bb_colder_than_loop_preheader (
+	    loop_preheader_edge (outmost_loop)->src,
+	    loop_preheader_edge (cold_loop)->src))
+	cold_loop = outmost_loop;
+      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
+    }
+
+  coldest_outermost_loop[loop->num] = cold_loop;
+  if (dump_enabled_p ())
+    dump_printf (MSG_NOTE, "loop %d's coldest outermost loop is %d\n",
+		 loop->num, cold_loop->num);
+
+  for (loop = loop->inner; loop; loop = loop->next)
+    fill_coldest_out_loop (loop);
+}
 
 /* Compute the global information needed by the loop invariant motion pass.  */
 
@@ -3237,6 +3336,8 @@ tree_ssa_lim_finalize (void)
     free_affine_expand_cache (&memory_accesses.ttae_cache);
 
   free (bb_loop_postorder);
+
+  free (coldest_outermost_loop);
 }
 
 /* Moves invariants from loops.  Only "expensive" invariants are moved out --
@@ -3256,6 +3357,12 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion)
   /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
   fill_always_executed_in ();
 
+  /* Pre-compute coldest outermost loop of each loop.  */
+  class loop *loop;
+  coldest_outermost_loop = XNEWVEC (class loop *, number_of_loops (cfun));
+  for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next)
+    fill_coldest_out_loop (loop);
+
   int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
   int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
index 638bf38db8c..641c91e719e 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
@@ -23,4 +23,4 @@ float h ()
 	F[0] += E / d;
 }
 
-/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
+/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
new file mode 100644
index 00000000000..7326a230b3f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+void
+bar (int, char *, char *);
+void
+foo (int *a, int n, int k)
+{
+  int i;
+
+  for (i = 0; i < n; i++)
+    {
+      if (__builtin_expect (x, 0))
+	bar (k / 5, "one", "two");
+      a[i] = k;
+    }
+}
+
+/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
new file mode 100644
index 00000000000..51c1913d003
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+void
+bar (int, char *, char *);
+void
+foo (int *a, int n, int m, int s, int t)
+{
+  int i;
+  int j;
+  int k;
+
+  for (i = 0; i < m; i++) // Loop 1
+    {
+      if (__builtin_expect (x, 0))
+	for (j = 0; j < n; j++) // Loop 2
+	  for (k = 0; k < n; k++) // Loop 3
+	    {
+	      bar (s / 5, "one", "two");
+	      a[t] = s;
+	    }
+      a[t] = t;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */
+/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
new file mode 100644
index 00000000000..bc60a040a70
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
@@ -0,0 +1,25 @@
+/* { dg-do compile  } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+/* Test that `count' is not hoisted out of loop when bb is cold.  */
+
+int count;
+volatile int x;
+
+struct obj {
+  int data;
+  struct obj *next;
+
+} *q;
+
+void
+func (int m)
+{
+  struct obj *p;
+  for (int i = 0; i < m; i++)
+    if (__builtin_expect (x, 0))
+      count++;
+
+}
+
+/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
new file mode 100644
index 00000000000..ffe6f8f699d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
@@ -0,0 +1,35 @@
+/* { dg-do compile  } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+/* Test that `data' and 'data1' is not hoisted out of inner loop and outer loop
+   when it is in cold loop.  */
+
+int count;
+volatile int x;
+
+struct obj {
+  int data;
+  int data1;
+  struct obj *next;
+};
+
+void
+func (int m, int n, int k, struct obj *a)
+{
+  struct obj *q = a;
+  for (int j = 0; j < m; j++)
+    if (__builtin_expect (m, 0))
+      for (int i = 0; i < m; i++)
+	{
+	  if (__builtin_expect (x, 0))
+	    {
+	      count++;
+	      q->data += 3; /* Not hoisted out to inner loop. */
+	    }
+	  count += n;
+	  q->data1 += k; /* Not hoisted out to outer loop. */
+	}
+}
+
+/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
new file mode 100644
index 00000000000..16ba4ceb8ab
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+volatile int y;
+void
+bar (int, char *, char *);
+void
+foo (int *a, int n, int m, int s, int t)
+{
+  int i;
+  int j;
+  int k;
+
+  for (i = 0; i < m; i++) // Loop 1
+    {
+      if (__builtin_expect (x, 0))
+	for (j = 0; j < n; j++) // Loop 2
+	  if (__builtin_expect (y, 0))
+	    for (k = 0; k < n; k++) // Loop 3
+	      {
+		bar (s / 5, "one", "two");
+		a[t] = s;
+	      }
+      a[t] = t;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "out of loop 3" 4 "lim2" } } */
+/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
+
+
Richard Biener Nov. 4, 2021, 1 p.m. UTC | #21
On Wed, Nov 3, 2021 at 2:29 PM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2021/10/29 19:48, Richard Biener wrote:
> > I'm talking about the can_sm_ref_p call, in that context 'loop' will
> > be the outermost loop of
> > interest, and we are calling this for all stores in a loop.  We're doing
> >
> > +bool
> > +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> > +{
> > +  basic_block curr_bb = gimple_bb (loc->stmt);
> > +  class loop *inner_loop = curr_bb->loop_father;
> > +  return find_coldest_out_loop (l, inner_loop, curr_bb);
> >
> > for each location the ref is accessed and the intent was to see
> > whether there's at least one
> > that we would like to move to 'loop'.  Indeed since we only know the
> > common outer loop
> > but not the inner we are hosting from there's not a single "coldest"
> > loop to cache and so
> > any caching we might want to perform could be applied to the other case as well.
> >
> > I suppose the most natural thing to cache is for each loop the outer loop where
> > its outer loop preheader would be hotter than the outer loops preheader so that
> >
> > +  while (outmost_loop != loop)
> > +    {
> > +      if (bb_colder_than_loop_preheader (loop_preheader_edge
> > (outmost_loop)->src,
> > +                                        loop_preheader_edge (cold_loop)->src))
> > +       cold_loop = outmost_loop;
> > +      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
> > +    }
> >
> > could be instead written as
> >
> >   coldest_loop = coldest_outermost_loop[loop->num];
> >   if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
> >     return outermost_loop;
> >   return coldest_loop;
> >
> > ?  And in the usual case coldest_outermost_loop[L] would be the loop tree root.
> > It should be possible to compute such cache in a DFS walk of the loop tree
> > (the loop iterator by default visits in such order).
>
>
>
> Thanks.  Updated the patch with your suggestion.  Not sure whether it strictly
> conforms to your comments.  Though the patch passed all my added tests(coverage not enough),
> I am still a bit worried if pre-computed coldest_loop is outside of outermost_loop, but
> outermost_loop is not the COLDEST LOOP, i.e. (outer->inner)
>
>  [loop tree root, coldest_loop, outermost_loop,..., second_coldest_loop, ..., loop],
>
> then function find_coldest_out_loop will return a loop NOT accord with our
> expectation, that should return second_coldest_loop instead of outermost_loop?

Hmm, interesting - yes.  I guess the common case will be that the pre-computed
outermost loop will be the loop at depth 1 since outer loops tend to
be colder than
inner loops?  That would then defeat the whole exercise.

To optimize the common case but not avoiding iteration in the cases we care
about we could instead cache the next outermost loop that is _not_ colder
than loop.  So for your [ ... ] example above we'd have
hotter_than_inner_loop[loop] == outer (second_coldest_loop), where the
candidate would then be 'second_coldest_loop' and we'd then iterate
to hotter_than_inner_loop[hotter_than_inner_loop[loop]] to find the next
cold candidate we can compare against?  For the common case we'd
have hotter_than_inner_loop[looo] == NULL (no such loop) and we then
simply pick 'outermost_loop'.

One comment on the patch itself below.

>
>
> Changes:
> 1. Add function fill_coldest_out_loop to pre compute the coldest
> outermost loop for each loop.
> 2. Rename find_coldest_out_loop to get_coldest_out_loop.
> 3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c.
>
> v5 changes:
> 1. Refine comments for new functions.
> 2. Use basic_block instead of count in bb_colder_than_loop_preheader
> to align with function name.
> 3. Refine with simpler implementation for get_coldest_out_loop and
> ref_in_loop_hot_body::operator for better understanding.
>
> v4 changes:
> 1. Sort out profile_count comparision to function bb_cold_than_loop_preheader.
> 2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare.
> 3. Split RTL invariant motion part out.
> 4. Remove aux changes.
>
> v3 changes:
> 1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop.
> 2. Remove unnecessary changes.
> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> infinite loop when implementing v1 and the iteration is missed to be
> updated actually.
>
> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html
> v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html
> v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html
>
> There was a patch trying to avoid move cold block out of loop:
>
> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>
> Richard suggested to "never hoist anything from a bb with lower execution
> frequency to a bb with higher one in LIM invariantness_dom_walker
> before_dom_children".
>
> In gimple LIM analysis, add get_coldest_out_loop to move invariants to
> expected target loop, if profile count of the loop bb is colder
> than target loop preheader, it won't be hoisted out of loop.
> Likely for store motion, if all locations of the REF in loop is cold,
> don't do store motion of it.
>
> SPEC2017 performance evaluation shows 1% performance improvement for
> intrate GEOMEAN and no obvious regression for others.  Especially,
> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> on P8LE.
>
> gcc/ChangeLog:
>
>         * tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New
>         function.
>         (get_coldest_out_loop): New function.
>         (determine_max_movement): Use get_coldest_out_loop.
>         (move_computations_worker): Adjust and fix iteration udpate.
>         (class ref_in_loop_hot_body): New functor.
>         (ref_in_loop_hot_body::operator): New.
>         (can_sm_ref_p): Use for_all_locs_in_loop.
>         (fill_coldest_out_loop): New.
>         (loop_invariant_motion_in_fun): Call fill_coldest_out_loop.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
>         * gcc.dg/tree-ssa/ssa-lim-21.c: New test.
>         * gcc.dg/tree-ssa/ssa-lim-22.c: New test.
> ---
>  gcc/tree-ssa-loop-im.c                     | 111 ++++++++++++++++++++-
>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |   2 +-
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c |  20 ++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  29 ++++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  25 +++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c |  35 +++++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c |  32 ++++++
>  7 files changed, 251 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
>
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index 4b187c2cdaf..d3390385fd9 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -146,6 +146,9 @@ public:
>  enum dep_kind { lim_raw, sm_war, sm_waw };
>  enum dep_state { dep_unknown, dep_independent, dep_dependent };
>
> +/* coldest outermost loop for given loop.  */
> +class loop **coldest_outermost_loop;
> +
>  /* Populate the loop dependence cache of REF for LOOP, KIND with STATE.  */
>
>  static void
> @@ -417,6 +420,43 @@ movement_possibility (gimple *stmt)
>    return ret;
>  }
>
> +/* Compare the profile count inequality of bb and preheader, it is three-state
> +   as stated in profile-count.h, FALSE is returned if inequality cannot be
> +   decided.  */
> +bool bb_colder_than_loop_preheader (basic_block bb, basic_block preheader)
> +{
> +  gcc_assert (bb && preheader);
> +  return bb->count < preheader->count;
> +}
> +
> +/* Check coldest loop between OUTMOST_LOOP and LOOP by comparing profile count.
> +   It does two steps check:
> +   1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just
> +   return NULL which means it should not be moved out at all;
> +   2)  CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of
> +   OUTMOST_LOOP.  */
> +
> +static class loop *
> +get_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> +                     basic_block curr_bb)
> +{
> +  gcc_assert (outmost_loop == loop || flow_loop_nested_p (outmost_loop, loop));
> +  class loop *cold_loop;
> +
> +  /* If bb_colder_than_loop_preheader returns false due to three-state
> +    comparision, OUTMOST_LOOP is returned finally to preserve the behavior.
> +    Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP.  */
> +  if (curr_bb
> +      && bb_colder_than_loop_preheader (curr_bb,
> +                                       loop_preheader_edge (loop)->src))
> +    return NULL;
> +
> +  class loop *coldest_loop = coldest_outermost_loop[loop->num];
> +  if (loop_depth (coldest_loop) < loop_depth (outmost_loop))
> +    return outmost_loop;
> +  return coldest_loop;
> +}
> +
>  /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
>     loop to that we could move the expression using DEF if it did not have
>     other operands, i.e. the outermost loop enclosing LOOP in that the value
> @@ -685,7 +725,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
>      level = ALWAYS_EXECUTED_IN (bb);
>    else
>      level = superloop_at_depth (loop, 1);
> -  lim_data->max_loop = level;
> +  lim_data->max_loop = get_coldest_out_loop (level, loop, bb);
> +  if (!lim_data->max_loop)
> +    return false;
>
>    if (gphi *phi = dyn_cast <gphi *> (stmt))
>      {
> @@ -1221,7 +1263,10 @@ move_computations_worker (basic_block bb)
>        /* We do not really want to move conditionals out of the loop; we just
>          placed it here to force its operands to be moved if necessary.  */
>        if (gimple_code (stmt) == GIMPLE_COND)
> -       continue;
> +       {
> +         gsi_next (&bsi);
> +         continue;
> +       }
>
>        if (dump_file && (dump_flags & TDF_DETAILS))
>         {
> @@ -2887,6 +2932,26 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind)
>    return indep_p;
>  }
>
> +class ref_in_loop_hot_body
> +{
> +public:
> +  ref_in_loop_hot_body (loop *loop_) : l (loop_) {}
> +  bool operator () (mem_ref_loc *loc);
> +  class loop *l;
> +};
> +
> +/* Check the coldest loop between loop L and innermost loop.  If there is one
> +   cold loop between L and INNER_LOOP, store motion can be performed, otherwise
> +   no cold loop means no store motion.  get_coldest_out_loop also handles cases
> +   when l is inner_loop.  */
> +bool
> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> +{
> +  basic_block curr_bb = gimple_bb (loc->stmt);
> +  class loop *inner_loop = curr_bb->loop_father;
> +  return get_coldest_out_loop (l, inner_loop, curr_bb);
> +}
> +
>
>  /* Returns true if we can perform store motion of REF from LOOP.  */
>
> @@ -2941,6 +3006,12 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
>    if (!ref_indep_loop_p (loop, ref, sm_war))
>      return false;
>
> +  /* Verify whether the candidate is hot for LOOP.  Only do store motion if the
> +    candidate's profile count is hot.  Statement in cold BB shouldn't be moved
> +    out of it's loop_father.  */
> +  if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
> +    return false;
> +
>    return true;
>  }
>
> @@ -3153,6 +3224,34 @@ fill_always_executed_in (void)
>      fill_always_executed_in_1 (loop, contains_call);
>  }
>
> +/* Find the coldest loop preheader from loop tree root to LOOP.  Set LOOP to
> +   cold_loop, then iteratively search loops from {L[outmost_loop],
> +   L[outmost_loop+1], ... L[loop]}, if L[i] is colder than L[cold_loop], reset
> +   cold_loop to L[i] until get the loop that has smallest profile_count.
> +   Then recursively set each inner loop.  */
> +
> +void
> +fill_coldest_out_loop (class loop *loop)
> +{
> +  class loop *outmost_loop = current_loops->tree_root->inner;

that should be superloop_at_depth (loop, 1), otherwise it's wrong
when the function has more than one loop at the outermost level.
I was also hoping to avoid this loop by passing down the current
coldest loop as the single one to compare against.

But with the above discussion things will look different anyway I guess.

> +  class loop *cold_loop = loop;
> +  while (outmost_loop != loop)
> +    {
> +      if (bb_colder_than_loop_preheader (
> +           loop_preheader_edge (outmost_loop)->src,
> +           loop_preheader_edge (cold_loop)->src))
> +       cold_loop = outmost_loop;
> +      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
> +    }
> +
> +  coldest_outermost_loop[loop->num] = cold_loop;
> +  if (dump_enabled_p ())
> +    dump_printf (MSG_NOTE, "loop %d's coldest outermost loop is %d\n",
> +                loop->num, cold_loop->num);
> +
> +  for (loop = loop->inner; loop; loop = loop->next)
> +    fill_coldest_out_loop (loop);
> +}
>
>  /* Compute the global information needed by the loop invariant motion pass.  */
>
> @@ -3237,6 +3336,8 @@ tree_ssa_lim_finalize (void)
>      free_affine_expand_cache (&memory_accesses.ttae_cache);
>
>    free (bb_loop_postorder);
> +
> +  free (coldest_outermost_loop);
>  }
>
>  /* Moves invariants from loops.  Only "expensive" invariants are moved out --
> @@ -3256,6 +3357,12 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion)
>    /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
>    fill_always_executed_in ();
>
> +  /* Pre-compute coldest outermost loop of each loop.  */
> +  class loop *loop;
> +  coldest_outermost_loop = XNEWVEC (class loop *, number_of_loops (cfun));
> +  for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next)
> +    fill_coldest_out_loop (loop);
> +
>    int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
>    int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
>
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> index 638bf38db8c..641c91e719e 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> @@ -23,4 +23,4 @@ float h ()
>         F[0] += E / d;
>  }
>
> -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
> +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> new file mode 100644
> index 00000000000..7326a230b3f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +void
> +bar (int, char *, char *);
> +void
> +foo (int *a, int n, int k)
> +{
> +  int i;
> +
> +  for (i = 0; i < n; i++)
> +    {
> +      if (__builtin_expect (x, 0))
> +       bar (k / 5, "one", "two");
> +      a[i] = k;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> new file mode 100644
> index 00000000000..51c1913d003
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> @@ -0,0 +1,29 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +void
> +bar (int, char *, char *);
> +void
> +foo (int *a, int n, int m, int s, int t)
> +{
> +  int i;
> +  int j;
> +  int k;
> +
> +  for (i = 0; i < m; i++) // Loop 1
> +    {
> +      if (__builtin_expect (x, 0))
> +       for (j = 0; j < n; j++) // Loop 2
> +         for (k = 0; k < n; k++) // Loop 3
> +           {
> +             bar (s / 5, "one", "two");
> +             a[t] = s;
> +           }
> +      a[t] = t;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */
> +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> new file mode 100644
> index 00000000000..bc60a040a70
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> @@ -0,0 +1,25 @@
> +/* { dg-do compile  } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +/* Test that `count' is not hoisted out of loop when bb is cold.  */
> +
> +int count;
> +volatile int x;
> +
> +struct obj {
> +  int data;
> +  struct obj *next;
> +
> +} *q;
> +
> +void
> +func (int m)
> +{
> +  struct obj *p;
> +  for (int i = 0; i < m; i++)
> +    if (__builtin_expect (x, 0))
> +      count++;
> +
> +}
> +
> +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> new file mode 100644
> index 00000000000..ffe6f8f699d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> @@ -0,0 +1,35 @@
> +/* { dg-do compile  } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +/* Test that `data' and 'data1' is not hoisted out of inner loop and outer loop
> +   when it is in cold loop.  */
> +
> +int count;
> +volatile int x;
> +
> +struct obj {
> +  int data;
> +  int data1;
> +  struct obj *next;
> +};
> +
> +void
> +func (int m, int n, int k, struct obj *a)
> +{
> +  struct obj *q = a;
> +  for (int j = 0; j < m; j++)
> +    if (__builtin_expect (m, 0))
> +      for (int i = 0; i < m; i++)
> +       {
> +         if (__builtin_expect (x, 0))
> +           {
> +             count++;
> +             q->data += 3; /* Not hoisted out to inner loop. */
> +           }
> +         count += n;
> +         q->data1 += k; /* Not hoisted out to outer loop. */
> +       }
> +}
> +
> +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
> +
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
> new file mode 100644
> index 00000000000..16ba4ceb8ab
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
> @@ -0,0 +1,32 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +volatile int y;
> +void
> +bar (int, char *, char *);
> +void
> +foo (int *a, int n, int m, int s, int t)
> +{
> +  int i;
> +  int j;
> +  int k;
> +
> +  for (i = 0; i < m; i++) // Loop 1
> +    {
> +      if (__builtin_expect (x, 0))
> +       for (j = 0; j < n; j++) // Loop 2
> +         if (__builtin_expect (y, 0))
> +           for (k = 0; k < n; k++) // Loop 3
> +             {
> +               bar (s / 5, "one", "two");
> +               a[t] = s;
> +             }
> +      a[t] = t;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "out of loop 3" 4 "lim2" } } */
> +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
> +
> +
> --
> 2.27.0.90.geebb51ba8c
>
>
>
diff mbox series

Patch

diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
index bdc7b59dd5f..7b5d64d11f9 100644
--- a/gcc/loop-invariant.c
+++ b/gcc/loop-invariant.c
@@ -1183,9 +1183,14 @@  find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
    call.  */
 
 static void
-find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
+find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
+		    bool always_executed)
 {
   rtx_insn *insn;
+  basic_block preheader = loop_preheader_edge (loop)->src;
+
+  if (preheader->count > bb->count)
+    return;
 
   FOR_BB_INSNS (bb, insn)
     {
@@ -1214,8 +1219,7 @@  find_invariants_body (class loop *loop, basic_block *body,
   unsigned i;
 
   for (i = 0; i < loop->num_nodes; i++)
-    find_invariants_bb (body[i],
-			bitmap_bit_p (always_reached, i),
+    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
 			bitmap_bit_p (always_executed, i));
 }
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
index 638bf38db8c..641c91e719e 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
@@ -23,4 +23,4 @@  float h ()
 	F[0] += E / d;
 }
 
-/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
+/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 7de47edbcb3..2bfb5e8ec15 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -1147,6 +1147,61 @@  move_computations_worker (basic_block bb)
 	  continue;
 	}
 
+      edge e = loop_preheader_edge (level);
+      if (e->src->count > bb->count)
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    {
+	      fprintf (dump_file, "PHI node NOT moved to %d from %d:\n",
+		       e->src->index, bb->index);
+	      print_gimple_stmt (dump_file, stmt, 0);
+	      fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
+		       level->num);
+	    }
+	  gsi_next (&bsi);
+	  continue;
+	}
+      else
+	{
+	  unsigned i;
+	  bool skip_phi_move = false;
+	  for (i = 0; i < gimple_phi_num_args (stmt); i++)
+	    {
+	      tree def = PHI_ARG_DEF (stmt, i);
+
+	      if (TREE_CODE (def) != SSA_NAME)
+		continue;
+
+	      gimple *def_stmt = SSA_NAME_DEF_STMT (def);
+
+	      if (!gimple_bb (def_stmt))
+		continue;
+
+	      if (!dominated_by_p (CDI_DOMINATORS, e->src,
+				   gimple_bb (def_stmt)))
+		{
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    {
+		      fprintf (dump_file,
+			       "PHI node NOT moved to %d [local count:%d] from "
+			       "%d [local count:%d]:\n",
+			       e->src->index, e->src->count.value (), bb->index,
+			       bb->count.value ());
+		      print_gimple_stmt (dump_file, stmt, 0);
+		      fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
+			       level->num);
+		    }
+		  skip_phi_move = true;
+		  break;
+		}
+	    }
+	  if (skip_phi_move)
+	    {
+	      gsi_next (&bsi);
+	      continue;
+	    }
+	}
+
       if (dump_file && (dump_flags & TDF_DETAILS))
 	{
 	  fprintf (dump_file, "Moving PHI node\n");
@@ -1184,14 +1239,13 @@  move_computations_worker (basic_block bb)
 	  tree lhs = gimple_assign_lhs (new_stmt);
 	  SSA_NAME_RANGE_INFO (lhs) = NULL;
 	}
-      gsi_insert_on_edge (loop_preheader_edge (level), new_stmt);
+      gsi_insert_on_edge (e, new_stmt);
       remove_phi_node (&bsi, false);
     }
 
   for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); )
     {
       edge e;
-
       gimple *stmt = gsi_stmt (bsi);
 
       lim_data = get_lim_data (stmt);
@@ -1214,7 +1268,90 @@  move_computations_worker (basic_block bb)
       /* We do not really want to move conditionals out of the loop; we just
 	 placed it here to force its operands to be moved if necessary.  */
       if (gimple_code (stmt) == GIMPLE_COND)
-	continue;
+	{
+	  gsi_next (&bsi);
+	  continue;
+	}
+
+      e = loop_preheader_edge (level);
+      if (e->src->count > bb->count)
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    {
+	      fprintf (dump_file,
+		       "stmt: Statement NOT moved to %d [local count:%d] from "
+		       "%d [local count:%d]:\n",
+		       e->src->index, e->src->count.value (), bb->index,
+		       bb->count.value ());
+	      print_gimple_stmt (dump_file, stmt, 0);
+	      fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
+		       level->num);
+	    }
+	  gsi_next (&bsi);
+	  continue;
+	}
+      else
+	{
+	  if (is_gimple_assign (stmt))
+	    {
+	      tree rhs1 = gimple_assign_rhs1 (stmt);
+	      tree rhs2 = gimple_assign_rhs2 (stmt);
+	      if (TREE_CODE (rhs1) == MEM_REF)
+		{
+		  rhs2 = TREE_OPERAND (rhs1, 1);
+		  rhs1 = TREE_OPERAND (rhs1, 0);
+		}
+	      gimple *stmt1 = NULL, *stmt2 = NULL;
+	      basic_block def_bb;
+	      if (rhs1 && TREE_CODE (rhs1) == SSA_NAME)
+		{
+		  stmt1 = SSA_NAME_DEF_STMT (rhs1);
+		  def_bb = gimple_bb (stmt1);
+		  if (stmt1
+		      && def_bb
+		      && (def_bb == bb
+			  || !dominated_by_p (CDI_DOMINATORS, e->src, def_bb)))
+		    {
+		      if (dump_file && (dump_flags & TDF_DETAILS))
+			{
+			  fprintf (dump_file,
+				   "stmt1: Statement NOT moved to %d [local "
+				   "count:%d] from %d [local count:%d]:\n",
+				   e->src->index, e->src->count.value (),
+				   bb->index, bb->count.value ());
+			  print_gimple_stmt (dump_file, stmt, 0);
+			  fprintf (dump_file, "(cost %u) out of loop %d.\n\n",
+				   cost, level->num);
+			}
+		      gsi_next (&bsi);
+		      continue;
+		    }
+		}
+	      if (rhs2 && TREE_CODE (rhs2) == SSA_NAME)
+		{
+		  stmt2 = SSA_NAME_DEF_STMT (rhs2);
+		  def_bb = gimple_bb (stmt2);
+		  if (stmt2 && def_bb
+		      && (def_bb == bb
+			  || !dominated_by_p (CDI_DOMINATORS, e->src, def_bb)))
+		    {
+		      if (dump_file && (dump_flags & TDF_DETAILS))
+			{
+			  fprintf (dump_file,
+				   "stmt2: Statement NOT moved to %d [local "
+				   "count:%d] from %d [local count:%d]:\n",
+				   e->src->index, e->src->count.value (),
+				   bb->index, bb->count.value ());
+			  print_gimple_stmt (dump_file, stmt, 0);
+			  fprintf (dump_file, "(cost %u) out of loop %d.\n\n",
+				   cost, level->num);
+			}
+		      gsi_next (&bsi);
+		      continue;
+		    }
+		}
+	    }
+	}
 
       if (dump_file && (dump_flags & TDF_DETAILS))
 	{
@@ -1224,7 +1361,6 @@  move_computations_worker (basic_block bb)
 		   cost, level->num);
 	}
 
-      e = loop_preheader_edge (level);
       gcc_assert (!gimple_vdef (stmt));
       if (gimple_vuse (stmt))
 	{
@@ -2094,6 +2230,19 @@  execute_sm (class loop *loop, im_mem_ref *ref,
   bool multi_threaded_model_p = false;
   gimple_stmt_iterator gsi;
   sm_aux *aux = new sm_aux;
+  basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt);
+
+  edge e = loop_preheader_edge (loop);
+  if (e->src->count > bb->count)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Don't execute store motion of ");
+	  print_generic_expr (dump_file, ref->mem.ref);
+	  fprintf (dump_file, " from loop %d\n", loop->num);
+	}
+      return;
+    }
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
@@ -2202,7 +2351,12 @@  execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq,
 	}
       else
 	{
-	  sm_aux *aux = *aux_map.get (ref);
+	  sm_aux **paux = aux_map.get (ref);
+	  sm_aux *aux;
+	  if (paux)
+	    aux = *paux;
+	  else
+	    continue;
 	  if (!aux->store_flag || kind == sm_ord)
 	    {
 	      gassign *store;
diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
index 3a09bbc39e5..4cae82936b9 100644
--- a/gcc/tree-ssa-loop-split.c
+++ b/gcc/tree-ssa-loop-split.c
@@ -577,14 +577,17 @@  split_loop (class loop *loop1)
 	if (!initial_true)
 	  cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond); 
 
+	edge true_edge = EDGE_SUCC (bbs[i], 0)->flags & EDGE_TRUE_VALUE
+			   ? EDGE_SUCC (bbs[i], 0)
+			   : EDGE_SUCC (bbs[i], 1);
 	/* Now version the loop, placing loop2 after loop1 connecting
 	   them, and fix up SSA form for that.  */
 	initialize_original_copy_tables ();
 	basic_block cond_bb;
 
 	class loop *loop2 = loop_version (loop1, cond, &cond_bb,
-					   profile_probability::always (),
-					   profile_probability::always (),
+					   true_edge->probability,
+					   true_edge->probability.invert (),
 					   profile_probability::always (),
 					   profile_probability::always (),
 					   true);
@@ -1486,8 +1489,8 @@  do_split_loop_on_cond (struct loop *loop1, edge invar_branch)
   initialize_original_copy_tables ();
 
   struct loop *loop2 = loop_version (loop1, boolean_true_node, NULL,
-				     profile_probability::always (),
-				     profile_probability::never (),
+				     invar_branch->probability,
+				     invar_branch->probability.invert (),
 				     profile_probability::always (),
 				     profile_probability::always (),
 				     true);
@@ -1530,6 +1533,9 @@  do_split_loop_on_cond (struct loop *loop1, edge invar_branch)
   to_loop1->flags |= true_invar ? EDGE_FALSE_VALUE : EDGE_TRUE_VALUE;
   to_loop2->flags |= true_invar ? EDGE_TRUE_VALUE : EDGE_FALSE_VALUE;
 
+  to_loop1->probability = invar_branch->probability.invert ();
+  to_loop2->probability = invar_branch->probability;
+
   /* Due to introduction of a control flow edge from loop1 latch to loop2
      pre-header, we should update PHIs in loop2 to reflect this connection
      between loop1 and loop2.  */