Message ID | 20210802050501.159058-1-luoxhu@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | [RFC] Don't move cold code out of loop by checking bb count | expand |
On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com> wrote: > > There was a patch trying to avoid move cold block out of loop: > > https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html > > Richard suggested to "never hoist anything from a bb with lower execution > frequency to a bb with higher one in LIM invariantness_dom_walker > before_dom_children". > > This patch does this profile count check in both gimple LIM > move_computations_worker and RTL loop-invariant.c find_invariants_bb, > if the loop bb is colder than loop preheader, don't hoist it out of > loop. > > Also, the profile count in loop split pass should be corrected to avoid > lim2 and lim4 mismatch behavior, currently, the new loop preheader generated > by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will > move statement out of loop unexpectely when lim2 didn't move it. This > change could fix regression on 544.nab_r from -1.55% to +0.46%. > > SPEC2017 performance evaluation shows 1% performance improvement for > intrate GEOMEAN and no obvious regression for others. Especially, > 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is > largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% > on P8LE. > > Regression and bootstrap tested pass on P8LE, any comments? Thanks. While I'm not familiar with the RTL invariant motion pass the patch there looks reasonable. Note that we should assess the profile quality somehow - I'm not sure how to do that, CCed Honza for that. For the GIMPLE part the patch looks quite complicated - but note it probably has to be since LIM performs kind of a "CSE" on loads (and stores for store-motion), so when there are multiple stmts affected by a hoisting decision the biggest block count has to be accounted. Likewise when there are dependent stmts involved that might include conditional stmts (a "PHI"), but the overall cost should be looked at. Now - GIMPLE LIM "costing" is somewhat backward right now and it isn't set up to consider those multiple involved stmts. Plus the store-motion part does not have any cost part (but it depends on previously decided invariant motions). I think the way you implemented the check will cause no hoisting to be performed instead of, say, hoisting to a different loop level only. Possibly shown when you consider a loop nest like for (;;) if (unlikely_cond) for (;;) invariant; we want to hoist 'invariant' but only from the inner loop even if it is invariant also in the outer loop. But for example if there is a store motion opportunity like for (;;) { if (unlikely_cond) for (;;) a = ...; a = ...; } we'd still want to perform the store motion on the outer loop. Note that store-motion already performs part of the transform before dependent code is moved in move_computations (that you patched). IIRC your main concern were the COND_EXPRs we insert for hoisted conditional stmts? Thanks, Richard. > gcc/ChangeLog: > > * loop-invariant.c (find_invariants_bb): Check profile count > before motion. > (find_invariants_body): Add argument. > * tree-ssa-loop-im.c (move_computations_worker): Check profile > count before motion. > (execute_sm): Likewise. > (execute_sm_exit): Check pointer validness. > * tree-ssa-loop-split.c (split_loop): Correct probability. > (do_split_loop_on_cond): Likewise. > > gcc/testsuite/ChangeLog: > > * gcc.dg/tree-ssa/recip-3.c: Adjust. > --- > gcc/loop-invariant.c | 10 +- > gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- > gcc/tree-ssa-loop-im.c | 164 +++++++++++++++++++++++- > gcc/tree-ssa-loop-split.c | 14 +- > 4 files changed, 177 insertions(+), 13 deletions(-) > > diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c > index bdc7b59dd5f..7b5d64d11f9 100644 > --- a/gcc/loop-invariant.c > +++ b/gcc/loop-invariant.c > @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) > call. */ > > static void > -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) > +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, > + bool always_executed) > { > rtx_insn *insn; > + basic_block preheader = loop_preheader_edge (loop)->src; > + > + if (preheader->count > bb->count) > + return; > > FOR_BB_INSNS (bb, insn) > { > @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, > unsigned i; > > for (i = 0; i < loop->num_nodes; i++) > - find_invariants_bb (body[i], > - bitmap_bit_p (always_reached, i), > + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), > bitmap_bit_p (always_executed, i)); > } > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > index 638bf38db8c..641c91e719e 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > @@ -23,4 +23,4 @@ float h () > F[0] += E / d; > } > > -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */ > +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */ > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > index 7de47edbcb3..2bfb5e8ec15 100644 > --- a/gcc/tree-ssa-loop-im.c > +++ b/gcc/tree-ssa-loop-im.c > @@ -1147,6 +1147,61 @@ move_computations_worker (basic_block bb) > continue; > } > > + edge e = loop_preheader_edge (level); > + if (e->src->count > bb->count) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "PHI node NOT moved to %d from %d:\n", > + e->src->index, bb->index); > + print_gimple_stmt (dump_file, stmt, 0); > + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost, > + level->num); > + } > + gsi_next (&bsi); > + continue; > + } > + else > + { > + unsigned i; > + bool skip_phi_move = false; > + for (i = 0; i < gimple_phi_num_args (stmt); i++) > + { > + tree def = PHI_ARG_DEF (stmt, i); > + > + if (TREE_CODE (def) != SSA_NAME) > + continue; > + > + gimple *def_stmt = SSA_NAME_DEF_STMT (def); > + > + if (!gimple_bb (def_stmt)) > + continue; > + > + if (!dominated_by_p (CDI_DOMINATORS, e->src, > + gimple_bb (def_stmt))) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, > + "PHI node NOT moved to %d [local count:%d] from " > + "%d [local count:%d]:\n", > + e->src->index, e->src->count.value (), bb->index, > + bb->count.value ()); > + print_gimple_stmt (dump_file, stmt, 0); > + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost, > + level->num); > + } > + skip_phi_move = true; > + break; > + } > + } > + if (skip_phi_move) > + { > + gsi_next (&bsi); > + continue; > + } > + } > + > if (dump_file && (dump_flags & TDF_DETAILS)) > { > fprintf (dump_file, "Moving PHI node\n"); > @@ -1184,14 +1239,13 @@ move_computations_worker (basic_block bb) > tree lhs = gimple_assign_lhs (new_stmt); > SSA_NAME_RANGE_INFO (lhs) = NULL; > } > - gsi_insert_on_edge (loop_preheader_edge (level), new_stmt); > + gsi_insert_on_edge (e, new_stmt); > remove_phi_node (&bsi, false); > } > > for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); ) > { > edge e; > - > gimple *stmt = gsi_stmt (bsi); > > lim_data = get_lim_data (stmt); > @@ -1214,7 +1268,90 @@ move_computations_worker (basic_block bb) > /* We do not really want to move conditionals out of the loop; we just > placed it here to force its operands to be moved if necessary. */ > if (gimple_code (stmt) == GIMPLE_COND) > - continue; > + { > + gsi_next (&bsi); > + continue; > + } > + > + e = loop_preheader_edge (level); > + if (e->src->count > bb->count) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, > + "stmt: Statement NOT moved to %d [local count:%d] from " > + "%d [local count:%d]:\n", > + e->src->index, e->src->count.value (), bb->index, > + bb->count.value ()); > + print_gimple_stmt (dump_file, stmt, 0); > + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost, > + level->num); > + } > + gsi_next (&bsi); > + continue; > + } > + else > + { > + if (is_gimple_assign (stmt)) > + { > + tree rhs1 = gimple_assign_rhs1 (stmt); > + tree rhs2 = gimple_assign_rhs2 (stmt); > + if (TREE_CODE (rhs1) == MEM_REF) > + { > + rhs2 = TREE_OPERAND (rhs1, 1); > + rhs1 = TREE_OPERAND (rhs1, 0); > + } > + gimple *stmt1 = NULL, *stmt2 = NULL; > + basic_block def_bb; > + if (rhs1 && TREE_CODE (rhs1) == SSA_NAME) > + { > + stmt1 = SSA_NAME_DEF_STMT (rhs1); > + def_bb = gimple_bb (stmt1); > + if (stmt1 > + && def_bb > + && (def_bb == bb > + || !dominated_by_p (CDI_DOMINATORS, e->src, def_bb))) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, > + "stmt1: Statement NOT moved to %d [local " > + "count:%d] from %d [local count:%d]:\n", > + e->src->index, e->src->count.value (), > + bb->index, bb->count.value ()); > + print_gimple_stmt (dump_file, stmt, 0); > + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", > + cost, level->num); > + } > + gsi_next (&bsi); > + continue; > + } > + } > + if (rhs2 && TREE_CODE (rhs2) == SSA_NAME) > + { > + stmt2 = SSA_NAME_DEF_STMT (rhs2); > + def_bb = gimple_bb (stmt2); > + if (stmt2 && def_bb > + && (def_bb == bb > + || !dominated_by_p (CDI_DOMINATORS, e->src, def_bb))) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, > + "stmt2: Statement NOT moved to %d [local " > + "count:%d] from %d [local count:%d]:\n", > + e->src->index, e->src->count.value (), > + bb->index, bb->count.value ()); > + print_gimple_stmt (dump_file, stmt, 0); > + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", > + cost, level->num); > + } > + gsi_next (&bsi); > + continue; > + } > + } > + } > + } > > if (dump_file && (dump_flags & TDF_DETAILS)) > { > @@ -1224,7 +1361,6 @@ move_computations_worker (basic_block bb) > cost, level->num); > } > > - e = loop_preheader_edge (level); > gcc_assert (!gimple_vdef (stmt)); > if (gimple_vuse (stmt)) > { > @@ -2094,6 +2230,19 @@ execute_sm (class loop *loop, im_mem_ref *ref, > bool multi_threaded_model_p = false; > gimple_stmt_iterator gsi; > sm_aux *aux = new sm_aux; > + basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt); > + > + edge e = loop_preheader_edge (loop); > + if (e->src->count > bb->count) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "Don't execute store motion of "); > + print_generic_expr (dump_file, ref->mem.ref); > + fprintf (dump_file, " from loop %d\n", loop->num); > + } > + return; > + } > > if (dump_file && (dump_flags & TDF_DETAILS)) > { > @@ -2202,7 +2351,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq, > } > else > { > - sm_aux *aux = *aux_map.get (ref); > + sm_aux **paux = aux_map.get (ref); > + sm_aux *aux; > + if (paux) > + aux = *paux; > + else > + continue; > if (!aux->store_flag || kind == sm_ord) > { > gassign *store; > diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c > index 3a09bbc39e5..4cae82936b9 100644 > --- a/gcc/tree-ssa-loop-split.c > +++ b/gcc/tree-ssa-loop-split.c > @@ -577,14 +577,17 @@ split_loop (class loop *loop1) > if (!initial_true) > cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond); > > + edge true_edge = EDGE_SUCC (bbs[i], 0)->flags & EDGE_TRUE_VALUE > + ? EDGE_SUCC (bbs[i], 0) > + : EDGE_SUCC (bbs[i], 1); > /* Now version the loop, placing loop2 after loop1 connecting > them, and fix up SSA form for that. */ > initialize_original_copy_tables (); > basic_block cond_bb; > > class loop *loop2 = loop_version (loop1, cond, &cond_bb, > - profile_probability::always (), > - profile_probability::always (), > + true_edge->probability, > + true_edge->probability.invert (), > profile_probability::always (), > profile_probability::always (), > true); > @@ -1486,8 +1489,8 @@ do_split_loop_on_cond (struct loop *loop1, edge invar_branch) > initialize_original_copy_tables (); > > struct loop *loop2 = loop_version (loop1, boolean_true_node, NULL, > - profile_probability::always (), > - profile_probability::never (), > + invar_branch->probability, > + invar_branch->probability.invert (), > profile_probability::always (), > profile_probability::always (), > true); > @@ -1530,6 +1533,9 @@ do_split_loop_on_cond (struct loop *loop1, edge invar_branch) > to_loop1->flags |= true_invar ? EDGE_FALSE_VALUE : EDGE_TRUE_VALUE; > to_loop2->flags |= true_invar ? EDGE_TRUE_VALUE : EDGE_FALSE_VALUE; > > + to_loop1->probability = invar_branch->probability.invert (); > + to_loop2->probability = invar_branch->probability; > + > /* Due to introduction of a control flow edge from loop1 latch to loop2 > pre-header, we should update PHIs in loop2 to reflect this connection > between loop1 and loop2. */ > -- > 2.27.0.90.geebb51ba8c >
Hi, On 2021/8/6 20:15, Richard Biener wrote: > On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com> wrote: >> >> There was a patch trying to avoid move cold block out of loop: >> >> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html >> >> Richard suggested to "never hoist anything from a bb with lower execution >> frequency to a bb with higher one in LIM invariantness_dom_walker >> before_dom_children". >> >> This patch does this profile count check in both gimple LIM >> move_computations_worker and RTL loop-invariant.c find_invariants_bb, >> if the loop bb is colder than loop preheader, don't hoist it out of >> loop. >> >> Also, the profile count in loop split pass should be corrected to avoid >> lim2 and lim4 mismatch behavior, currently, the new loop preheader generated >> by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will >> move statement out of loop unexpectely when lim2 didn't move it. This >> change could fix regression on 544.nab_r from -1.55% to +0.46%. >> >> SPEC2017 performance evaluation shows 1% performance improvement for >> intrate GEOMEAN and no obvious regression for others. Especially, >> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is >> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% >> on P8LE. >> >> Regression and bootstrap tested pass on P8LE, any comments? Thanks. > > While I'm not familiar with the RTL invariant motion pass the patch there > looks reasonable. Note that we should assess the profile quality > somehow - I'm not sure how to do that, CCed Honza for that. Thanks. > > For the GIMPLE part the patch looks quite complicated - but note it > probably has to be since LIM performs kind of a "CSE" on loads > (and stores for store-motion), so when there are multiple stmts > affected by a hoisting decision the biggest block count has to be > accounted. Likewise when there are dependent stmts involved > that might include conditional stmts (a "PHI"), but the overall > cost should be looked at. Currently, The gimple code check two situations with the patch: 1) The statement or PHI‘s BB is *colder* then preheader, don't move it out of loop; 2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs couldn't be moved out of loop, also don't move it out of loop to avoid definition not dominates use error. May be I could collect the number of instructions not hoisted with the patch on regression tests and SPEC2017 to do a estimation for "multiple stmts affected" and "overall cost" need to be considered? But it seems move_computations_worker couldn't rollback if we still want to hoist multiple stmts out during the iterations? > > Now - GIMPLE LIM "costing" is somewhat backward right now > and it isn't set up to consider those multiple involved stmts. Plus > the store-motion part does not have any cost part (but it depends > on previously decided invariant motions). > > I think the way you implemented the check will cause no hoisting > to be performed instead of, say, hoisting to a different loop level > only. Possibly shown when you consider a loop nest like > > for (;;) > if (unlikely_cond) > for (;;) > invariant; > > we want to hoist 'invariant' but only from the inner loop even if it > is invariant also in the outer loop. For this case, theorotically I think the master GCC will optimize it to: invariant; for (;;) if (unlikely_cond) for (;;) ; 'invariant' is moved out of outer loop, but with the patch, it will get: for (;;) if (unlikely_cond) { invariant; for (;;) ; } 'invariant' is *cold* for outer loop, but it is still *hot* for inner loop, so hoist it out of inner loop, this is exactly what we want, right? > But for example if there is > a store motion opportunity like > > for (;;) > { > if (unlikely_cond) > for (;;) > a = ...; > a = ...; > } > > we'd still want to perform the store motion on the outer loop. > > Note that store-motion already performs part of the transform > before dependent code is moved in move_computations (that > you patched). Yes. do_store_motion is running before move_computations_worker, store motion happens earlier in execute_sm, I also added the check in execute_sm to stop cold store moved out of loop. So for your case, I think my patch will similarly optimize it to: for (;;) { if (unlikely_cond) { for (;;) ; a = ...; } } a = ...; Whether this is better? Will construct cases to verify it. > > IIRC your main concern were the COND_EXPRs we insert > for hoisted conditional stmts? Not sure what you mean here of COND_EXPRs? Thanks, Xionghu > > Thanks, > Richard. > >> gcc/ChangeLog: >> >> * loop-invariant.c (find_invariants_bb): Check profile count >> before motion. >> (find_invariants_body): Add argument. >> * tree-ssa-loop-im.c (move_computations_worker): Check profile >> count before motion. >> (execute_sm): Likewise. >> (execute_sm_exit): Check pointer validness. >> * tree-ssa-loop-split.c (split_loop): Correct probability. >> (do_split_loop_on_cond): Likewise. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.dg/tree-ssa/recip-3.c: Adjust. >> --- >> gcc/loop-invariant.c | 10 +- >> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- >> gcc/tree-ssa-loop-im.c | 164 +++++++++++++++++++++++- >> gcc/tree-ssa-loop-split.c | 14 +- >> 4 files changed, 177 insertions(+), 13 deletions(-) >> >> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c >> index bdc7b59dd5f..7b5d64d11f9 100644 >> --- a/gcc/loop-invariant.c >> +++ b/gcc/loop-invariant.c >> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) >> call. */ >> >> static void >> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) >> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, >> + bool always_executed) >> { >> rtx_insn *insn; >> + basic_block preheader = loop_preheader_edge (loop)->src; >> + >> + if (preheader->count > bb->count) >> + return; >> >> FOR_BB_INSNS (bb, insn) >> { >> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, >> unsigned i; >> >> for (i = 0; i < loop->num_nodes; i++) >> - find_invariants_bb (body[i], >> - bitmap_bit_p (always_reached, i), >> + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), >> bitmap_bit_p (always_executed, i)); >> } >> >> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c >> index 638bf38db8c..641c91e719e 100644 >> --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c >> +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c >> @@ -23,4 +23,4 @@ float h () >> F[0] += E / d; >> } >> >> -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */ >> +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */ >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c >> index 7de47edbcb3..2bfb5e8ec15 100644 >> --- a/gcc/tree-ssa-loop-im.c >> +++ b/gcc/tree-ssa-loop-im.c >> @@ -1147,6 +1147,61 @@ move_computations_worker (basic_block bb) >> continue; >> } >> >> + edge e = loop_preheader_edge (level); >> + if (e->src->count > bb->count) >> + { >> + if (dump_file && (dump_flags & TDF_DETAILS)) >> + { >> + fprintf (dump_file, "PHI node NOT moved to %d from %d:\n", >> + e->src->index, bb->index); >> + print_gimple_stmt (dump_file, stmt, 0); >> + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost, >> + level->num); >> + } >> + gsi_next (&bsi); >> + continue; >> + } >> + else >> + { >> + unsigned i; >> + bool skip_phi_move = false; >> + for (i = 0; i < gimple_phi_num_args (stmt); i++) >> + { >> + tree def = PHI_ARG_DEF (stmt, i); >> + >> + if (TREE_CODE (def) != SSA_NAME) >> + continue; >> + >> + gimple *def_stmt = SSA_NAME_DEF_STMT (def); >> + >> + if (!gimple_bb (def_stmt)) >> + continue; >> + >> + if (!dominated_by_p (CDI_DOMINATORS, e->src, >> + gimple_bb (def_stmt))) >> + { >> + if (dump_file && (dump_flags & TDF_DETAILS)) >> + { >> + fprintf (dump_file, >> + "PHI node NOT moved to %d [local count:%d] from " >> + "%d [local count:%d]:\n", >> + e->src->index, e->src->count.value (), bb->index, >> + bb->count.value ()); >> + print_gimple_stmt (dump_file, stmt, 0); >> + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost, >> + level->num); >> + } >> + skip_phi_move = true; >> + break; >> + } >> + } >> + if (skip_phi_move) >> + { >> + gsi_next (&bsi); >> + continue; >> + } >> + } >> + >> if (dump_file && (dump_flags & TDF_DETAILS)) >> { >> fprintf (dump_file, "Moving PHI node\n"); >> @@ -1184,14 +1239,13 @@ move_computations_worker (basic_block bb) >> tree lhs = gimple_assign_lhs (new_stmt); >> SSA_NAME_RANGE_INFO (lhs) = NULL; >> } >> - gsi_insert_on_edge (loop_preheader_edge (level), new_stmt); >> + gsi_insert_on_edge (e, new_stmt); >> remove_phi_node (&bsi, false); >> } >> >> for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); ) >> { >> edge e; >> - >> gimple *stmt = gsi_stmt (bsi); >> >> lim_data = get_lim_data (stmt); >> @@ -1214,7 +1268,90 @@ move_computations_worker (basic_block bb) >> /* We do not really want to move conditionals out of the loop; we just >> placed it here to force its operands to be moved if necessary. */ >> if (gimple_code (stmt) == GIMPLE_COND) >> - continue; >> + { >> + gsi_next (&bsi); >> + continue; >> + } >> + >> + e = loop_preheader_edge (level); >> + if (e->src->count > bb->count) >> + { >> + if (dump_file && (dump_flags & TDF_DETAILS)) >> + { >> + fprintf (dump_file, >> + "stmt: Statement NOT moved to %d [local count:%d] from " >> + "%d [local count:%d]:\n", >> + e->src->index, e->src->count.value (), bb->index, >> + bb->count.value ()); >> + print_gimple_stmt (dump_file, stmt, 0); >> + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost, >> + level->num); >> + } >> + gsi_next (&bsi); >> + continue; >> + } >> + else >> + { >> + if (is_gimple_assign (stmt)) >> + { >> + tree rhs1 = gimple_assign_rhs1 (stmt); >> + tree rhs2 = gimple_assign_rhs2 (stmt); >> + if (TREE_CODE (rhs1) == MEM_REF) >> + { >> + rhs2 = TREE_OPERAND (rhs1, 1); >> + rhs1 = TREE_OPERAND (rhs1, 0); >> + } >> + gimple *stmt1 = NULL, *stmt2 = NULL; >> + basic_block def_bb; >> + if (rhs1 && TREE_CODE (rhs1) == SSA_NAME) >> + { >> + stmt1 = SSA_NAME_DEF_STMT (rhs1); >> + def_bb = gimple_bb (stmt1); >> + if (stmt1 >> + && def_bb >> + && (def_bb == bb >> + || !dominated_by_p (CDI_DOMINATORS, e->src, def_bb))) >> + { >> + if (dump_file && (dump_flags & TDF_DETAILS)) >> + { >> + fprintf (dump_file, >> + "stmt1: Statement NOT moved to %d [local " >> + "count:%d] from %d [local count:%d]:\n", >> + e->src->index, e->src->count.value (), >> + bb->index, bb->count.value ()); >> + print_gimple_stmt (dump_file, stmt, 0); >> + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", >> + cost, level->num); >> + } >> + gsi_next (&bsi); >> + continue; >> + } >> + } >> + if (rhs2 && TREE_CODE (rhs2) == SSA_NAME) >> + { >> + stmt2 = SSA_NAME_DEF_STMT (rhs2); >> + def_bb = gimple_bb (stmt2); >> + if (stmt2 && def_bb >> + && (def_bb == bb >> + || !dominated_by_p (CDI_DOMINATORS, e->src, def_bb))) >> + { >> + if (dump_file && (dump_flags & TDF_DETAILS)) >> + { >> + fprintf (dump_file, >> + "stmt2: Statement NOT moved to %d [local " >> + "count:%d] from %d [local count:%d]:\n", >> + e->src->index, e->src->count.value (), >> + bb->index, bb->count.value ()); >> + print_gimple_stmt (dump_file, stmt, 0); >> + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", >> + cost, level->num); >> + } >> + gsi_next (&bsi); >> + continue; >> + } >> + } >> + } >> + } >> >> if (dump_file && (dump_flags & TDF_DETAILS)) >> { >> @@ -1224,7 +1361,6 @@ move_computations_worker (basic_block bb) >> cost, level->num); >> } >> >> - e = loop_preheader_edge (level); >> gcc_assert (!gimple_vdef (stmt)); >> if (gimple_vuse (stmt)) >> { >> @@ -2094,6 +2230,19 @@ execute_sm (class loop *loop, im_mem_ref *ref, >> bool multi_threaded_model_p = false; >> gimple_stmt_iterator gsi; >> sm_aux *aux = new sm_aux; >> + basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt); >> + >> + edge e = loop_preheader_edge (loop); >> + if (e->src->count > bb->count) >> + { >> + if (dump_file && (dump_flags & TDF_DETAILS)) >> + { >> + fprintf (dump_file, "Don't execute store motion of "); >> + print_generic_expr (dump_file, ref->mem.ref); >> + fprintf (dump_file, " from loop %d\n", loop->num); >> + } >> + return; >> + } >> >> if (dump_file && (dump_flags & TDF_DETAILS)) >> { >> @@ -2202,7 +2351,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq, >> } >> else >> { >> - sm_aux *aux = *aux_map.get (ref); >> + sm_aux **paux = aux_map.get (ref); >> + sm_aux *aux; >> + if (paux) >> + aux = *paux; >> + else >> + continue; >> if (!aux->store_flag || kind == sm_ord) >> { >> gassign *store; >> diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c >> index 3a09bbc39e5..4cae82936b9 100644 >> --- a/gcc/tree-ssa-loop-split.c >> +++ b/gcc/tree-ssa-loop-split.c >> @@ -577,14 +577,17 @@ split_loop (class loop *loop1) >> if (!initial_true) >> cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond); >> >> + edge true_edge = EDGE_SUCC (bbs[i], 0)->flags & EDGE_TRUE_VALUE >> + ? EDGE_SUCC (bbs[i], 0) >> + : EDGE_SUCC (bbs[i], 1); >> /* Now version the loop, placing loop2 after loop1 connecting >> them, and fix up SSA form for that. */ >> initialize_original_copy_tables (); >> basic_block cond_bb; >> >> class loop *loop2 = loop_version (loop1, cond, &cond_bb, >> - profile_probability::always (), >> - profile_probability::always (), >> + true_edge->probability, >> + true_edge->probability.invert (), >> profile_probability::always (), >> profile_probability::always (), >> true); >> @@ -1486,8 +1489,8 @@ do_split_loop_on_cond (struct loop *loop1, edge invar_branch) >> initialize_original_copy_tables (); >> >> struct loop *loop2 = loop_version (loop1, boolean_true_node, NULL, >> - profile_probability::always (), >> - profile_probability::never (), >> + invar_branch->probability, >> + invar_branch->probability.invert (), >> profile_probability::always (), >> profile_probability::always (), >> true); >> @@ -1530,6 +1533,9 @@ do_split_loop_on_cond (struct loop *loop1, edge invar_branch) >> to_loop1->flags |= true_invar ? EDGE_FALSE_VALUE : EDGE_TRUE_VALUE; >> to_loop2->flags |= true_invar ? EDGE_TRUE_VALUE : EDGE_FALSE_VALUE; >> >> + to_loop1->probability = invar_branch->probability.invert (); >> + to_loop2->probability = invar_branch->probability; >> + >> /* Due to introduction of a control flow edge from loop1 latch to loop2 >> pre-header, we should update PHIs in loop2 to reflect this connection >> between loop1 and loop2. */ >> -- >> 2.27.0.90.geebb51ba8c >>
On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > For this case, theorotically I think the master GCC will optimize it to: > > invariant; > for (;;) > if (unlikely_cond) > for (;;) > ; > > 'invariant' is moved out of outer loop, but with the patch, it will get: > > for (;;) > if (unlikely_cond) > { > invariant; > for (;;) > ; > } > > 'invariant' is *cold* for outer loop, but it is still *hot* for inner loop, > so hoist it out of inner loop, this is exactly what we want, right? Is relying on absolute numbers really what you want? If the 'unlikely_cond' condition depends on the iteration count of the outer loop the probability of it being true in each individual iteration can be low (at least that's how I use unlikely) but the overall probability of needing the code is higher 1 - (1 - p)^n if 'p' is the probability of 'unlikely_cond' and 'n' is the number of iterations. Assuming complete independence of the loop iterations, otherwise it's rather an upper limit. At the very least I'd generate code like this: first = true; for (;;) if (unlikely_cond) { if (first) { invariant; first = false; } for (;;) ; } If it's worth hoisting the code the the extra test and flag should be small in cost in comparison. If 'unlikely_cond' does not in any way depend on the loop iteration then I think your code generation is fine.
On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: > > Hi, > > On 2021/8/6 20:15, Richard Biener wrote: > > On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com> wrote: > >> > >> There was a patch trying to avoid move cold block out of loop: > >> > >> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html > >> > >> Richard suggested to "never hoist anything from a bb with lower execution > >> frequency to a bb with higher one in LIM invariantness_dom_walker > >> before_dom_children". > >> > >> This patch does this profile count check in both gimple LIM > >> move_computations_worker and RTL loop-invariant.c find_invariants_bb, > >> if the loop bb is colder than loop preheader, don't hoist it out of > >> loop. > >> > >> Also, the profile count in loop split pass should be corrected to avoid > >> lim2 and lim4 mismatch behavior, currently, the new loop preheader generated > >> by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will > >> move statement out of loop unexpectely when lim2 didn't move it. This > >> change could fix regression on 544.nab_r from -1.55% to +0.46%. > >> > >> SPEC2017 performance evaluation shows 1% performance improvement for > >> intrate GEOMEAN and no obvious regression for others. Especially, > >> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is > >> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% > >> on P8LE. > >> > >> Regression and bootstrap tested pass on P8LE, any comments? Thanks. > > > > While I'm not familiar with the RTL invariant motion pass the patch there > > looks reasonable. Note that we should assess the profile quality > > somehow - I'm not sure how to do that, CCed Honza for that. > > Thanks. > > > > > For the GIMPLE part the patch looks quite complicated - but note it > > probably has to be since LIM performs kind of a "CSE" on loads > > (and stores for store-motion), so when there are multiple stmts > > affected by a hoisting decision the biggest block count has to be > > accounted. Likewise when there are dependent stmts involved > > that might include conditional stmts (a "PHI"), but the overall > > cost should be looked at. > > Currently, The gimple code check two situations with the patch: > 1) The statement or PHI‘s BB is *colder* then preheader, don't move it out > of loop; > 2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs > couldn't be moved out of loop, also don't move it out of loop to avoid definition > not dominates use error. But part 2) is obviously already done. What I tried to say is your heuristic doesn't integrate nicely with the pass but I admitted that it might be a bit difficult to find a place to add this heuristic. There is lim_data->cost which we could bias negatively but then this is a cost that is independent on the hoisting distance. But doing this would work at least for the case where the immediately enclosing loop preheader is hotter than the stmt and with this it would be a patch that's similarly simple as the RTL one. Another possibility is to simply only adjust PHI processing in compute_invariantness, capping movement according to the hotness heuristic. The same could be done for regular stmts there but I'm not sure that will do good in the end since this function is supposed to compute "correctness" (well, it also has the cost stuff), and it's not the place to do overall cost considerations. > May be I could collect the number of instructions not hoisted with the patch > on regression tests and SPEC2017 to do a estimation for "multiple stmts affected" > and "overall cost" need to be considered? But it seems move_computations_worker > couldn't rollback if we still want to hoist multiple stmts out during the iterations? > > > > > Now - GIMPLE LIM "costing" is somewhat backward right now > > and it isn't set up to consider those multiple involved stmts. Plus > > the store-motion part does not have any cost part (but it depends > > on previously decided invariant motions). > > > > I think the way you implemented the check will cause no hoisting > > to be performed instead of, say, hoisting to a different loop level > > only. Possibly shown when you consider a loop nest like > > > > for (;;) > > if (unlikely_cond) > > for (;;) > > invariant; > > > > we want to hoist 'invariant' but only from the inner loop even if it > > is invariant also in the outer loop. > > > For this case, theorotically I think the master GCC will optimize it to: > > invariant; > for (;;) > if (unlikely_cond) > for (;;) > ; > > 'invariant' is moved out of outer loop, but with the patch, it will get: > > for (;;) > if (unlikely_cond) > { > invariant; > for (;;) > ; > } > > 'invariant' is *cold* for outer loop, but it is still *hot* for inner loop, > so hoist it out of inner loop, this is exactly what we want, right? Yes. I had doubts your patch would achieve that. > > > But for example if there is > > a store motion opportunity like > > > > for (;;) > > { > > if (unlikely_cond) > > for (;;) > > a = ...; > > a = ...; > > } > > > > we'd still want to perform the store motion on the outer loop. > > > > Note that store-motion already performs part of the transform > > before dependent code is moved in move_computations (that > > you patched). > > Yes. do_store_motion is running before move_computations_worker, store > motion happens earlier in execute_sm, I also added the check in execute_sm > to stop cold store moved out of loop. So for your case, I think my patch > will similarly optimize it to: > > for (;;) > { > if (unlikely_cond) > { > for (;;) > ; > a = ...; > } > } > a = ...; > > Whether this is better? Will construct cases to verify it. > > > > > IIRC your main concern were the COND_EXPRs we insert > > for hoisted conditional stmts? > > Not sure what you mean here of COND_EXPRs? The PHIs we hoist and for which we insert COND_EXPRs. IIRC that was your original complaint. So my question is whether we can fix the really bad cases moving PHIs with sth local to compute_invariantness. Richard. > > Thanks, > Xionghu > > > > > Thanks, > > Richard. > > > >> gcc/ChangeLog: > >> > >> * loop-invariant.c (find_invariants_bb): Check profile count > >> before motion. > >> (find_invariants_body): Add argument. > >> * tree-ssa-loop-im.c (move_computations_worker): Check profile > >> count before motion. > >> (execute_sm): Likewise. > >> (execute_sm_exit): Check pointer validness. > >> * tree-ssa-loop-split.c (split_loop): Correct probability. > >> (do_split_loop_on_cond): Likewise. > >> > >> gcc/testsuite/ChangeLog: > >> > >> * gcc.dg/tree-ssa/recip-3.c: Adjust. > >> --- > >> gcc/loop-invariant.c | 10 +- > >> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- > >> gcc/tree-ssa-loop-im.c | 164 +++++++++++++++++++++++- > >> gcc/tree-ssa-loop-split.c | 14 +- > >> 4 files changed, 177 insertions(+), 13 deletions(-) > >> > >> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c > >> index bdc7b59dd5f..7b5d64d11f9 100644 > >> --- a/gcc/loop-invariant.c > >> +++ b/gcc/loop-invariant.c > >> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) > >> call. */ > >> > >> static void > >> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) > >> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, > >> + bool always_executed) > >> { > >> rtx_insn *insn; > >> + basic_block preheader = loop_preheader_edge (loop)->src; > >> + > >> + if (preheader->count > bb->count) > >> + return; > >> > >> FOR_BB_INSNS (bb, insn) > >> { > >> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, > >> unsigned i; > >> > >> for (i = 0; i < loop->num_nodes; i++) > >> - find_invariants_bb (body[i], > >> - bitmap_bit_p (always_reached, i), > >> + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), > >> bitmap_bit_p (always_executed, i)); > >> } > >> > >> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > >> index 638bf38db8c..641c91e719e 100644 > >> --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > >> +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > >> @@ -23,4 +23,4 @@ float h () > >> F[0] += E / d; > >> } > >> > >> -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */ > >> +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */ > >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > >> index 7de47edbcb3..2bfb5e8ec15 100644 > >> --- a/gcc/tree-ssa-loop-im.c > >> +++ b/gcc/tree-ssa-loop-im.c > >> @@ -1147,6 +1147,61 @@ move_computations_worker (basic_block bb) > >> continue; > >> } > >> > >> + edge e = loop_preheader_edge (level); > >> + if (e->src->count > bb->count) > >> + { > >> + if (dump_file && (dump_flags & TDF_DETAILS)) > >> + { > >> + fprintf (dump_file, "PHI node NOT moved to %d from %d:\n", > >> + e->src->index, bb->index); > >> + print_gimple_stmt (dump_file, stmt, 0); > >> + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost, > >> + level->num); > >> + } > >> + gsi_next (&bsi); > >> + continue; > >> + } > >> + else > >> + { > >> + unsigned i; > >> + bool skip_phi_move = false; > >> + for (i = 0; i < gimple_phi_num_args (stmt); i++) > >> + { > >> + tree def = PHI_ARG_DEF (stmt, i); > >> + > >> + if (TREE_CODE (def) != SSA_NAME) > >> + continue; > >> + > >> + gimple *def_stmt = SSA_NAME_DEF_STMT (def); > >> + > >> + if (!gimple_bb (def_stmt)) > >> + continue; > >> + > >> + if (!dominated_by_p (CDI_DOMINATORS, e->src, > >> + gimple_bb (def_stmt))) > >> + { > >> + if (dump_file && (dump_flags & TDF_DETAILS)) > >> + { > >> + fprintf (dump_file, > >> + "PHI node NOT moved to %d [local count:%d] from " > >> + "%d [local count:%d]:\n", > >> + e->src->index, e->src->count.value (), bb->index, > >> + bb->count.value ()); > >> + print_gimple_stmt (dump_file, stmt, 0); > >> + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost, > >> + level->num); > >> + } > >> + skip_phi_move = true; > >> + break; > >> + } > >> + } > >> + if (skip_phi_move) > >> + { > >> + gsi_next (&bsi); > >> + continue; > >> + } > >> + } > >> + > >> if (dump_file && (dump_flags & TDF_DETAILS)) > >> { > >> fprintf (dump_file, "Moving PHI node\n"); > >> @@ -1184,14 +1239,13 @@ move_computations_worker (basic_block bb) > >> tree lhs = gimple_assign_lhs (new_stmt); > >> SSA_NAME_RANGE_INFO (lhs) = NULL; > >> } > >> - gsi_insert_on_edge (loop_preheader_edge (level), new_stmt); > >> + gsi_insert_on_edge (e, new_stmt); > >> remove_phi_node (&bsi, false); > >> } > >> > >> for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); ) > >> { > >> edge e; > >> - > >> gimple *stmt = gsi_stmt (bsi); > >> > >> lim_data = get_lim_data (stmt); > >> @@ -1214,7 +1268,90 @@ move_computations_worker (basic_block bb) > >> /* We do not really want to move conditionals out of the loop; we just > >> placed it here to force its operands to be moved if necessary. */ > >> if (gimple_code (stmt) == GIMPLE_COND) > >> - continue; > >> + { > >> + gsi_next (&bsi); > >> + continue; > >> + } > >> + > >> + e = loop_preheader_edge (level); > >> + if (e->src->count > bb->count) > >> + { > >> + if (dump_file && (dump_flags & TDF_DETAILS)) > >> + { > >> + fprintf (dump_file, > >> + "stmt: Statement NOT moved to %d [local count:%d] from " > >> + "%d [local count:%d]:\n", > >> + e->src->index, e->src->count.value (), bb->index, > >> + bb->count.value ()); > >> + print_gimple_stmt (dump_file, stmt, 0); > >> + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost, > >> + level->num); > >> + } > >> + gsi_next (&bsi); > >> + continue; > >> + } > >> + else > >> + { > >> + if (is_gimple_assign (stmt)) > >> + { > >> + tree rhs1 = gimple_assign_rhs1 (stmt); > >> + tree rhs2 = gimple_assign_rhs2 (stmt); > >> + if (TREE_CODE (rhs1) == MEM_REF) > >> + { > >> + rhs2 = TREE_OPERAND (rhs1, 1); > >> + rhs1 = TREE_OPERAND (rhs1, 0); > >> + } > >> + gimple *stmt1 = NULL, *stmt2 = NULL; > >> + basic_block def_bb; > >> + if (rhs1 && TREE_CODE (rhs1) == SSA_NAME) > >> + { > >> + stmt1 = SSA_NAME_DEF_STMT (rhs1); > >> + def_bb = gimple_bb (stmt1); > >> + if (stmt1 > >> + && def_bb > >> + && (def_bb == bb > >> + || !dominated_by_p (CDI_DOMINATORS, e->src, def_bb))) > >> + { > >> + if (dump_file && (dump_flags & TDF_DETAILS)) > >> + { > >> + fprintf (dump_file, > >> + "stmt1: Statement NOT moved to %d [local " > >> + "count:%d] from %d [local count:%d]:\n", > >> + e->src->index, e->src->count.value (), > >> + bb->index, bb->count.value ()); > >> + print_gimple_stmt (dump_file, stmt, 0); > >> + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", > >> + cost, level->num); > >> + } > >> + gsi_next (&bsi); > >> + continue; > >> + } > >> + } > >> + if (rhs2 && TREE_CODE (rhs2) == SSA_NAME) > >> + { > >> + stmt2 = SSA_NAME_DEF_STMT (rhs2); > >> + def_bb = gimple_bb (stmt2); > >> + if (stmt2 && def_bb > >> + && (def_bb == bb > >> + || !dominated_by_p (CDI_DOMINATORS, e->src, def_bb))) > >> + { > >> + if (dump_file && (dump_flags & TDF_DETAILS)) > >> + { > >> + fprintf (dump_file, > >> + "stmt2: Statement NOT moved to %d [local " > >> + "count:%d] from %d [local count:%d]:\n", > >> + e->src->index, e->src->count.value (), > >> + bb->index, bb->count.value ()); > >> + print_gimple_stmt (dump_file, stmt, 0); > >> + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", > >> + cost, level->num); > >> + } > >> + gsi_next (&bsi); > >> + continue; > >> + } > >> + } > >> + } > >> + } > >> > >> if (dump_file && (dump_flags & TDF_DETAILS)) > >> { > >> @@ -1224,7 +1361,6 @@ move_computations_worker (basic_block bb) > >> cost, level->num); > >> } > >> > >> - e = loop_preheader_edge (level); > >> gcc_assert (!gimple_vdef (stmt)); > >> if (gimple_vuse (stmt)) > >> { > >> @@ -2094,6 +2230,19 @@ execute_sm (class loop *loop, im_mem_ref *ref, > >> bool multi_threaded_model_p = false; > >> gimple_stmt_iterator gsi; > >> sm_aux *aux = new sm_aux; > >> + basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt); > >> + > >> + edge e = loop_preheader_edge (loop); > >> + if (e->src->count > bb->count) > >> + { > >> + if (dump_file && (dump_flags & TDF_DETAILS)) > >> + { > >> + fprintf (dump_file, "Don't execute store motion of "); > >> + print_generic_expr (dump_file, ref->mem.ref); > >> + fprintf (dump_file, " from loop %d\n", loop->num); > >> + } > >> + return; > >> + } > >> > >> if (dump_file && (dump_flags & TDF_DETAILS)) > >> { > >> @@ -2202,7 +2351,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq, > >> } > >> else > >> { > >> - sm_aux *aux = *aux_map.get (ref); > >> + sm_aux **paux = aux_map.get (ref); > >> + sm_aux *aux; > >> + if (paux) > >> + aux = *paux; > >> + else > >> + continue; > >> if (!aux->store_flag || kind == sm_ord) > >> { > >> gassign *store; > >> diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c > >> index 3a09bbc39e5..4cae82936b9 100644 > >> --- a/gcc/tree-ssa-loop-split.c > >> +++ b/gcc/tree-ssa-loop-split.c > >> @@ -577,14 +577,17 @@ split_loop (class loop *loop1) > >> if (!initial_true) > >> cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond); > >> > >> + edge true_edge = EDGE_SUCC (bbs[i], 0)->flags & EDGE_TRUE_VALUE > >> + ? EDGE_SUCC (bbs[i], 0) > >> + : EDGE_SUCC (bbs[i], 1); > >> /* Now version the loop, placing loop2 after loop1 connecting > >> them, and fix up SSA form for that. */ > >> initialize_original_copy_tables (); > >> basic_block cond_bb; > >> > >> class loop *loop2 = loop_version (loop1, cond, &cond_bb, > >> - profile_probability::always (), > >> - profile_probability::always (), > >> + true_edge->probability, > >> + true_edge->probability.invert (), > >> profile_probability::always (), > >> profile_probability::always (), > >> true); > >> @@ -1486,8 +1489,8 @@ do_split_loop_on_cond (struct loop *loop1, edge invar_branch) > >> initialize_original_copy_tables (); > >> > >> struct loop *loop2 = loop_version (loop1, boolean_true_node, NULL, > >> - profile_probability::always (), > >> - profile_probability::never (), > >> + invar_branch->probability, > >> + invar_branch->probability.invert (), > >> profile_probability::always (), > >> profile_probability::always (), > >> true); > >> @@ -1530,6 +1533,9 @@ do_split_loop_on_cond (struct loop *loop1, edge invar_branch) > >> to_loop1->flags |= true_invar ? EDGE_FALSE_VALUE : EDGE_TRUE_VALUE; > >> to_loop2->flags |= true_invar ? EDGE_TRUE_VALUE : EDGE_FALSE_VALUE; > >> > >> + to_loop1->probability = invar_branch->probability.invert (); > >> + to_loop2->probability = invar_branch->probability; > >> + > >> /* Due to introduction of a control flow edge from loop1 latch to loop2 > >> pre-header, we should update PHIs in loop2 to reflect this connection > >> between loop1 and loop2. */ > >> -- > >> 2.27.0.90.geebb51ba8c > >>
On 2021/8/26 19:33, Richard Biener wrote: > On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: >> >> Hi, >> >> On 2021/8/6 20:15, Richard Biener wrote: >>> On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com> wrote: >>>> >>>> There was a patch trying to avoid move cold block out of loop: >>>> >>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html >>>> >>>> Richard suggested to "never hoist anything from a bb with lower execution >>>> frequency to a bb with higher one in LIM invariantness_dom_walker >>>> before_dom_children". >>>> >>>> This patch does this profile count check in both gimple LIM >>>> move_computations_worker and RTL loop-invariant.c find_invariants_bb, >>>> if the loop bb is colder than loop preheader, don't hoist it out of >>>> loop. >>>> >>>> Also, the profile count in loop split pass should be corrected to avoid >>>> lim2 and lim4 mismatch behavior, currently, the new loop preheader generated >>>> by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will >>>> move statement out of loop unexpectely when lim2 didn't move it. This >>>> change could fix regression on 544.nab_r from -1.55% to +0.46%. >>>> >>>> SPEC2017 performance evaluation shows 1% performance improvement for >>>> intrate GEOMEAN and no obvious regression for others. Especially, >>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is >>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% >>>> on P8LE. >>>> >>>> Regression and bootstrap tested pass on P8LE, any comments? Thanks. >>> >>> While I'm not familiar with the RTL invariant motion pass the patch there >>> looks reasonable. Note that we should assess the profile quality >>> somehow - I'm not sure how to do that, CCed Honza for that. >> >> Thanks. >> >>> >>> For the GIMPLE part the patch looks quite complicated - but note it >>> probably has to be since LIM performs kind of a "CSE" on loads >>> (and stores for store-motion), so when there are multiple stmts >>> affected by a hoisting decision the biggest block count has to be >>> accounted. Likewise when there are dependent stmts involved >>> that might include conditional stmts (a "PHI"), but the overall >>> cost should be looked at. >> >> Currently, The gimple code check two situations with the patch: >> 1) The statement or PHI‘s BB is *colder* then preheader, don't move it out >> of loop; >> 2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs >> couldn't be moved out of loop, also don't move it out of loop to avoid definition >> not dominates use error. > > But part 2) is obviously already done. What I tried to say is your heuristic > doesn't integrate nicely with the pass but I admitted that it might be a bit > difficult to find a place to add this heuristic. > > There is lim_data->cost which we could bias negatively but then this is > a cost that is independent on the hoisting distance. But doing this would > work at least for the case where the immediately enclosing loop preheader > is hotter than the stmt and with this it would be a patch that's similarly > simple as the RTL one. > > Another possibility is to simply only adjust PHI processing in > compute_invariantness, capping movement according to the hotness > heuristic. The same could be done for regular stmts there but I'm > not sure that will do good in the end since this function is supposed > to compute "correctness" (well, it also has the cost stuff), and it's > not the place to do overall cost considerations. Thanks. I found that adding a function find_coldest_out_loop and check it in outermost_invariant_loop to find the coldest invariant loop between outermost loop and itself could also reach the purpose. Then the gimple code check is redundant and could be removed. > >> May be I could collect the number of instructions not hoisted with the patch >> on regression tests and SPEC2017 to do a estimation for "multiple stmts affected" >> and "overall cost" need to be considered? But it seems move_computations_worker >> couldn't rollback if we still want to hoist multiple stmts out during the iterations? >> >>> >>> Now - GIMPLE LIM "costing" is somewhat backward right now >>> and it isn't set up to consider those multiple involved stmts. Plus >>> the store-motion part does not have any cost part (but it depends >>> on previously decided invariant motions). >>> >>> I think the way you implemented the check will cause no hoisting >>> to be performed instead of, say, hoisting to a different loop level >>> only. Possibly shown when you consider a loop nest like >>> >>> for (;;) >>> if (unlikely_cond) >>> for (;;) >>> invariant; >>> >>> we want to hoist 'invariant' but only from the inner loop even if it >>> is invariant also in the outer loop. >> >> >> For this case, theorotically I think the master GCC will optimize it to: >> >> invariant; >> for (;;) >> if (unlikely_cond) >> for (;;) >> ; >> >> 'invariant' is moved out of outer loop, but with the patch, it will get: >> >> for (;;) >> if (unlikely_cond) >> { >> invariant; >> for (;;) >> ; >> } >> >> 'invariant' is *cold* for outer loop, but it is still *hot* for inner loop, >> so hoist it out of inner loop, this is exactly what we want, right? > > Yes. I had doubts your patch would achieve that. > The below updated patch could achieve it: There was a patch trying to avoid move cold block out of loop: https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html Richard suggested to "never hoist anything from a bb with lower execution frequency to a bb with higher one in LIM invariantness_dom_walker before_dom_children". In gimple LIM analysis, add find_coldest_out_loop to move invariants to expected target loop, then if profile count of the loop bb is colder than target loop preheader, it won't be hoisted out of loop. SPEC2017 performance evaluation shows 1% performance improvement for intrate GEOMEAN and no obvious regression for others. Especially, 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% on P8LE. Regression and bootstrap tested pass on P8LE, any comments? Thanks. gcc/ChangeLog: * loop-invariant.c (find_invariants_bb): Check profile count before motion. (find_invariants_body): Add argument. * tree-ssa-loop-im.c (find_coldest_out_loop): New function. (outermost_invariant_loop): Use find_coldest_out_loop. (determine_max_movement): Likewise. (move_computations_worker): Adjust and fix iteration udpate. (execute_sm): Likewise. (execute_sm_exit): Check pointer validness. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/recip-3.c: Adjust. * gcc.dg/tree-ssa/ssa-lim-16.c: New test. * gcc.dg/tree-ssa/ssa-lim-17.c: New test. --- gcc/loop-invariant.c | 10 ++- gcc/tree-ssa-loop-im.c | 79 ++++++++++++++++++---- gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c | 20 ++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c | 26 +++++++ 5 files changed, 121 insertions(+), 16 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c index fca0c2b24be..5c3be7bf0eb 100644 --- a/gcc/loop-invariant.c +++ b/gcc/loop-invariant.c @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) call. */ static void -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, + bool always_executed) { rtx_insn *insn; + basic_block preheader = loop_preheader_edge (loop)->src; + + if (preheader->count > bb->count) + return; FOR_BB_INSNS (bb, insn) { @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, unsigned i; for (i = 0; i < loop->num_nodes; i++) - find_invariants_bb (body[i], - bitmap_bit_p (always_reached, i), + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), bitmap_bit_p (always_executed, i)); } diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index d9f75d5025e..f5ab6a734e7 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) return ret; } +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ + +static class loop * +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, + basic_block def_bb = NULL) +{ + class loop *cold_loop, *min_loop; + cold_loop = min_loop = outmost_loop; + profile_count min_count = loop_preheader_edge (min_loop)->src->count; + + if (def_bb && def_bb->count < loop_preheader_edge (loop)->src->count) + return NULL; + + while (min_loop != loop) + { + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); + if (loop_preheader_edge (min_loop)->src->count < min_count) + cold_loop = min_loop; + } + return cold_loop; +} + /* Suppose that operand DEF is used inside the LOOP. Returns the outermost loop to that we could move the expression using DEF if it did not have other operands, i.e. the outermost loop enclosing LOOP in that the value @@ -431,18 +453,18 @@ outermost_invariant_loop (tree def, class loop *loop) struct lim_aux_data *lim_data; if (!def) - return superloop_at_depth (loop, 1); + return find_coldest_out_loop (superloop_at_depth (loop, 1), loop); if (TREE_CODE (def) != SSA_NAME) { gcc_assert (is_gimple_min_invariant (def)); - return superloop_at_depth (loop, 1); + return find_coldest_out_loop (superloop_at_depth (loop, 1), loop); } def_stmt = SSA_NAME_DEF_STMT (def); def_bb = gimple_bb (def_stmt); if (!def_bb) - return superloop_at_depth (loop, 1); + return find_coldest_out_loop (superloop_at_depth (loop, 1), loop, def_bb); max_loop = find_common_loop (loop, def_bb->loop_father); @@ -452,7 +474,13 @@ outermost_invariant_loop (tree def, class loop *loop) loop_outer (lim_data->max_loop)); if (max_loop == loop) return NULL; - max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1); + max_loop = find_coldest_out_loop (max_loop, loop, def_bb); + if (!max_loop) + return NULL; + if (max_loop == loop) + return max_loop; + else + max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1); return max_loop; } @@ -684,7 +712,11 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec) if (must_preserve_exec) level = ALWAYS_EXECUTED_IN (bb); else - level = superloop_at_depth (loop, 1); + level = find_coldest_out_loop (superloop_at_depth (loop, 1), loop, bb); + + if (!level) + return false; + lim_data->max_loop = level; if (gphi *phi = dyn_cast <gphi *> (stmt)) @@ -783,8 +815,10 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec) if (ref && MEM_ANALYZABLE (ref)) { - lim_data->max_loop = outermost_indep_loop (lim_data->max_loop, - loop, ref); + level = outermost_indep_loop (lim_data->max_loop, loop, ref); + if (!level) + return false; + lim_data->max_loop = find_coldest_out_loop (level, loop, bb); if (!lim_data->max_loop) return false; } @@ -1154,6 +1188,7 @@ move_computations_worker (basic_block bb) continue; } + edge e = loop_preheader_edge (level); if (dump_file && (dump_flags & TDF_DETAILS)) { fprintf (dump_file, "Moving PHI node\n"); @@ -1191,14 +1226,13 @@ move_computations_worker (basic_block bb) tree lhs = gimple_assign_lhs (new_stmt); SSA_NAME_RANGE_INFO (lhs) = NULL; } - gsi_insert_on_edge (loop_preheader_edge (level), new_stmt); + gsi_insert_on_edge (e, new_stmt); remove_phi_node (&bsi, false); } for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); ) { edge e; - gimple *stmt = gsi_stmt (bsi); lim_data = get_lim_data (stmt); @@ -1221,8 +1255,12 @@ move_computations_worker (basic_block bb) /* We do not really want to move conditionals out of the loop; we just placed it here to force its operands to be moved if necessary. */ if (gimple_code (stmt) == GIMPLE_COND) - continue; + { + gsi_next (&bsi); + continue; + } + e = loop_preheader_edge (level); if (dump_file && (dump_flags & TDF_DETAILS)) { fprintf (dump_file, "Moving statement\n"); @@ -1231,7 +1269,6 @@ move_computations_worker (basic_block bb) cost, level->num); } - e = loop_preheader_edge (level); gcc_assert (!gimple_vdef (stmt)); if (gimple_vuse (stmt)) { @@ -2133,6 +2170,19 @@ execute_sm (class loop *loop, im_mem_ref *ref, bool multi_threaded_model_p = false; gimple_stmt_iterator gsi; sm_aux *aux = new sm_aux; + basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt); + + edge e = loop_preheader_edge (loop); + if (e->src->count > bb->count) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "Don't execute store motion of "); + print_generic_expr (dump_file, ref->mem.ref); + fprintf (dump_file, " from loop %d\n", loop->num); + } + return; + } if (dump_file && (dump_flags & TDF_DETAILS)) { @@ -2241,7 +2291,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq, } else { - sm_aux *aux = *aux_map.get (ref); + sm_aux **paux = aux_map.get (ref); + sm_aux *aux; + if (paux) + aux = *paux; + else + continue; if (!aux->store_flag || kind == sm_ord) { gassign *store; diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c index 638bf38db8c..641c91e719e 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c @@ -23,4 +23,4 @@ float h () F[0] += E / d; } -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */ +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c new file mode 100644 index 00000000000..2303f3d5d86 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +volatile int x; +void +assert_fail (int, char *, char *); +void +foo (int *a, int n, int k) +{ + int i; + + for (i = 0; i < n; i++) + { + if (__builtin_expect (x, 0)) + assert_fail (k / 5, "one", "two"); + a[i] = k; + } +} + +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c new file mode 100644 index 00000000000..3b1c7c0cb3e --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c @@ -0,0 +1,26 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +volatile int x; +void +assert_fail (int, char *, char *); +void +foo (int *a, int n, int m, int k, int s) +{ + int i; + int j; + + for (i = 0; i < m; i++) + { + if (__builtin_expect (x, 0)) + for (j = 0; j < n; j++) + { + assert_fail (k / 5, "one", "two"); + a[s] = k; + } + a[s] = s; + } +} + +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */ +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
On Thu, Sep 9, 2021 at 3:56 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: > > > > On 2021/8/26 19:33, Richard Biener wrote: > > On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: > >> > >> Hi, > >> > >> On 2021/8/6 20:15, Richard Biener wrote: > >>> On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com> wrote: > >>>> > >>>> There was a patch trying to avoid move cold block out of loop: > >>>> > >>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html > >>>> > >>>> Richard suggested to "never hoist anything from a bb with lower execution > >>>> frequency to a bb with higher one in LIM invariantness_dom_walker > >>>> before_dom_children". > >>>> > >>>> This patch does this profile count check in both gimple LIM > >>>> move_computations_worker and RTL loop-invariant.c find_invariants_bb, > >>>> if the loop bb is colder than loop preheader, don't hoist it out of > >>>> loop. > >>>> > >>>> Also, the profile count in loop split pass should be corrected to avoid > >>>> lim2 and lim4 mismatch behavior, currently, the new loop preheader generated > >>>> by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will > >>>> move statement out of loop unexpectely when lim2 didn't move it. This > >>>> change could fix regression on 544.nab_r from -1.55% to +0.46%. > >>>> > >>>> SPEC2017 performance evaluation shows 1% performance improvement for > >>>> intrate GEOMEAN and no obvious regression for others. Especially, > >>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is > >>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% > >>>> on P8LE. > >>>> > >>>> Regression and bootstrap tested pass on P8LE, any comments? Thanks. > >>> > >>> While I'm not familiar with the RTL invariant motion pass the patch there > >>> looks reasonable. Note that we should assess the profile quality > >>> somehow - I'm not sure how to do that, CCed Honza for that. > >> > >> Thanks. > >> > >>> > >>> For the GIMPLE part the patch looks quite complicated - but note it > >>> probably has to be since LIM performs kind of a "CSE" on loads > >>> (and stores for store-motion), so when there are multiple stmts > >>> affected by a hoisting decision the biggest block count has to be > >>> accounted. Likewise when there are dependent stmts involved > >>> that might include conditional stmts (a "PHI"), but the overall > >>> cost should be looked at. > >> > >> Currently, The gimple code check two situations with the patch: > >> 1) The statement or PHI‘s BB is *colder* then preheader, don't move it out > >> of loop; > >> 2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs > >> couldn't be moved out of loop, also don't move it out of loop to avoid definition > >> not dominates use error. > > > > But part 2) is obviously already done. What I tried to say is your heuristic > > doesn't integrate nicely with the pass but I admitted that it might be a bit > > difficult to find a place to add this heuristic. > > > > There is lim_data->cost which we could bias negatively but then this is > > a cost that is independent on the hoisting distance. But doing this would > > work at least for the case where the immediately enclosing loop preheader > > is hotter than the stmt and with this it would be a patch that's similarly > > simple as the RTL one. > > > > Another possibility is to simply only adjust PHI processing in > > compute_invariantness, capping movement according to the hotness > > heuristic. The same could be done for regular stmts there but I'm > > not sure that will do good in the end since this function is supposed > > to compute "correctness" (well, it also has the cost stuff), and it's > > not the place to do overall cost considerations. > > Thanks. I found that adding a function find_coldest_out_loop and check it in > outermost_invariant_loop to find the coldest invariant loop between outermost > loop and itself could also reach the purpose. Then the gimple code check is > redundant and could be removed. > > > > >> May be I could collect the number of instructions not hoisted with the patch > >> on regression tests and SPEC2017 to do a estimation for "multiple stmts affected" > >> and "overall cost" need to be considered? But it seems move_computations_worker > >> couldn't rollback if we still want to hoist multiple stmts out during the iterations? > >> > >>> > >>> Now - GIMPLE LIM "costing" is somewhat backward right now > >>> and it isn't set up to consider those multiple involved stmts. Plus > >>> the store-motion part does not have any cost part (but it depends > >>> on previously decided invariant motions). > >>> > >>> I think the way you implemented the check will cause no hoisting > >>> to be performed instead of, say, hoisting to a different loop level > >>> only. Possibly shown when you consider a loop nest like > >>> > >>> for (;;) > >>> if (unlikely_cond) > >>> for (;;) > >>> invariant; > >>> > >>> we want to hoist 'invariant' but only from the inner loop even if it > >>> is invariant also in the outer loop. > >> > >> > >> For this case, theorotically I think the master GCC will optimize it to: > >> > >> invariant; > >> for (;;) > >> if (unlikely_cond) > >> for (;;) > >> ; > >> > >> 'invariant' is moved out of outer loop, but with the patch, it will get: > >> > >> for (;;) > >> if (unlikely_cond) > >> { > >> invariant; > >> for (;;) > >> ; > >> } > >> > >> 'invariant' is *cold* for outer loop, but it is still *hot* for inner loop, > >> so hoist it out of inner loop, this is exactly what we want, right? > > > > Yes. I had doubts your patch would achieve that. > > > > > The below updated patch could achieve it: > > > There was a patch trying to avoid move cold block out of loop: > > https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html > > Richard suggested to "never hoist anything from a bb with lower execution > frequency to a bb with higher one in LIM invariantness_dom_walker > before_dom_children". > > In gimple LIM analysis, add find_coldest_out_loop to move invariants to > expected target loop, then if profile count of the loop bb is colder > than target loop preheader, it won't be hoisted out of loop. > > SPEC2017 performance evaluation shows 1% performance improvement for > intrate GEOMEAN and no obvious regression for others. Especially, > 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is > largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% > on P8LE. > > Regression and bootstrap tested pass on P8LE, any comments? Thanks. Can you split the RTL and GIMPLE changes and measure them separately please? > gcc/ChangeLog: > > * loop-invariant.c (find_invariants_bb): Check profile count > before motion. > (find_invariants_body): Add argument. > * tree-ssa-loop-im.c (find_coldest_out_loop): New function. > (outermost_invariant_loop): Use find_coldest_out_loop. > (determine_max_movement): Likewise. > (move_computations_worker): Adjust and fix iteration udpate. > (execute_sm): Likewise. > (execute_sm_exit): Check pointer validness. > > gcc/testsuite/ChangeLog: > > * gcc.dg/tree-ssa/recip-3.c: Adjust. > * gcc.dg/tree-ssa/ssa-lim-16.c: New test. > * gcc.dg/tree-ssa/ssa-lim-17.c: New test. > --- > gcc/loop-invariant.c | 10 ++- > gcc/tree-ssa-loop-im.c | 79 ++++++++++++++++++---- > gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c | 20 ++++++ > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c | 26 +++++++ > 5 files changed, 121 insertions(+), 16 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c > > diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c > index fca0c2b24be..5c3be7bf0eb 100644 > --- a/gcc/loop-invariant.c > +++ b/gcc/loop-invariant.c > @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) > call. */ > > static void > -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) > +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, > + bool always_executed) > { > rtx_insn *insn; > + basic_block preheader = loop_preheader_edge (loop)->src; > + > + if (preheader->count > bb->count) > + return; > > FOR_BB_INSNS (bb, insn) > { > @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, > unsigned i; > > for (i = 0; i < loop->num_nodes; i++) > - find_invariants_bb (body[i], > - bitmap_bit_p (always_reached, i), > + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), > bitmap_bit_p (always_executed, i)); > } > > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > index d9f75d5025e..f5ab6a734e7 100644 > --- a/gcc/tree-ssa-loop-im.c > +++ b/gcc/tree-ssa-loop-im.c > @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) > return ret; > } > > +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ > + > +static class loop * > +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, > + basic_block def_bb = NULL) > +{ > + class loop *cold_loop, *min_loop; > + cold_loop = min_loop = outmost_loop; > + profile_count min_count = loop_preheader_edge (min_loop)->src->count; > + > + if (def_bb && def_bb->count < loop_preheader_edge (loop)->src->count) > + return NULL; > + > + while (min_loop != loop) > + { > + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); > + if (loop_preheader_edge (min_loop)->src->count < min_count) > + cold_loop = min_loop; > + } > + return cold_loop; > +} > + > /* Suppose that operand DEF is used inside the LOOP. Returns the outermost > loop to that we could move the expression using DEF if it did not have > other operands, i.e. the outermost loop enclosing LOOP in that the value > @@ -431,18 +453,18 @@ outermost_invariant_loop (tree def, class loop *loop) > struct lim_aux_data *lim_data; > > if (!def) > - return superloop_at_depth (loop, 1); > + return find_coldest_out_loop (superloop_at_depth (loop, 1), loop); > > if (TREE_CODE (def) != SSA_NAME) > { > gcc_assert (is_gimple_min_invariant (def)); > - return superloop_at_depth (loop, 1); > + return find_coldest_out_loop (superloop_at_depth (loop, 1), loop); > } > > def_stmt = SSA_NAME_DEF_STMT (def); > def_bb = gimple_bb (def_stmt); > if (!def_bb) > - return superloop_at_depth (loop, 1); > + return find_coldest_out_loop (superloop_at_depth (loop, 1), loop, def_bb); > > max_loop = find_common_loop (loop, def_bb->loop_father); > > @@ -452,7 +474,13 @@ outermost_invariant_loop (tree def, class loop *loop) > loop_outer (lim_data->max_loop)); > if (max_loop == loop) > return NULL; > - max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1); > + max_loop = find_coldest_out_loop (max_loop, loop, def_bb); > + if (!max_loop) > + return NULL; > + if (max_loop == loop) > + return max_loop; > + else > + max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1); > > return max_loop; > } As said 'outermost_invariant_loop' is the "correctness" part and I don't like changing it this way. Instead determine_max_movement is what should be adjusted ... > @@ -684,7 +712,11 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec) > if (must_preserve_exec) > level = ALWAYS_EXECUTED_IN (bb); > else > - level = superloop_at_depth (loop, 1); > + level = find_coldest_out_loop (superloop_at_depth (loop, 1), loop, bb); ... which you do here (but you should apply that also to the must_preserve_exec result). > + > + if (!level) > + return false; > + > lim_data->max_loop = level; > > if (gphi *phi = dyn_cast <gphi *> (stmt)) > @@ -783,8 +815,10 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec) > if (ref > && MEM_ANALYZABLE (ref)) > { > - lim_data->max_loop = outermost_indep_loop (lim_data->max_loop, > - loop, ref); > + level = outermost_indep_loop (lim_data->max_loop, loop, ref); > + if (!level) > + return false; > + lim_data->max_loop = find_coldest_out_loop (level, loop, bb); ... why again here? outermost_indep_loop honors the passed max_loop. > if (!lim_data->max_loop) > return false; > } > @@ -1154,6 +1188,7 @@ move_computations_worker (basic_block bb) > continue; > } > > + edge e = loop_preheader_edge (level); unncecessary change > if (dump_file && (dump_flags & TDF_DETAILS)) > { > fprintf (dump_file, "Moving PHI node\n"); > @@ -1191,14 +1226,13 @@ move_computations_worker (basic_block bb) > tree lhs = gimple_assign_lhs (new_stmt); > SSA_NAME_RANGE_INFO (lhs) = NULL; > } > - gsi_insert_on_edge (loop_preheader_edge (level), new_stmt); > + gsi_insert_on_edge (e, new_stmt); > remove_phi_node (&bsi, false); > } > > for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); ) > { > edge e; > - > gimple *stmt = gsi_stmt (bsi); > > lim_data = get_lim_data (stmt); > @@ -1221,8 +1255,12 @@ move_computations_worker (basic_block bb) > /* We do not really want to move conditionals out of the loop; we just > placed it here to force its operands to be moved if necessary. */ > if (gimple_code (stmt) == GIMPLE_COND) > - continue; > + { > + gsi_next (&bsi); > + continue; > + } looks like an omission - do you now run into this? > > + e = loop_preheader_edge (level); unnecessary change > if (dump_file && (dump_flags & TDF_DETAILS)) > { > fprintf (dump_file, "Moving statement\n"); > @@ -1231,7 +1269,6 @@ move_computations_worker (basic_block bb) > cost, level->num); > } > > - e = loop_preheader_edge (level); > gcc_assert (!gimple_vdef (stmt)); > if (gimple_vuse (stmt)) > { > @@ -2133,6 +2170,19 @@ execute_sm (class loop *loop, im_mem_ref *ref, > bool multi_threaded_model_p = false; > gimple_stmt_iterator gsi; > sm_aux *aux = new sm_aux; > + basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt); > + > + edge e = loop_preheader_edge (loop); > + if (e->src->count > bb->count) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "Don't execute store motion of "); > + print_generic_expr (dump_file, ref->mem.ref); > + fprintf (dump_file, " from loop %d\n", loop->num); > + } > + return; > + } why do you need this? I think you instead want to adjust 'can_sm_ref_p' where you want to use sth like for_all_locs_in_loop (loop, ref, ...) and ... being a lambda or function that checks loc->stmt and if at least one reference is executed in a hot part of 'loop' then we should apply store-motion. Do this last because it looks somewhat expensive. > > if (dump_file && (dump_flags & TDF_DETAILS)) > { > @@ -2241,7 +2291,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq, > } > else > { > - sm_aux *aux = *aux_map.get (ref); > + sm_aux **paux = aux_map.get (ref); > + sm_aux *aux; > + if (paux) > + aux = *paux; > + else > + continue; > if (!aux->store_flag || kind == sm_ord) > { > gassign *store; > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > index 638bf38db8c..641c91e719e 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > @@ -23,4 +23,4 @@ float h () > F[0] += E / d; > } > > -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */ > +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c > new file mode 100644 > index 00000000000..2303f3d5d86 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c > @@ -0,0 +1,20 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +volatile int x; > +void > +assert_fail (int, char *, char *); > +void > +foo (int *a, int n, int k) > +{ > + int i; > + > + for (i = 0; i < n; i++) > + { > + if (__builtin_expect (x, 0)) > + assert_fail (k / 5, "one", "two"); I don't think these are very good testcases since 'assert' is usually noreturn which would place the whole block outside of the loop (it's a loop exit then). But naming the function 'foo' would make it less obviously pointless. > + a[i] = k; > + } > +} > + > +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c > new file mode 100644 > index 00000000000..3b1c7c0cb3e > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c > @@ -0,0 +1,26 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +volatile int x; > +void > +assert_fail (int, char *, char *); > +void > +foo (int *a, int n, int m, int k, int s) > +{ > + int i; > + int j; > + > + for (i = 0; i < m; i++) > + { > + if (__builtin_expect (x, 0)) > + for (j = 0; j < n; j++) > + { > + assert_fail (k / 5, "one", "two"); > + a[s] = k; > + } > + a[s] = s; > + } > +} > + > +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */ > +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */ > -- > 2.27.0.90.geebb51ba8c > >
On 2021/9/22 17:14, Richard Biener wrote: > On Thu, Sep 9, 2021 at 3:56 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: >> >> >> >> On 2021/8/26 19:33, Richard Biener wrote: >>> On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: >>>> >>>> Hi, >>>> >>>> On 2021/8/6 20:15, Richard Biener wrote: >>>>> On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com> wrote: >>>>>> >>>>>> There was a patch trying to avoid move cold block out of loop: >>>>>> >>>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html >>>>>> >>>>>> Richard suggested to "never hoist anything from a bb with lower execution >>>>>> frequency to a bb with higher one in LIM invariantness_dom_walker >>>>>> before_dom_children". >>>>>> >>>>>> This patch does this profile count check in both gimple LIM >>>>>> move_computations_worker and RTL loop-invariant.c find_invariants_bb, >>>>>> if the loop bb is colder than loop preheader, don't hoist it out of >>>>>> loop. >>>>>> >>>>>> Also, the profile count in loop split pass should be corrected to avoid >>>>>> lim2 and lim4 mismatch behavior, currently, the new loop preheader generated >>>>>> by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will >>>>>> move statement out of loop unexpectely when lim2 didn't move it. This >>>>>> change could fix regression on 544.nab_r from -1.55% to +0.46%. >>>>>> >>>>>> SPEC2017 performance evaluation shows 1% performance improvement for >>>>>> intrate GEOMEAN and no obvious regression for others. Especially, >>>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is >>>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% >>>>>> on P8LE. >>>>>> >>>>>> Regression and bootstrap tested pass on P8LE, any comments? Thanks. >>>>> >>>>> While I'm not familiar with the RTL invariant motion pass the patch there >>>>> looks reasonable. Note that we should assess the profile quality >>>>> somehow - I'm not sure how to do that, CCed Honza for that. >>>> >>>> Thanks. >>>> >>>>> >>>>> For the GIMPLE part the patch looks quite complicated - but note it >>>>> probably has to be since LIM performs kind of a "CSE" on loads >>>>> (and stores for store-motion), so when there are multiple stmts >>>>> affected by a hoisting decision the biggest block count has to be >>>>> accounted. Likewise when there are dependent stmts involved >>>>> that might include conditional stmts (a "PHI"), but the overall >>>>> cost should be looked at. >>>> >>>> Currently, The gimple code check two situations with the patch: >>>> 1) The statement or PHI‘s BB is *colder* then preheader, don't move it out >>>> of loop; >>>> 2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs >>>> couldn't be moved out of loop, also don't move it out of loop to avoid definition >>>> not dominates use error. >>> >>> But part 2) is obviously already done. What I tried to say is your heuristic >>> doesn't integrate nicely with the pass but I admitted that it might be a bit >>> difficult to find a place to add this heuristic. >>> >>> There is lim_data->cost which we could bias negatively but then this is >>> a cost that is independent on the hoisting distance. But doing this would >>> work at least for the case where the immediately enclosing loop preheader >>> is hotter than the stmt and with this it would be a patch that's similarly >>> simple as the RTL one. >>> >>> Another possibility is to simply only adjust PHI processing in >>> compute_invariantness, capping movement according to the hotness >>> heuristic. The same could be done for regular stmts there but I'm >>> not sure that will do good in the end since this function is supposed >>> to compute "correctness" (well, it also has the cost stuff), and it's >>> not the place to do overall cost considerations. >> >> Thanks. I found that adding a function find_coldest_out_loop and check it in >> outermost_invariant_loop to find the coldest invariant loop between outermost >> loop and itself could also reach the purpose. Then the gimple code check is >> redundant and could be removed. >> >>> >>>> May be I could collect the number of instructions not hoisted with the patch >>>> on regression tests and SPEC2017 to do a estimation for "multiple stmts affected" >>>> and "overall cost" need to be considered? But it seems move_computations_worker >>>> couldn't rollback if we still want to hoist multiple stmts out during the iterations? >>>> >>>>> >>>>> Now - GIMPLE LIM "costing" is somewhat backward right now >>>>> and it isn't set up to consider those multiple involved stmts. Plus >>>>> the store-motion part does not have any cost part (but it depends >>>>> on previously decided invariant motions). >>>>> >>>>> I think the way you implemented the check will cause no hoisting >>>>> to be performed instead of, say, hoisting to a different loop level >>>>> only. Possibly shown when you consider a loop nest like >>>>> >>>>> for (;;) >>>>> if (unlikely_cond) >>>>> for (;;) >>>>> invariant; >>>>> >>>>> we want to hoist 'invariant' but only from the inner loop even if it >>>>> is invariant also in the outer loop. >>>> >>>> >>>> For this case, theorotically I think the master GCC will optimize it to: >>>> >>>> invariant; >>>> for (;;) >>>> if (unlikely_cond) >>>> for (;;) >>>> ; >>>> >>>> 'invariant' is moved out of outer loop, but with the patch, it will get: >>>> >>>> for (;;) >>>> if (unlikely_cond) >>>> { >>>> invariant; >>>> for (;;) >>>> ; >>>> } >>>> >>>> 'invariant' is *cold* for outer loop, but it is still *hot* for inner loop, >>>> so hoist it out of inner loop, this is exactly what we want, right? >>> >>> Yes. I had doubts your patch would achieve that. >>> >> >> >> The below updated patch could achieve it: >> >> >> There was a patch trying to avoid move cold block out of loop: >> >> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html >> >> Richard suggested to "never hoist anything from a bb with lower execution >> frequency to a bb with higher one in LIM invariantness_dom_walker >> before_dom_children". >> >> In gimple LIM analysis, add find_coldest_out_loop to move invariants to >> expected target loop, then if profile count of the loop bb is colder >> than target loop preheader, it won't be hoisted out of loop. >> >> SPEC2017 performance evaluation shows 1% performance improvement for >> intrate GEOMEAN and no obvious regression for others. Especially, >> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is >> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% >> on P8LE. >> >> Regression and bootstrap tested pass on P8LE, any comments? Thanks. > > Can you split the RTL and GIMPLE changes and measure them separately > please? I did that before and got below data, it is slightly different due to using ratio instead of seconds, 500.perlbench_r obviously benefits from the RTL part change, while gimple part only improves exchange2 and blender, with a regression on nab which requires the fix of loop split, https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576566.html Reason is lim2 doesn't hoist code out of loop, but loop split generates duplicated loop with incorrect profile count on preheader and header bb, then later lim4 hoists code out of loop to unexpected place. Same loop shows mismatch behavior in lim2 and lim4. With that patch, the regression is gone. Gimple+RTL | Gimple lim | RTL loop-invariant 500.perlbench_r 8.03% 0.67% 7.69% 502.gcc_r 0.56% 0.37% 0.19% 505.mcf_r 0.19% -0.19% 0.39% 520.omnetpp_r 0.83% 0.83% 0.83% 523.xalancbmk_r -0.78% 0.00% -1.04% 525.x264_r 0.17% 0.00% 0.00% 531.deepsjeng_r 0.00% 0.31% 0.00% 541.leela_r 0.00% -0.31% 0.31% 548.exchange2_r 2.08% 1.85% 0.23% 557.xz_r 0.97% 0.00% 0.65% 503.bwaves_r -0.12% 0.00% -0.23% 507.cactuBSSN_r 0.00% 0.14% 0.00% 508.namd_r 0.00% 0.00% 0.00% 510.parest_r -0.16% -0.65% 0.00% 511.povray_r 0.30% 0.91% 0.91% 519.lbm_r 0.15% 0.00% 0.00% 521.wrf_r 0.00% 0.00% -0.80% 526.blender_r 1.84% 0.26% 0.52% 527.cam4_r 0.28% 0.00% 0.00% 538.imagick_r 0.20% 0.00% 0.00% 544.nab_r -1.55% -0.78% 0.00% 549.fotonik3d_r -0.25% 0.00% 0.00% 554.roms_r -0.84% 0.00% -0.63% INT GEAMEAN 1.16% 0.35% 0.90% FLOAT GEOMEAN -0.01% -0.01% -0.02% GEOMEAN 0.50% 0.15% 0.38% Will address other comments in later reply. Thanks.
On 2021/9/23 10:13, Xionghu Luo via Gcc-patches wrote: > > > On 2021/9/22 17:14, Richard Biener wrote: >> On Thu, Sep 9, 2021 at 3:56 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: >>> >>> >>> >>> On 2021/8/26 19:33, Richard Biener wrote: >>>> On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luoxhu@linux.ibm.com> >>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> On 2021/8/6 20:15, Richard Biener wrote: >>>>>> On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com> >>>>>> wrote: >>>>>>> >>>>>>> There was a patch trying to avoid move cold block out of loop: >>>>>>> >>>>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html >>>>>>> >>>>>>> Richard suggested to "never hoist anything from a bb with lower >>>>>>> execution >>>>>>> frequency to a bb with higher one in LIM invariantness_dom_walker >>>>>>> before_dom_children". >>>>>>> >>>>>>> This patch does this profile count check in both gimple LIM >>>>>>> move_computations_worker and RTL loop-invariant.c >>>>>>> find_invariants_bb, >>>>>>> if the loop bb is colder than loop preheader, don't hoist it out of >>>>>>> loop. >>>>>>> >>>>>>> Also, the profile count in loop split pass should be corrected to >>>>>>> avoid >>>>>>> lim2 and lim4 mismatch behavior, currently, the new loop >>>>>>> preheader generated >>>>>>> by loop_version is set to "[count: 0]:", then lim4 after lsplt >>>>>>> pass will >>>>>>> move statement out of loop unexpectely when lim2 didn't move it. >>>>>>> This >>>>>>> change could fix regression on 544.nab_r from -1.55% to +0.46%. >>>>>>> >>>>>>> SPEC2017 performance evaluation shows 1% performance improvement for >>>>>>> intrate GEOMEAN and no obvious regression for others. Especially, >>>>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is >>>>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% >>>>>>> on P8LE. >>>>>>> >>>>>>> Regression and bootstrap tested pass on P8LE, any comments? Thanks. >>>>>> >>>>>> While I'm not familiar with the RTL invariant motion pass the >>>>>> patch there >>>>>> looks reasonable. Note that we should assess the profile quality >>>>>> somehow - I'm not sure how to do that, CCed Honza for that. >>>>> >>>>> Thanks. >>>>> >>>>>> >>>>>> For the GIMPLE part the patch looks quite complicated - but note it >>>>>> probably has to be since LIM performs kind of a "CSE" on loads >>>>>> (and stores for store-motion), so when there are multiple stmts >>>>>> affected by a hoisting decision the biggest block count has to be >>>>>> accounted. Likewise when there are dependent stmts involved >>>>>> that might include conditional stmts (a "PHI"), but the overall >>>>>> cost should be looked at. >>>>> >>>>> Currently, The gimple code check two situations with the patch: >>>>> 1) The statement or PHI‘s BB is *colder* then preheader, don't move >>>>> it out >>>>> of loop; >>>>> 2) The statement or PHI's BB is *hotter* then preheader, but any of >>>>> it's rhs >>>>> couldn't be moved out of loop, also don't move it out of loop to >>>>> avoid definition >>>>> not dominates use error. >>>> >>>> But part 2) is obviously already done. What I tried to say is your >>>> heuristic >>>> doesn't integrate nicely with the pass but I admitted that it might >>>> be a bit >>>> difficult to find a place to add this heuristic. >>>> >>>> There is lim_data->cost which we could bias negatively but then this is >>>> a cost that is independent on the hoisting distance. But doing this >>>> would >>>> work at least for the case where the immediately enclosing loop >>>> preheader >>>> is hotter than the stmt and with this it would be a patch that's >>>> similarly >>>> simple as the RTL one. >>>> >>>> Another possibility is to simply only adjust PHI processing in >>>> compute_invariantness, capping movement according to the hotness >>>> heuristic. The same could be done for regular stmts there but I'm >>>> not sure that will do good in the end since this function is supposed >>>> to compute "correctness" (well, it also has the cost stuff), and it's >>>> not the place to do overall cost considerations. >>> >>> Thanks. I found that adding a function find_coldest_out_loop and >>> check it in >>> outermost_invariant_loop to find the coldest invariant loop between >>> outermost >>> loop and itself could also reach the purpose. Then the gimple code >>> check is >>> redundant and could be removed. >>> >>>> >>>>> May be I could collect the number of instructions not hoisted with >>>>> the patch >>>>> on regression tests and SPEC2017 to do a estimation for "multiple >>>>> stmts affected" >>>>> and "overall cost" need to be considered? But it seems >>>>> move_computations_worker >>>>> couldn't rollback if we still want to hoist multiple stmts out >>>>> during the iterations? >>>>> >>>>>> >>>>>> Now - GIMPLE LIM "costing" is somewhat backward right now >>>>>> and it isn't set up to consider those multiple involved stmts. Plus >>>>>> the store-motion part does not have any cost part (but it depends >>>>>> on previously decided invariant motions). >>>>>> >>>>>> I think the way you implemented the check will cause no hoisting >>>>>> to be performed instead of, say, hoisting to a different loop level >>>>>> only. Possibly shown when you consider a loop nest like >>>>>> >>>>>> for (;;) >>>>>> if (unlikely_cond) >>>>>> for (;;) >>>>>> invariant; >>>>>> >>>>>> we want to hoist 'invariant' but only from the inner loop even if it >>>>>> is invariant also in the outer loop. >>>>> >>>>> >>>>> For this case, theorotically I think the master GCC will optimize >>>>> it to: >>>>> >>>>> invariant; >>>>> for (;;) >>>>> if (unlikely_cond) >>>>> for (;;) >>>>> ; >>>>> >>>>> 'invariant' is moved out of outer loop, but with the patch, it will >>>>> get: >>>>> >>>>> for (;;) >>>>> if (unlikely_cond) >>>>> { >>>>> invariant; >>>>> for (;;) >>>>> ; >>>>> } >>>>> >>>>> 'invariant' is *cold* for outer loop, but it is still *hot* for >>>>> inner loop, >>>>> so hoist it out of inner loop, this is exactly what we want, right? >>>> >>>> Yes. I had doubts your patch would achieve that. >>>> >>> >>> >>> The below updated patch could achieve it: >>> >>> >>> There was a patch trying to avoid move cold block out of loop: >>> >>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html >>> >>> Richard suggested to "never hoist anything from a bb with lower >>> execution >>> frequency to a bb with higher one in LIM invariantness_dom_walker >>> before_dom_children". >>> >>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to >>> expected target loop, then if profile count of the loop bb is colder >>> than target loop preheader, it won't be hoisted out of loop. >>> >>> SPEC2017 performance evaluation shows 1% performance improvement for >>> intrate GEOMEAN and no obvious regression for others. Especially, >>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is >>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% >>> on P8LE. >>> >>> Regression and bootstrap tested pass on P8LE, any comments? Thanks. >> >> Can you split the RTL and GIMPLE changes and measure them separately >> please? > > I did that before and got below data, it is slightly different due to > using ratio instead of seconds, 500.perlbench_r obviously benefits > from the RTL part change, while gimple part only improves exchange2 > and blender, with a regression on nab which requires the fix of loop > split, > > https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576566.html > > Reason is lim2 doesn't hoist code out of loop, but loop split generates > duplicated loop with incorrect profile count on preheader and header bb, > then later lim4 hoists code out of loop to unexpected place. Same loop > shows mismatch behavior in lim2 and lim4. With that patch, the regression > is gone. > > > Gimple+RTL | Gimple lim | RTL loop-invariant > 500.perlbench_r 8.03% 0.67% 7.69% > 502.gcc_r 0.56% 0.37% 0.19% > 505.mcf_r 0.19% -0.19% 0.39% > 520.omnetpp_r 0.83% 0.83% 0.83% > 523.xalancbmk_r -0.78% 0.00% -1.04% > 525.x264_r 0.17% 0.00% 0.00% > 531.deepsjeng_r 0.00% 0.31% 0.00% > 541.leela_r 0.00% -0.31% 0.31% > 548.exchange2_r 2.08% 1.85% 0.23% > 557.xz_r 0.97% 0.00% 0.65% > 503.bwaves_r -0.12% 0.00% -0.23% > 507.cactuBSSN_r 0.00% 0.14% 0.00% > 508.namd_r 0.00% 0.00% 0.00% > 510.parest_r -0.16% -0.65% 0.00% > 511.povray_r 0.30% 0.91% 0.91% > 519.lbm_r 0.15% 0.00% 0.00% > 521.wrf_r 0.00% 0.00% -0.80% > 526.blender_r 1.84% 0.26% 0.52% > 527.cam4_r 0.28% 0.00% 0.00% > 538.imagick_r 0.20% 0.00% 0.00% > 544.nab_r -1.55% -0.78% 0.00% > 549.fotonik3d_r -0.25% 0.00% 0.00% > 554.roms_r -0.84% 0.00% -0.63% > INT GEAMEAN 1.16% 0.35% 0.90% > FLOAT GEOMEAN -0.01% -0.01% -0.02% > GEOMEAN 0.50% 0.15% 0.38% BTW, feedback from other platform: I do see ~8% performance improvement for 500.perlbench on aarch64. > > > Will address other comments in later reply. Thanks. > >
Update the patch to v3, not sure whether you prefer the paste style and continue to link the previous thread as Segher dislikes this... [PATCH v3] Don't move cold code out of loop by checking bb count Changes: 1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop. 2. Remove unnecessary changes. 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused infinite loop when implementing v1 and the iteration is missed to be updated actually. v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html There was a patch trying to avoid move cold block out of loop: https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html Richard suggested to "never hoist anything from a bb with lower execution frequency to a bb with higher one in LIM invariantness_dom_walker before_dom_children". In gimple LIM analysis, add find_coldest_out_loop to move invariants to expected target loop, if profile count of the loop bb is colder than target loop preheader, it won't be hoisted out of loop. Likely for store motion, if all locations of the REF in loop is cold, don't do store motion of it. SPEC2017 performance evaluation shows 1% performance improvement for intrate GEOMEAN and no obvious regression for others. Especially, 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% on P8LE. gcc/ChangeLog: * loop-invariant.c (find_invariants_bb): Check profile count before motion. (find_invariants_body): Add argument. * tree-ssa-loop-im.c (find_coldest_out_loop): New function. (determine_max_movement): Use find_coldest_out_loop. (move_computations_worker): Adjust and fix iteration udpate. (execute_sm_exit): Check pointer validness. (class ref_in_loop_hot_body): New functor. (ref_in_loop_hot_body::operator): New. (can_sm_ref_p): Use for_all_locs_in_loop. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/recip-3.c: Adjust. * gcc.dg/tree-ssa/ssa-lim-18.c: New test. * gcc.dg/tree-ssa/ssa-lim-19.c: New test. * gcc.dg/tree-ssa/ssa-lim-20.c: New test. --- gcc/loop-invariant.c | 10 ++-- gcc/tree-ssa-loop-im.c | 61 ++++++++++++++++++++-- gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++ 7 files changed, 165 insertions(+), 8 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c index fca0c2b24be..5c3be7bf0eb 100644 --- a/gcc/loop-invariant.c +++ b/gcc/loop-invariant.c @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) call. */ static void -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, + bool always_executed) { rtx_insn *insn; + basic_block preheader = loop_preheader_edge (loop)->src; + + if (preheader->count > bb->count) + return; FOR_BB_INSNS (bb, insn) { @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, unsigned i; for (i = 0; i < loop->num_nodes; i++) - find_invariants_bb (body[i], - bitmap_bit_p (always_reached, i), + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), bitmap_bit_p (always_executed, i)); } diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index 4b187c2cdaf..655fab03442 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) return ret; } +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ + +static class loop * +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, + basic_block curr_bb) +{ + class loop *cold_loop, *min_loop; + cold_loop = min_loop = outmost_loop; + profile_count min_count = loop_preheader_edge (min_loop)->src->count; + + if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count) + return NULL; + + while (min_loop != loop) + { + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); + if (loop_preheader_edge (min_loop)->src->count < min_count) + cold_loop = min_loop; + } + return cold_loop; +} + /* Suppose that operand DEF is used inside the LOOP. Returns the outermost loop to that we could move the expression using DEF if it did not have other operands, i.e. the outermost loop enclosing LOOP in that the value @@ -685,7 +707,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec) level = ALWAYS_EXECUTED_IN (bb); else level = superloop_at_depth (loop, 1); - lim_data->max_loop = level; + lim_data->max_loop = find_coldest_out_loop (level, loop, bb); + if (!lim_data->max_loop) + return false; if (gphi *phi = dyn_cast <gphi *> (stmt)) { @@ -1198,7 +1222,6 @@ move_computations_worker (basic_block bb) for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); ) { edge e; - gimple *stmt = gsi_stmt (bsi); lim_data = get_lim_data (stmt); @@ -1221,7 +1244,10 @@ move_computations_worker (basic_block bb) /* We do not really want to move conditionals out of the loop; we just placed it here to force its operands to be moved if necessary. */ if (gimple_code (stmt) == GIMPLE_COND) - continue; + { + gsi_next (&bsi); + continue; + } if (dump_file && (dump_flags & TDF_DETAILS)) { @@ -2241,7 +2267,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq, } else { - sm_aux *aux = *aux_map.get (ref); + sm_aux **paux = aux_map.get (ref); + sm_aux *aux; + if (paux) + aux = *paux; + else + continue; if (!aux->store_flag || kind == sm_ord) { gassign *store; @@ -2887,6 +2918,25 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind) return indep_p; } +class ref_in_loop_hot_body +{ +public: + ref_in_loop_hot_body (loop *loop_) : l (loop_) {} + bool operator () (mem_ref_loc *loc); + class loop *l; +}; + +bool +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) +{ + basic_block curr_bb = gimple_bb (loc->stmt); + edge e = loop_preheader_edge (l); + if (e->src->count > curr_bb->count) + return false; + else + return true; +} + /* Returns true if we can perform store motion of REF from LOOP. */ @@ -2941,6 +2991,9 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref) if (!ref_indep_loop_p (loop, ref, sm_war)) return false; + if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop))) + return false; + return true; } diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c index 638bf38db8c..641c91e719e 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c @@ -23,4 +23,4 @@ float h () F[0] += E / d; } -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */ +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c new file mode 100644 index 00000000000..7326a230b3f --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +volatile int x; +void +bar (int, char *, char *); +void +foo (int *a, int n, int k) +{ + int i; + + for (i = 0; i < n; i++) + { + if (__builtin_expect (x, 0)) + bar (k / 5, "one", "two"); + a[i] = k; + } +} + +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c new file mode 100644 index 00000000000..f0a99fa42b4 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +volatile int x; +void +bar (int, char *, char *); +void +foo (int *a, int n, int m, int k, int s) +{ + int i; + int j; + + for (i = 0; i < m; i++) + { + if (__builtin_expect (x, 0)) + for (j = 0; j < n; j++) + { + bar (k / 5, "one", "two"); + a[s] = k; + } + a[s] = s; + } +} + +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */ +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */ + diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c new file mode 100644 index 00000000000..bc60a040a70 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c @@ -0,0 +1,25 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +/* Test that `count' is not hoisted out of loop when bb is cold. */ + +int count; +volatile int x; + +struct obj { + int data; + struct obj *next; + +} *q; + +void +func (int m) +{ + struct obj *p; + for (int i = 0; i < m; i++) + if (__builtin_expect (x, 0)) + count++; + +} + +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c new file mode 100644 index 00000000000..fedaa3b7119 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c @@ -0,0 +1,28 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +/* Test that `count' is hoisted out of loop when one of it's used bb is hot. */ + +int count; +volatile int x; + +struct obj { + int data; + struct obj *next; + +} *q; + +void +func (int m, int n) +{ + struct obj *p; + for (int i = 0; i < m; i++) + { + if (__builtin_expect (x, 0)) + count++; + count += n; + } +} + +/* { dg-final { scan-tree-dump-times "Executing store motion of" 1 "lim2" } } */ +
On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: > > Update the patch to v3, not sure whether you prefer the paste style > and continue to link the previous thread as Segher dislikes this... > > > [PATCH v3] Don't move cold code out of loop by checking bb count > > > Changes: > 1. Handle max_loop in determine_max_movement instead of > outermost_invariant_loop. > 2. Remove unnecessary changes. > 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. > 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused > infinite loop when implementing v1 and the iteration is missed to be > updated actually. > > v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html > v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html > > There was a patch trying to avoid move cold block out of loop: > > https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html > > Richard suggested to "never hoist anything from a bb with lower execution > frequency to a bb with higher one in LIM invariantness_dom_walker > before_dom_children". > > In gimple LIM analysis, add find_coldest_out_loop to move invariants to > expected target loop, if profile count of the loop bb is colder > than target loop preheader, it won't be hoisted out of loop. > Likely for store motion, if all locations of the REF in loop is cold, > don't do store motion of it. > > SPEC2017 performance evaluation shows 1% performance improvement for > intrate GEOMEAN and no obvious regression for others. Especially, > 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is > largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% > on P8LE. > > gcc/ChangeLog: > > * loop-invariant.c (find_invariants_bb): Check profile count > before motion. > (find_invariants_body): Add argument. > * tree-ssa-loop-im.c (find_coldest_out_loop): New function. > (determine_max_movement): Use find_coldest_out_loop. > (move_computations_worker): Adjust and fix iteration udpate. > (execute_sm_exit): Check pointer validness. > (class ref_in_loop_hot_body): New functor. > (ref_in_loop_hot_body::operator): New. > (can_sm_ref_p): Use for_all_locs_in_loop. > > gcc/testsuite/ChangeLog: > > * gcc.dg/tree-ssa/recip-3.c: Adjust. > * gcc.dg/tree-ssa/ssa-lim-18.c: New test. > * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > * gcc.dg/tree-ssa/ssa-lim-20.c: New test. > --- > gcc/loop-invariant.c | 10 ++-- > gcc/tree-ssa-loop-im.c | 61 ++++++++++++++++++++-- > gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++ > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++ > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++ > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++ > 7 files changed, 165 insertions(+), 8 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c > > diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c > index fca0c2b24be..5c3be7bf0eb 100644 > --- a/gcc/loop-invariant.c > +++ b/gcc/loop-invariant.c > @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) > call. */ > > static void > -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) > +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, > + bool always_executed) > { > rtx_insn *insn; > + basic_block preheader = loop_preheader_edge (loop)->src; > + > + if (preheader->count > bb->count) > + return; > > FOR_BB_INSNS (bb, insn) > { > @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, > unsigned i; > > for (i = 0; i < loop->num_nodes; i++) > - find_invariants_bb (body[i], > - bitmap_bit_p (always_reached, i), > + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), > bitmap_bit_p (always_executed, i)); > } > > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > index 4b187c2cdaf..655fab03442 100644 > --- a/gcc/tree-ssa-loop-im.c > +++ b/gcc/tree-ssa-loop-im.c > @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) > return ret; > } > > +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ > + > +static class loop * > +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, > + basic_block curr_bb) > +{ > + class loop *cold_loop, *min_loop; > + cold_loop = min_loop = outmost_loop; > + profile_count min_count = loop_preheader_edge (min_loop)->src->count; > + > + if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count) Honza - can you comment on whether we should compare BB counts this way? I would suspect that for, say, for (...) if (a) X; else Y; that the counts for X and Y will be less than that of the preheader of the loop only when the loop is estimated to run once. That is, should we really compare the to the preheader count or maybe better to the _header_ count which would keep the number of iterations out of the equation? If we look at maybe_hot_count_p that's a quite sophisticated thing to compare a count to the "IPA hot", here we're comparing two counts within a function where it actually matters whether we use a<b or !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p function). Xionghu, you error on the side of not hoisting for unordered counts here > + return NULL; > + > + while (min_loop != loop) > + { > + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); > + if (loop_preheader_edge (min_loop)->src->count < min_count) but in the other direction here and on the side of not hoisting in ref_in_loop_hot_body. The three-state relational operator overloads are probably not the very best idea... (see profile-count.h for them) > + cold_loop = min_loop; > + } > + return cold_loop; > +} > + > /* Suppose that operand DEF is used inside the LOOP. Returns the outermost > loop to that we could move the expression using DEF if it did not have > other operands, i.e. the outermost loop enclosing LOOP in that the value > @@ -685,7 +707,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec) > level = ALWAYS_EXECUTED_IN (bb); > else > level = superloop_at_depth (loop, 1); > - lim_data->max_loop = level; > + lim_data->max_loop = find_coldest_out_loop (level, loop, bb); > + if (!lim_data->max_loop) > + return false; > > if (gphi *phi = dyn_cast <gphi *> (stmt)) > { > @@ -1198,7 +1222,6 @@ move_computations_worker (basic_block bb) > for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); ) > { > edge e; > - > gimple *stmt = gsi_stmt (bsi); > > lim_data = get_lim_data (stmt); > @@ -1221,7 +1244,10 @@ move_computations_worker (basic_block bb) > /* We do not really want to move conditionals out of the loop; we just > placed it here to force its operands to be moved if necessary. */ > if (gimple_code (stmt) == GIMPLE_COND) > - continue; > + { > + gsi_next (&bsi); > + continue; > + } > > if (dump_file && (dump_flags & TDF_DETAILS)) > { > @@ -2241,7 +2267,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq, > } > else > { > - sm_aux *aux = *aux_map.get (ref); > + sm_aux **paux = aux_map.get (ref); > + sm_aux *aux; > + if (paux) > + aux = *paux; > + else > + continue; do you really need this? I doubt so. > if (!aux->store_flag || kind == sm_ord) > { > gassign *store; > @@ -2887,6 +2918,25 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind) > return indep_p; > } > > +class ref_in_loop_hot_body > +{ > +public: > + ref_in_loop_hot_body (loop *loop_) : l (loop_) {} > + bool operator () (mem_ref_loc *loc); > + class loop *l; > +}; > + > +bool > +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) > +{ > + basic_block curr_bb = gimple_bb (loc->stmt); > + edge e = loop_preheader_edge (l); > + if (e->src->count > curr_bb->count) > + return false; > + else > + return true; > +} > + > > /* Returns true if we can perform store motion of REF from LOOP. */ > > @@ -2941,6 +2991,9 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref) > if (!ref_indep_loop_p (loop, ref, sm_war)) > return false; > Add a comment here what this is about. Otherwise the GIMPLE invariant motion parts look sensible, but I'd really like to have the issue on the profile_count API sorted out. Can you split out the RTL invariant motion part to a separate patch please? Thanks, Richard. > + if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop))) > + return false; > + > return true; > } > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > index 638bf38db8c..641c91e719e 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > @@ -23,4 +23,4 @@ float h () > F[0] += E / d; > } > > -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */ > +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c > new file mode 100644 > index 00000000000..7326a230b3f > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c > @@ -0,0 +1,20 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +volatile int x; > +void > +bar (int, char *, char *); > +void > +foo (int *a, int n, int k) > +{ > + int i; > + > + for (i = 0; i < n; i++) > + { > + if (__builtin_expect (x, 0)) > + bar (k / 5, "one", "two"); > + a[i] = k; > + } > +} > + > +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > new file mode 100644 > index 00000000000..f0a99fa42b4 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > @@ -0,0 +1,27 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +volatile int x; > +void > +bar (int, char *, char *); > +void > +foo (int *a, int n, int m, int k, int s) > +{ > + int i; > + int j; > + > + for (i = 0; i < m; i++) > + { > + if (__builtin_expect (x, 0)) > + for (j = 0; j < n; j++) > + { > + bar (k / 5, "one", "two"); > + a[s] = k; > + } > + a[s] = s; > + } > +} > + > +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */ > +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */ > + > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > new file mode 100644 > index 00000000000..bc60a040a70 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > @@ -0,0 +1,25 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +/* Test that `count' is not hoisted out of loop when bb is cold. */ > + > +int count; > +volatile int x; > + > +struct obj { > + int data; > + struct obj *next; > + > +} *q; > + > +void > +func (int m) > +{ > + struct obj *p; > + for (int i = 0; i < m; i++) > + if (__builtin_expect (x, 0)) > + count++; > + > +} > + > +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c > new file mode 100644 > index 00000000000..fedaa3b7119 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c > @@ -0,0 +1,28 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +/* Test that `count' is hoisted out of loop when one of it's used bb is hot. */ > + > +int count; > +volatile int x; > + > +struct obj { > + int data; > + struct obj *next; > + > +} *q; > + > +void > +func (int m, int n) > +{ > + struct obj *p; > + for (int i = 0; i < m; i++) > + { > + if (__builtin_expect (x, 0)) > + count++; > + count += n; > + } > +} > + > +/* { dg-final { scan-tree-dump-times "Executing store motion of" 1 "lim2" } } */ > + > -- > 2.27.0.90.geebb51ba8c > >
Hi, On 2021/9/28 20:09, Richard Biener wrote: > On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: >> >> Update the patch to v3, not sure whether you prefer the paste style >> and continue to link the previous thread as Segher dislikes this... >> >> >> [PATCH v3] Don't move cold code out of loop by checking bb count >> >> >> Changes: >> 1. Handle max_loop in determine_max_movement instead of >> outermost_invariant_loop. >> 2. Remove unnecessary changes. >> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. >> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused >> infinite loop when implementing v1 and the iteration is missed to be >> updated actually. >> >> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html >> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html >> >> There was a patch trying to avoid move cold block out of loop: >> >> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html >> >> Richard suggested to "never hoist anything from a bb with lower execution >> frequency to a bb with higher one in LIM invariantness_dom_walker >> before_dom_children". >> >> In gimple LIM analysis, add find_coldest_out_loop to move invariants to >> expected target loop, if profile count of the loop bb is colder >> than target loop preheader, it won't be hoisted out of loop. >> Likely for store motion, if all locations of the REF in loop is cold, >> don't do store motion of it. >> >> SPEC2017 performance evaluation shows 1% performance improvement for >> intrate GEOMEAN and no obvious regression for others. Especially, >> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is >> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% >> on P8LE. >> >> gcc/ChangeLog: >> >> * loop-invariant.c (find_invariants_bb): Check profile count >> before motion. >> (find_invariants_body): Add argument. >> * tree-ssa-loop-im.c (find_coldest_out_loop): New function. >> (determine_max_movement): Use find_coldest_out_loop. >> (move_computations_worker): Adjust and fix iteration udpate. >> (execute_sm_exit): Check pointer validness. >> (class ref_in_loop_hot_body): New functor. >> (ref_in_loop_hot_body::operator): New. >> (can_sm_ref_p): Use for_all_locs_in_loop. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.dg/tree-ssa/recip-3.c: Adjust. >> * gcc.dg/tree-ssa/ssa-lim-18.c: New test. >> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. >> * gcc.dg/tree-ssa/ssa-lim-20.c: New test. >> --- >> gcc/loop-invariant.c | 10 ++-- >> gcc/tree-ssa-loop-im.c | 61 ++++++++++++++++++++-- >> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++ >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++ >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++ >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++ >> 7 files changed, 165 insertions(+), 8 deletions(-) >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c >> >> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c >> index fca0c2b24be..5c3be7bf0eb 100644 >> --- a/gcc/loop-invariant.c >> +++ b/gcc/loop-invariant.c >> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) >> call. */ >> >> static void >> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) >> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, >> + bool always_executed) >> { >> rtx_insn *insn; >> + basic_block preheader = loop_preheader_edge (loop)->src; >> + >> + if (preheader->count > bb->count) >> + return; >> >> FOR_BB_INSNS (bb, insn) >> { >> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, >> unsigned i; >> >> for (i = 0; i < loop->num_nodes; i++) >> - find_invariants_bb (body[i], >> - bitmap_bit_p (always_reached, i), >> + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), >> bitmap_bit_p (always_executed, i)); >> } >> >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c >> index 4b187c2cdaf..655fab03442 100644 >> --- a/gcc/tree-ssa-loop-im.c >> +++ b/gcc/tree-ssa-loop-im.c >> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) >> return ret; >> } >> >> +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ >> + >> +static class loop * >> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, >> + basic_block curr_bb) >> +{ >> + class loop *cold_loop, *min_loop; >> + cold_loop = min_loop = outmost_loop; >> + profile_count min_count = loop_preheader_edge (min_loop)->src->count; >> + >> + if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count) > > Honza - can you comment on whether we should compare BB counts this way? > > I would suspect that for, say, > > for (...) > if (a) > X; > else > Y; > > that the counts for X and Y will be less than that of the preheader of the loop > only when the loop is estimated to run once. That is, should we really compare > the to the preheader count or maybe better to the _header_ count which > would keep the number of iterations out of the equation? I quickly tried to replace all the loop_preheader_edge (loop)->src with loop_preheader_edge (loop)->dest, it will cause many failures in gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems reasonable to compare the bb count with preheader count as both gimple lim and RTL loop-invariant move instructions to *preheader* instead of *header* after analysis? > > If we look at maybe_hot_count_p that's a quite sophisticated thing to > compare a count to the "IPA hot", here we're comparing two counts > within a function where it actually matters whether we use a<b or > !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p > function). > > Xionghu, you error on the side of not hoisting for unordered counts here > >> + return NULL; >> + >> + while (min_loop != loop) >> + { >> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); >> + if (loop_preheader_edge (min_loop)->src->count < min_count) > > but in the other direction here and on the side of not hoisting > in ref_in_loop_hot_body. > > The three-state relational operator overloads are probably not the > very best idea... > (see profile-count.h for them) > Added new function bb_colder_than_loop_preheader to encapsulate the comparision, if FALSE is returned due to three-state inequality, find_coldest_out_loop will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator () will return true to continue perform store motion, both preserve the previous behavior. >> + cold_loop = min_loop; >> + } >> + return cold_loop; >> +} >> + >> /* Suppose that operand DEF is used inside the LOOP. Returns the outermost >> loop to that we could move the expression using DEF if it did not have >> other operands, i.e. the outermost loop enclosing LOOP in that the value >> @@ -685,7 +707,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec) >> level = ALWAYS_EXECUTED_IN (bb); >> else >> level = superloop_at_depth (loop, 1); >> - lim_data->max_loop = level; >> + lim_data->max_loop = find_coldest_out_loop (level, loop, bb); >> + if (!lim_data->max_loop) >> + return false; >> >> if (gphi *phi = dyn_cast <gphi *> (stmt)) >> { >> @@ -1198,7 +1222,6 @@ move_computations_worker (basic_block bb) >> for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); ) >> { >> edge e; >> - >> gimple *stmt = gsi_stmt (bsi); >> >> lim_data = get_lim_data (stmt); >> @@ -1221,7 +1244,10 @@ move_computations_worker (basic_block bb) >> /* We do not really want to move conditionals out of the loop; we just >> placed it here to force its operands to be moved if necessary. */ >> if (gimple_code (stmt) == GIMPLE_COND) >> - continue; >> + { >> + gsi_next (&bsi); >> + continue; >> + } >> >> if (dump_file && (dump_flags & TDF_DETAILS)) >> { >> @@ -2241,7 +2267,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq, >> } >> else >> { >> - sm_aux *aux = *aux_map.get (ref); >> + sm_aux **paux = aux_map.get (ref); >> + sm_aux *aux; >> + if (paux) >> + aux = *paux; >> + else >> + continue; > > do you really need this? I doubt so. Removed. > >> if (!aux->store_flag || kind == sm_ord) >> { >> gassign *store; >> @@ -2887,6 +2918,25 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind) >> return indep_p; >> } >> >> +class ref_in_loop_hot_body >> +{ >> +public: >> + ref_in_loop_hot_body (loop *loop_) : l (loop_) {} >> + bool operator () (mem_ref_loc *loc); >> + class loop *l; >> +}; >> + >> +bool >> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) >> +{ >> + basic_block curr_bb = gimple_bb (loc->stmt); >> + edge e = loop_preheader_edge (l); >> + if (e->src->count > curr_bb->count) >> + return false; >> + else >> + return true; >> +} >> + >> >> /* Returns true if we can perform store motion of REF from LOOP. */ >> >> @@ -2941,6 +2991,9 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref) >> if (!ref_indep_loop_p (loop, ref, sm_war)) >> return false; >> > > Add a comment here what this is about. Done. > > Otherwise the GIMPLE invariant motion parts look sensible, but I'd > really like to have > the issue on the profile_count API sorted out. > > Can you split out the RTL invariant motion part to a separate patch please? Done. Attached the two patches, thanks. BR, Xionghu From 092d6df49c0027001c3ed9343f0d1e8c02232d95 Mon Sep 17 00:00:00 2001 From: Xiong Hu Luo <luoxhu@linux.ibm.com> Date: Mon, 5 Jul 2021 03:57:11 -0500 Subject: [PATCH v4 1/2] Don't move cold code out of loop by checking bb count v4 changes: 1. Sort out profile_count comparision to function bb_cold_than_loop_preheader. 2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare. 3. Split RTL invariant motion part out. 4. Remove aux changes. v3 changes: 1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop. 2. Remove unnecessary changes. 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused infinite loop when implementing v1 and the iteration is missed to be updated actually. v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html There was a patch trying to avoid move cold block out of loop: https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html Richard suggested to "never hoist anything from a bb with lower execution frequency to a bb with higher one in LIM invariantness_dom_walker before_dom_children". In gimple LIM analysis, add find_coldest_out_loop to move invariants to expected target loop, if profile count of the loop bb is colder than target loop preheader, it won't be hoisted out of loop. Likely for store motion, if all locations of the REF in loop is cold, don't do store motion of it. SPEC2017 performance evaluation shows 1% performance improvement for intrate GEOMEAN and no obvious regression for others. Especially, 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% on P8LE. gcc/ChangeLog: (find_invariants_body): Add argument. * tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New function. (find_coldest_out_loop): New function. (determine_max_movement): Use find_coldest_out_loop. (move_computations_worker): Adjust and fix iteration udpate. (execute_sm_exit): Check pointer validness. (class ref_in_loop_hot_body): New functor. (ref_in_loop_hot_body::operator): New. (can_sm_ref_p): Use for_all_locs_in_loop. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/recip-3.c: Adjust. * gcc.dg/tree-ssa/ssa-lim-18.c: New test. * gcc.dg/tree-ssa/ssa-lim-19.c: New test. * gcc.dg/tree-ssa/ssa-lim-20.c: New test. * gcc.dg/tree-ssa/ssa-lim-21.c: New test. --- gcc/tree-ssa-loop-im.c | 85 +++++++++++++++++++++- gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 +++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 35 +++++++++ 6 files changed, 191 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index 4b187c2cdaf..870e0a00512 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -417,6 +417,46 @@ movement_possibility (gimple *stmt) return ret; } +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state + as stated in profile-count.h, FALSE is returned if inequality cannot be + decided. */ +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) +{ + if (count1 < count2) + return true; + else + return false; +} + +/* Find coldest loop between OUTMOST_LOOP and LOOP by comapring profile count. + */ + +static class loop * +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, + basic_block curr_bb) +{ + class loop *cold_loop, *min_loop; + cold_loop = min_loop = outmost_loop; + profile_count min_count = loop_preheader_edge (min_loop)->src->count; + + /* If bb_colder_than_loop_preheader returns false due to three-state + comparision, OUTMOST_LOOP is returned finally to preserve the behavior. + Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP. */ + if (curr_bb + && bb_colder_than_loop_preheader (curr_bb->count, + loop_preheader_edge (loop)->src->count)) + return NULL; + + while (min_loop != loop) + { + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); + if (bb_colder_than_loop_preheader ( + loop_preheader_edge (min_loop)->src->count, min_count)) + cold_loop = min_loop; + } + return cold_loop; +} + /* Suppose that operand DEF is used inside the LOOP. Returns the outermost loop to that we could move the expression using DEF if it did not have other operands, i.e. the outermost loop enclosing LOOP in that the value @@ -685,7 +725,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec) level = ALWAYS_EXECUTED_IN (bb); else level = superloop_at_depth (loop, 1); - lim_data->max_loop = level; + lim_data->max_loop = find_coldest_out_loop (level, loop, bb); + if (!lim_data->max_loop) + return false; if (gphi *phi = dyn_cast <gphi *> (stmt)) { @@ -1221,7 +1263,10 @@ move_computations_worker (basic_block bb) /* We do not really want to move conditionals out of the loop; we just placed it here to force its operands to be moved if necessary. */ if (gimple_code (stmt) == GIMPLE_COND) - continue; + { + gsi_next (&bsi); + continue; + } if (dump_file && (dump_flags & TDF_DETAILS)) { @@ -2887,6 +2932,35 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind) return indep_p; } +class ref_in_loop_hot_body +{ +public: + ref_in_loop_hot_body (loop *loop_) : l (loop_) {} + bool operator () (mem_ref_loc *loc); + class loop *l; +}; + +/* Find out the coldest loop between loop L and innermost loop, compare the + hotness between current BB and coldest loop preheader by profile count. */ +bool +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) +{ + basic_block curr_bb = gimple_bb (loc->stmt); + class loop *inner_loop = curr_bb->loop_father; + class loop *cold_loop = l; + if (l != inner_loop) + cold_loop = find_coldest_out_loop (l, inner_loop, curr_bb); + if (!cold_loop) + return false; + edge e = loop_preheader_edge (cold_loop); + /* If bb_colder_than_loop_preheader is false due to three-state inequality + comparision, TRUE is returned to continue perform store motion. */ + if (bb_colder_than_loop_preheader (curr_bb->count, e->src->count)) + return false; + else + return true; +} + /* Returns true if we can perform store motion of REF from LOOP. */ @@ -2941,6 +3015,13 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref) if (!ref_indep_loop_p (loop, ref, sm_war)) return false; + /* Verify whether the candidate is hot for LOOP. Only do store motion if the + candidate's profile count is hot. Statement in cold BB shouldn't be moved + out of it's loop_father, also it shouldn't be moved out of LOOP if it is + colder than LOOP's preheader. See ssa-lim-21.c. */ + if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop))) + return false; + return true; } diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c index 638bf38db8c..641c91e719e 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c @@ -23,4 +23,4 @@ float h () F[0] += E / d; } -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */ +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c new file mode 100644 index 00000000000..7326a230b3f --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +volatile int x; +void +bar (int, char *, char *); +void +foo (int *a, int n, int k) +{ + int i; + + for (i = 0; i < n; i++) + { + if (__builtin_expect (x, 0)) + bar (k / 5, "one", "two"); + a[i] = k; + } +} + +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c new file mode 100644 index 00000000000..f0a99fa42b4 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +volatile int x; +void +bar (int, char *, char *); +void +foo (int *a, int n, int m, int k, int s) +{ + int i; + int j; + + for (i = 0; i < m; i++) + { + if (__builtin_expect (x, 0)) + for (j = 0; j < n; j++) + { + bar (k / 5, "one", "two"); + a[s] = k; + } + a[s] = s; + } +} + +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */ +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */ + diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c new file mode 100644 index 00000000000..bc60a040a70 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c @@ -0,0 +1,25 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +/* Test that `count' is not hoisted out of loop when bb is cold. */ + +int count; +volatile int x; + +struct obj { + int data; + struct obj *next; + +} *q; + +void +func (int m) +{ + struct obj *p; + for (int i = 0; i < m; i++) + if (__builtin_expect (x, 0)) + count++; + +} + +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c new file mode 100644 index 00000000000..c38a858283f --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c @@ -0,0 +1,35 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +/* Test that `count' is not hoisted out of inner loop and outer loop when it is + in cold loop. */ + +int count; +volatile int x; + +struct obj { + int data; + int data1; + struct obj *next; +}; + +void +func (int m, int n, int k, struct obj *a) +{ + struct obj *q = a; + for (int j = 0; j < m; j++) + if (__builtin_expect (m, 0)) + for (int i = 0; i < m; i++) + { + if (__builtin_expect (x, 0)) + { + count++; + q->data += 3; /* Not hoisted out to inner loop. */ + } + count += n; + q->data1 += k; /* Not hoisted out to outer loop. */ + } +} + +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2" } } */ +
On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: > > Hi, > > On 2021/9/28 20:09, Richard Biener wrote: > > On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: > >> > >> Update the patch to v3, not sure whether you prefer the paste style > >> and continue to link the previous thread as Segher dislikes this... > >> > >> > >> [PATCH v3] Don't move cold code out of loop by checking bb count > >> > >> > >> Changes: > >> 1. Handle max_loop in determine_max_movement instead of > >> outermost_invariant_loop. > >> 2. Remove unnecessary changes. > >> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. > >> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused > >> infinite loop when implementing v1 and the iteration is missed to be > >> updated actually. > >> > >> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html > >> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html > >> > >> There was a patch trying to avoid move cold block out of loop: > >> > >> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html > >> > >> Richard suggested to "never hoist anything from a bb with lower execution > >> frequency to a bb with higher one in LIM invariantness_dom_walker > >> before_dom_children". > >> > >> In gimple LIM analysis, add find_coldest_out_loop to move invariants to > >> expected target loop, if profile count of the loop bb is colder > >> than target loop preheader, it won't be hoisted out of loop. > >> Likely for store motion, if all locations of the REF in loop is cold, > >> don't do store motion of it. > >> > >> SPEC2017 performance evaluation shows 1% performance improvement for > >> intrate GEOMEAN and no obvious regression for others. Especially, > >> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is > >> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% > >> on P8LE. > >> > >> gcc/ChangeLog: > >> > >> * loop-invariant.c (find_invariants_bb): Check profile count > >> before motion. > >> (find_invariants_body): Add argument. > >> * tree-ssa-loop-im.c (find_coldest_out_loop): New function. > >> (determine_max_movement): Use find_coldest_out_loop. > >> (move_computations_worker): Adjust and fix iteration udpate. > >> (execute_sm_exit): Check pointer validness. > >> (class ref_in_loop_hot_body): New functor. > >> (ref_in_loop_hot_body::operator): New. > >> (can_sm_ref_p): Use for_all_locs_in_loop. > >> > >> gcc/testsuite/ChangeLog: > >> > >> * gcc.dg/tree-ssa/recip-3.c: Adjust. > >> * gcc.dg/tree-ssa/ssa-lim-18.c: New test. > >> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > >> * gcc.dg/tree-ssa/ssa-lim-20.c: New test. > >> --- > >> gcc/loop-invariant.c | 10 ++-- > >> gcc/tree-ssa-loop-im.c | 61 ++++++++++++++++++++-- > >> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++ > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++ > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++ > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++ > >> 7 files changed, 165 insertions(+), 8 deletions(-) > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c > >> > >> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c > >> index fca0c2b24be..5c3be7bf0eb 100644 > >> --- a/gcc/loop-invariant.c > >> +++ b/gcc/loop-invariant.c > >> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) > >> call. */ > >> > >> static void > >> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) > >> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, > >> + bool always_executed) > >> { > >> rtx_insn *insn; > >> + basic_block preheader = loop_preheader_edge (loop)->src; > >> + > >> + if (preheader->count > bb->count) > >> + return; > >> > >> FOR_BB_INSNS (bb, insn) > >> { > >> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, > >> unsigned i; > >> > >> for (i = 0; i < loop->num_nodes; i++) > >> - find_invariants_bb (body[i], > >> - bitmap_bit_p (always_reached, i), > >> + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), > >> bitmap_bit_p (always_executed, i)); > >> } > >> > >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > >> index 4b187c2cdaf..655fab03442 100644 > >> --- a/gcc/tree-ssa-loop-im.c > >> +++ b/gcc/tree-ssa-loop-im.c > >> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) > >> return ret; > >> } > >> > >> +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ > >> + > >> +static class loop * > >> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, > >> + basic_block curr_bb) > >> +{ > >> + class loop *cold_loop, *min_loop; > >> + cold_loop = min_loop = outmost_loop; > >> + profile_count min_count = loop_preheader_edge (min_loop)->src->count; > >> + > >> + if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count) > > > > Honza - can you comment on whether we should compare BB counts this way? > > > > I would suspect that for, say, > > > > for (...) > > if (a) > > X; > > else > > Y; > > > > that the counts for X and Y will be less than that of the preheader of the loop > > only when the loop is estimated to run once. That is, should we really compare > > the to the preheader count or maybe better to the _header_ count which > > would keep the number of iterations out of the equation? > > I quickly tried to replace all the loop_preheader_edge (loop)->src with > loop_preheader_edge (loop)->dest, it will cause many failures in > gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems > reasonable to compare the bb count with preheader count as both gimple lim > and RTL loop-invariant move instructions to *preheader* instead of *header* > after analysis? Hmm, yeah - guess I was confused here. > > > > If we look at maybe_hot_count_p that's a quite sophisticated thing to > > compare a count to the "IPA hot", here we're comparing two counts > > within a function where it actually matters whether we use a<b or > > !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p > > function). > > > > Xionghu, you error on the side of not hoisting for unordered counts here > > > >> + return NULL; > >> + > >> + while (min_loop != loop) > >> + { > >> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); > >> + if (loop_preheader_edge (min_loop)->src->count < min_count) > > > > but in the other direction here and on the side of not hoisting > > in ref_in_loop_hot_body. > > > > The three-state relational operator overloads are probably not the > > very best idea... > > (see profile-count.h for them) > > > Added new function bb_colder_than_loop_preheader to encapsulate the comparision, > if FALSE is returned due to three-state inequality, find_coldest_out_loop > will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator () > will return true to continue perform store motion, both preserve the previous > behavior. Thanks. But I don't think the abstraction as written is useful: +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state + as stated in profile-count.h, FALSE is returned if inequality cannot be + decided. */ +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) +{ + if (count1 < count2) + return true; + else + return false; +} given the following seems to pass the preheader count in place of the BB count. + if (bb_colder_than_loop_preheader ( + loop_preheader_edge (min_loop)->src->count, min_count)) + cold_loop = min_loop; find_coldest_out_loop is also a bit weird, I think we want to find the outermost loop between outmost_loop and loop that has a lower count than the curr_bb count but + while (min_loop != loop) + { + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); + if (bb_colder_than_loop_preheader ( + loop_preheader_edge (min_loop)->src->count, min_count)) + cold_loop = min_loop; compares the outermost loops count (min_count) against the preheader count? So we're searching for a cold loop with respect to its enclosing loop here? Why is this function not simply +static class loop * +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, + basic_block curr_bb) +{ while (bb_colder_than_loop_preheader (curr_bb->count, loop_preheader_edge (outermost_loop)->src->count)) { if (outermost_loop == loop) return NULL; outermost_loop = superloop_at_depth (loop, loop_depth (outermost_loop) + 1); } return outermost_loop; } ? Likewise I wonder why ref_in_loop_hot_body::operator () needs to call find_coldest_out_loop and why it not simply does +bool +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) +{ + basic_block curr_bb = gimple_bb (loc->stmt); if (bb_colder_than_loop_preheader (curr_bb->count, loop_preheader_edge (l)->src->count)) return false; return true; } ? > > >> + cold_loop = min_loop; > >> + } > >> + return cold_loop; > >> +} > >> + > >> /* Suppose that operand DEF is used inside the LOOP. Returns the outermost > >> loop to that we could move the expression using DEF if it did not have > >> other operands, i.e. the outermost loop enclosing LOOP in that the value > >> @@ -685,7 +707,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec) > >> level = ALWAYS_EXECUTED_IN (bb); > >> else > >> level = superloop_at_depth (loop, 1); > >> - lim_data->max_loop = level; > >> + lim_data->max_loop = find_coldest_out_loop (level, loop, bb); > >> + if (!lim_data->max_loop) > >> + return false; > >> > >> if (gphi *phi = dyn_cast <gphi *> (stmt)) > >> { > >> @@ -1198,7 +1222,6 @@ move_computations_worker (basic_block bb) > >> for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); ) > >> { > >> edge e; > >> - > >> gimple *stmt = gsi_stmt (bsi); > >> > >> lim_data = get_lim_data (stmt); > >> @@ -1221,7 +1244,10 @@ move_computations_worker (basic_block bb) > >> /* We do not really want to move conditionals out of the loop; we just > >> placed it here to force its operands to be moved if necessary. */ > >> if (gimple_code (stmt) == GIMPLE_COND) > >> - continue; > >> + { > >> + gsi_next (&bsi); > >> + continue; > >> + } > >> > >> if (dump_file && (dump_flags & TDF_DETAILS)) > >> { > >> @@ -2241,7 +2267,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq, > >> } > >> else > >> { > >> - sm_aux *aux = *aux_map.get (ref); > >> + sm_aux **paux = aux_map.get (ref); > >> + sm_aux *aux; > >> + if (paux) > >> + aux = *paux; > >> + else > >> + continue; > > > > do you really need this? I doubt so. > > Removed. > > > > >> if (!aux->store_flag || kind == sm_ord) > >> { > >> gassign *store; > >> @@ -2887,6 +2918,25 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind) > >> return indep_p; > >> } > >> > >> +class ref_in_loop_hot_body > >> +{ > >> +public: > >> + ref_in_loop_hot_body (loop *loop_) : l (loop_) {} > >> + bool operator () (mem_ref_loc *loc); > >> + class loop *l; > >> +}; > >> + > >> +bool > >> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) > >> +{ > >> + basic_block curr_bb = gimple_bb (loc->stmt); > >> + edge e = loop_preheader_edge (l); > >> + if (e->src->count > curr_bb->count) > >> + return false; > >> + else > >> + return true; > >> +} > >> + > >> > >> /* Returns true if we can perform store motion of REF from LOOP. */ > >> > >> @@ -2941,6 +2991,9 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref) > >> if (!ref_indep_loop_p (loop, ref, sm_war)) > >> return false; > >> > > > > Add a comment here what this is about. > > Done. > > > > > Otherwise the GIMPLE invariant motion parts look sensible, but I'd > > really like to have > > the issue on the profile_count API sorted out. > > > > Can you split out the RTL invariant motion part to a separate patch please? > > Done. Attached the two patches, thanks. > > > BR, > Xionghu
On 2021/10/15 16:11, Richard Biener wrote: > On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: >> >> Hi, >> >> On 2021/9/28 20:09, Richard Biener wrote: >>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: >>>> >>>> Update the patch to v3, not sure whether you prefer the paste style >>>> and continue to link the previous thread as Segher dislikes this... >>>> >>>> >>>> [PATCH v3] Don't move cold code out of loop by checking bb count >>>> >>>> >>>> Changes: >>>> 1. Handle max_loop in determine_max_movement instead of >>>> outermost_invariant_loop. >>>> 2. Remove unnecessary changes. >>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. >>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused >>>> infinite loop when implementing v1 and the iteration is missed to be >>>> updated actually. >>>> >>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html >>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html >>>> >>>> There was a patch trying to avoid move cold block out of loop: >>>> >>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html >>>> >>>> Richard suggested to "never hoist anything from a bb with lower execution >>>> frequency to a bb with higher one in LIM invariantness_dom_walker >>>> before_dom_children". >>>> >>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to >>>> expected target loop, if profile count of the loop bb is colder >>>> than target loop preheader, it won't be hoisted out of loop. >>>> Likely for store motion, if all locations of the REF in loop is cold, >>>> don't do store motion of it. >>>> >>>> SPEC2017 performance evaluation shows 1% performance improvement for >>>> intrate GEOMEAN and no obvious regression for others. Especially, >>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is >>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% >>>> on P8LE. >>>> >>>> gcc/ChangeLog: >>>> >>>> * loop-invariant.c (find_invariants_bb): Check profile count >>>> before motion. >>>> (find_invariants_body): Add argument. >>>> * tree-ssa-loop-im.c (find_coldest_out_loop): New function. >>>> (determine_max_movement): Use find_coldest_out_loop. >>>> (move_computations_worker): Adjust and fix iteration udpate. >>>> (execute_sm_exit): Check pointer validness. >>>> (class ref_in_loop_hot_body): New functor. >>>> (ref_in_loop_hot_body::operator): New. >>>> (can_sm_ref_p): Use for_all_locs_in_loop. >>>> >>>> gcc/testsuite/ChangeLog: >>>> >>>> * gcc.dg/tree-ssa/recip-3.c: Adjust. >>>> * gcc.dg/tree-ssa/ssa-lim-18.c: New test. >>>> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. >>>> * gcc.dg/tree-ssa/ssa-lim-20.c: New test. >>>> --- >>>> gcc/loop-invariant.c | 10 ++-- >>>> gcc/tree-ssa-loop-im.c | 61 ++++++++++++++++++++-- >>>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++ >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++ >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++ >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++ >>>> 7 files changed, 165 insertions(+), 8 deletions(-) >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c >>>> >>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c >>>> index fca0c2b24be..5c3be7bf0eb 100644 >>>> --- a/gcc/loop-invariant.c >>>> +++ b/gcc/loop-invariant.c >>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) >>>> call. */ >>>> >>>> static void >>>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) >>>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, >>>> + bool always_executed) >>>> { >>>> rtx_insn *insn; >>>> + basic_block preheader = loop_preheader_edge (loop)->src; >>>> + >>>> + if (preheader->count > bb->count) >>>> + return; >>>> >>>> FOR_BB_INSNS (bb, insn) >>>> { >>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, >>>> unsigned i; >>>> >>>> for (i = 0; i < loop->num_nodes; i++) >>>> - find_invariants_bb (body[i], >>>> - bitmap_bit_p (always_reached, i), >>>> + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), >>>> bitmap_bit_p (always_executed, i)); >>>> } >>>> >>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c >>>> index 4b187c2cdaf..655fab03442 100644 >>>> --- a/gcc/tree-ssa-loop-im.c >>>> +++ b/gcc/tree-ssa-loop-im.c >>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) >>>> return ret; >>>> } >>>> >>>> +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ >>>> + >>>> +static class loop * >>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, >>>> + basic_block curr_bb) >>>> +{ >>>> + class loop *cold_loop, *min_loop; >>>> + cold_loop = min_loop = outmost_loop; >>>> + profile_count min_count = loop_preheader_edge (min_loop)->src->count; >>>> + >>>> + if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count) >>> >>> Honza - can you comment on whether we should compare BB counts this way? >>> >>> I would suspect that for, say, >>> >>> for (...) >>> if (a) >>> X; >>> else >>> Y; >>> >>> that the counts for X and Y will be less than that of the preheader of the loop >>> only when the loop is estimated to run once. That is, should we really compare >>> the to the preheader count or maybe better to the _header_ count which >>> would keep the number of iterations out of the equation? >> >> I quickly tried to replace all the loop_preheader_edge (loop)->src with >> loop_preheader_edge (loop)->dest, it will cause many failures in >> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems >> reasonable to compare the bb count with preheader count as both gimple lim >> and RTL loop-invariant move instructions to *preheader* instead of *header* >> after analysis? > > Hmm, yeah - guess I was confused here. > >>> >>> If we look at maybe_hot_count_p that's a quite sophisticated thing to >>> compare a count to the "IPA hot", here we're comparing two counts >>> within a function where it actually matters whether we use a<b or >>> !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p >>> function). >>> >>> Xionghu, you error on the side of not hoisting for unordered counts here >>> >>>> + return NULL; >>>> + >>>> + while (min_loop != loop) >>>> + { >>>> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); >>>> + if (loop_preheader_edge (min_loop)->src->count < min_count) >>> >>> but in the other direction here and on the side of not hoisting >>> in ref_in_loop_hot_body. >>> >>> The three-state relational operator overloads are probably not the >>> very best idea... >>> (see profile-count.h for them) >>> >> Added new function bb_colder_than_loop_preheader to encapsulate the comparision, >> if FALSE is returned due to three-state inequality, find_coldest_out_loop >> will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator () >> will return true to continue perform store motion, both preserve the previous >> behavior. > > Thanks. But I don't think the abstraction as written is useful: > > +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state > + as stated in profile-count.h, FALSE is returned if inequality cannot be > + decided. */ > +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) > +{ > + if (count1 < count2) > + return true; > + else > + return false; > +} > > given the following seems to pass the preheader count in place of the BB count. > > + if (bb_colder_than_loop_preheader ( > + loop_preheader_edge (min_loop)->src->count, min_count)) > + cold_loop = min_loop; > > find_coldest_out_loop is also a bit weird, I think we want to find > the outermost loop between outmost_loop and loop that has a > lower count than the curr_bb count but > > + while (min_loop != loop) > + { > + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); > + if (bb_colder_than_loop_preheader ( > + loop_preheader_edge (min_loop)->src->count, min_count)) > + cold_loop = min_loop; > > compares the outermost loops count (min_count) against the preheader > count? So we're searching for a cold loop with respect to its enclosing loop > here? Let me try to explain how it works :) find_coldest_out_loop does two steps check: 1) Check whether curr_bb is cold in it's own loop_father, if it is cold, just return NULL which means it should not be moved out at all; 2) curr_bb is NOT cold, assuming the current loop L[m] is the coldest first, than try to find a cold loop to be hoisted to from {L[1], L[2], ... L[m]}, if L[i]->count < L[m]->count, set the cold_loop to L[i] until find the loop that has smallest profile_count. Take the updated ssa-lim-19.c as example, check whether curr_bb(bb 5) is cold in loop 3, if it is cold, just return NULL, otherwise select the coldest loop in {loop1, loop2, loop3} and find that loop2 is colder than loop3, return loop2 to be the target hoist loop. The first check could AVOID hoist if curr_bb is colder than loop3, but it is still hot than loop1 and loop2. Not sure whether it is possible to construct such cases? gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c volatile int x; void bar (int, char *, char *); void foo (int *a, int n, int m, int s, int t) { int i; int j; int k; for (i = 0; i < m; i++) // loop 1 { if (__builtin_expect (x, 0)) for (j = 0; j < n; j++) // loop 2 for (k = 0; k < n; k++) // loop 3 { bar (s / 5, "one", "two"); // curr_bb a[t] = s; } a[t] = t; // curr_bb2 } } The 4 invariant statements are moved to bb 11(loop2) instead of bb 10(loop1) with this patch. There are totally 6 combinations when curr_bb is hotter than loop 3. We need to compare the "Loop preheader hotness" instead of "every Loop[i] and curr_bb hotness", returning the coldest loop for this function find_coldest_out_loop, otherwise unexpected behavior happens. L1 > L2 > L3 => return L3 L1 > L3 > L2 => return L2 L2 > L1 > L3 => return L3 L2 > L3 > L1 => return L1 L3 > L1 > L2 => return L2 L3 > L2 > L1 => return L1 So bb_colder_than_loop_preheader does two kind of checks, one is checking L3 preheader count with curr_bb count, another is checking L3 preheader count with L1 preheader count, L2 preheader count, etc... ssa-lim-19.c.138t.lim2: ... <bb 10> [local count: 16057869]: // L1 preheader - _4 = s_22(D) / 5; - _5 = (long unsigned int) t_24(D); - _6 = _5 * 4; - _7 = a_25(D) + _6; _8 = (long unsigned int) t_24(D); _9 = _8 * 4; _10 = a_25(D) + _9; <bb 3> [local count: 145980626]: # i_34 = PHI <i_29(12), 0(10)> x.0_1 ={v} x; if (x.0_1 != 0) goto <bb 4>; [10.00%] else goto <bb 8>; [90.00%] <bb 4> [local count: 14598063]: if (n_20(D) > 0) goto <bb 11>; [89.00%] else goto <bb 8>; [11.00%] <bb 11> [local count: 12992276]: // L2 preheader + _4 = s_22(D) / 5; + _5 = (long unsigned int) t_24(D); + _6 = _5 * 4; + _7 = a_25(D) + _6; goto <bb 7>; [100.00%] <bb 14> [local count: 850510901]: <bb 5> [local count: 955630225]: // curr_bb # k_36 = PHI <k_27(14), 0(7)> bar (_4, "one", "two"); *_7 = s_22(D); k_27 = k_36 + 1; if (n_20(D) > k_27) goto <bb 14>; [89.00%] else goto <bb 6>; [11.00%] <bb 6> [local count: 118111600]: j_21 = j_35 + 1; if (n_20(D) > j_21) goto <bb 13>; [89.00%] else goto <bb 8>; [11.00%] <bb 13> [local count: 105119324]: <bb 7> [local count: 118111600]: // L3 preheader # j_35 = PHI <j_21(13), 0(11)> goto <bb 5>; [100.00%] <bb 8> [local count: 145980626]: *_10 = t_24(D); i_29 = i_34 + 1; Re-paste the bb_colder_than_loop_preheader and find_coldest_out_loop: +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state + as stated in profile-count.h, FALSE is returned if inequality cannot be + decided. */ +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) +{ + if (count1 < count2) + return true; + else + return false; +} + +/* Find coldest loop between OUTMOST_LOOP and LOOP by comapring profile count. + */ + +static class loop * +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, + basic_block curr_bb) +{ + class loop *cold_loop, *min_loop; + cold_loop = min_loop = outmost_loop; + profile_count min_count = loop_preheader_edge (min_loop)->src->count; + + /* If bb_colder_than_loop_preheader returns false due to three-state + comparision, OUTMOST_LOOP is returned finally to preserve the behavior. + Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP. */ + if (curr_bb + && bb_colder_than_loop_preheader (curr_bb->count, + loop_preheader_edge (loop)->src->count)) + return NULL; + + while (min_loop != loop) + { + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); + if (bb_colder_than_loop_preheader ( + loop_preheader_edge (min_loop)->src->count, min_count)) + cold_loop = min_loop; + } + return cold_loop; +} + > > Why is this function not simply > > +static class loop * > +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, > + basic_block curr_bb) > +{ > while (bb_colder_than_loop_preheader (curr_bb->count, > loop_preheader_edge (outermost_loop)->src->count)) > { > if (outermost_loop == loop) > return NULL; > outermost_loop = superloop_at_depth (loop, loop_depth > (outermost_loop) + 1); > } > return outermost_loop; > } If change like this, when processing curr_bb(5), outermost_loop will return loop 1 since curr_bb->count > Loop1_prehead->count, the while loop stopped. This doesn't meet what we want. > > ? > > Likewise I wonder why ref_in_loop_hot_body::operator () needs to call > find_coldest_out_loop and why it not simply does > > +bool > +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) > +{ > + basic_block curr_bb = gimple_bb (loc->stmt); > if (bb_colder_than_loop_preheader (curr_bb->count, > loop_preheader_edge (l)->src->count)) > return false; > return true; > } Likely for this part, +/* Find out the coldest loop between loop L and innermost loop, compare the + hotness between current BB and coldest loop preheader by profile count. */ +bool +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) +{ + basic_block curr_bb = gimple_bb (loc->stmt); + class loop *inner_loop = curr_bb->loop_father; + class loop *cold_loop = l; + if (l != inner_loop) + cold_loop = find_coldest_out_loop (l, inner_loop, curr_bb); + if (!cold_loop) + return false; + edge e = loop_preheader_edge (cold_loop); + /* If bb_colder_than_loop_preheader is false due to three-state inequality + comparision, TRUE is returned to continue perform store motion. */ + if (bb_colder_than_loop_preheader (curr_bb->count, e->src->count)) + return false; + else + return true; +} l is the input of ref_in_loop_hot_body, it is an out loop, we need to find a cold_loop between l and inner_loop. Reason is there may be cold loop between l and inner_loop, which means we shouldn't do store-motion from curr_bb to l directly. After reconsideration, I think the bb_colder_than_loop_preheader could be removed since curr_bb is checked in find_coldest_out_loop already. And remove the "l != inner_loop" check: +/* Find out the coldest loop between loop L and innermost loop. */ +bool +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) +{ + basic_block curr_bb = gimple_bb (loc->stmt); + class loop *inner_loop = curr_bb->loop_father; + class loop *cold_loop = l; + cold_loop = find_coldest_out_loop (l, inner_loop, curr_bb); + if (!cold_loop) + return false; + return true; +}
On Mon, Oct 18, 2021 at 6:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: > > > > On 2021/10/15 16:11, Richard Biener wrote: > > On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: > >> > >> Hi, > >> > >> On 2021/9/28 20:09, Richard Biener wrote: > >>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: > >>>> > >>>> Update the patch to v3, not sure whether you prefer the paste style > >>>> and continue to link the previous thread as Segher dislikes this... > >>>> > >>>> > >>>> [PATCH v3] Don't move cold code out of loop by checking bb count > >>>> > >>>> > >>>> Changes: > >>>> 1. Handle max_loop in determine_max_movement instead of > >>>> outermost_invariant_loop. > >>>> 2. Remove unnecessary changes. > >>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. > >>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused > >>>> infinite loop when implementing v1 and the iteration is missed to be > >>>> updated actually. > >>>> > >>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html > >>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html > >>>> > >>>> There was a patch trying to avoid move cold block out of loop: > >>>> > >>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html > >>>> > >>>> Richard suggested to "never hoist anything from a bb with lower execution > >>>> frequency to a bb with higher one in LIM invariantness_dom_walker > >>>> before_dom_children". > >>>> > >>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to > >>>> expected target loop, if profile count of the loop bb is colder > >>>> than target loop preheader, it won't be hoisted out of loop. > >>>> Likely for store motion, if all locations of the REF in loop is cold, > >>>> don't do store motion of it. > >>>> > >>>> SPEC2017 performance evaluation shows 1% performance improvement for > >>>> intrate GEOMEAN and no obvious regression for others. Especially, > >>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is > >>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% > >>>> on P8LE. > >>>> > >>>> gcc/ChangeLog: > >>>> > >>>> * loop-invariant.c (find_invariants_bb): Check profile count > >>>> before motion. > >>>> (find_invariants_body): Add argument. > >>>> * tree-ssa-loop-im.c (find_coldest_out_loop): New function. > >>>> (determine_max_movement): Use find_coldest_out_loop. > >>>> (move_computations_worker): Adjust and fix iteration udpate. > >>>> (execute_sm_exit): Check pointer validness. > >>>> (class ref_in_loop_hot_body): New functor. > >>>> (ref_in_loop_hot_body::operator): New. > >>>> (can_sm_ref_p): Use for_all_locs_in_loop. > >>>> > >>>> gcc/testsuite/ChangeLog: > >>>> > >>>> * gcc.dg/tree-ssa/recip-3.c: Adjust. > >>>> * gcc.dg/tree-ssa/ssa-lim-18.c: New test. > >>>> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > >>>> * gcc.dg/tree-ssa/ssa-lim-20.c: New test. > >>>> --- > >>>> gcc/loop-invariant.c | 10 ++-- > >>>> gcc/tree-ssa-loop-im.c | 61 ++++++++++++++++++++-- > >>>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- > >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++ > >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++ > >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++ > >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++ > >>>> 7 files changed, 165 insertions(+), 8 deletions(-) > >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c > >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c > >>>> > >>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c > >>>> index fca0c2b24be..5c3be7bf0eb 100644 > >>>> --- a/gcc/loop-invariant.c > >>>> +++ b/gcc/loop-invariant.c > >>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) > >>>> call. */ > >>>> > >>>> static void > >>>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) > >>>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, > >>>> + bool always_executed) > >>>> { > >>>> rtx_insn *insn; > >>>> + basic_block preheader = loop_preheader_edge (loop)->src; > >>>> + > >>>> + if (preheader->count > bb->count) > >>>> + return; > >>>> > >>>> FOR_BB_INSNS (bb, insn) > >>>> { > >>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, > >>>> unsigned i; > >>>> > >>>> for (i = 0; i < loop->num_nodes; i++) > >>>> - find_invariants_bb (body[i], > >>>> - bitmap_bit_p (always_reached, i), > >>>> + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), > >>>> bitmap_bit_p (always_executed, i)); > >>>> } > >>>> > >>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > >>>> index 4b187c2cdaf..655fab03442 100644 > >>>> --- a/gcc/tree-ssa-loop-im.c > >>>> +++ b/gcc/tree-ssa-loop-im.c > >>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) > >>>> return ret; > >>>> } > >>>> > >>>> +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ > >>>> + > >>>> +static class loop * > >>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, > >>>> + basic_block curr_bb) > >>>> +{ > >>>> + class loop *cold_loop, *min_loop; > >>>> + cold_loop = min_loop = outmost_loop; > >>>> + profile_count min_count = loop_preheader_edge (min_loop)->src->count; > >>>> + > >>>> + if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count) > >>> > >>> Honza - can you comment on whether we should compare BB counts this way? > >>> > >>> I would suspect that for, say, > >>> > >>> for (...) > >>> if (a) > >>> X; > >>> else > >>> Y; > >>> > >>> that the counts for X and Y will be less than that of the preheader of the loop > >>> only when the loop is estimated to run once. That is, should we really compare > >>> the to the preheader count or maybe better to the _header_ count which > >>> would keep the number of iterations out of the equation? > >> > >> I quickly tried to replace all the loop_preheader_edge (loop)->src with > >> loop_preheader_edge (loop)->dest, it will cause many failures in > >> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems > >> reasonable to compare the bb count with preheader count as both gimple lim > >> and RTL loop-invariant move instructions to *preheader* instead of *header* > >> after analysis? > > > > Hmm, yeah - guess I was confused here. > > > >>> > >>> If we look at maybe_hot_count_p that's a quite sophisticated thing to > >>> compare a count to the "IPA hot", here we're comparing two counts > >>> within a function where it actually matters whether we use a<b or > >>> !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p > >>> function). > >>> > >>> Xionghu, you error on the side of not hoisting for unordered counts here > >>> > >>>> + return NULL; > >>>> + > >>>> + while (min_loop != loop) > >>>> + { > >>>> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); > >>>> + if (loop_preheader_edge (min_loop)->src->count < min_count) > >>> > >>> but in the other direction here and on the side of not hoisting > >>> in ref_in_loop_hot_body. > >>> > >>> The three-state relational operator overloads are probably not the > >>> very best idea... > >>> (see profile-count.h for them) > >>> > >> Added new function bb_colder_than_loop_preheader to encapsulate the comparision, > >> if FALSE is returned due to three-state inequality, find_coldest_out_loop > >> will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator () > >> will return true to continue perform store motion, both preserve the previous > >> behavior. > > > > Thanks. But I don't think the abstraction as written is useful: > > > > +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state > > + as stated in profile-count.h, FALSE is returned if inequality cannot be > > + decided. */ > > +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) > > +{ > > + if (count1 < count2) > > + return true; > > + else > > + return false; > > +} > > > > given the following seems to pass the preheader count in place of the BB count. > > > > + if (bb_colder_than_loop_preheader ( > > + loop_preheader_edge (min_loop)->src->count, min_count)) > > + cold_loop = min_loop; > > > > find_coldest_out_loop is also a bit weird, I think we want to find > > the outermost loop between outmost_loop and loop that has a > > lower count than the curr_bb count but > > > > + while (min_loop != loop) > > + { > > + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); > > + if (bb_colder_than_loop_preheader ( > > + loop_preheader_edge (min_loop)->src->count, min_count)) > > + cold_loop = min_loop; > > > > compares the outermost loops count (min_count) against the preheader > > count? So we're searching for a cold loop with respect to its enclosing loop > > here? > > Let me try to explain how it works :) > > find_coldest_out_loop does two steps check: > 1) Check whether curr_bb is cold in it's own loop_father, if it is cold, > just return NULL which means it should not be moved out at all; > 2) curr_bb is NOT cold, assuming the current loop L[m] is the coldest first, > than try to find a cold loop to be hoisted to from {L[1], L[2], ... L[m]}, > if L[i]->count < L[m]->count, set the cold_loop to L[i] until find the loop > that has smallest profile_count. > > > Take the updated ssa-lim-19.c as example, check whether curr_bb(bb 5) is cold in > loop 3, if it is cold, just return NULL, otherwise select the coldest loop in > {loop1, loop2, loop3} and find that loop2 is colder than loop3, return loop2 to > be the target hoist loop. The first check could AVOID hoist if curr_bb is colder > than loop3, but it is still hot than loop1 and loop2. Not sure whether it is possible > to construct such cases? > > > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > > volatile int x; > void > bar (int, char *, char *); > void > foo (int *a, int n, int m, int s, int t) > { > int i; > int j; > int k; > > for (i = 0; i < m; i++) // loop 1 > { > if (__builtin_expect (x, 0)) > for (j = 0; j < n; j++) // loop 2 > for (k = 0; k < n; k++) // loop 3 > { > bar (s / 5, "one", "two"); // curr_bb > a[t] = s; > } > a[t] = t; // curr_bb2 > } > } > > The 4 invariant statements are moved to bb 11(loop2) instead of bb 10(loop1) > with this patch. > There are totally 6 combinations when curr_bb is hotter than loop 3. We need > to compare the "Loop preheader hotness" instead of "every Loop[i] and curr_bb hotness", > returning the coldest loop for this function find_coldest_out_loop, otherwise > unexpected behavior happens. > > L1 > L2 > L3 => return L3 > L1 > L3 > L2 => return L2 > L2 > L1 > L3 => return L3 > L2 > L3 > L1 => return L1 > L3 > L1 > L2 => return L2 > L3 > L2 > L1 => return L1 > > So bb_colder_than_loop_preheader does two kind of checks, one is checking > L3 preheader count with curr_bb count, another is checking L3 preheader count > with L1 preheader count, L2 preheader count, etc... > > > ssa-lim-19.c.138t.lim2: > ... > <bb 10> [local count: 16057869]: // L1 preheader > - _4 = s_22(D) / 5; > - _5 = (long unsigned int) t_24(D); > - _6 = _5 * 4; > - _7 = a_25(D) + _6; > _8 = (long unsigned int) t_24(D); > _9 = _8 * 4; > _10 = a_25(D) + _9; > > <bb 3> [local count: 145980626]: > # i_34 = PHI <i_29(12), 0(10)> > x.0_1 ={v} x; > if (x.0_1 != 0) > goto <bb 4>; [10.00%] > else > goto <bb 8>; [90.00%] > > <bb 4> [local count: 14598063]: > if (n_20(D) > 0) > goto <bb 11>; [89.00%] > else > goto <bb 8>; [11.00%] > > <bb 11> [local count: 12992276]: // L2 preheader > + _4 = s_22(D) / 5; > + _5 = (long unsigned int) t_24(D); > + _6 = _5 * 4; > + _7 = a_25(D) + _6; > goto <bb 7>; [100.00%] > > <bb 14> [local count: 850510901]: > > <bb 5> [local count: 955630225]: // curr_bb > # k_36 = PHI <k_27(14), 0(7)> > bar (_4, "one", "two"); > *_7 = s_22(D); > k_27 = k_36 + 1; > if (n_20(D) > k_27) > goto <bb 14>; [89.00%] > else > goto <bb 6>; [11.00%] > > <bb 6> [local count: 118111600]: > j_21 = j_35 + 1; > if (n_20(D) > j_21) > goto <bb 13>; [89.00%] > else > goto <bb 8>; [11.00%] > > <bb 13> [local count: 105119324]: > > <bb 7> [local count: 118111600]: // L3 preheader > # j_35 = PHI <j_21(13), 0(11)> > goto <bb 5>; [100.00%] > > <bb 8> [local count: 145980626]: > *_10 = t_24(D); > i_29 = i_34 + 1; > > Re-paste the bb_colder_than_loop_preheader and find_coldest_out_loop: > > +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state > + as stated in profile-count.h, FALSE is returned if inequality cannot be > + decided. */ > +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) > +{ > + if (count1 < count2) > + return true; > + else > + return false; > +} > + > +/* Find coldest loop between OUTMOST_LOOP and LOOP by comapring profile count. > + */ > + > +static class loop * > +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, > + basic_block curr_bb) > +{ > + class loop *cold_loop, *min_loop; > + cold_loop = min_loop = outmost_loop; > + profile_count min_count = loop_preheader_edge (min_loop)->src->count; > + > + /* If bb_colder_than_loop_preheader returns false due to three-state > + comparision, OUTMOST_LOOP is returned finally to preserve the behavior. > + Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP. */ > + if (curr_bb > + && bb_colder_than_loop_preheader (curr_bb->count, > + loop_preheader_edge (loop)->src->count)) > + return NULL; > + > + while (min_loop != loop) > + { > + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); > + if (bb_colder_than_loop_preheader ( > + loop_preheader_edge (min_loop)->src->count, min_count)) > + cold_loop = min_loop; > + } > + return cold_loop; > +} > + > > > > > > Why is this function not simply > > > > +static class loop * > > +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, > > + basic_block curr_bb) > > +{ > > while (bb_colder_than_loop_preheader (curr_bb->count, > > loop_preheader_edge (outermost_loop)->src->count)) > > { > > if (outermost_loop == loop) > > return NULL; > > outermost_loop = superloop_at_depth (loop, loop_depth > > (outermost_loop) + 1); > > } > > return outermost_loop; > > } > > If change like this, when processing curr_bb(5), outermost_loop will > return loop 1 since curr_bb->count > Loop1_prehead->count, the while > loop stopped. This doesn't meet what we want. Why? curr_bb is executed at least as often as loop1 preheader if we look at the counts? So either the counts do not really tell us anything of help or I am missing something. Are you merely looking for a block with a lower count on the path from the outermost loop entry to the block in question and deciding you do not want to hoist further than that? So it's not about not hoisting to a hot place but instead hoist to the coldest place within a loop nest? So we have for (i = 0; i < m; i++) // loop 1 { if (__builtin_expect (x, 0)) for (j = 0; j < n; j++) // loop 2 <bb 10> [local count: 16057869]: // L1 preheader ... <bb 3> [local count: 145980626]: # i_34 = PHI <i_29(12), 0(10)> ... <bb 11> [local count: 12992276]: // L2 preheader ... <bb 7> [local count: 118111600]: // L3 preheader # j_35 = PHI <j_21(13), 0(11)> goto <bb 5>; [100.00%] and we want to hoist to the L2 preheader because that's less frequently executed than the L1 preheader (which is less frequently executed than the L3 preheader or the block we are hoisting from). I'm concerned with compile-time complexity re-evaluating counts on the loop nest many times. So it looks to me that we can pre-compute this lowest-preheader-count loop for a loop nest at least for the store-motion case where we know the outermost loop? > > > > ? > > > > Likewise I wonder why ref_in_loop_hot_body::operator () needs to call > > find_coldest_out_loop and why it not simply does > > > > +bool > > +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) > > +{ > > + basic_block curr_bb = gimple_bb (loc->stmt); > > if (bb_colder_than_loop_preheader (curr_bb->count, > > loop_preheader_edge (l)->src->count)) > > return false; > > return true; > > } > > Likely for this part, > > +/* Find out the coldest loop between loop L and innermost loop, compare the > + hotness between current BB and coldest loop preheader by profile count. */ > +bool > +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) > +{ > + basic_block curr_bb = gimple_bb (loc->stmt); > + class loop *inner_loop = curr_bb->loop_father; > + class loop *cold_loop = l; > + if (l != inner_loop) > + cold_loop = find_coldest_out_loop (l, inner_loop, curr_bb); > + if (!cold_loop) > + return false; > + edge e = loop_preheader_edge (cold_loop); > + /* If bb_colder_than_loop_preheader is false due to three-state inequality > + comparision, TRUE is returned to continue perform store motion. */ > + if (bb_colder_than_loop_preheader (curr_bb->count, e->src->count)) > + return false; > + else > + return true; > +} > > l is the input of ref_in_loop_hot_body, it is an out loop, we need to find a > cold_loop between l and inner_loop. Reason is there may be cold loop between > l and inner_loop, which means we shouldn't do store-motion from curr_bb to l > directly. > After reconsideration, I think the bb_colder_than_loop_preheader could be > removed since curr_bb is checked in find_coldest_out_loop already. And remove > the "l != inner_loop" check: > > +/* Find out the coldest loop between loop L and innermost loop. */ > +bool > +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) > +{ > + basic_block curr_bb = gimple_bb (loc->stmt); > + class loop *inner_loop = curr_bb->loop_father; > + class loop *cold_loop = l; > + cold_loop = find_coldest_out_loop (l, inner_loop, curr_bb); > + if (!cold_loop) > + return false; > + return true; > +} > > > -- > Thanks, > Xionghu
On 2021/10/26 21:20, Richard Biener wrote: > On Mon, Oct 18, 2021 at 6:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: >> >> >> >> On 2021/10/15 16:11, Richard Biener wrote: >>> On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: >>>> >>>> Hi, >>>> >>>> On 2021/9/28 20:09, Richard Biener wrote: >>>>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: >>>>>> >>>>>> Update the patch to v3, not sure whether you prefer the paste style >>>>>> and continue to link the previous thread as Segher dislikes this... >>>>>> >>>>>> >>>>>> [PATCH v3] Don't move cold code out of loop by checking bb count >>>>>> >>>>>> >>>>>> Changes: >>>>>> 1. Handle max_loop in determine_max_movement instead of >>>>>> outermost_invariant_loop. >>>>>> 2. Remove unnecessary changes. >>>>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. >>>>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused >>>>>> infinite loop when implementing v1 and the iteration is missed to be >>>>>> updated actually. >>>>>> >>>>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html >>>>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html >>>>>> >>>>>> There was a patch trying to avoid move cold block out of loop: >>>>>> >>>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html >>>>>> >>>>>> Richard suggested to "never hoist anything from a bb with lower execution >>>>>> frequency to a bb with higher one in LIM invariantness_dom_walker >>>>>> before_dom_children". >>>>>> >>>>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to >>>>>> expected target loop, if profile count of the loop bb is colder >>>>>> than target loop preheader, it won't be hoisted out of loop. >>>>>> Likely for store motion, if all locations of the REF in loop is cold, >>>>>> don't do store motion of it. >>>>>> >>>>>> SPEC2017 performance evaluation shows 1% performance improvement for >>>>>> intrate GEOMEAN and no obvious regression for others. Especially, >>>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is >>>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% >>>>>> on P8LE. >>>>>> >>>>>> gcc/ChangeLog: >>>>>> >>>>>> * loop-invariant.c (find_invariants_bb): Check profile count >>>>>> before motion. >>>>>> (find_invariants_body): Add argument. >>>>>> * tree-ssa-loop-im.c (find_coldest_out_loop): New function. >>>>>> (determine_max_movement): Use find_coldest_out_loop. >>>>>> (move_computations_worker): Adjust and fix iteration udpate. >>>>>> (execute_sm_exit): Check pointer validness. >>>>>> (class ref_in_loop_hot_body): New functor. >>>>>> (ref_in_loop_hot_body::operator): New. >>>>>> (can_sm_ref_p): Use for_all_locs_in_loop. >>>>>> >>>>>> gcc/testsuite/ChangeLog: >>>>>> >>>>>> * gcc.dg/tree-ssa/recip-3.c: Adjust. >>>>>> * gcc.dg/tree-ssa/ssa-lim-18.c: New test. >>>>>> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. >>>>>> * gcc.dg/tree-ssa/ssa-lim-20.c: New test. >>>>>> --- >>>>>> gcc/loop-invariant.c | 10 ++-- >>>>>> gcc/tree-ssa-loop-im.c | 61 ++++++++++++++++++++-- >>>>>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- >>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++ >>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++ >>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++ >>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++ >>>>>> 7 files changed, 165 insertions(+), 8 deletions(-) >>>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c >>>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c >>>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c >>>>>> >>>>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c >>>>>> index fca0c2b24be..5c3be7bf0eb 100644 >>>>>> --- a/gcc/loop-invariant.c >>>>>> +++ b/gcc/loop-invariant.c >>>>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) >>>>>> call. */ >>>>>> >>>>>> static void >>>>>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) >>>>>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, >>>>>> + bool always_executed) >>>>>> { >>>>>> rtx_insn *insn; >>>>>> + basic_block preheader = loop_preheader_edge (loop)->src; >>>>>> + >>>>>> + if (preheader->count > bb->count) >>>>>> + return; >>>>>> >>>>>> FOR_BB_INSNS (bb, insn) >>>>>> { >>>>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, >>>>>> unsigned i; >>>>>> >>>>>> for (i = 0; i < loop->num_nodes; i++) >>>>>> - find_invariants_bb (body[i], >>>>>> - bitmap_bit_p (always_reached, i), >>>>>> + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), >>>>>> bitmap_bit_p (always_executed, i)); >>>>>> } >>>>>> >>>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c >>>>>> index 4b187c2cdaf..655fab03442 100644 >>>>>> --- a/gcc/tree-ssa-loop-im.c >>>>>> +++ b/gcc/tree-ssa-loop-im.c >>>>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) >>>>>> return ret; >>>>>> } >>>>>> >>>>>> +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ >>>>>> + >>>>>> +static class loop * >>>>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, >>>>>> + basic_block curr_bb) >>>>>> +{ >>>>>> + class loop *cold_loop, *min_loop; >>>>>> + cold_loop = min_loop = outmost_loop; >>>>>> + profile_count min_count = loop_preheader_edge (min_loop)->src->count; >>>>>> + >>>>>> + if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count) >>>>> >>>>> Honza - can you comment on whether we should compare BB counts this way? >>>>> >>>>> I would suspect that for, say, >>>>> >>>>> for (...) >>>>> if (a) >>>>> X; >>>>> else >>>>> Y; >>>>> >>>>> that the counts for X and Y will be less than that of the preheader of the loop >>>>> only when the loop is estimated to run once. That is, should we really compare >>>>> the to the preheader count or maybe better to the _header_ count which >>>>> would keep the number of iterations out of the equation? >>>> >>>> I quickly tried to replace all the loop_preheader_edge (loop)->src with >>>> loop_preheader_edge (loop)->dest, it will cause many failures in >>>> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems >>>> reasonable to compare the bb count with preheader count as both gimple lim >>>> and RTL loop-invariant move instructions to *preheader* instead of *header* >>>> after analysis? >>> >>> Hmm, yeah - guess I was confused here. >>> >>>>> >>>>> If we look at maybe_hot_count_p that's a quite sophisticated thing to >>>>> compare a count to the "IPA hot", here we're comparing two counts >>>>> within a function where it actually matters whether we use a<b or >>>>> !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p >>>>> function). >>>>> >>>>> Xionghu, you error on the side of not hoisting for unordered counts here >>>>> >>>>>> + return NULL; >>>>>> + >>>>>> + while (min_loop != loop) >>>>>> + { >>>>>> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); >>>>>> + if (loop_preheader_edge (min_loop)->src->count < min_count) >>>>> >>>>> but in the other direction here and on the side of not hoisting >>>>> in ref_in_loop_hot_body. >>>>> >>>>> The three-state relational operator overloads are probably not the >>>>> very best idea... >>>>> (see profile-count.h for them) >>>>> >>>> Added new function bb_colder_than_loop_preheader to encapsulate the comparision, >>>> if FALSE is returned due to three-state inequality, find_coldest_out_loop >>>> will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator () >>>> will return true to continue perform store motion, both preserve the previous >>>> behavior. >>> >>> Thanks. But I don't think the abstraction as written is useful: >>> >>> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state >>> + as stated in profile-count.h, FALSE is returned if inequality cannot be >>> + decided. */ >>> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) >>> +{ >>> + if (count1 < count2) >>> + return true; >>> + else >>> + return false; >>> +} >>> >>> given the following seems to pass the preheader count in place of the BB count. >>> >>> + if (bb_colder_than_loop_preheader ( >>> + loop_preheader_edge (min_loop)->src->count, min_count)) >>> + cold_loop = min_loop; >>> >>> find_coldest_out_loop is also a bit weird, I think we want to find >>> the outermost loop between outmost_loop and loop that has a >>> lower count than the curr_bb count but >>> >>> + while (min_loop != loop) >>> + { >>> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); >>> + if (bb_colder_than_loop_preheader ( >>> + loop_preheader_edge (min_loop)->src->count, min_count)) >>> + cold_loop = min_loop; >>> >>> compares the outermost loops count (min_count) against the preheader >>> count? So we're searching for a cold loop with respect to its enclosing loop >>> here? >> >> Let me try to explain how it works :) >> >> find_coldest_out_loop does two steps check: >> 1) Check whether curr_bb is cold in it's own loop_father, if it is cold, >> just return NULL which means it should not be moved out at all; >> 2) curr_bb is NOT cold, assuming the current loop L[m] is the coldest first, >> than try to find a cold loop to be hoisted to from {L[1], L[2], ... L[m]}, >> if L[i]->count < L[m]->count, set the cold_loop to L[i] until find the loop >> that has smallest profile_count. >> >> >> Take the updated ssa-lim-19.c as example, check whether curr_bb(bb 5) is cold in >> loop 3, if it is cold, just return NULL, otherwise select the coldest loop in >> {loop1, loop2, loop3} and find that loop2 is colder than loop3, return loop2 to >> be the target hoist loop. The first check could AVOID hoist if curr_bb is colder >> than loop3, but it is still hot than loop1 and loop2. Not sure whether it is possible >> to construct such cases? >> >> >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >> >> volatile int x; >> void >> bar (int, char *, char *); >> void >> foo (int *a, int n, int m, int s, int t) >> { >> int i; >> int j; >> int k; >> >> for (i = 0; i < m; i++) // loop 1 >> { >> if (__builtin_expect (x, 0)) >> for (j = 0; j < n; j++) // loop 2 >> for (k = 0; k < n; k++) // loop 3 >> { >> bar (s / 5, "one", "two"); // curr_bb >> a[t] = s; >> } >> a[t] = t; // curr_bb2 >> } >> } >> >> The 4 invariant statements are moved to bb 11(loop2) instead of bb 10(loop1) >> with this patch. >> There are totally 6 combinations when curr_bb is hotter than loop 3. We need >> to compare the "Loop preheader hotness" instead of "every Loop[i] and curr_bb hotness", >> returning the coldest loop for this function find_coldest_out_loop, otherwise >> unexpected behavior happens. >> >> L1 > L2 > L3 => return L3 >> L1 > L3 > L2 => return L2 >> L2 > L1 > L3 => return L3 >> L2 > L3 > L1 => return L1 >> L3 > L1 > L2 => return L2 >> L3 > L2 > L1 => return L1 >> >> So bb_colder_than_loop_preheader does two kind of checks, one is checking >> L3 preheader count with curr_bb count, another is checking L3 preheader count >> with L1 preheader count, L2 preheader count, etc... >> >> >> ssa-lim-19.c.138t.lim2: >> ... >> <bb 10> [local count: 16057869]: // L1 preheader >> - _4 = s_22(D) / 5; >> - _5 = (long unsigned int) t_24(D); >> - _6 = _5 * 4; >> - _7 = a_25(D) + _6; >> _8 = (long unsigned int) t_24(D); >> _9 = _8 * 4; >> _10 = a_25(D) + _9; >> >> <bb 3> [local count: 145980626]: >> # i_34 = PHI <i_29(12), 0(10)> >> x.0_1 ={v} x; >> if (x.0_1 != 0) >> goto <bb 4>; [10.00%] >> else >> goto <bb 8>; [90.00%] >> >> <bb 4> [local count: 14598063]: >> if (n_20(D) > 0) >> goto <bb 11>; [89.00%] >> else >> goto <bb 8>; [11.00%] >> >> <bb 11> [local count: 12992276]: // L2 preheader >> + _4 = s_22(D) / 5; >> + _5 = (long unsigned int) t_24(D); >> + _6 = _5 * 4; >> + _7 = a_25(D) + _6; >> goto <bb 7>; [100.00%] >> >> <bb 14> [local count: 850510901]: >> >> <bb 5> [local count: 955630225]: // curr_bb >> # k_36 = PHI <k_27(14), 0(7)> >> bar (_4, "one", "two"); >> *_7 = s_22(D); >> k_27 = k_36 + 1; >> if (n_20(D) > k_27) >> goto <bb 14>; [89.00%] >> else >> goto <bb 6>; [11.00%] >> >> <bb 6> [local count: 118111600]: >> j_21 = j_35 + 1; >> if (n_20(D) > j_21) >> goto <bb 13>; [89.00%] >> else >> goto <bb 8>; [11.00%] >> >> <bb 13> [local count: 105119324]: >> >> <bb 7> [local count: 118111600]: // L3 preheader >> # j_35 = PHI <j_21(13), 0(11)> >> goto <bb 5>; [100.00%] >> >> <bb 8> [local count: 145980626]: >> *_10 = t_24(D); >> i_29 = i_34 + 1; >> >> Re-paste the bb_colder_than_loop_preheader and find_coldest_out_loop: >> >> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state >> + as stated in profile-count.h, FALSE is returned if inequality cannot be >> + decided. */ >> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) >> +{ >> + if (count1 < count2) >> + return true; >> + else >> + return false; >> +} >> + >> +/* Find coldest loop between OUTMOST_LOOP and LOOP by comapring profile count. >> + */ >> + >> +static class loop * >> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, >> + basic_block curr_bb) >> +{ >> + class loop *cold_loop, *min_loop; >> + cold_loop = min_loop = outmost_loop; >> + profile_count min_count = loop_preheader_edge (min_loop)->src->count; >> + >> + /* If bb_colder_than_loop_preheader returns false due to three-state >> + comparision, OUTMOST_LOOP is returned finally to preserve the behavior. >> + Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP. */ >> + if (curr_bb >> + && bb_colder_than_loop_preheader (curr_bb->count, >> + loop_preheader_edge (loop)->src->count)) >> + return NULL; >> + >> + while (min_loop != loop) >> + { >> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); >> + if (bb_colder_than_loop_preheader ( >> + loop_preheader_edge (min_loop)->src->count, min_count)) >> + cold_loop = min_loop; >> + } >> + return cold_loop; >> +} >> + >> >> >>> >>> Why is this function not simply >>> >>> +static class loop * >>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, >>> + basic_block curr_bb) >>> +{ >>> while (bb_colder_than_loop_preheader (curr_bb->count, >>> loop_preheader_edge (outermost_loop)->src->count)) >>> { >>> if (outermost_loop == loop) >>> return NULL; >>> outermost_loop = superloop_at_depth (loop, loop_depth >>> (outermost_loop) + 1); >>> } >>> return outermost_loop; >>> } >> >> If change like this, when processing curr_bb(5), outermost_loop will >> return loop 1 since curr_bb->count > Loop1_prehead->count, the while >> loop stopped. This doesn't meet what we want. > > Why? curr_bb is executed at least as often as loop1 preheader if > we look at the counts? So either the counts do not really tell us > anything of help or I am missing something. Are you merely > looking for a block with a lower count on the path from the outermost > loop entry to the block in question and deciding you do not want to > hoist further than that? So it's not about not hoisting to a hot place > but instead hoist to the coldest place within a loop nest? > > So we have > > for (i = 0; i < m; i++) // loop 1 > { > if (__builtin_expect (x, 0)) > for (j = 0; j < n; j++) // loop 2 > > > <bb 10> [local count: 16057869]: // L1 preheader > ... > <bb 3> [local count: 145980626]: > # i_34 = PHI <i_29(12), 0(10)> > ... > <bb 11> [local count: 12992276]: // L2 preheader > ... > <bb 7> [local count: 118111600]: // L3 preheader > # j_35 = PHI <j_21(13), 0(11)> > goto <bb 5>; [100.00%] > > and we want to hoist to the L2 preheader because that's less frequently > executed than the L1 preheader (which is less frequently executed > than the L3 preheader or the block we are hoisting from). Yes, this is exactly what I want, sorry for not describe it clear before ;( The updated patch[1] may reflect find_coldest_out_loop better: It first check whether curr_bb is hotter than it's preheader, if false, return NULL which means no need hoist at all; Then find a *coldest* preheader to hoist within a loop nest from outmost_loop. [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html +/* Find coldest loop between OUTMOST_LOOP and LOOP by comparing profile count. + It does two steps check: + 1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just + return NULL which means it should not be moved out at all; + 2) CURR_BB is NOT cold, set LOOP to cold_loop, then iteratively search loops + from {L[outmost_loop], L[outmost_loop+1], ... L[loop]}, if L[i] is colder + than L[cold_loop], reset cold_loop to L[i] until get the loop that has + smallest profile_count. */ + +static class loop * +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, + basic_block curr_bb) +{ + class loop *cold_loop; + + /* If bb_colder_than_loop_preheader returns false due to three-state + comparision, OUTMOST_LOOP is returned finally to preserve the behavior. + Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP. */ + if (curr_bb + && bb_colder_than_loop_preheader (curr_bb, + loop_preheader_edge (loop)->src)) + return NULL; + + cold_loop = loop; + while (outmost_loop != loop) + { + if (bb_colder_than_loop_preheader (loop_preheader_edge (outmost_loop)->src, + loop_preheader_edge (cold_loop)->src)) + cold_loop = outmost_loop; + outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1); + } + return cold_loop; +} > > I'm concerned with compile-time complexity re-evaluating counts on the > loop nest many times. So it looks to me that we can pre-compute > this lowest-preheader-count loop for a loop nest at least for the > store-motion case where we know the outermost loop? > > But the lowest-preheader-count loop may change for a loop/bb with different outermost loop. For example if, L1_preheader_count < L2_preheader_count < L3_preheader_count < L4_preheader_count < curr_bb_count then, find_coldest_out_loop (L1, loop, curr_bb) => coldest preheader loop is L1 find_coldest_out_loop (L2, loop, curr_bb) => coldest preheader loop is L2 So it will be a 1:N map? Pre-compute it in find_coldest_out_loop and save it also in lim_data with a new variable coldest_preheader_loop[outmost_loop][coldest_preheader_loop]? each call of find_coldest_out_loop will check whether that variable is set already, only continue the search if coldest_preheader_loop[outmost_loop][coldest_preheader_loop] is not set? Seems a bit complicated and not sure whether it helps to reduce compile-time complexity or I am misunderstanding...
> Hi, > > On 2021/9/28 20:09, Richard Biener wrote: > > On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: > >> > >> Update the patch to v3, not sure whether you prefer the paste style > >> and continue to link the previous thread as Segher dislikes this... > >> > >> > >> [PATCH v3] Don't move cold code out of loop by checking bb count > >> > >> > >> Changes: > >> 1. Handle max_loop in determine_max_movement instead of > >> outermost_invariant_loop. > >> 2. Remove unnecessary changes. > >> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. > >> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused > >> infinite loop when implementing v1 and the iteration is missed to be > >> updated actually. > >> > >> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html > >> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html > >> > >> There was a patch trying to avoid move cold block out of loop: > >> > >> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html > >> > >> Richard suggested to "never hoist anything from a bb with lower execution > >> frequency to a bb with higher one in LIM invariantness_dom_walker > >> before_dom_children". > >> > >> In gimple LIM analysis, add find_coldest_out_loop to move invariants to > >> expected target loop, if profile count of the loop bb is colder > >> than target loop preheader, it won't be hoisted out of loop. > >> Likely for store motion, if all locations of the REF in loop is cold, > >> don't do store motion of it. > >> > >> SPEC2017 performance evaluation shows 1% performance improvement for > >> intrate GEOMEAN and no obvious regression for others. Especially, > >> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is > >> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% > >> on P8LE. > >> > >> gcc/ChangeLog: > >> > >> * loop-invariant.c (find_invariants_bb): Check profile count > >> before motion. > >> (find_invariants_body): Add argument. > >> * tree-ssa-loop-im.c (find_coldest_out_loop): New function. > >> (determine_max_movement): Use find_coldest_out_loop. > >> (move_computations_worker): Adjust and fix iteration udpate. > >> (execute_sm_exit): Check pointer validness. > >> (class ref_in_loop_hot_body): New functor. > >> (ref_in_loop_hot_body::operator): New. > >> (can_sm_ref_p): Use for_all_locs_in_loop. > >> > >> gcc/testsuite/ChangeLog: > >> > >> * gcc.dg/tree-ssa/recip-3.c: Adjust. > >> * gcc.dg/tree-ssa/ssa-lim-18.c: New test. > >> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > >> * gcc.dg/tree-ssa/ssa-lim-20.c: New test. > >> --- > >> gcc/loop-invariant.c | 10 ++-- > >> gcc/tree-ssa-loop-im.c | 61 ++++++++++++++++++++-- > >> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++ > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++ > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++ > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++ > >> 7 files changed, 165 insertions(+), 8 deletions(-) > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c > >> > >> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c > >> index fca0c2b24be..5c3be7bf0eb 100644 > >> --- a/gcc/loop-invariant.c > >> +++ b/gcc/loop-invariant.c > >> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) > >> call. */ > >> > >> static void > >> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) > >> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, > >> + bool always_executed) > >> { > >> rtx_insn *insn; > >> + basic_block preheader = loop_preheader_edge (loop)->src; > >> + > >> + if (preheader->count > bb->count) > >> + return; > >> > >> FOR_BB_INSNS (bb, insn) > >> { > >> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, > >> unsigned i; > >> > >> for (i = 0; i < loop->num_nodes; i++) > >> - find_invariants_bb (body[i], > >> - bitmap_bit_p (always_reached, i), > >> + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), > >> bitmap_bit_p (always_executed, i)); > >> } > >> > >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > >> index 4b187c2cdaf..655fab03442 100644 > >> --- a/gcc/tree-ssa-loop-im.c > >> +++ b/gcc/tree-ssa-loop-im.c > >> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) > >> return ret; > >> } > >> > >> +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ > >> + > >> +static class loop * > >> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, > >> + basic_block curr_bb) > >> +{ > >> + class loop *cold_loop, *min_loop; > >> + cold_loop = min_loop = outmost_loop; > >> + profile_count min_count = loop_preheader_edge (min_loop)->src->count; > >> + > >> + if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count) > > > > Honza - can you comment on whether we should compare BB counts this way? > > > > I would suspect that for, say, > > > > for (...) > > if (a) > > X; > > else > > Y; > > > > that the counts for X and Y will be less than that of the preheader of the loop > > only when the loop is estimated to run once. That is, should we really compare > > the to the preheader count or maybe better to the _header_ count which > > would keep the number of iterations out of the equation? > > I quickly tried to replace all the loop_preheader_edge (loop)->src with > loop_preheader_edge (loop)->dest, it will cause many failures in > gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems > reasonable to compare the bb count with preheader count as both gimple lim > and RTL loop-invariant move instructions to *preheader* instead of *header* > after analysis? I am not quite sure I understand what you shoot for. But if you have loop invariant inside a loop nest and you get range of loops in the nest where you want to move it, you want to pick chepaer preheader count, since the statement is going to be executed there. For > > for (...) > > if (a) > > X; > > else > > Y; You may have frequency of X less then preheader i.e. when probability that a is true is lower than the expected iteration count. If I understand correctly, you want to compare sum of counts of all BBs where invariant evaulates currently to the minimal count of preheader where you can move it. If you have for A for B for C invariant_computation Usually you want to move it: invariant_computation for A for B for C However if for B usually iterates 0 times, it may happen that preheader of for C is executed less often then preheaders of for A/B and you want: for A for B invariant_computation for C > > > > > If we look at maybe_hot_count_p that's a quite sophisticated thing to > > compare a count to the "IPA hot", here we're comparing two counts > > within a function where it actually matters whether we use a<b or > > !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p > > function). > > > > Xionghu, you error on the side of not hoisting for unordered counts here > > > >> + return NULL; > >> + > >> + while (min_loop != loop) > >> + { > >> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); > >> + if (loop_preheader_edge (min_loop)->src->count < min_count) > > > > but in the other direction here and on the side of not hoisting > > in ref_in_loop_hot_body. > > > > The three-state relational operator overloads are probably not the > > very best idea... > > (see profile-count.h for them) In first version of the patch I had count1.known_le (count2) which however made the code to look quite ugly and eventually I convinced myself that three-state comparators are less pain than hard to read conditionals... But i guess we can ensapsulate them when it makes code easier to read. I would be OK with having known_XY comparator variants in profile-count.h Honza
On 2021/10/27 20:54, Jan Hubicka wrote: >> Hi, >> >> On 2021/9/28 20:09, Richard Biener wrote: >>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: >>>> >>>> Update the patch to v3, not sure whether you prefer the paste style >>>> and continue to link the previous thread as Segher dislikes this... >>>> >>>> >>>> [PATCH v3] Don't move cold code out of loop by checking bb count >>>> >>>> >>>> Changes: >>>> 1. Handle max_loop in determine_max_movement instead of >>>> outermost_invariant_loop. >>>> 2. Remove unnecessary changes. >>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. >>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused >>>> infinite loop when implementing v1 and the iteration is missed to be >>>> updated actually. >>>> >>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html >>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html >>>> >>>> There was a patch trying to avoid move cold block out of loop: >>>> >>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html >>>> >>>> Richard suggested to "never hoist anything from a bb with lower execution >>>> frequency to a bb with higher one in LIM invariantness_dom_walker >>>> before_dom_children". >>>> >>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to >>>> expected target loop, if profile count of the loop bb is colder >>>> than target loop preheader, it won't be hoisted out of loop. >>>> Likely for store motion, if all locations of the REF in loop is cold, >>>> don't do store motion of it. >>>> >>>> SPEC2017 performance evaluation shows 1% performance improvement for >>>> intrate GEOMEAN and no obvious regression for others. Especially, >>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is >>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% >>>> on P8LE. >>>> >>>> gcc/ChangeLog: >>>> >>>> * loop-invariant.c (find_invariants_bb): Check profile count >>>> before motion. >>>> (find_invariants_body): Add argument. >>>> * tree-ssa-loop-im.c (find_coldest_out_loop): New function. >>>> (determine_max_movement): Use find_coldest_out_loop. >>>> (move_computations_worker): Adjust and fix iteration udpate. >>>> (execute_sm_exit): Check pointer validness. >>>> (class ref_in_loop_hot_body): New functor. >>>> (ref_in_loop_hot_body::operator): New. >>>> (can_sm_ref_p): Use for_all_locs_in_loop. >>>> >>>> gcc/testsuite/ChangeLog: >>>> >>>> * gcc.dg/tree-ssa/recip-3.c: Adjust. >>>> * gcc.dg/tree-ssa/ssa-lim-18.c: New test. >>>> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. >>>> * gcc.dg/tree-ssa/ssa-lim-20.c: New test. >>>> --- >>>> gcc/loop-invariant.c | 10 ++-- >>>> gcc/tree-ssa-loop-im.c | 61 ++++++++++++++++++++-- >>>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++ >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++ >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++ >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++ >>>> 7 files changed, 165 insertions(+), 8 deletions(-) >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c >>>> >>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c >>>> index fca0c2b24be..5c3be7bf0eb 100644 >>>> --- a/gcc/loop-invariant.c >>>> +++ b/gcc/loop-invariant.c >>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) >>>> call. */ >>>> >>>> static void >>>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) >>>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, >>>> + bool always_executed) >>>> { >>>> rtx_insn *insn; >>>> + basic_block preheader = loop_preheader_edge (loop)->src; >>>> + >>>> + if (preheader->count > bb->count) >>>> + return; >>>> >>>> FOR_BB_INSNS (bb, insn) >>>> { >>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, >>>> unsigned i; >>>> >>>> for (i = 0; i < loop->num_nodes; i++) >>>> - find_invariants_bb (body[i], >>>> - bitmap_bit_p (always_reached, i), >>>> + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), >>>> bitmap_bit_p (always_executed, i)); >>>> } >>>> >>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c >>>> index 4b187c2cdaf..655fab03442 100644 >>>> --- a/gcc/tree-ssa-loop-im.c >>>> +++ b/gcc/tree-ssa-loop-im.c >>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) >>>> return ret; >>>> } >>>> >>>> +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ >>>> + >>>> +static class loop * >>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, >>>> + basic_block curr_bb) >>>> +{ >>>> + class loop *cold_loop, *min_loop; >>>> + cold_loop = min_loop = outmost_loop; >>>> + profile_count min_count = loop_preheader_edge (min_loop)->src->count; >>>> + >>>> + if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count) >>> >>> Honza - can you comment on whether we should compare BB counts this way? >>> >>> I would suspect that for, say, >>> >>> for (...) >>> if (a) >>> X; >>> else >>> Y; >>> >>> that the counts for X and Y will be less than that of the preheader of the loop >>> only when the loop is estimated to run once. That is, should we really compare >>> the to the preheader count or maybe better to the _header_ count which >>> would keep the number of iterations out of the equation? >> >> I quickly tried to replace all the loop_preheader_edge (loop)->src with >> loop_preheader_edge (loop)->dest, it will cause many failures in >> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems >> reasonable to compare the bb count with preheader count as both gimple lim >> and RTL loop-invariant move instructions to *preheader* instead of *header* >> after analysis? > > I am not quite sure I understand what you shoot for. But if you have > loop invariant inside a loop nest and you get range of loops in the nest > where you want to move it, you want to pick chepaer preheader count, > since the statement is going to be executed there. > > For >>> for (...) >>> if (a) >>> X; >>> else >>> Y; > > You may have frequency of X less then preheader i.e. when probability > that a is true is lower than the expected iteration count. > > If I understand correctly, you want to compare sum of counts of all > BBs where invariant evaulates currently to the minimal count of > preheader where you can move it. > > If you have > > for A > for B > for C > invariant_computation > > Usually you want to move it: > > invariant_computation > for A > for B > for C > > However if for B usually iterates 0 times, it may happen that preheader > of for C is executed less often then preheaders of for A/B and you want: > > for A > for B > invariant_computation > for C Thanks, this is what I am trying to do in both gimple lim and RTL loop-invariant motion. In gimple lim, the new added function find_coldest_out_loop[1] will check whether invariant_computation is hotter than C_preheader, if true, find a coldest preheader from outermost nested loop, if B is the coldest, reset the outmost_loop to B, which could avoid hoist cold statement to hot loops to reduce execution counts. Gimple only change could improve 500.perlbench_r and 548.exchange2_r a bit[2]. RTL patch need only small check like below, it could improve performance 500.perlbench_r ~8% [2]for at least Power and aarch64. [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html [2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580110.html This is the other patch Richard and I expecting your review :) From 468e0b252a6b4a8b648c4a49850ed337ab5e03e1 Mon Sep 17 00:00:00 2001 From: Xiong Hu Luo <luoxhu@linux.ibm.com> Date: Fri, 8 Oct 2021 22:05:39 -0500 Subject: [PATCH v4 2/2] loop-invariant: Don't move cold bb instructions to preheader in RTL gcc/ChangeLog: * loop-invariant.c (find_invariants_bb): Check profile count before motion. (find_invariants_body): Add argument. --- gcc/loop-invariant.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c index fca0c2b24be..5c3be7bf0eb 100644 --- a/gcc/loop-invariant.c +++ b/gcc/loop-invariant.c @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) call. */ static void -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, + bool always_executed) { rtx_insn *insn; + basic_block preheader = loop_preheader_edge (loop)->src; + + if (preheader->count > bb->count) + return; FOR_BB_INSNS (bb, insn) { @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, unsigned i; for (i = 0; i < loop->num_nodes; i++) - find_invariants_bb (body[i], - bitmap_bit_p (always_reached, i), + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), bitmap_bit_p (always_executed, i)); }
On Wed, Oct 27, 2021 at 4:40 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: > > > > On 2021/10/26 21:20, Richard Biener wrote: > > On Mon, Oct 18, 2021 at 6:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: > >> > >> > >> > >> On 2021/10/15 16:11, Richard Biener wrote: > >>> On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: > >>>> > >>>> Hi, > >>>> > >>>> On 2021/9/28 20:09, Richard Biener wrote: > >>>>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote: > >>>>>> > >>>>>> Update the patch to v3, not sure whether you prefer the paste style > >>>>>> and continue to link the previous thread as Segher dislikes this... > >>>>>> > >>>>>> > >>>>>> [PATCH v3] Don't move cold code out of loop by checking bb count > >>>>>> > >>>>>> > >>>>>> Changes: > >>>>>> 1. Handle max_loop in determine_max_movement instead of > >>>>>> outermost_invariant_loop. > >>>>>> 2. Remove unnecessary changes. > >>>>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. > >>>>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused > >>>>>> infinite loop when implementing v1 and the iteration is missed to be > >>>>>> updated actually. > >>>>>> > >>>>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html > >>>>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html > >>>>>> > >>>>>> There was a patch trying to avoid move cold block out of loop: > >>>>>> > >>>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html > >>>>>> > >>>>>> Richard suggested to "never hoist anything from a bb with lower execution > >>>>>> frequency to a bb with higher one in LIM invariantness_dom_walker > >>>>>> before_dom_children". > >>>>>> > >>>>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to > >>>>>> expected target loop, if profile count of the loop bb is colder > >>>>>> than target loop preheader, it won't be hoisted out of loop. > >>>>>> Likely for store motion, if all locations of the REF in loop is cold, > >>>>>> don't do store motion of it. > >>>>>> > >>>>>> SPEC2017 performance evaluation shows 1% performance improvement for > >>>>>> intrate GEOMEAN and no obvious regression for others. Especially, > >>>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is > >>>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% > >>>>>> on P8LE. > >>>>>> > >>>>>> gcc/ChangeLog: > >>>>>> > >>>>>> * loop-invariant.c (find_invariants_bb): Check profile count > >>>>>> before motion. > >>>>>> (find_invariants_body): Add argument. > >>>>>> * tree-ssa-loop-im.c (find_coldest_out_loop): New function. > >>>>>> (determine_max_movement): Use find_coldest_out_loop. > >>>>>> (move_computations_worker): Adjust and fix iteration udpate. > >>>>>> (execute_sm_exit): Check pointer validness. > >>>>>> (class ref_in_loop_hot_body): New functor. > >>>>>> (ref_in_loop_hot_body::operator): New. > >>>>>> (can_sm_ref_p): Use for_all_locs_in_loop. > >>>>>> > >>>>>> gcc/testsuite/ChangeLog: > >>>>>> > >>>>>> * gcc.dg/tree-ssa/recip-3.c: Adjust. > >>>>>> * gcc.dg/tree-ssa/ssa-lim-18.c: New test. > >>>>>> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > >>>>>> * gcc.dg/tree-ssa/ssa-lim-20.c: New test. > >>>>>> --- > >>>>>> gcc/loop-invariant.c | 10 ++-- > >>>>>> gcc/tree-ssa-loop-im.c | 61 ++++++++++++++++++++-- > >>>>>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- > >>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++ > >>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++ > >>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++ > >>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++ > >>>>>> 7 files changed, 165 insertions(+), 8 deletions(-) > >>>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c > >>>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >>>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > >>>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c > >>>>>> > >>>>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c > >>>>>> index fca0c2b24be..5c3be7bf0eb 100644 > >>>>>> --- a/gcc/loop-invariant.c > >>>>>> +++ b/gcc/loop-invariant.c > >>>>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) > >>>>>> call. */ > >>>>>> > >>>>>> static void > >>>>>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) > >>>>>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, > >>>>>> + bool always_executed) > >>>>>> { > >>>>>> rtx_insn *insn; > >>>>>> + basic_block preheader = loop_preheader_edge (loop)->src; > >>>>>> + > >>>>>> + if (preheader->count > bb->count) > >>>>>> + return; > >>>>>> > >>>>>> FOR_BB_INSNS (bb, insn) > >>>>>> { > >>>>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, > >>>>>> unsigned i; > >>>>>> > >>>>>> for (i = 0; i < loop->num_nodes; i++) > >>>>>> - find_invariants_bb (body[i], > >>>>>> - bitmap_bit_p (always_reached, i), > >>>>>> + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), > >>>>>> bitmap_bit_p (always_executed, i)); > >>>>>> } > >>>>>> > >>>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > >>>>>> index 4b187c2cdaf..655fab03442 100644 > >>>>>> --- a/gcc/tree-ssa-loop-im.c > >>>>>> +++ b/gcc/tree-ssa-loop-im.c > >>>>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) > >>>>>> return ret; > >>>>>> } > >>>>>> > >>>>>> +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ > >>>>>> + > >>>>>> +static class loop * > >>>>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, > >>>>>> + basic_block curr_bb) > >>>>>> +{ > >>>>>> + class loop *cold_loop, *min_loop; > >>>>>> + cold_loop = min_loop = outmost_loop; > >>>>>> + profile_count min_count = loop_preheader_edge (min_loop)->src->count; > >>>>>> + > >>>>>> + if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count) > >>>>> > >>>>> Honza - can you comment on whether we should compare BB counts this way? > >>>>> > >>>>> I would suspect that for, say, > >>>>> > >>>>> for (...) > >>>>> if (a) > >>>>> X; > >>>>> else > >>>>> Y; > >>>>> > >>>>> that the counts for X and Y will be less than that of the preheader of the loop > >>>>> only when the loop is estimated to run once. That is, should we really compare > >>>>> the to the preheader count or maybe better to the _header_ count which > >>>>> would keep the number of iterations out of the equation? > >>>> > >>>> I quickly tried to replace all the loop_preheader_edge (loop)->src with > >>>> loop_preheader_edge (loop)->dest, it will cause many failures in > >>>> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems > >>>> reasonable to compare the bb count with preheader count as both gimple lim > >>>> and RTL loop-invariant move instructions to *preheader* instead of *header* > >>>> after analysis? > >>> > >>> Hmm, yeah - guess I was confused here. > >>> > >>>>> > >>>>> If we look at maybe_hot_count_p that's a quite sophisticated thing to > >>>>> compare a count to the "IPA hot", here we're comparing two counts > >>>>> within a function where it actually matters whether we use a<b or > >>>>> !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p > >>>>> function). > >>>>> > >>>>> Xionghu, you error on the side of not hoisting for unordered counts here > >>>>> > >>>>>> + return NULL; > >>>>>> + > >>>>>> + while (min_loop != loop) > >>>>>> + { > >>>>>> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); > >>>>>> + if (loop_preheader_edge (min_loop)->src->count < min_count) > >>>>> > >>>>> but in the other direction here and on the side of not hoisting > >>>>> in ref_in_loop_hot_body. > >>>>> > >>>>> The three-state relational operator overloads are probably not the > >>>>> very best idea... > >>>>> (see profile-count.h for them) > >>>>> > >>>> Added new function bb_colder_than_loop_preheader to encapsulate the comparision, > >>>> if FALSE is returned due to three-state inequality, find_coldest_out_loop > >>>> will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator () > >>>> will return true to continue perform store motion, both preserve the previous > >>>> behavior. > >>> > >>> Thanks. But I don't think the abstraction as written is useful: > >>> > >>> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state > >>> + as stated in profile-count.h, FALSE is returned if inequality cannot be > >>> + decided. */ > >>> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) > >>> +{ > >>> + if (count1 < count2) > >>> + return true; > >>> + else > >>> + return false; > >>> +} > >>> > >>> given the following seems to pass the preheader count in place of the BB count. > >>> > >>> + if (bb_colder_than_loop_preheader ( > >>> + loop_preheader_edge (min_loop)->src->count, min_count)) > >>> + cold_loop = min_loop; > >>> > >>> find_coldest_out_loop is also a bit weird, I think we want to find > >>> the outermost loop between outmost_loop and loop that has a > >>> lower count than the curr_bb count but > >>> > >>> + while (min_loop != loop) > >>> + { > >>> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); > >>> + if (bb_colder_than_loop_preheader ( > >>> + loop_preheader_edge (min_loop)->src->count, min_count)) > >>> + cold_loop = min_loop; > >>> > >>> compares the outermost loops count (min_count) against the preheader > >>> count? So we're searching for a cold loop with respect to its enclosing loop > >>> here? > >> > >> Let me try to explain how it works :) > >> > >> find_coldest_out_loop does two steps check: > >> 1) Check whether curr_bb is cold in it's own loop_father, if it is cold, > >> just return NULL which means it should not be moved out at all; > >> 2) curr_bb is NOT cold, assuming the current loop L[m] is the coldest first, > >> than try to find a cold loop to be hoisted to from {L[1], L[2], ... L[m]}, > >> if L[i]->count < L[m]->count, set the cold_loop to L[i] until find the loop > >> that has smallest profile_count. > >> > >> > >> Take the updated ssa-lim-19.c as example, check whether curr_bb(bb 5) is cold in > >> loop 3, if it is cold, just return NULL, otherwise select the coldest loop in > >> {loop1, loop2, loop3} and find that loop2 is colder than loop3, return loop2 to > >> be the target hoist loop. The first check could AVOID hoist if curr_bb is colder > >> than loop3, but it is still hot than loop1 and loop2. Not sure whether it is possible > >> to construct such cases? > >> > >> > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >> > >> volatile int x; > >> void > >> bar (int, char *, char *); > >> void > >> foo (int *a, int n, int m, int s, int t) > >> { > >> int i; > >> int j; > >> int k; > >> > >> for (i = 0; i < m; i++) // loop 1 > >> { > >> if (__builtin_expect (x, 0)) > >> for (j = 0; j < n; j++) // loop 2 > >> for (k = 0; k < n; k++) // loop 3 > >> { > >> bar (s / 5, "one", "two"); // curr_bb > >> a[t] = s; > >> } > >> a[t] = t; // curr_bb2 > >> } > >> } > >> > >> The 4 invariant statements are moved to bb 11(loop2) instead of bb 10(loop1) > >> with this patch. > >> There are totally 6 combinations when curr_bb is hotter than loop 3. We need > >> to compare the "Loop preheader hotness" instead of "every Loop[i] and curr_bb hotness", > >> returning the coldest loop for this function find_coldest_out_loop, otherwise > >> unexpected behavior happens. > >> > >> L1 > L2 > L3 => return L3 > >> L1 > L3 > L2 => return L2 > >> L2 > L1 > L3 => return L3 > >> L2 > L3 > L1 => return L1 > >> L3 > L1 > L2 => return L2 > >> L3 > L2 > L1 => return L1 > >> > >> So bb_colder_than_loop_preheader does two kind of checks, one is checking > >> L3 preheader count with curr_bb count, another is checking L3 preheader count > >> with L1 preheader count, L2 preheader count, etc... > >> > >> > >> ssa-lim-19.c.138t.lim2: > >> ... > >> <bb 10> [local count: 16057869]: // L1 preheader > >> - _4 = s_22(D) / 5; > >> - _5 = (long unsigned int) t_24(D); > >> - _6 = _5 * 4; > >> - _7 = a_25(D) + _6; > >> _8 = (long unsigned int) t_24(D); > >> _9 = _8 * 4; > >> _10 = a_25(D) + _9; > >> > >> <bb 3> [local count: 145980626]: > >> # i_34 = PHI <i_29(12), 0(10)> > >> x.0_1 ={v} x; > >> if (x.0_1 != 0) > >> goto <bb 4>; [10.00%] > >> else > >> goto <bb 8>; [90.00%] > >> > >> <bb 4> [local count: 14598063]: > >> if (n_20(D) > 0) > >> goto <bb 11>; [89.00%] > >> else > >> goto <bb 8>; [11.00%] > >> > >> <bb 11> [local count: 12992276]: // L2 preheader > >> + _4 = s_22(D) / 5; > >> + _5 = (long unsigned int) t_24(D); > >> + _6 = _5 * 4; > >> + _7 = a_25(D) + _6; > >> goto <bb 7>; [100.00%] > >> > >> <bb 14> [local count: 850510901]: > >> > >> <bb 5> [local count: 955630225]: // curr_bb > >> # k_36 = PHI <k_27(14), 0(7)> > >> bar (_4, "one", "two"); > >> *_7 = s_22(D); > >> k_27 = k_36 + 1; > >> if (n_20(D) > k_27) > >> goto <bb 14>; [89.00%] > >> else > >> goto <bb 6>; [11.00%] > >> > >> <bb 6> [local count: 118111600]: > >> j_21 = j_35 + 1; > >> if (n_20(D) > j_21) > >> goto <bb 13>; [89.00%] > >> else > >> goto <bb 8>; [11.00%] > >> > >> <bb 13> [local count: 105119324]: > >> > >> <bb 7> [local count: 118111600]: // L3 preheader > >> # j_35 = PHI <j_21(13), 0(11)> > >> goto <bb 5>; [100.00%] > >> > >> <bb 8> [local count: 145980626]: > >> *_10 = t_24(D); > >> i_29 = i_34 + 1; > >> > >> Re-paste the bb_colder_than_loop_preheader and find_coldest_out_loop: > >> > >> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state > >> + as stated in profile-count.h, FALSE is returned if inequality cannot be > >> + decided. */ > >> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) > >> +{ > >> + if (count1 < count2) > >> + return true; > >> + else > >> + return false; > >> +} > >> + > >> +/* Find coldest loop between OUTMOST_LOOP and LOOP by comapring profile count. > >> + */ > >> + > >> +static class loop * > >> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, > >> + basic_block curr_bb) > >> +{ > >> + class loop *cold_loop, *min_loop; > >> + cold_loop = min_loop = outmost_loop; > >> + profile_count min_count = loop_preheader_edge (min_loop)->src->count; > >> + > >> + /* If bb_colder_than_loop_preheader returns false due to three-state > >> + comparision, OUTMOST_LOOP is returned finally to preserve the behavior. > >> + Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP. */ > >> + if (curr_bb > >> + && bb_colder_than_loop_preheader (curr_bb->count, > >> + loop_preheader_edge (loop)->src->count)) > >> + return NULL; > >> + > >> + while (min_loop != loop) > >> + { > >> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); > >> + if (bb_colder_than_loop_preheader ( > >> + loop_preheader_edge (min_loop)->src->count, min_count)) > >> + cold_loop = min_loop; > >> + } > >> + return cold_loop; > >> +} > >> + > >> > >> > >>> > >>> Why is this function not simply > >>> > >>> +static class loop * > >>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, > >>> + basic_block curr_bb) > >>> +{ > >>> while (bb_colder_than_loop_preheader (curr_bb->count, > >>> loop_preheader_edge (outermost_loop)->src->count)) > >>> { > >>> if (outermost_loop == loop) > >>> return NULL; > >>> outermost_loop = superloop_at_depth (loop, loop_depth > >>> (outermost_loop) + 1); > >>> } > >>> return outermost_loop; > >>> } > >> > >> If change like this, when processing curr_bb(5), outermost_loop will > >> return loop 1 since curr_bb->count > Loop1_prehead->count, the while > >> loop stopped. This doesn't meet what we want. > > > > Why? curr_bb is executed at least as often as loop1 preheader if > > we look at the counts? So either the counts do not really tell us > > anything of help or I am missing something. Are you merely > > looking for a block with a lower count on the path from the outermost > > loop entry to the block in question and deciding you do not want to > > hoist further than that? So it's not about not hoisting to a hot place > > but instead hoist to the coldest place within a loop nest? > > > > So we have > > > > for (i = 0; i < m; i++) // loop 1 > > { > > if (__builtin_expect (x, 0)) > > for (j = 0; j < n; j++) // loop 2 > > > > > > <bb 10> [local count: 16057869]: // L1 preheader > > ... > > <bb 3> [local count: 145980626]: > > # i_34 = PHI <i_29(12), 0(10)> > > ... > > <bb 11> [local count: 12992276]: // L2 preheader > > ... > > <bb 7> [local count: 118111600]: // L3 preheader > > # j_35 = PHI <j_21(13), 0(11)> > > goto <bb 5>; [100.00%] > > > > and we want to hoist to the L2 preheader because that's less frequently > > executed than the L1 preheader (which is less frequently executed > > than the L3 preheader or the block we are hoisting from). > > Yes, this is exactly what I want, sorry for not describe it clear before ;( OK, thanks for confirming ;) > The updated patch[1] may reflect find_coldest_out_loop better: > It first check whether curr_bb is hotter than it's preheader, if false, return NULL > which means no need hoist at all; Then find a *coldest* preheader to hoist > within a loop nest from outmost_loop. > > > [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html > > > +/* Find coldest loop between OUTMOST_LOOP and LOOP by comparing profile count. > + It does two steps check: > + 1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just > + return NULL which means it should not be moved out at all; > + 2) CURR_BB is NOT cold, set LOOP to cold_loop, then iteratively search loops > + from {L[outmost_loop], L[outmost_loop+1], ... L[loop]}, if L[i] is colder > + than L[cold_loop], reset cold_loop to L[i] until get the loop that has > + smallest profile_count. */ > + > +static class loop * > +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, > + basic_block curr_bb) > +{ > + class loop *cold_loop; > + > + /* If bb_colder_than_loop_preheader returns false due to three-state > + comparision, OUTMOST_LOOP is returned finally to preserve the behavior. > + Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP. */ > + if (curr_bb > + && bb_colder_than_loop_preheader (curr_bb, > + loop_preheader_edge (loop)->src)) > + return NULL; > + > + cold_loop = loop; > + while (outmost_loop != loop) > + { > + if (bb_colder_than_loop_preheader (loop_preheader_edge (outmost_loop)->src, > + loop_preheader_edge (cold_loop)->src)) > + cold_loop = outmost_loop; > + outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1); > + } > + return cold_loop; > +} > > > > > > I'm concerned with compile-time complexity re-evaluating counts on the > > loop nest many times. So it looks to me that we can pre-compute > > this lowest-preheader-count loop for a loop nest at least for the > > store-motion case where we know the outermost loop? > > > > > > But the lowest-preheader-count loop may change for a loop/bb with different > outermost loop. For example if, > > L1_preheader_count < L2_preheader_count < L3_preheader_count < L4_preheader_count < curr_bb_count > > then, > > find_coldest_out_loop (L1, loop, curr_bb) => coldest preheader loop is L1 > find_coldest_out_loop (L2, loop, curr_bb) => coldest preheader loop is L2 > > So it will be a 1:N map? I'm talking about the can_sm_ref_p call, in that context 'loop' will be the outermost loop of interest, and we are calling this for all stores in a loop. We're doing +bool +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) +{ + basic_block curr_bb = gimple_bb (loc->stmt); + class loop *inner_loop = curr_bb->loop_father; + return find_coldest_out_loop (l, inner_loop, curr_bb); for each location the ref is accessed and the intent was to see whether there's at least one that we would like to move to 'loop'. Indeed since we only know the common outer loop but not the inner we are hosting from there's not a single "coldest" loop to cache and so any caching we might want to perform could be applied to the other case as well. I suppose the most natural thing to cache is for each loop the outer loop where its outer loop preheader would be hotter than the outer loops preheader so that + while (outmost_loop != loop) + { + if (bb_colder_than_loop_preheader (loop_preheader_edge (outmost_loop)->src, + loop_preheader_edge (cold_loop)->src)) + cold_loop = outmost_loop; + outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1); + } could be instead written as coldest_loop = coldest_outermost_loop[loop->num]; if (loop_depth (coldest_loop) < loop_depth (outermost_loop)) return outermost_loop; return coldest_loop; ? And in the usual case coldest_outermost_loop[L] would be the loop tree root. It should be possible to compute such cache in a DFS walk of the loop tree (the loop iterator by default visits in such order). > Pre-compute it in find_coldest_out_loop > and save it also in lim_data with a new variable > coldest_preheader_loop[outmost_loop][coldest_preheader_loop]? > each call of find_coldest_out_loop will check whether that variable is set > already, only continue the search if > coldest_preheader_loop[outmost_loop][coldest_preheader_loop] is not set? > Seems a bit complicated and not sure whether it helps to reduce > compile-time complexity or I am misunderstanding... > > > -- > Thanks, > Xionghu
On 2021/10/29 19:48, Richard Biener wrote: > I'm talking about the can_sm_ref_p call, in that context 'loop' will > be the outermost loop of > interest, and we are calling this for all stores in a loop. We're doing > > +bool > +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) > +{ > + basic_block curr_bb = gimple_bb (loc->stmt); > + class loop *inner_loop = curr_bb->loop_father; > + return find_coldest_out_loop (l, inner_loop, curr_bb); > > for each location the ref is accessed and the intent was to see > whether there's at least one > that we would like to move to 'loop'. Indeed since we only know the > common outer loop > but not the inner we are hosting from there's not a single "coldest" > loop to cache and so > any caching we might want to perform could be applied to the other case as well. > > I suppose the most natural thing to cache is for each loop the outer loop where > its outer loop preheader would be hotter than the outer loops preheader so that > > + while (outmost_loop != loop) > + { > + if (bb_colder_than_loop_preheader (loop_preheader_edge > (outmost_loop)->src, > + loop_preheader_edge (cold_loop)->src)) > + cold_loop = outmost_loop; > + outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1); > + } > > could be instead written as > > coldest_loop = coldest_outermost_loop[loop->num]; > if (loop_depth (coldest_loop) < loop_depth (outermost_loop)) > return outermost_loop; > return coldest_loop; > > ? And in the usual case coldest_outermost_loop[L] would be the loop tree root. > It should be possible to compute such cache in a DFS walk of the loop tree > (the loop iterator by default visits in such order). Thanks. Updated the patch with your suggestion. Not sure whether it strictly conforms to your comments. Though the patch passed all my added tests(coverage not enough), I am still a bit worried if pre-computed coldest_loop is outside of outermost_loop, but outermost_loop is not the COLDEST LOOP, i.e. (outer->inner) [loop tree root, coldest_loop, outermost_loop,..., second_coldest_loop, ..., loop], then function find_coldest_out_loop will return a loop NOT accord with our expectation, that should return second_coldest_loop instead of outermost_loop? Changes: 1. Add function fill_coldest_out_loop to pre compute the coldest outermost loop for each loop. 2. Rename find_coldest_out_loop to get_coldest_out_loop. 3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c. v5 changes: 1. Refine comments for new functions. 2. Use basic_block instead of count in bb_colder_than_loop_preheader to align with function name. 3. Refine with simpler implementation for get_coldest_out_loop and ref_in_loop_hot_body::operator for better understanding. v4 changes: 1. Sort out profile_count comparision to function bb_cold_than_loop_preheader. 2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare. 3. Split RTL invariant motion part out. 4. Remove aux changes. v3 changes: 1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop. 2. Remove unnecessary changes. 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused infinite loop when implementing v1 and the iteration is missed to be updated actually. v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html There was a patch trying to avoid move cold block out of loop: https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html Richard suggested to "never hoist anything from a bb with lower execution frequency to a bb with higher one in LIM invariantness_dom_walker before_dom_children". In gimple LIM analysis, add get_coldest_out_loop to move invariants to expected target loop, if profile count of the loop bb is colder than target loop preheader, it won't be hoisted out of loop. Likely for store motion, if all locations of the REF in loop is cold, don't do store motion of it. SPEC2017 performance evaluation shows 1% performance improvement for intrate GEOMEAN and no obvious regression for others. Especially, 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% on P8LE. gcc/ChangeLog: * tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New function. (get_coldest_out_loop): New function. (determine_max_movement): Use get_coldest_out_loop. (move_computations_worker): Adjust and fix iteration udpate. (class ref_in_loop_hot_body): New functor. (ref_in_loop_hot_body::operator): New. (can_sm_ref_p): Use for_all_locs_in_loop. (fill_coldest_out_loop): New. (loop_invariant_motion_in_fun): Call fill_coldest_out_loop. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/recip-3.c: Adjust. * gcc.dg/tree-ssa/ssa-lim-18.c: New test. * gcc.dg/tree-ssa/ssa-lim-19.c: New test. * gcc.dg/tree-ssa/ssa-lim-20.c: New test. * gcc.dg/tree-ssa/ssa-lim-21.c: New test. * gcc.dg/tree-ssa/ssa-lim-22.c: New test. --- gcc/tree-ssa-loop-im.c | 111 ++++++++++++++++++++- gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 ++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 29 ++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 35 +++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c | 32 ++++++ 7 files changed, 251 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index 4b187c2cdaf..d3390385fd9 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -146,6 +146,9 @@ public: enum dep_kind { lim_raw, sm_war, sm_waw }; enum dep_state { dep_unknown, dep_independent, dep_dependent }; +/* coldest outermost loop for given loop. */ +class loop **coldest_outermost_loop; + /* Populate the loop dependence cache of REF for LOOP, KIND with STATE. */ static void @@ -417,6 +420,43 @@ movement_possibility (gimple *stmt) return ret; } +/* Compare the profile count inequality of bb and preheader, it is three-state + as stated in profile-count.h, FALSE is returned if inequality cannot be + decided. */ +bool bb_colder_than_loop_preheader (basic_block bb, basic_block preheader) +{ + gcc_assert (bb && preheader); + return bb->count < preheader->count; +} + +/* Check coldest loop between OUTMOST_LOOP and LOOP by comparing profile count. + It does two steps check: + 1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just + return NULL which means it should not be moved out at all; + 2) CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of + OUTMOST_LOOP. */ + +static class loop * +get_coldest_out_loop (class loop *outmost_loop, class loop *loop, + basic_block curr_bb) +{ + gcc_assert (outmost_loop == loop || flow_loop_nested_p (outmost_loop, loop)); + class loop *cold_loop; + + /* If bb_colder_than_loop_preheader returns false due to three-state + comparision, OUTMOST_LOOP is returned finally to preserve the behavior. + Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP. */ + if (curr_bb + && bb_colder_than_loop_preheader (curr_bb, + loop_preheader_edge (loop)->src)) + return NULL; + + class loop *coldest_loop = coldest_outermost_loop[loop->num]; + if (loop_depth (coldest_loop) < loop_depth (outmost_loop)) + return outmost_loop; + return coldest_loop; +} + /* Suppose that operand DEF is used inside the LOOP. Returns the outermost loop to that we could move the expression using DEF if it did not have other operands, i.e. the outermost loop enclosing LOOP in that the value @@ -685,7 +725,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec) level = ALWAYS_EXECUTED_IN (bb); else level = superloop_at_depth (loop, 1); - lim_data->max_loop = level; + lim_data->max_loop = get_coldest_out_loop (level, loop, bb); + if (!lim_data->max_loop) + return false; if (gphi *phi = dyn_cast <gphi *> (stmt)) { @@ -1221,7 +1263,10 @@ move_computations_worker (basic_block bb) /* We do not really want to move conditionals out of the loop; we just placed it here to force its operands to be moved if necessary. */ if (gimple_code (stmt) == GIMPLE_COND) - continue; + { + gsi_next (&bsi); + continue; + } if (dump_file && (dump_flags & TDF_DETAILS)) { @@ -2887,6 +2932,26 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind) return indep_p; } +class ref_in_loop_hot_body +{ +public: + ref_in_loop_hot_body (loop *loop_) : l (loop_) {} + bool operator () (mem_ref_loc *loc); + class loop *l; +}; + +/* Check the coldest loop between loop L and innermost loop. If there is one + cold loop between L and INNER_LOOP, store motion can be performed, otherwise + no cold loop means no store motion. get_coldest_out_loop also handles cases + when l is inner_loop. */ +bool +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) +{ + basic_block curr_bb = gimple_bb (loc->stmt); + class loop *inner_loop = curr_bb->loop_father; + return get_coldest_out_loop (l, inner_loop, curr_bb); +} + /* Returns true if we can perform store motion of REF from LOOP. */ @@ -2941,6 +3006,12 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref) if (!ref_indep_loop_p (loop, ref, sm_war)) return false; + /* Verify whether the candidate is hot for LOOP. Only do store motion if the + candidate's profile count is hot. Statement in cold BB shouldn't be moved + out of it's loop_father. */ + if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop))) + return false; + return true; } @@ -3153,6 +3224,34 @@ fill_always_executed_in (void) fill_always_executed_in_1 (loop, contains_call); } +/* Find the coldest loop preheader from loop tree root to LOOP. Set LOOP to + cold_loop, then iteratively search loops from {L[outmost_loop], + L[outmost_loop+1], ... L[loop]}, if L[i] is colder than L[cold_loop], reset + cold_loop to L[i] until get the loop that has smallest profile_count. + Then recursively set each inner loop. */ + +void +fill_coldest_out_loop (class loop *loop) +{ + class loop *outmost_loop = current_loops->tree_root->inner; + class loop *cold_loop = loop; + while (outmost_loop != loop) + { + if (bb_colder_than_loop_preheader ( + loop_preheader_edge (outmost_loop)->src, + loop_preheader_edge (cold_loop)->src)) + cold_loop = outmost_loop; + outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1); + } + + coldest_outermost_loop[loop->num] = cold_loop; + if (dump_enabled_p ()) + dump_printf (MSG_NOTE, "loop %d's coldest outermost loop is %d\n", + loop->num, cold_loop->num); + + for (loop = loop->inner; loop; loop = loop->next) + fill_coldest_out_loop (loop); +} /* Compute the global information needed by the loop invariant motion pass. */ @@ -3237,6 +3336,8 @@ tree_ssa_lim_finalize (void) free_affine_expand_cache (&memory_accesses.ttae_cache); free (bb_loop_postorder); + + free (coldest_outermost_loop); } /* Moves invariants from loops. Only "expensive" invariants are moved out -- @@ -3256,6 +3357,12 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion) /* Fills ALWAYS_EXECUTED_IN information for basic blocks. */ fill_always_executed_in (); + /* Pre-compute coldest outermost loop of each loop. */ + class loop *loop; + coldest_outermost_loop = XNEWVEC (class loop *, number_of_loops (cfun)); + for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next) + fill_coldest_out_loop (loop); + int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun)); int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false); diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c index 638bf38db8c..641c91e719e 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c @@ -23,4 +23,4 @@ float h () F[0] += E / d; } -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */ +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c new file mode 100644 index 00000000000..7326a230b3f --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +volatile int x; +void +bar (int, char *, char *); +void +foo (int *a, int n, int k) +{ + int i; + + for (i = 0; i < n; i++) + { + if (__builtin_expect (x, 0)) + bar (k / 5, "one", "two"); + a[i] = k; + } +} + +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c new file mode 100644 index 00000000000..51c1913d003 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c @@ -0,0 +1,29 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +volatile int x; +void +bar (int, char *, char *); +void +foo (int *a, int n, int m, int s, int t) +{ + int i; + int j; + int k; + + for (i = 0; i < m; i++) // Loop 1 + { + if (__builtin_expect (x, 0)) + for (j = 0; j < n; j++) // Loop 2 + for (k = 0; k < n; k++) // Loop 3 + { + bar (s / 5, "one", "two"); + a[t] = s; + } + a[t] = t; + } +} + +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */ +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */ + diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c new file mode 100644 index 00000000000..bc60a040a70 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c @@ -0,0 +1,25 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +/* Test that `count' is not hoisted out of loop when bb is cold. */ + +int count; +volatile int x; + +struct obj { + int data; + struct obj *next; + +} *q; + +void +func (int m) +{ + struct obj *p; + for (int i = 0; i < m; i++) + if (__builtin_expect (x, 0)) + count++; + +} + +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c new file mode 100644 index 00000000000..ffe6f8f699d --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c @@ -0,0 +1,35 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +/* Test that `data' and 'data1' is not hoisted out of inner loop and outer loop + when it is in cold loop. */ + +int count; +volatile int x; + +struct obj { + int data; + int data1; + struct obj *next; +}; + +void +func (int m, int n, int k, struct obj *a) +{ + struct obj *q = a; + for (int j = 0; j < m; j++) + if (__builtin_expect (m, 0)) + for (int i = 0; i < m; i++) + { + if (__builtin_expect (x, 0)) + { + count++; + q->data += 3; /* Not hoisted out to inner loop. */ + } + count += n; + q->data1 += k; /* Not hoisted out to outer loop. */ + } +} + +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2" } } */ + diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c new file mode 100644 index 00000000000..16ba4ceb8ab --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c @@ -0,0 +1,32 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +volatile int x; +volatile int y; +void +bar (int, char *, char *); +void +foo (int *a, int n, int m, int s, int t) +{ + int i; + int j; + int k; + + for (i = 0; i < m; i++) // Loop 1 + { + if (__builtin_expect (x, 0)) + for (j = 0; j < n; j++) // Loop 2 + if (__builtin_expect (y, 0)) + for (k = 0; k < n; k++) // Loop 3 + { + bar (s / 5, "one", "two"); + a[t] = s; + } + a[t] = t; + } +} + +/* { dg-final { scan-tree-dump-times "out of loop 3" 4 "lim2" } } */ +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */ + +
On 2021/10/29 19:48, Richard Biener wrote: > I'm talking about the can_sm_ref_p call, in that context 'loop' will > be the outermost loop of > interest, and we are calling this for all stores in a loop. We're doing > > +bool > +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) > +{ > + basic_block curr_bb = gimple_bb (loc->stmt); > + class loop *inner_loop = curr_bb->loop_father; > + return find_coldest_out_loop (l, inner_loop, curr_bb); > > for each location the ref is accessed and the intent was to see > whether there's at least one > that we would like to move to 'loop'. Indeed since we only know the > common outer loop > but not the inner we are hosting from there's not a single "coldest" > loop to cache and so > any caching we might want to perform could be applied to the other case as well. > > I suppose the most natural thing to cache is for each loop the outer loop where > its outer loop preheader would be hotter than the outer loops preheader so that > > + while (outmost_loop != loop) > + { > + if (bb_colder_than_loop_preheader (loop_preheader_edge > (outmost_loop)->src, > + loop_preheader_edge (cold_loop)->src)) > + cold_loop = outmost_loop; > + outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1); > + } > > could be instead written as > > coldest_loop = coldest_outermost_loop[loop->num]; > if (loop_depth (coldest_loop) < loop_depth (outermost_loop)) > return outermost_loop; > return coldest_loop; > > ? And in the usual case coldest_outermost_loop[L] would be the loop tree root. > It should be possible to compute such cache in a DFS walk of the loop tree > (the loop iterator by default visits in such order). Thanks. Updated the patch with your suggestion. Not sure whether it strictly conforms to your comments. Though the patch passed all my added tests(coverage not enough), I am still a bit worried if pre-computed coldest_loop is outside of outermost_loop, but outermost_loop is not the COLDEST LOOP, i.e. (outer->inner) [loop tree root, coldest_loop, outermost_loop,..., second_coldest_loop, ..., loop], then function find_coldest_out_loop will return a loop NOT accord with our expectation, that should return second_coldest_loop instead of outermost_loop? Changes: 1. Add function fill_coldest_out_loop to pre compute the coldest outermost loop for each loop. 2. Rename find_coldest_out_loop to get_coldest_out_loop. 3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c. v5 changes: 1. Refine comments for new functions. 2. Use basic_block instead of count in bb_colder_than_loop_preheader to align with function name. 3. Refine with simpler implementation for get_coldest_out_loop and ref_in_loop_hot_body::operator for better understanding. v4 changes: 1. Sort out profile_count comparision to function bb_cold_than_loop_preheader. 2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare. 3. Split RTL invariant motion part out. 4. Remove aux changes. v3 changes: 1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop. 2. Remove unnecessary changes. 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused infinite loop when implementing v1 and the iteration is missed to be updated actually. v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html There was a patch trying to avoid move cold block out of loop: https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html Richard suggested to "never hoist anything from a bb with lower execution frequency to a bb with higher one in LIM invariantness_dom_walker before_dom_children". In gimple LIM analysis, add get_coldest_out_loop to move invariants to expected target loop, if profile count of the loop bb is colder than target loop preheader, it won't be hoisted out of loop. Likely for store motion, if all locations of the REF in loop is cold, don't do store motion of it. SPEC2017 performance evaluation shows 1% performance improvement for intrate GEOMEAN and no obvious regression for others. Especially, 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% on P8LE. gcc/ChangeLog: * tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New function. (get_coldest_out_loop): New function. (determine_max_movement): Use get_coldest_out_loop. (move_computations_worker): Adjust and fix iteration udpate. (class ref_in_loop_hot_body): New functor. (ref_in_loop_hot_body::operator): New. (can_sm_ref_p): Use for_all_locs_in_loop. (fill_coldest_out_loop): New. (loop_invariant_motion_in_fun): Call fill_coldest_out_loop. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/recip-3.c: Adjust. * gcc.dg/tree-ssa/ssa-lim-18.c: New test. * gcc.dg/tree-ssa/ssa-lim-19.c: New test. * gcc.dg/tree-ssa/ssa-lim-20.c: New test. * gcc.dg/tree-ssa/ssa-lim-21.c: New test. * gcc.dg/tree-ssa/ssa-lim-22.c: New test. --- gcc/tree-ssa-loop-im.c | 111 ++++++++++++++++++++- gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 ++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 29 ++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 35 +++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c | 32 ++++++ 7 files changed, 251 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index 4b187c2cdaf..d3390385fd9 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -146,6 +146,9 @@ public: enum dep_kind { lim_raw, sm_war, sm_waw }; enum dep_state { dep_unknown, dep_independent, dep_dependent }; +/* coldest outermost loop for given loop. */ +class loop **coldest_outermost_loop; + /* Populate the loop dependence cache of REF for LOOP, KIND with STATE. */ static void @@ -417,6 +420,43 @@ movement_possibility (gimple *stmt) return ret; } +/* Compare the profile count inequality of bb and preheader, it is three-state + as stated in profile-count.h, FALSE is returned if inequality cannot be + decided. */ +bool bb_colder_than_loop_preheader (basic_block bb, basic_block preheader) +{ + gcc_assert (bb && preheader); + return bb->count < preheader->count; +} + +/* Check coldest loop between OUTMOST_LOOP and LOOP by comparing profile count. + It does two steps check: + 1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just + return NULL which means it should not be moved out at all; + 2) CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of + OUTMOST_LOOP. */ + +static class loop * +get_coldest_out_loop (class loop *outmost_loop, class loop *loop, + basic_block curr_bb) +{ + gcc_assert (outmost_loop == loop || flow_loop_nested_p (outmost_loop, loop)); + class loop *cold_loop; + + /* If bb_colder_than_loop_preheader returns false due to three-state + comparision, OUTMOST_LOOP is returned finally to preserve the behavior. + Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP. */ + if (curr_bb + && bb_colder_than_loop_preheader (curr_bb, + loop_preheader_edge (loop)->src)) + return NULL; + + class loop *coldest_loop = coldest_outermost_loop[loop->num]; + if (loop_depth (coldest_loop) < loop_depth (outmost_loop)) + return outmost_loop; + return coldest_loop; +} + /* Suppose that operand DEF is used inside the LOOP. Returns the outermost loop to that we could move the expression using DEF if it did not have other operands, i.e. the outermost loop enclosing LOOP in that the value @@ -685,7 +725,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec) level = ALWAYS_EXECUTED_IN (bb); else level = superloop_at_depth (loop, 1); - lim_data->max_loop = level; + lim_data->max_loop = get_coldest_out_loop (level, loop, bb); + if (!lim_data->max_loop) + return false; if (gphi *phi = dyn_cast <gphi *> (stmt)) { @@ -1221,7 +1263,10 @@ move_computations_worker (basic_block bb) /* We do not really want to move conditionals out of the loop; we just placed it here to force its operands to be moved if necessary. */ if (gimple_code (stmt) == GIMPLE_COND) - continue; + { + gsi_next (&bsi); + continue; + } if (dump_file && (dump_flags & TDF_DETAILS)) { @@ -2887,6 +2932,26 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind) return indep_p; } +class ref_in_loop_hot_body +{ +public: + ref_in_loop_hot_body (loop *loop_) : l (loop_) {} + bool operator () (mem_ref_loc *loc); + class loop *l; +}; + +/* Check the coldest loop between loop L and innermost loop. If there is one + cold loop between L and INNER_LOOP, store motion can be performed, otherwise + no cold loop means no store motion. get_coldest_out_loop also handles cases + when l is inner_loop. */ +bool +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) +{ + basic_block curr_bb = gimple_bb (loc->stmt); + class loop *inner_loop = curr_bb->loop_father; + return get_coldest_out_loop (l, inner_loop, curr_bb); +} + /* Returns true if we can perform store motion of REF from LOOP. */ @@ -2941,6 +3006,12 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref) if (!ref_indep_loop_p (loop, ref, sm_war)) return false; + /* Verify whether the candidate is hot for LOOP. Only do store motion if the + candidate's profile count is hot. Statement in cold BB shouldn't be moved + out of it's loop_father. */ + if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop))) + return false; + return true; } @@ -3153,6 +3224,34 @@ fill_always_executed_in (void) fill_always_executed_in_1 (loop, contains_call); } +/* Find the coldest loop preheader from loop tree root to LOOP. Set LOOP to + cold_loop, then iteratively search loops from {L[outmost_loop], + L[outmost_loop+1], ... L[loop]}, if L[i] is colder than L[cold_loop], reset + cold_loop to L[i] until get the loop that has smallest profile_count. + Then recursively set each inner loop. */ + +void +fill_coldest_out_loop (class loop *loop) +{ + class loop *outmost_loop = current_loops->tree_root->inner; + class loop *cold_loop = loop; + while (outmost_loop != loop) + { + if (bb_colder_than_loop_preheader ( + loop_preheader_edge (outmost_loop)->src, + loop_preheader_edge (cold_loop)->src)) + cold_loop = outmost_loop; + outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1); + } + + coldest_outermost_loop[loop->num] = cold_loop; + if (dump_enabled_p ()) + dump_printf (MSG_NOTE, "loop %d's coldest outermost loop is %d\n", + loop->num, cold_loop->num); + + for (loop = loop->inner; loop; loop = loop->next) + fill_coldest_out_loop (loop); +} /* Compute the global information needed by the loop invariant motion pass. */ @@ -3237,6 +3336,8 @@ tree_ssa_lim_finalize (void) free_affine_expand_cache (&memory_accesses.ttae_cache); free (bb_loop_postorder); + + free (coldest_outermost_loop); } /* Moves invariants from loops. Only "expensive" invariants are moved out -- @@ -3256,6 +3357,12 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion) /* Fills ALWAYS_EXECUTED_IN information for basic blocks. */ fill_always_executed_in (); + /* Pre-compute coldest outermost loop of each loop. */ + class loop *loop; + coldest_outermost_loop = XNEWVEC (class loop *, number_of_loops (cfun)); + for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next) + fill_coldest_out_loop (loop); + int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun)); int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false); diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c index 638bf38db8c..641c91e719e 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c @@ -23,4 +23,4 @@ float h () F[0] += E / d; } -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */ +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c new file mode 100644 index 00000000000..7326a230b3f --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +volatile int x; +void +bar (int, char *, char *); +void +foo (int *a, int n, int k) +{ + int i; + + for (i = 0; i < n; i++) + { + if (__builtin_expect (x, 0)) + bar (k / 5, "one", "two"); + a[i] = k; + } +} + +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c new file mode 100644 index 00000000000..51c1913d003 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c @@ -0,0 +1,29 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +volatile int x; +void +bar (int, char *, char *); +void +foo (int *a, int n, int m, int s, int t) +{ + int i; + int j; + int k; + + for (i = 0; i < m; i++) // Loop 1 + { + if (__builtin_expect (x, 0)) + for (j = 0; j < n; j++) // Loop 2 + for (k = 0; k < n; k++) // Loop 3 + { + bar (s / 5, "one", "two"); + a[t] = s; + } + a[t] = t; + } +} + +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */ +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */ + diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c new file mode 100644 index 00000000000..bc60a040a70 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c @@ -0,0 +1,25 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +/* Test that `count' is not hoisted out of loop when bb is cold. */ + +int count; +volatile int x; + +struct obj { + int data; + struct obj *next; + +} *q; + +void +func (int m) +{ + struct obj *p; + for (int i = 0; i < m; i++) + if (__builtin_expect (x, 0)) + count++; + +} + +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c new file mode 100644 index 00000000000..ffe6f8f699d --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c @@ -0,0 +1,35 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +/* Test that `data' and 'data1' is not hoisted out of inner loop and outer loop + when it is in cold loop. */ + +int count; +volatile int x; + +struct obj { + int data; + int data1; + struct obj *next; +}; + +void +func (int m, int n, int k, struct obj *a) +{ + struct obj *q = a; + for (int j = 0; j < m; j++) + if (__builtin_expect (m, 0)) + for (int i = 0; i < m; i++) + { + if (__builtin_expect (x, 0)) + { + count++; + q->data += 3; /* Not hoisted out to inner loop. */ + } + count += n; + q->data1 += k; /* Not hoisted out to outer loop. */ + } +} + +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2" } } */ + diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c new file mode 100644 index 00000000000..16ba4ceb8ab --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c @@ -0,0 +1,32 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +volatile int x; +volatile int y; +void +bar (int, char *, char *); +void +foo (int *a, int n, int m, int s, int t) +{ + int i; + int j; + int k; + + for (i = 0; i < m; i++) // Loop 1 + { + if (__builtin_expect (x, 0)) + for (j = 0; j < n; j++) // Loop 2 + if (__builtin_expect (y, 0)) + for (k = 0; k < n; k++) // Loop 3 + { + bar (s / 5, "one", "two"); + a[t] = s; + } + a[t] = t; + } +} + +/* { dg-final { scan-tree-dump-times "out of loop 3" 4 "lim2" } } */ +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */ + +
On Wed, Nov 3, 2021 at 2:29 PM Xionghu Luo <luoxhu@linux.ibm.com> wrote: > > > > On 2021/10/29 19:48, Richard Biener wrote: > > I'm talking about the can_sm_ref_p call, in that context 'loop' will > > be the outermost loop of > > interest, and we are calling this for all stores in a loop. We're doing > > > > +bool > > +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) > > +{ > > + basic_block curr_bb = gimple_bb (loc->stmt); > > + class loop *inner_loop = curr_bb->loop_father; > > + return find_coldest_out_loop (l, inner_loop, curr_bb); > > > > for each location the ref is accessed and the intent was to see > > whether there's at least one > > that we would like to move to 'loop'. Indeed since we only know the > > common outer loop > > but not the inner we are hosting from there's not a single "coldest" > > loop to cache and so > > any caching we might want to perform could be applied to the other case as well. > > > > I suppose the most natural thing to cache is for each loop the outer loop where > > its outer loop preheader would be hotter than the outer loops preheader so that > > > > + while (outmost_loop != loop) > > + { > > + if (bb_colder_than_loop_preheader (loop_preheader_edge > > (outmost_loop)->src, > > + loop_preheader_edge (cold_loop)->src)) > > + cold_loop = outmost_loop; > > + outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1); > > + } > > > > could be instead written as > > > > coldest_loop = coldest_outermost_loop[loop->num]; > > if (loop_depth (coldest_loop) < loop_depth (outermost_loop)) > > return outermost_loop; > > return coldest_loop; > > > > ? And in the usual case coldest_outermost_loop[L] would be the loop tree root. > > It should be possible to compute such cache in a DFS walk of the loop tree > > (the loop iterator by default visits in such order). > > > > Thanks. Updated the patch with your suggestion. Not sure whether it strictly > conforms to your comments. Though the patch passed all my added tests(coverage not enough), > I am still a bit worried if pre-computed coldest_loop is outside of outermost_loop, but > outermost_loop is not the COLDEST LOOP, i.e. (outer->inner) > > [loop tree root, coldest_loop, outermost_loop,..., second_coldest_loop, ..., loop], > > then function find_coldest_out_loop will return a loop NOT accord with our > expectation, that should return second_coldest_loop instead of outermost_loop? Hmm, interesting - yes. I guess the common case will be that the pre-computed outermost loop will be the loop at depth 1 since outer loops tend to be colder than inner loops? That would then defeat the whole exercise. To optimize the common case but not avoiding iteration in the cases we care about we could instead cache the next outermost loop that is _not_ colder than loop. So for your [ ... ] example above we'd have hotter_than_inner_loop[loop] == outer (second_coldest_loop), where the candidate would then be 'second_coldest_loop' and we'd then iterate to hotter_than_inner_loop[hotter_than_inner_loop[loop]] to find the next cold candidate we can compare against? For the common case we'd have hotter_than_inner_loop[looo] == NULL (no such loop) and we then simply pick 'outermost_loop'. One comment on the patch itself below. > > > Changes: > 1. Add function fill_coldest_out_loop to pre compute the coldest > outermost loop for each loop. > 2. Rename find_coldest_out_loop to get_coldest_out_loop. > 3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c. > > v5 changes: > 1. Refine comments for new functions. > 2. Use basic_block instead of count in bb_colder_than_loop_preheader > to align with function name. > 3. Refine with simpler implementation for get_coldest_out_loop and > ref_in_loop_hot_body::operator for better understanding. > > v4 changes: > 1. Sort out profile_count comparision to function bb_cold_than_loop_preheader. > 2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare. > 3. Split RTL invariant motion part out. > 4. Remove aux changes. > > v3 changes: > 1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop. > 2. Remove unnecessary changes. > 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. > 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused > infinite loop when implementing v1 and the iteration is missed to be > updated actually. > > v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html > v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html > v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html > v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html > v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html > > There was a patch trying to avoid move cold block out of loop: > > https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html > > Richard suggested to "never hoist anything from a bb with lower execution > frequency to a bb with higher one in LIM invariantness_dom_walker > before_dom_children". > > In gimple LIM analysis, add get_coldest_out_loop to move invariants to > expected target loop, if profile count of the loop bb is colder > than target loop preheader, it won't be hoisted out of loop. > Likely for store motion, if all locations of the REF in loop is cold, > don't do store motion of it. > > SPEC2017 performance evaluation shows 1% performance improvement for > intrate GEOMEAN and no obvious regression for others. Especially, > 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is > largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% > on P8LE. > > gcc/ChangeLog: > > * tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New > function. > (get_coldest_out_loop): New function. > (determine_max_movement): Use get_coldest_out_loop. > (move_computations_worker): Adjust and fix iteration udpate. > (class ref_in_loop_hot_body): New functor. > (ref_in_loop_hot_body::operator): New. > (can_sm_ref_p): Use for_all_locs_in_loop. > (fill_coldest_out_loop): New. > (loop_invariant_motion_in_fun): Call fill_coldest_out_loop. > > gcc/testsuite/ChangeLog: > > * gcc.dg/tree-ssa/recip-3.c: Adjust. > * gcc.dg/tree-ssa/ssa-lim-18.c: New test. > * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > * gcc.dg/tree-ssa/ssa-lim-20.c: New test. > * gcc.dg/tree-ssa/ssa-lim-21.c: New test. > * gcc.dg/tree-ssa/ssa-lim-22.c: New test. > --- > gcc/tree-ssa-loop-im.c | 111 ++++++++++++++++++++- > gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 ++++ > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 29 ++++++ > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++ > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 35 +++++++ > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c | 32 ++++++ > 7 files changed, 251 insertions(+), 3 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c > > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > index 4b187c2cdaf..d3390385fd9 100644 > --- a/gcc/tree-ssa-loop-im.c > +++ b/gcc/tree-ssa-loop-im.c > @@ -146,6 +146,9 @@ public: > enum dep_kind { lim_raw, sm_war, sm_waw }; > enum dep_state { dep_unknown, dep_independent, dep_dependent }; > > +/* coldest outermost loop for given loop. */ > +class loop **coldest_outermost_loop; > + > /* Populate the loop dependence cache of REF for LOOP, KIND with STATE. */ > > static void > @@ -417,6 +420,43 @@ movement_possibility (gimple *stmt) > return ret; > } > > +/* Compare the profile count inequality of bb and preheader, it is three-state > + as stated in profile-count.h, FALSE is returned if inequality cannot be > + decided. */ > +bool bb_colder_than_loop_preheader (basic_block bb, basic_block preheader) > +{ > + gcc_assert (bb && preheader); > + return bb->count < preheader->count; > +} > + > +/* Check coldest loop between OUTMOST_LOOP and LOOP by comparing profile count. > + It does two steps check: > + 1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just > + return NULL which means it should not be moved out at all; > + 2) CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of > + OUTMOST_LOOP. */ > + > +static class loop * > +get_coldest_out_loop (class loop *outmost_loop, class loop *loop, > + basic_block curr_bb) > +{ > + gcc_assert (outmost_loop == loop || flow_loop_nested_p (outmost_loop, loop)); > + class loop *cold_loop; > + > + /* If bb_colder_than_loop_preheader returns false due to three-state > + comparision, OUTMOST_LOOP is returned finally to preserve the behavior. > + Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP. */ > + if (curr_bb > + && bb_colder_than_loop_preheader (curr_bb, > + loop_preheader_edge (loop)->src)) > + return NULL; > + > + class loop *coldest_loop = coldest_outermost_loop[loop->num]; > + if (loop_depth (coldest_loop) < loop_depth (outmost_loop)) > + return outmost_loop; > + return coldest_loop; > +} > + > /* Suppose that operand DEF is used inside the LOOP. Returns the outermost > loop to that we could move the expression using DEF if it did not have > other operands, i.e. the outermost loop enclosing LOOP in that the value > @@ -685,7 +725,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec) > level = ALWAYS_EXECUTED_IN (bb); > else > level = superloop_at_depth (loop, 1); > - lim_data->max_loop = level; > + lim_data->max_loop = get_coldest_out_loop (level, loop, bb); > + if (!lim_data->max_loop) > + return false; > > if (gphi *phi = dyn_cast <gphi *> (stmt)) > { > @@ -1221,7 +1263,10 @@ move_computations_worker (basic_block bb) > /* We do not really want to move conditionals out of the loop; we just > placed it here to force its operands to be moved if necessary. */ > if (gimple_code (stmt) == GIMPLE_COND) > - continue; > + { > + gsi_next (&bsi); > + continue; > + } > > if (dump_file && (dump_flags & TDF_DETAILS)) > { > @@ -2887,6 +2932,26 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind) > return indep_p; > } > > +class ref_in_loop_hot_body > +{ > +public: > + ref_in_loop_hot_body (loop *loop_) : l (loop_) {} > + bool operator () (mem_ref_loc *loc); > + class loop *l; > +}; > + > +/* Check the coldest loop between loop L and innermost loop. If there is one > + cold loop between L and INNER_LOOP, store motion can be performed, otherwise > + no cold loop means no store motion. get_coldest_out_loop also handles cases > + when l is inner_loop. */ > +bool > +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) > +{ > + basic_block curr_bb = gimple_bb (loc->stmt); > + class loop *inner_loop = curr_bb->loop_father; > + return get_coldest_out_loop (l, inner_loop, curr_bb); > +} > + > > /* Returns true if we can perform store motion of REF from LOOP. */ > > @@ -2941,6 +3006,12 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref) > if (!ref_indep_loop_p (loop, ref, sm_war)) > return false; > > + /* Verify whether the candidate is hot for LOOP. Only do store motion if the > + candidate's profile count is hot. Statement in cold BB shouldn't be moved > + out of it's loop_father. */ > + if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop))) > + return false; > + > return true; > } > > @@ -3153,6 +3224,34 @@ fill_always_executed_in (void) > fill_always_executed_in_1 (loop, contains_call); > } > > +/* Find the coldest loop preheader from loop tree root to LOOP. Set LOOP to > + cold_loop, then iteratively search loops from {L[outmost_loop], > + L[outmost_loop+1], ... L[loop]}, if L[i] is colder than L[cold_loop], reset > + cold_loop to L[i] until get the loop that has smallest profile_count. > + Then recursively set each inner loop. */ > + > +void > +fill_coldest_out_loop (class loop *loop) > +{ > + class loop *outmost_loop = current_loops->tree_root->inner; that should be superloop_at_depth (loop, 1), otherwise it's wrong when the function has more than one loop at the outermost level. I was also hoping to avoid this loop by passing down the current coldest loop as the single one to compare against. But with the above discussion things will look different anyway I guess. > + class loop *cold_loop = loop; > + while (outmost_loop != loop) > + { > + if (bb_colder_than_loop_preheader ( > + loop_preheader_edge (outmost_loop)->src, > + loop_preheader_edge (cold_loop)->src)) > + cold_loop = outmost_loop; > + outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1); > + } > + > + coldest_outermost_loop[loop->num] = cold_loop; > + if (dump_enabled_p ()) > + dump_printf (MSG_NOTE, "loop %d's coldest outermost loop is %d\n", > + loop->num, cold_loop->num); > + > + for (loop = loop->inner; loop; loop = loop->next) > + fill_coldest_out_loop (loop); > +} > > /* Compute the global information needed by the loop invariant motion pass. */ > > @@ -3237,6 +3336,8 @@ tree_ssa_lim_finalize (void) > free_affine_expand_cache (&memory_accesses.ttae_cache); > > free (bb_loop_postorder); > + > + free (coldest_outermost_loop); > } > > /* Moves invariants from loops. Only "expensive" invariants are moved out -- > @@ -3256,6 +3357,12 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion) > /* Fills ALWAYS_EXECUTED_IN information for basic blocks. */ > fill_always_executed_in (); > > + /* Pre-compute coldest outermost loop of each loop. */ > + class loop *loop; > + coldest_outermost_loop = XNEWVEC (class loop *, number_of_loops (cfun)); > + for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next) > + fill_coldest_out_loop (loop); > + > int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun)); > int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false); > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > index 638bf38db8c..641c91e719e 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > @@ -23,4 +23,4 @@ float h () > F[0] += E / d; > } > > -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */ > +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c > new file mode 100644 > index 00000000000..7326a230b3f > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c > @@ -0,0 +1,20 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +volatile int x; > +void > +bar (int, char *, char *); > +void > +foo (int *a, int n, int k) > +{ > + int i; > + > + for (i = 0; i < n; i++) > + { > + if (__builtin_expect (x, 0)) > + bar (k / 5, "one", "two"); > + a[i] = k; > + } > +} > + > +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > new file mode 100644 > index 00000000000..51c1913d003 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > @@ -0,0 +1,29 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +volatile int x; > +void > +bar (int, char *, char *); > +void > +foo (int *a, int n, int m, int s, int t) > +{ > + int i; > + int j; > + int k; > + > + for (i = 0; i < m; i++) // Loop 1 > + { > + if (__builtin_expect (x, 0)) > + for (j = 0; j < n; j++) // Loop 2 > + for (k = 0; k < n; k++) // Loop 3 > + { > + bar (s / 5, "one", "two"); > + a[t] = s; > + } > + a[t] = t; > + } > +} > + > +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */ > +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */ > + > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > new file mode 100644 > index 00000000000..bc60a040a70 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > @@ -0,0 +1,25 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +/* Test that `count' is not hoisted out of loop when bb is cold. */ > + > +int count; > +volatile int x; > + > +struct obj { > + int data; > + struct obj *next; > + > +} *q; > + > +void > +func (int m) > +{ > + struct obj *p; > + for (int i = 0; i < m; i++) > + if (__builtin_expect (x, 0)) > + count++; > + > +} > + > +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c > new file mode 100644 > index 00000000000..ffe6f8f699d > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c > @@ -0,0 +1,35 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +/* Test that `data' and 'data1' is not hoisted out of inner loop and outer loop > + when it is in cold loop. */ > + > +int count; > +volatile int x; > + > +struct obj { > + int data; > + int data1; > + struct obj *next; > +}; > + > +void > +func (int m, int n, int k, struct obj *a) > +{ > + struct obj *q = a; > + for (int j = 0; j < m; j++) > + if (__builtin_expect (m, 0)) > + for (int i = 0; i < m; i++) > + { > + if (__builtin_expect (x, 0)) > + { > + count++; > + q->data += 3; /* Not hoisted out to inner loop. */ > + } > + count += n; > + q->data1 += k; /* Not hoisted out to outer loop. */ > + } > +} > + > +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2" } } */ > + > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c > new file mode 100644 > index 00000000000..16ba4ceb8ab > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c > @@ -0,0 +1,32 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +volatile int x; > +volatile int y; > +void > +bar (int, char *, char *); > +void > +foo (int *a, int n, int m, int s, int t) > +{ > + int i; > + int j; > + int k; > + > + for (i = 0; i < m; i++) // Loop 1 > + { > + if (__builtin_expect (x, 0)) > + for (j = 0; j < n; j++) // Loop 2 > + if (__builtin_expect (y, 0)) > + for (k = 0; k < n; k++) // Loop 3 > + { > + bar (s / 5, "one", "two"); > + a[t] = s; > + } > + a[t] = t; > + } > +} > + > +/* { dg-final { scan-tree-dump-times "out of loop 3" 4 "lim2" } } */ > +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */ > + > + > -- > 2.27.0.90.geebb51ba8c > > >
diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c index bdc7b59dd5f..7b5d64d11f9 100644 --- a/gcc/loop-invariant.c +++ b/gcc/loop-invariant.c @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) call. */ static void -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, + bool always_executed) { rtx_insn *insn; + basic_block preheader = loop_preheader_edge (loop)->src; + + if (preheader->count > bb->count) + return; FOR_BB_INSNS (bb, insn) { @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, unsigned i; for (i = 0; i < loop->num_nodes; i++) - find_invariants_bb (body[i], - bitmap_bit_p (always_reached, i), + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), bitmap_bit_p (always_executed, i)); } diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c index 638bf38db8c..641c91e719e 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c @@ -23,4 +23,4 @@ float h () F[0] += E / d; } -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */ +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */ diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index 7de47edbcb3..2bfb5e8ec15 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -1147,6 +1147,61 @@ move_computations_worker (basic_block bb) continue; } + edge e = loop_preheader_edge (level); + if (e->src->count > bb->count) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "PHI node NOT moved to %d from %d:\n", + e->src->index, bb->index); + print_gimple_stmt (dump_file, stmt, 0); + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost, + level->num); + } + gsi_next (&bsi); + continue; + } + else + { + unsigned i; + bool skip_phi_move = false; + for (i = 0; i < gimple_phi_num_args (stmt); i++) + { + tree def = PHI_ARG_DEF (stmt, i); + + if (TREE_CODE (def) != SSA_NAME) + continue; + + gimple *def_stmt = SSA_NAME_DEF_STMT (def); + + if (!gimple_bb (def_stmt)) + continue; + + if (!dominated_by_p (CDI_DOMINATORS, e->src, + gimple_bb (def_stmt))) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, + "PHI node NOT moved to %d [local count:%d] from " + "%d [local count:%d]:\n", + e->src->index, e->src->count.value (), bb->index, + bb->count.value ()); + print_gimple_stmt (dump_file, stmt, 0); + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost, + level->num); + } + skip_phi_move = true; + break; + } + } + if (skip_phi_move) + { + gsi_next (&bsi); + continue; + } + } + if (dump_file && (dump_flags & TDF_DETAILS)) { fprintf (dump_file, "Moving PHI node\n"); @@ -1184,14 +1239,13 @@ move_computations_worker (basic_block bb) tree lhs = gimple_assign_lhs (new_stmt); SSA_NAME_RANGE_INFO (lhs) = NULL; } - gsi_insert_on_edge (loop_preheader_edge (level), new_stmt); + gsi_insert_on_edge (e, new_stmt); remove_phi_node (&bsi, false); } for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); ) { edge e; - gimple *stmt = gsi_stmt (bsi); lim_data = get_lim_data (stmt); @@ -1214,7 +1268,90 @@ move_computations_worker (basic_block bb) /* We do not really want to move conditionals out of the loop; we just placed it here to force its operands to be moved if necessary. */ if (gimple_code (stmt) == GIMPLE_COND) - continue; + { + gsi_next (&bsi); + continue; + } + + e = loop_preheader_edge (level); + if (e->src->count > bb->count) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, + "stmt: Statement NOT moved to %d [local count:%d] from " + "%d [local count:%d]:\n", + e->src->index, e->src->count.value (), bb->index, + bb->count.value ()); + print_gimple_stmt (dump_file, stmt, 0); + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost, + level->num); + } + gsi_next (&bsi); + continue; + } + else + { + if (is_gimple_assign (stmt)) + { + tree rhs1 = gimple_assign_rhs1 (stmt); + tree rhs2 = gimple_assign_rhs2 (stmt); + if (TREE_CODE (rhs1) == MEM_REF) + { + rhs2 = TREE_OPERAND (rhs1, 1); + rhs1 = TREE_OPERAND (rhs1, 0); + } + gimple *stmt1 = NULL, *stmt2 = NULL; + basic_block def_bb; + if (rhs1 && TREE_CODE (rhs1) == SSA_NAME) + { + stmt1 = SSA_NAME_DEF_STMT (rhs1); + def_bb = gimple_bb (stmt1); + if (stmt1 + && def_bb + && (def_bb == bb + || !dominated_by_p (CDI_DOMINATORS, e->src, def_bb))) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, + "stmt1: Statement NOT moved to %d [local " + "count:%d] from %d [local count:%d]:\n", + e->src->index, e->src->count.value (), + bb->index, bb->count.value ()); + print_gimple_stmt (dump_file, stmt, 0); + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", + cost, level->num); + } + gsi_next (&bsi); + continue; + } + } + if (rhs2 && TREE_CODE (rhs2) == SSA_NAME) + { + stmt2 = SSA_NAME_DEF_STMT (rhs2); + def_bb = gimple_bb (stmt2); + if (stmt2 && def_bb + && (def_bb == bb + || !dominated_by_p (CDI_DOMINATORS, e->src, def_bb))) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, + "stmt2: Statement NOT moved to %d [local " + "count:%d] from %d [local count:%d]:\n", + e->src->index, e->src->count.value (), + bb->index, bb->count.value ()); + print_gimple_stmt (dump_file, stmt, 0); + fprintf (dump_file, "(cost %u) out of loop %d.\n\n", + cost, level->num); + } + gsi_next (&bsi); + continue; + } + } + } + } if (dump_file && (dump_flags & TDF_DETAILS)) { @@ -1224,7 +1361,6 @@ move_computations_worker (basic_block bb) cost, level->num); } - e = loop_preheader_edge (level); gcc_assert (!gimple_vdef (stmt)); if (gimple_vuse (stmt)) { @@ -2094,6 +2230,19 @@ execute_sm (class loop *loop, im_mem_ref *ref, bool multi_threaded_model_p = false; gimple_stmt_iterator gsi; sm_aux *aux = new sm_aux; + basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt); + + edge e = loop_preheader_edge (loop); + if (e->src->count > bb->count) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "Don't execute store motion of "); + print_generic_expr (dump_file, ref->mem.ref); + fprintf (dump_file, " from loop %d\n", loop->num); + } + return; + } if (dump_file && (dump_flags & TDF_DETAILS)) { @@ -2202,7 +2351,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq, } else { - sm_aux *aux = *aux_map.get (ref); + sm_aux **paux = aux_map.get (ref); + sm_aux *aux; + if (paux) + aux = *paux; + else + continue; if (!aux->store_flag || kind == sm_ord) { gassign *store; diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c index 3a09bbc39e5..4cae82936b9 100644 --- a/gcc/tree-ssa-loop-split.c +++ b/gcc/tree-ssa-loop-split.c @@ -577,14 +577,17 @@ split_loop (class loop *loop1) if (!initial_true) cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond); + edge true_edge = EDGE_SUCC (bbs[i], 0)->flags & EDGE_TRUE_VALUE + ? EDGE_SUCC (bbs[i], 0) + : EDGE_SUCC (bbs[i], 1); /* Now version the loop, placing loop2 after loop1 connecting them, and fix up SSA form for that. */ initialize_original_copy_tables (); basic_block cond_bb; class loop *loop2 = loop_version (loop1, cond, &cond_bb, - profile_probability::always (), - profile_probability::always (), + true_edge->probability, + true_edge->probability.invert (), profile_probability::always (), profile_probability::always (), true); @@ -1486,8 +1489,8 @@ do_split_loop_on_cond (struct loop *loop1, edge invar_branch) initialize_original_copy_tables (); struct loop *loop2 = loop_version (loop1, boolean_true_node, NULL, - profile_probability::always (), - profile_probability::never (), + invar_branch->probability, + invar_branch->probability.invert (), profile_probability::always (), profile_probability::always (), true); @@ -1530,6 +1533,9 @@ do_split_loop_on_cond (struct loop *loop1, edge invar_branch) to_loop1->flags |= true_invar ? EDGE_FALSE_VALUE : EDGE_TRUE_VALUE; to_loop2->flags |= true_invar ? EDGE_TRUE_VALUE : EDGE_FALSE_VALUE; + to_loop1->probability = invar_branch->probability.invert (); + to_loop2->probability = invar_branch->probability; + /* Due to introduction of a control flow edge from loop1 latch to loop2 pre-header, we should update PHIs in loop2 to reflect this connection between loop1 and loop2. */