Message ID | 4C8A90C5.4020306@linux.vnet.ibm.com |
---|---|
State | New |
Headers | show |
On Fri, Sep 10, 2010 at 10:10 PM, Pat Haugen <pthaugen@linux.vnet.ibm.com> wrote: > The following patch tries to limit TER such that it will not propagate > expressions across calls, which can extend lifetimes across calls and result > in suboptimal code due to additional save/restore of registers. I tried > allowing just BUILT_IN_MD builtins to be bypassed, but that still didn't > catch things like sqrt/powf which were seen in 435.gromacs benchmark, so > resorted to all builtins (no worse than what was being done before). > > Bootstrap/regtested on powerpc64-linux with no new regressions. OK for > mainline? Hm - I think we are not restricting TER to work inside basic-blocks. As FOR_EACH_BB walks basic blocks in index order your call counter will not properly account for calls in def-use paths that cross basic block boundaries. So I wonder if it makes sense to walk blocks in dominator order instead (still you can end up being confused when working in loops). Richard. > > 2010-09-10 Pat Haugen <pthaugen@us.ibm.com> > > * tree-ssa-ter.c (temp_expr_table_d): Add call_cnt field. > (new_temp_expr_table): Allocate call_cnt vector. > (free_temp_expr_table): Free it. > (process_replaceable): Add call_cnt parm and set in vector. > (find_replaceable_in_bb): Skip replacement if def/use span a call. > (debug_ter): Dump call_cnt value, remove stderr uses. > > -Pat > > >
On Sat, Sep 11, 2010 at 12:06 PM, Richard Guenther <richard.guenther@gmail.com> wrote: > On Fri, Sep 10, 2010 at 10:10 PM, Pat Haugen > <pthaugen@linux.vnet.ibm.com> wrote: >> The following patch tries to limit TER such that it will not propagate >> expressions across calls, which can extend lifetimes across calls and result >> in suboptimal code due to additional save/restore of registers. I tried >> allowing just BUILT_IN_MD builtins to be bypassed, but that still didn't >> catch things like sqrt/powf which were seen in 435.gromacs benchmark, so >> resorted to all builtins (no worse than what was being done before). >> >> Bootstrap/regtested on powerpc64-linux with no new regressions. OK for >> mainline? > > Hm - I think we are not restricting TER to work inside basic-blocks. If that's true, then this comment in tree-ssa-ter.c needs updating: " A pass is made through the function, one block at a time. No cross block information is tracked. " Ciao! Steven
Hi, On Sat, 11 Sep 2010, Richard Guenther wrote: > On Fri, Sep 10, 2010 at 10:10 PM, Pat Haugen > <pthaugen@linux.vnet.ibm.com> wrote: > > The following patch tries to limit TER such that it will not propagate > > expressions across calls, which can extend lifetimes across calls and result > > in suboptimal code due to additional save/restore of registers. I tried > > allowing just BUILT_IN_MD builtins to be bypassed, but that still didn't > > catch things like sqrt/powf which were seen in 435.gromacs benchmark, so > > resorted to all builtins (no worse than what was being done before). > > > > Bootstrap/regtested on powerpc64-linux with no new regressions. OK for > > mainline? > > Hm - I think we are not restricting TER to work inside basic-blocks. We do. Ciao, Michael.
On Mon, Sep 13, 2010 at 12:53 PM, Michael Matz <matz@suse.de> wrote: > Hi, > > On Sat, 11 Sep 2010, Richard Guenther wrote: > >> On Fri, Sep 10, 2010 at 10:10 PM, Pat Haugen >> <pthaugen@linux.vnet.ibm.com> wrote: >> > The following patch tries to limit TER such that it will not propagate >> > expressions across calls, which can extend lifetimes across calls and result >> > in suboptimal code due to additional save/restore of registers. I tried >> > allowing just BUILT_IN_MD builtins to be bypassed, but that still didn't >> > catch things like sqrt/powf which were seen in 435.gromacs benchmark, so >> > resorted to all builtins (no worse than what was being done before). >> > >> > Bootstrap/regtested on powerpc64-linux with no new regressions. OK for >> > mainline? >> >> Hm - I think we are not restricting TER to work inside basic-blocks. > > We do. Ok, with two people confirming this the original patch looks ok to me. Thanks, Richard.
On Fri, Sep 10, 2010 at 1:10 PM, Pat Haugen <pthaugen@linux.vnet.ibm.com> wrote: > The following patch tries to limit TER such that it will not propagate > expressions across calls, which can extend lifetimes across calls and result > in suboptimal code due to additional save/restore of registers. I tried > allowing just BUILT_IN_MD builtins to be bypassed, but that still didn't > catch things like sqrt/powf which were seen in 435.gromacs benchmark, so > resorted to all builtins (no worse than what was being done before). > > Bootstrap/regtested on powerpc64-linux with no new regressions. OK for > mainline? > > > 2010-09-10 Pat Haugen <pthaugen@us.ibm.com> > > * tree-ssa-ter.c (temp_expr_table_d): Add call_cnt field. > (new_temp_expr_table): Allocate call_cnt vector. > (free_temp_expr_table): Free it. > (process_replaceable): Add call_cnt parm and set in vector. > (find_replaceable_in_bb): Skip replacement if def/use span a call. > (debug_ter): Dump call_cnt value, remove stderr uses. > This caused: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45663 on Fedora 13.
Hi, > 2010-09-10 Pat Haugen <pthaugen@us.ibm.com> > > * tree-ssa-ter.c (temp_expr_table_d): Add call_cnt field. > (new_temp_expr_table): Allocate call_cnt vector. > (free_temp_expr_table): Free it. > (process_replaceable): Add call_cnt parm and set in vector. > (find_replaceable_in_bb): Skip replacement if def/use span a call. > (debug_ter): Dump call_cnt value, remove stderr uses. This has introduced some pessimizations on the SPARC, visible as regressions in the testsuite: FAIL: gcc.target/sparc/fandnot.c scan-assembler-times fandnot1\\t% 6 FAIL: gcc.target/sparc/fandnots.c scan-assembler-times fandnot1s\\t% 4 FAIL: gcc.target/sparc/fornot.c scan-assembler-times fornot1\\t% 6 FAIL: gcc.target/sparc/fornots.c scan-assembler-times fornot1s\\t% 4 For typedef int vec32 __attribute__((vector_size(8))); extern vec32 foo1_32(void); extern vec32 foo2_32(void); vec32 fun32(void) { return ~foo1_32 () & foo2_32 (); } we used to generate a fandnot instruction at -O, combining fand and fnot. The .optimized dump contains: fun32 () { vec32 D.2044; vec32 D.2043; vec32 D.2042; vec32 D.2041; <bb 2>: D.2042_1 = foo1_32 (); D.2043_2 = ~D.2042_1; D.2044_3 = foo2_32 (); D.2041_4 = D.2043_2 & D.2044_3; return D.2041_4; } The NOT operation used to be propagated into the AND operation; it isn't any more after the patch. Now this propagation was a complete win: an SSA name was eliminated and lifetimes didn't change globally. Could the patch be tweaked so as to fix the pessimization this case?
> Why does combine not perform this optimization on RTL?
Because of the call. :-)
On Mon, Oct 18, 2010 at 9:18 AM, Eric Botcazou <ebotcazou@adacore.com> wrote: > Hi, > >> 2010-09-10 Pat Haugen <pthaugen@us.ibm.com> >> >> * tree-ssa-ter.c (temp_expr_table_d): Add call_cnt field. >> (new_temp_expr_table): Allocate call_cnt vector. >> (free_temp_expr_table): Free it. >> (process_replaceable): Add call_cnt parm and set in vector. >> (find_replaceable_in_bb): Skip replacement if def/use span a call. >> (debug_ter): Dump call_cnt value, remove stderr uses. > > This has introduced some pessimizations on the SPARC, visible as regressions > in the testsuite: > > FAIL: gcc.target/sparc/fandnot.c scan-assembler-times fandnot1\\t% 6 > FAIL: gcc.target/sparc/fandnots.c scan-assembler-times fandnot1s\\t% 4 > FAIL: gcc.target/sparc/fornot.c scan-assembler-times fornot1\\t% 6 > FAIL: gcc.target/sparc/fornots.c scan-assembler-times fornot1s\\t% 4 > > For > > typedef int vec32 __attribute__((vector_size(8))); > > extern vec32 foo1_32(void); > extern vec32 foo2_32(void); > > vec32 fun32(void) > { > return ~foo1_32 () & foo2_32 (); > } > > we used to generate a fandnot instruction at -O, combining fand and fnot. > > > The .optimized dump contains: > > fun32 () > { > vec32 D.2044; > vec32 D.2043; > vec32 D.2042; > vec32 D.2041; > > <bb 2>: > D.2042_1 = foo1_32 (); > D.2043_2 = ~D.2042_1; > D.2044_3 = foo2_32 (); > D.2041_4 = D.2043_2 & D.2044_3; > return D.2041_4; > > } > > The NOT operation used to be propagated into the AND operation; it isn't any > more after the patch. Now this propagation was a complete win: an SSA name > was eliminated and lifetimes didn't change globally. > > Could the patch be tweaked so as to fix the pessimization this case? We could allow the TER if the operation just has a single operand as that doesn't pessimize register allocation across calls. Why does combine not perform this optimization on RTL? Richard. > -- > Eric Botcazou >
On Mon, Oct 18, 2010 at 11:27 AM, Eric Botcazou <ebotcazou@adacore.com> wrote: >> Why does combine not perform this optimization on RTL? > > Because of the call. :-) Huh? Isn't it two pseudos? Why would it care about the call? Ah, probably because of the same issue ... Richard. > -- > Eric Botcazou >
> Huh? Isn't it two pseudos? Why would it care about the call? Ah, > probably because of the same issue ... Yes, RTL is generally very conservative about calls. In any case, I don't think the pessimization is very significant, but it does exist so if we can easily avoid it...
Index: gcc/tree-ssa-ter.c =================================================================== --- gcc/tree-ssa-ter.c (revision 163779) +++ gcc/tree-ssa-ter.c (working copy) @@ -167,6 +167,7 @@ typedef struct temp_expr_table_d bitmap partition_in_use; /* Partitions with kill entries. */ bitmap new_replaceable_dependencies; /* Holding place for pending dep's. */ int *num_in_part; /* # of ssa_names in a partition. */ + int *call_cnt; /* Call count at definition. */ } *temp_expr_table_p; /* Used to indicate a dependency on VDEFs. */ @@ -209,6 +210,7 @@ new_temp_expr_table (var_map map) if (p != NO_PARTITION) t->num_in_part[p]++; } + t->call_cnt = XCNEWVEC (int, num_ssa_names + 1); return t; } @@ -240,6 +242,7 @@ free_temp_expr_table (temp_expr_table_p free (t->kill_list); free (t->partition_dependencies); free (t->num_in_part); + free (t->call_cnt); if (t->replaceable_expressions) ret = t->replaceable_expressions; @@ -469,7 +472,7 @@ finished_with_expr (temp_expr_table_p ta /* Create an expression entry for a replaceable expression. */ static void -process_replaceable (temp_expr_table_p tab, gimple stmt) +process_replaceable (temp_expr_table_p tab, gimple stmt, int call_cnt) { tree var, def, basevar; int version; @@ -510,6 +513,8 @@ process_replaceable (temp_expr_table_p t make_dependent_on_partition (tab, version, VIRTUAL_PARTITION (tab)); add_to_partition_kill_list (tab, VIRTUAL_PARTITION (tab), version); } + + tab->call_cnt[version] = call_cnt; } @@ -576,11 +581,12 @@ find_replaceable_in_bb (temp_expr_table_ { gimple_stmt_iterator bsi; gimple stmt; - tree def, use; + tree def, use, fndecl; int partition; var_map map = tab->map; ssa_op_iter iter; bool stmt_replaceable; + int cur_call_cnt = 0; for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi)) { @@ -634,10 +640,12 @@ find_replaceable_in_bb (temp_expr_table_ same_root_var = true; } - /* Mark expression as replaceable unless stmt is volatile or the + /* Mark expression as replaceable unless stmt is volatile, or the def variable has the same root variable as something in the - substitution list. */ - if (gimple_has_volatile_ops (stmt) || same_root_var) + substitution list, or the def and use span a call such that + we'll expand lifetimes across a call. */ + if (gimple_has_volatile_ops (stmt) || same_root_var || + tab->call_cnt[ver] != cur_call_cnt) finished_with_expr (tab, ver, true); else mark_replaceable (tab, use, stmt_replaceable); @@ -652,9 +660,17 @@ find_replaceable_in_bb (temp_expr_table_ kill_expr (tab, partition); } + /* Increment counter if this is a non BUILT_IN call. We allow + replacement over BUILT_IN calls since many will expand to inline + insns instead of a true call. */ + if (is_gimple_call (stmt) + && !((fndecl = gimple_call_fndecl (stmt)) + && DECL_BUILT_IN (fndecl))) + cur_call_cnt++; + /* Now see if we are creating a new expression or not. */ if (stmt_replaceable) - process_replaceable (tab, stmt); + process_replaceable (tab, stmt, cur_call_cnt); /* Free any unused dependency lists. */ bitmap_clear (tab->new_replaceable_dependencies); @@ -737,7 +753,7 @@ debug_ter (FILE *f, temp_expr_table_p t) for (x = 1; x < num_ssa_names; x++) if (t->expr_decl_uids[x]) { - print_generic_expr (stderr, ssa_name (x), TDF_SLIM); + print_generic_expr (f, ssa_name (x), TDF_SLIM); fprintf (f, " dep-parts : "); if (t->partition_dependencies[x] && !bitmap_empty_p (t->partition_dependencies[x])) @@ -745,10 +761,11 @@ debug_ter (FILE *f, temp_expr_table_p t) EXECUTE_IF_SET_IN_BITMAP (t->partition_dependencies[x], 0, y, bi) fprintf (f, "P%d ",y); } - fprintf (stderr, " basedecls: "); + fprintf (f, " basedecls: "); EXECUTE_IF_SET_IN_BITMAP (t->expr_decl_uids[x], 0, y, bi) fprintf (f, "%d ",y); - fprintf (stderr, "\n"); + fprintf (f, " call_cnt : %d",t->call_cnt[x]); + fprintf (f, "\n"); } bitmap_print (f, t->partition_in_use, "Partitions in use ",