Patchwork Modify TER to not propagate expressions across call

login
register
mail settings
Submitter Pat Haugen
Date Sept. 10, 2010, 8:10 p.m.
Message ID <4C8A90C5.4020306@linux.vnet.ibm.com>
Download mbox | patch
Permalink /patch/64454/
State New
Headers show

Comments

Pat Haugen - Sept. 10, 2010, 8:10 p.m.
The following patch tries to limit TER such that it will not propagate 
expressions across calls, which can extend lifetimes across calls and 
result in suboptimal code due to additional save/restore of registers. I 
tried allowing just BUILT_IN_MD builtins to be bypassed, but that still 
didn't catch things like sqrt/powf which were seen in 435.gromacs 
benchmark, so resorted to all builtins (no worse than what was being 
done before).

Bootstrap/regtested on powerpc64-linux with no new regressions.  OK for 
mainline?


2010-09-10  Pat Haugen <pthaugen@us.ibm.com>

     * tree-ssa-ter.c (temp_expr_table_d): Add call_cnt field.
     (new_temp_expr_table): Allocate call_cnt vector.
     (free_temp_expr_table): Free it.
     (process_replaceable): Add call_cnt parm and set in vector.
     (find_replaceable_in_bb): Skip replacement if def/use span a call.
     (debug_ter): Dump call_cnt value, remove stderr uses.

-Pat
Richard Guenther - Sept. 11, 2010, 10:06 a.m.
On Fri, Sep 10, 2010 at 10:10 PM, Pat Haugen
<pthaugen@linux.vnet.ibm.com> wrote:
>  The following patch tries to limit TER such that it will not propagate
> expressions across calls, which can extend lifetimes across calls and result
> in suboptimal code due to additional save/restore of registers. I tried
> allowing just BUILT_IN_MD builtins to be bypassed, but that still didn't
> catch things like sqrt/powf which were seen in 435.gromacs benchmark, so
> resorted to all builtins (no worse than what was being done before).
>
> Bootstrap/regtested on powerpc64-linux with no new regressions.  OK for
> mainline?

Hm - I think we are not restricting TER to work inside basic-blocks.  As
FOR_EACH_BB walks basic blocks in index order your call counter will
not properly account for calls in def-use paths that cross basic block
boundaries.  So I wonder if it makes sense to walk blocks in dominator
order instead (still you can end up being confused when working in
loops).

Richard.

>
> 2010-09-10  Pat Haugen <pthaugen@us.ibm.com>
>
>    * tree-ssa-ter.c (temp_expr_table_d): Add call_cnt field.
>    (new_temp_expr_table): Allocate call_cnt vector.
>    (free_temp_expr_table): Free it.
>    (process_replaceable): Add call_cnt parm and set in vector.
>    (find_replaceable_in_bb): Skip replacement if def/use span a call.
>    (debug_ter): Dump call_cnt value, remove stderr uses.
>
> -Pat
>
>
>
Steven Bosscher - Sept. 11, 2010, 10:10 a.m.
On Sat, Sep 11, 2010 at 12:06 PM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Fri, Sep 10, 2010 at 10:10 PM, Pat Haugen
> <pthaugen@linux.vnet.ibm.com> wrote:
>>  The following patch tries to limit TER such that it will not propagate
>> expressions across calls, which can extend lifetimes across calls and result
>> in suboptimal code due to additional save/restore of registers. I tried
>> allowing just BUILT_IN_MD builtins to be bypassed, but that still didn't
>> catch things like sqrt/powf which were seen in 435.gromacs benchmark, so
>> resorted to all builtins (no worse than what was being done before).
>>
>> Bootstrap/regtested on powerpc64-linux with no new regressions.  OK for
>> mainline?
>
> Hm - I think we are not restricting TER to work inside basic-blocks.

If that's true, then this comment in tree-ssa-ter.c needs updating:

"
   A pass is made through the function, one block at a time.  No cross block
   information is tracked.
"

Ciao!
Steven
Michael Matz - Sept. 13, 2010, 10:53 a.m.
Hi,

On Sat, 11 Sep 2010, Richard Guenther wrote:

> On Fri, Sep 10, 2010 at 10:10 PM, Pat Haugen
> <pthaugen@linux.vnet.ibm.com> wrote:
> >  The following patch tries to limit TER such that it will not propagate
> > expressions across calls, which can extend lifetimes across calls and result
> > in suboptimal code due to additional save/restore of registers. I tried
> > allowing just BUILT_IN_MD builtins to be bypassed, but that still didn't
> > catch things like sqrt/powf which were seen in 435.gromacs benchmark, so
> > resorted to all builtins (no worse than what was being done before).
> >
> > Bootstrap/regtested on powerpc64-linux with no new regressions.  OK for
> > mainline?
> 
> Hm - I think we are not restricting TER to work inside basic-blocks.

We do.


Ciao,
Michael.
Richard Guenther - Sept. 13, 2010, 10:57 a.m.
On Mon, Sep 13, 2010 at 12:53 PM, Michael Matz <matz@suse.de> wrote:
> Hi,
>
> On Sat, 11 Sep 2010, Richard Guenther wrote:
>
>> On Fri, Sep 10, 2010 at 10:10 PM, Pat Haugen
>> <pthaugen@linux.vnet.ibm.com> wrote:
>> >  The following patch tries to limit TER such that it will not propagate
>> > expressions across calls, which can extend lifetimes across calls and result
>> > in suboptimal code due to additional save/restore of registers. I tried
>> > allowing just BUILT_IN_MD builtins to be bypassed, but that still didn't
>> > catch things like sqrt/powf which were seen in 435.gromacs benchmark, so
>> > resorted to all builtins (no worse than what was being done before).
>> >
>> > Bootstrap/regtested on powerpc64-linux with no new regressions.  OK for
>> > mainline?
>>
>> Hm - I think we are not restricting TER to work inside basic-blocks.
>
> We do.

Ok, with two people confirming this the original patch looks ok to me.

Thanks,
Richard.
H.J. Lu - Sept. 13, 2010, 7:40 p.m.
On Fri, Sep 10, 2010 at 1:10 PM, Pat Haugen <pthaugen@linux.vnet.ibm.com> wrote:
>  The following patch tries to limit TER such that it will not propagate
> expressions across calls, which can extend lifetimes across calls and result
> in suboptimal code due to additional save/restore of registers. I tried
> allowing just BUILT_IN_MD builtins to be bypassed, but that still didn't
> catch things like sqrt/powf which were seen in 435.gromacs benchmark, so
> resorted to all builtins (no worse than what was being done before).
>
> Bootstrap/regtested on powerpc64-linux with no new regressions.  OK for
> mainline?
>
>
> 2010-09-10  Pat Haugen <pthaugen@us.ibm.com>
>
>    * tree-ssa-ter.c (temp_expr_table_d): Add call_cnt field.
>    (new_temp_expr_table): Allocate call_cnt vector.
>    (free_temp_expr_table): Free it.
>    (process_replaceable): Add call_cnt parm and set in vector.
>    (find_replaceable_in_bb): Skip replacement if def/use span a call.
>    (debug_ter): Dump call_cnt value, remove stderr uses.
>

This caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45663

on Fedora 13.
Eric Botcazou - Oct. 18, 2010, 7:18 a.m.
Hi,

> 2010-09-10  Pat Haugen <pthaugen@us.ibm.com>
>
>      * tree-ssa-ter.c (temp_expr_table_d): Add call_cnt field.
>      (new_temp_expr_table): Allocate call_cnt vector.
>      (free_temp_expr_table): Free it.
>      (process_replaceable): Add call_cnt parm and set in vector.
>      (find_replaceable_in_bb): Skip replacement if def/use span a call.
>      (debug_ter): Dump call_cnt value, remove stderr uses.

This has introduced some pessimizations on the SPARC, visible as regressions 
in the testsuite:

FAIL: gcc.target/sparc/fandnot.c scan-assembler-times fandnot1\\t% 6
FAIL: gcc.target/sparc/fandnots.c scan-assembler-times fandnot1s\\t% 4
FAIL: gcc.target/sparc/fornot.c scan-assembler-times fornot1\\t% 6
FAIL: gcc.target/sparc/fornots.c scan-assembler-times fornot1s\\t% 4

For

typedef int   vec32 __attribute__((vector_size(8)));

extern vec32 foo1_32(void);
extern vec32 foo2_32(void);

vec32 fun32(void)
{
  return ~foo1_32 () & foo2_32 ();
}

we used to generate a fandnot instruction at -O, combining fand and fnot.


The .optimized dump contains:

fun32 ()
{
  vec32 D.2044;
  vec32 D.2043;
  vec32 D.2042;
  vec32 D.2041;

<bb 2>:
  D.2042_1 = foo1_32 ();
  D.2043_2 = ~D.2042_1;
  D.2044_3 = foo2_32 ();
  D.2041_4 = D.2043_2 & D.2044_3;
  return D.2041_4;

}

The NOT operation used to be propagated into the AND operation; it isn't any 
more after the patch.  Now this propagation was a complete win: an SSA name 
was eliminated and lifetimes didn't change globally.

Could the patch be tweaked so as to fix the pessimization this case?
Eric Botcazou - Oct. 18, 2010, 9:27 a.m.
> Why does combine not perform this optimization on RTL?

Because of the call. :-)
Richard Guenther - Oct. 18, 2010, 9:33 a.m.
On Mon, Oct 18, 2010 at 9:18 AM, Eric Botcazou <ebotcazou@adacore.com> wrote:
> Hi,
>
>> 2010-09-10  Pat Haugen <pthaugen@us.ibm.com>
>>
>>      * tree-ssa-ter.c (temp_expr_table_d): Add call_cnt field.
>>      (new_temp_expr_table): Allocate call_cnt vector.
>>      (free_temp_expr_table): Free it.
>>      (process_replaceable): Add call_cnt parm and set in vector.
>>      (find_replaceable_in_bb): Skip replacement if def/use span a call.
>>      (debug_ter): Dump call_cnt value, remove stderr uses.
>
> This has introduced some pessimizations on the SPARC, visible as regressions
> in the testsuite:
>
> FAIL: gcc.target/sparc/fandnot.c scan-assembler-times fandnot1\\t% 6
> FAIL: gcc.target/sparc/fandnots.c scan-assembler-times fandnot1s\\t% 4
> FAIL: gcc.target/sparc/fornot.c scan-assembler-times fornot1\\t% 6
> FAIL: gcc.target/sparc/fornots.c scan-assembler-times fornot1s\\t% 4
>
> For
>
> typedef int   vec32 __attribute__((vector_size(8)));
>
> extern vec32 foo1_32(void);
> extern vec32 foo2_32(void);
>
> vec32 fun32(void)
> {
>  return ~foo1_32 () & foo2_32 ();
> }
>
> we used to generate a fandnot instruction at -O, combining fand and fnot.
>
>
> The .optimized dump contains:
>
> fun32 ()
> {
>  vec32 D.2044;
>  vec32 D.2043;
>  vec32 D.2042;
>  vec32 D.2041;
>
> <bb 2>:
>  D.2042_1 = foo1_32 ();
>  D.2043_2 = ~D.2042_1;
>  D.2044_3 = foo2_32 ();
>  D.2041_4 = D.2043_2 & D.2044_3;
>  return D.2041_4;
>
> }
>
> The NOT operation used to be propagated into the AND operation; it isn't any
> more after the patch.  Now this propagation was a complete win: an SSA name
> was eliminated and lifetimes didn't change globally.
>
> Could the patch be tweaked so as to fix the pessimization this case?

We could allow the TER if the operation just has a single operand as
that doesn't pessimize register allocation across calls.

Why does combine not perform this optimization on RTL?

Richard.

> --
> Eric Botcazou
>
Richard Guenther - Oct. 18, 2010, 9:38 a.m.
On Mon, Oct 18, 2010 at 11:27 AM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>> Why does combine not perform this optimization on RTL?
>
> Because of the call. :-)

Huh?  Isn't it two pseudos?  Why would it care about the call?  Ah, probably
because of the same issue ...

Richard.

> --
> Eric Botcazou
>
Eric Botcazou - Oct. 18, 2010, 9:43 a.m.
> Huh?  Isn't it two pseudos?  Why would it care about the call?  Ah,
> probably because of the same issue ...

Yes, RTL is generally very conservative about calls.  In any case, I don't 
think the pessimization is very significant, but it does exist so if we can 
easily avoid it...

Patch

Index: gcc/tree-ssa-ter.c
===================================================================
--- gcc/tree-ssa-ter.c	(revision 163779)
+++ gcc/tree-ssa-ter.c	(working copy)
@@ -167,6 +167,7 @@  typedef struct temp_expr_table_d
   bitmap partition_in_use;		/* Partitions with kill entries.  */
   bitmap new_replaceable_dependencies;	/* Holding place for pending dep's.  */
   int *num_in_part;			/* # of ssa_names in a partition.  */
+  int *call_cnt;			/* Call count at definition.  */
 } *temp_expr_table_p;
 
 /* Used to indicate a dependency on VDEFs.  */
@@ -209,6 +210,7 @@  new_temp_expr_table (var_map map)
       if (p != NO_PARTITION)
         t->num_in_part[p]++;
     }
+  t->call_cnt = XCNEWVEC (int, num_ssa_names + 1);
 
   return t;
 }
@@ -240,6 +242,7 @@  free_temp_expr_table (temp_expr_table_p
   free (t->kill_list);
   free (t->partition_dependencies);
   free (t->num_in_part);
+  free (t->call_cnt);
 
   if (t->replaceable_expressions)
     ret = t->replaceable_expressions;
@@ -469,7 +472,7 @@  finished_with_expr (temp_expr_table_p ta
 /* Create an expression entry for a replaceable expression.  */
 
 static void
-process_replaceable (temp_expr_table_p tab, gimple stmt)
+process_replaceable (temp_expr_table_p tab, gimple stmt, int call_cnt)
 {
   tree var, def, basevar;
   int version;
@@ -510,6 +513,8 @@  process_replaceable (temp_expr_table_p t
       make_dependent_on_partition (tab, version, VIRTUAL_PARTITION (tab));
       add_to_partition_kill_list (tab, VIRTUAL_PARTITION (tab), version);
     }
+
+  tab->call_cnt[version] = call_cnt;
 }
 
 
@@ -576,11 +581,12 @@  find_replaceable_in_bb (temp_expr_table_
 {
   gimple_stmt_iterator bsi;
   gimple stmt;
-  tree def, use;
+  tree def, use, fndecl;
   int partition;
   var_map map = tab->map;
   ssa_op_iter iter;
   bool stmt_replaceable;
+  int cur_call_cnt = 0;
 
   for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi))
     {
@@ -634,10 +640,12 @@  find_replaceable_in_bb (temp_expr_table_
 		    same_root_var = true;
 		}
 
-	      /* Mark expression as replaceable unless stmt is volatile or the
+	      /* Mark expression as replaceable unless stmt is volatile, or the
 		 def variable has the same root variable as something in the
-		 substitution list.  */
-	      if (gimple_has_volatile_ops (stmt) || same_root_var)
+		 substitution list, or the def and use span a call such that
+		 we'll expand lifetimes across a call.  */
+	      if (gimple_has_volatile_ops (stmt) || same_root_var ||
+		  tab->call_cnt[ver] != cur_call_cnt)
 		finished_with_expr (tab, ver, true);
 	      else
 		mark_replaceable (tab, use, stmt_replaceable);
@@ -652,9 +660,17 @@  find_replaceable_in_bb (temp_expr_table_
 	    kill_expr (tab, partition);
 	}
 
+      /* Increment counter if this is a non BUILT_IN call. We allow
+	 replacement over BUILT_IN calls since many will expand to inline
+	 insns instead of a true call.  */
+      if (is_gimple_call (stmt)
+	  && !((fndecl = gimple_call_fndecl (stmt))
+	       && DECL_BUILT_IN (fndecl)))
+	cur_call_cnt++;
+
       /* Now see if we are creating a new expression or not.  */
       if (stmt_replaceable)
-	process_replaceable (tab, stmt);
+	process_replaceable (tab, stmt, cur_call_cnt);
 
       /* Free any unused dependency lists.  */
       bitmap_clear (tab->new_replaceable_dependencies);
@@ -737,7 +753,7 @@  debug_ter (FILE *f, temp_expr_table_p t)
   for (x = 1; x < num_ssa_names; x++)
     if (t->expr_decl_uids[x])
       {
-        print_generic_expr (stderr, ssa_name (x), TDF_SLIM);
+        print_generic_expr (f, ssa_name (x), TDF_SLIM);
         fprintf (f, " dep-parts : ");
 	if (t->partition_dependencies[x]
 	    && !bitmap_empty_p (t->partition_dependencies[x]))
@@ -745,10 +761,11 @@  debug_ter (FILE *f, temp_expr_table_p t)
 	    EXECUTE_IF_SET_IN_BITMAP (t->partition_dependencies[x], 0, y, bi)
 	      fprintf (f, "P%d ",y);
 	  }
-	fprintf (stderr, "   basedecls: ");
+	fprintf (f, "   basedecls: ");
 	EXECUTE_IF_SET_IN_BITMAP (t->expr_decl_uids[x], 0, y, bi)
 	  fprintf (f, "%d ",y);
-	fprintf (stderr, "\n");
+	fprintf (f, "   call_cnt : %d",t->call_cnt[x]);
+	fprintf (f, "\n");
       }
 
   bitmap_print (f, t->partition_in_use, "Partitions in use ",