diff mbox

patch implementing a new pass for register-pressure relief through live range shrinkage

Message ID 5279103F.20906@redhat.com
State New
Headers show

Commit Message

Vladimir Makarov Nov. 5, 2013, 3:35 p.m. UTC
I'd like to add a new experimental optimization to the trunk.  This
optimization was discussed on RA BOF of this summer GNU Cauldron.

  It is a register pressure relief through live-range shrinkage.  It
is implemented on the scheduler base and uses register-pressure insn
scheduling infrastructure.  By rearranging insns we shorten pseudo
live-ranges and increase a chance to them be assigned to a hard
register.

  The code looks pretty simple but there are a lot of works behind
this patch.  I've tried about ten different versions of this code
(different heuristics for two currently existing register-pressure
algorithms).

  I think it is *upto target maintainers* to decide to use or not to
use this optimization for their targets.  I'd recommend to use this at
least for x86/x86-64.  I think any OOO processor with small or
moderate register file which does not use the 1st insn scheduling
might benefit from this too.

  On SPEC2000 for x86/x86-64 (I use Haswell processor, -O3 with
general tuning), the optimization usage results in smaller code size
in average (for floating point and integer benchmarks in 32- and
64-bit mode).  The improvement better visible for SPECFP2000 (although
I have the same improvement on x86-64 SPECInt2000 but it might be
attributed mostly mcf benchmark unstability).  It is about 0.5% for
32-bit and 64-bit mode.  It is understandable, as the optimization has
more opportunities to improve the code on longer BBs.  Different from
other heuristic optimizations, I don't see any significant worse
performance.  It gives practically the same or better performance (a
few benchmarks imporoved by 1% or more upto 3%).

  The single but significant drawback is additional compilation time
(4%-6%) as the 1st insn scheduling pass is quite expensive.  So I'd
recommend target maintainers to switch it on only for -Ofast.  If
somebody finds that the optimization works on processors which uses
1st insn scheduling by default (in which I slightly doubt), we could
improve the compilation time by reusing data for this optimization and
the 1st insn scheduling.

  Any comments, questions, thoughts are appreciated.

2013-11-05  Vladimir Makarov  <vmakarov@redhat.com>

        * tree-pass.h (make_pass_live_range_shrinkage): New external.
        * timevar.def (TV_LIVE_RANGE_SHRINKAGE): New.
        * sched-rgn.c (gate_handle_live_range_shrinkage): New.
        (rest_of_handle_live_range_shrinkage): Ditto
        (class pass_live_range_shrinkage): Ditto.
        (pass_data_live_range_shrinkage): Ditto.
        (make_pass_live_range_shrinkage): Ditto.
        * sched-int.h (sched_relief_p): New external.
        * sched-deps.c (create_insn_reg_set): Make void return value.
        * passes.def: Add pass_live_range_shrinkage.
        * ira.c (update_equiv_regs): Don't move if
        flag_live_range_shrinkage.
        * haifa-sched.c (sched_relief_p): New.
        (rank_for_schedule): Add code for pressure relief through live
        range shrinkage.
        (schedule_insn): Print more debug info.
        (sched_init): Setup SCHED_PRESSURE_WEIGHTED for pressure relief
        through live range shrinkage.
        * doc/invoke.texi (-flive-range-shrinkage): New.
        * common.opt (flive-range-shrinkage): New.

Comments

Richard Biener Nov. 6, 2013, 9:17 a.m. UTC | #1
On Tue, Nov 5, 2013 at 4:35 PM, Vladimir Makarov <vmakarov@redhat.com> wrote:
>   I'd like to add a new experimental optimization to the trunk.  This
> optimization was discussed on RA BOF of this summer GNU Cauldron.
>
>   It is a register pressure relief through live-range shrinkage.  It
> is implemented on the scheduler base and uses register-pressure insn
> scheduling infrastructure.  By rearranging insns we shorten pseudo
> live-ranges and increase a chance to them be assigned to a hard
> register.
>
>   The code looks pretty simple but there are a lot of works behind
> this patch.  I've tried about ten different versions of this code
> (different heuristics for two currently existing register-pressure
> algorithms).
>
>   I think it is *upto target maintainers* to decide to use or not to
> use this optimization for their targets.  I'd recommend to use this at
> least for x86/x86-64.  I think any OOO processor with small or
> moderate register file which does not use the 1st insn scheduling
> might benefit from this too.
>
>   On SPEC2000 for x86/x86-64 (I use Haswell processor, -O3 with
> general tuning), the optimization usage results in smaller code size
> in average (for floating point and integer benchmarks in 32- and
> 64-bit mode).  The improvement better visible for SPECFP2000 (although
> I have the same improvement on x86-64 SPECInt2000 but it might be
> attributed mostly mcf benchmark unstability).  It is about 0.5% for
> 32-bit and 64-bit mode.  It is understandable, as the optimization has
> more opportunities to improve the code on longer BBs.  Different from
> other heuristic optimizations, I don't see any significant worse
> performance.  It gives practically the same or better performance (a
> few benchmarks imporoved by 1% or more upto 3%).
>
>   The single but significant drawback is additional compilation time
> (4%-6%) as the 1st insn scheduling pass is quite expensive.  So I'd
> recommend target maintainers to switch it on only for -Ofast.

Generally I'd not recomment viewing -Ofast as -O4 but as -O3
plus generally "unsafe" optimizations.  So I'd not enable it for -Ofast
but for -O3 - possibly also with -Os if indeed the main motivation is
also code-size improvements (-Os is a similar beast as -O3, spend
as much time as you can on optimizing size).

Btw, thanks for working on this.  How does it relate to
-fsched-pressure?  Does it treat all register classes the same?
On x86 mostly the few fixed registers for some of the integer pipeline
instructions hurt, x86_64 has enough general and FP registers?

Richard.

>  If
> somebody finds that the optimization works on processors which uses
> 1st insn scheduling by default (in which I slightly doubt), we could
> improve the compilation time by reusing data for this optimization and
> the 1st insn scheduling.
>
>   Any comments, questions, thoughts are appreciated.
>
> 2013-11-05  Vladimir Makarov  <vmakarov@redhat.com>
>
>         * tree-pass.h (make_pass_live_range_shrinkage): New external.
>         * timevar.def (TV_LIVE_RANGE_SHRINKAGE): New.
>         * sched-rgn.c (gate_handle_live_range_shrinkage): New.
>         (rest_of_handle_live_range_shrinkage): Ditto
>         (class pass_live_range_shrinkage): Ditto.
>         (pass_data_live_range_shrinkage): Ditto.
>         (make_pass_live_range_shrinkage): Ditto.
>         * sched-int.h (sched_relief_p): New external.
>         * sched-deps.c (create_insn_reg_set): Make void return value.
>         * passes.def: Add pass_live_range_shrinkage.
>         * ira.c (update_equiv_regs): Don't move if
>         flag_live_range_shrinkage.
>         * haifa-sched.c (sched_relief_p): New.
>         (rank_for_schedule): Add code for pressure relief through live
>         range shrinkage.
>         (schedule_insn): Print more debug info.
>         (sched_init): Setup SCHED_PRESSURE_WEIGHTED for pressure relief
>         through live range shrinkage.
>         * doc/invoke.texi (-flive-range-shrinkage): New.
>         * common.opt (flive-range-shrinkage): New.
>
Vladimir Makarov Nov. 6, 2013, 4:54 p.m. UTC | #2
On 11/6/2013, 4:17 AM, Richard Biener wrote:
> On Tue, Nov 5, 2013 at 4:35 PM, Vladimir Makarov <vmakarov@redhat.com> wrote:
>>    I'd like to add a new experimental optimization to the trunk.  This
>> optimization was discussed on RA BOF of this summer GNU Cauldron.
>>
>>    It is a register pressure relief through live-range shrinkage.  It
>> is implemented on the scheduler base and uses register-pressure insn
>> scheduling infrastructure.  By rearranging insns we shorten pseudo
>> live-ranges and increase a chance to them be assigned to a hard
>> register.
>>
>>    The code looks pretty simple but there are a lot of works behind
>> this patch.  I've tried about ten different versions of this code
>> (different heuristics for two currently existing register-pressure
>> algorithms).
>>
>>    I think it is *upto target maintainers* to decide to use or not to
>> use this optimization for their targets.  I'd recommend to use this at
>> least for x86/x86-64.  I think any OOO processor with small or
>> moderate register file which does not use the 1st insn scheduling
>> might benefit from this too.
>>
>>    On SPEC2000 for x86/x86-64 (I use Haswell processor, -O3 with
>> general tuning), the optimization usage results in smaller code size
>> in average (for floating point and integer benchmarks in 32- and
>> 64-bit mode).  The improvement better visible for SPECFP2000 (although
>> I have the same improvement on x86-64 SPECInt2000 but it might be
>> attributed mostly mcf benchmark unstability).  It is about 0.5% for
>> 32-bit and 64-bit mode.  It is understandable, as the optimization has
>> more opportunities to improve the code on longer BBs.  Different from
>> other heuristic optimizations, I don't see any significant worse
>> performance.  It gives practically the same or better performance (a
>> few benchmarks imporoved by 1% or more upto 3%).
>>
>>    The single but significant drawback is additional compilation time
>> (4%-6%) as the 1st insn scheduling pass is quite expensive.  So I'd
>> recommend target maintainers to switch it on only for -Ofast.
> Generally I'd not recomment viewing -Ofast as -O4 but as -O3
> plus generally "unsafe" optimizations.  So I'd not enable it for -Ofast
> but for -O3 - possibly also with -Os if indeed the main motivation is
> also code-size improvements (-Os is a similar beast as -O3, spend
> as much time as you can on optimizing size).
Ok.  Probably my recommendation is wrong.  It is actually upto target 
maintainers to decide when to use the optimization and or use it at all 
for default (may be they just decide to use it only for SPEC reporting).

I guess that in some time we will need to use something like -O4 for 
greedy algorithms (there are a lot of researches in this area, e.g. I am 
reading an article about optimal register-pressure sensitive insn 
scheduling but the optimization can be constrained for time, for example 
1ms for each insn, and still to produce better results than the current 
heuristics).  I am sure such algorithms will be coming.

> Btw, thanks for working on this.  How does it relate to
> -fsched-pressure?
It is based on -fsched-pressure infrastructure but has different 
heuristics and goals.  GCC with 1st insn scheduling even with 
-fsched-pressure still produces worse results on mainstream x86/x86-64 
processors that  GCC without it.  I've also tried -flive-range-shrinkage 
-fschedule-insns -fsched-pressure, but just -flive-range-shrinkage is 
better for x86/x86-64.

By the way, LLVM uses insn-scheduling for x86/x86-64 before RA, but it 
goal is only register-pressure decrease (for x86, for x86-64 it is a bit 
more complicated).  So with this optimization we are just catching up 
with LLVM (which is unusual for us in optimization area).
>    Does it treat all register classes the same?
> On x86 mostly the few fixed registers for some of the integer pipeline
> instructions hurt, x86_64 has enough general and FP registers?
It treats them the same (although it is different for different classes 
as they have different number of available regs).  It is always some 
kind of approximation as we use register pressure classes here not the 
classes which will be actually used for RA.  It is even more complicated 
as IRA actually uses dynamic classes (only sets of regs which are 
profitable, e.g. it can be different from classes defined in the target 
file as reg in classes are caller-saved or some specific hard regs are 
used for arg passing). It makes graph coloring better for irregular 
register file architectures.  In whole as I remember, dynamic classes 
gave about 1% improvement even for ppc.

I should say that presence of hard regs in RTL (e.g. for parameter 
passing) is still a challenge for live-range shrinkage and 
register-pressure scheduling.  It should be addressed somehow.
diff mbox

Patch

Index: common.opt
===================================================================
--- common.opt	(revision 204380)
+++ common.opt	(working copy)
@@ -1738,6 +1738,10 @@  fregmove
 Common Ignore
 Does nothing. Preserved for backward compatibility.
 
+flive-range-shrinkage
+Common Report Var(flag_live_range_shrinkage) Init(0) Optimization
+Relief of register pressure through live range shrinkage
+
 frename-registers
 Common Report Var(flag_rename_registers) Init(2) Optimization
 Perform a register renaming optimization pass
Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 204216)
+++ doc/invoke.texi	(working copy)
@@ -378,7 +378,7 @@  Objective-C and Objective-C++ Dialects}.
 -fira-region=@var{region} -fira-hoist-pressure @gol
 -fira-loop-pressure -fno-ira-share-save-slots @gol
 -fno-ira-share-spill-slots -fira-verbose=@var{n} @gol
--fivopts -fkeep-inline-functions -fkeep-static-consts @gol
+-fivopts -fkeep-inline-functions -fkeep-static-consts -flive-range-shrinkage @gol
 -floop-block -floop-interchange -floop-strip-mine -floop-nest-optimize @gol
 -floop-parallelize-all -flto -flto-compression-level @gol
 -flto-partition=@var{alg} -flto-report -flto-report-wpa -fmerge-all-constants @gol
@@ -7257,6 +7257,12 @@  registers after writing to their lower 3
 
 Enabled for x86 at levels @option{-O2}, @option{-O3}.
 
+@item -flive-range-shrinkage
+@opindex flive-range-shrinkage
+Attempt to decrease register pressure through register live range
+shrinkage.  This is helpful for fast processors with small or moderate
+size register sets.
+
 @item -fira-algorithm=@var{algorithm}
 Use the specified coloring algorithm for the integrated register
 allocator.  The @var{algorithm} argument can be @samp{priority}, which
Index: haifa-sched.c
===================================================================
--- haifa-sched.c	(revision 204380)
+++ haifa-sched.c	(working copy)
@@ -150,6 +150,9 @@  along with GCC; see the file COPYING3.
 
 #ifdef INSN_SCHEDULING
 
+/* True if we do pressure relief pass.  */
+bool sched_relief_p;
+
 /* issue_rate is the number of insns that can be scheduled in the same
    machine cycle.  It can be defined in the config/mach/mach.h file,
    otherwise we set it to 1.  */
@@ -2519,7 +2522,7 @@  rank_for_schedule (const void *x, const
   rtx tmp = *(const rtx *) y;
   rtx tmp2 = *(const rtx *) x;
   int tmp_class, tmp2_class;
-  int val, priority_val, info_val;
+  int val, priority_val, info_val, diff;
 
   if (MAY_HAVE_DEBUG_INSNS)
     {
@@ -2532,6 +2535,20 @@  rank_for_schedule (const void *x, const
 	return INSN_LUID (tmp) - INSN_LUID (tmp2);
     }
 
+  if (sched_relief_p)
+    {
+      gcc_assert (sched_pressure == SCHED_PRESSURE_WEIGHTED);
+      if ((INSN_REG_PRESSURE_EXCESS_COST_CHANGE (tmp) < 0
+	   || INSN_REG_PRESSURE_EXCESS_COST_CHANGE (tmp2) < 0)
+	  && (diff = (INSN_REG_PRESSURE_EXCESS_COST_CHANGE (tmp)
+		      - INSN_REG_PRESSURE_EXCESS_COST_CHANGE (tmp2))) != 0)
+	return diff;
+      /* Sort by INSN_LUID (original insn order), so that we make the
+	 sort stable.  This minimizes instruction movement, thus
+	 minimizing sched's effect on debugging and cross-jumping.  */
+      return INSN_LUID (tmp) - INSN_LUID (tmp2);
+    }
+
   /* The insn in a schedule group should be issued the first.  */
   if (flag_sched_group_heuristic &&
       SCHED_GROUP_P (tmp) != SCHED_GROUP_P (tmp2))
@@ -2542,8 +2559,6 @@  rank_for_schedule (const void *x, const
 
   if (sched_pressure != SCHED_PRESSURE_NONE)
     {
-      int diff;
-
       /* Prefer insn whose scheduling results in the smallest register
 	 pressure excess.  */
       if ((diff = (INSN_REG_PRESSURE_EXCESS_COST_CHANGE (tmp)
@@ -3731,7 +3746,10 @@  schedule_insn (rtx insn)
 	{
 	  fputc (':', sched_dump);
 	  for (i = 0; i < ira_pressure_classes_num; i++)
-	    fprintf (sched_dump, "%s%+d(%d)",
+	    fprintf (sched_dump, "%s%s%+d(%d)",
+		     scheduled_insns.length () > 1
+		     && INSN_LUID (insn)
+		     < INSN_LUID (scheduled_insns[scheduled_insns.length () - 2]) ? "@" : "",
 		     reg_class_names[ira_pressure_classes[i]],
 		     pressure_info[i].set_increase, pressure_info[i].change);
 	}
@@ -6578,9 +6596,11 @@  sched_init (void)
   if (targetm.sched.dispatch (NULL_RTX, IS_DISPATCH_ON))
     targetm.sched.dispatch_do (NULL_RTX, DISPATCH_INIT);
 
-  if (flag_sched_pressure
-      && !reload_completed
-      && common_sched_info->sched_pass_id == SCHED_RGN_PASS)
+  if (sched_relief_p)
+    sched_pressure = SCHED_PRESSURE_WEIGHTED;
+  else if (flag_sched_pressure
+	   && !reload_completed
+	   && common_sched_info->sched_pass_id == SCHED_RGN_PASS)
     sched_pressure = ((enum sched_pressure_algorithm)
 		      PARAM_VALUE (PARAM_SCHED_PRESSURE_ALGORITHM));
   else
Index: ira.c
===================================================================
--- ira.c	(revision 204380)
+++ ira.c	(working copy)
@@ -3794,11 +3794,12 @@  update_equiv_regs (void)
 
 		  if (! reg_equiv[regno].replace
 		      || reg_equiv[regno].loop_depth < loop_depth
-		      /* There is no sense to move insns if we did
-			 register pressure-sensitive scheduling was
-			 done because it will not improve allocation
-			 but worsen insn schedule with a big
-			 probability.  */
+		      /* There is no sense to move insns if live range
+			 shrinkage or register pressure-sensitive
+			 scheduling were done because it will not
+			 improve allocation but worsen insn schedule
+			 with a big probability.  */
+		      || flag_live_range_shrinkage
 		      || (flag_sched_pressure && flag_schedule_insns))
 		    continue;
 
Index: passes.def
===================================================================
--- passes.def	(revision 204380)
+++ passes.def	(working copy)
@@ -358,6 +358,7 @@  along with GCC; see the file COPYING3.
       NEXT_PASS (pass_mode_switching);
       NEXT_PASS (pass_match_asm_constraints);
       NEXT_PASS (pass_sms);
+      NEXT_PASS (pass_live_range_shrinkage);
       NEXT_PASS (pass_sched);
       NEXT_PASS (pass_ira);
       NEXT_PASS (pass_reload);
Index: sched-deps.c
===================================================================
--- sched-deps.c	(revision 204380)
+++ sched-deps.c	(working copy)
@@ -1938,8 +1938,8 @@  create_insn_reg_use (int regno, rtx insn
   return use;
 }
 
-/* Allocate and return reg_set_data structure for REGNO and INSN.  */
-static struct reg_set_data *
+/* Allocate reg_set_data structure for REGNO and INSN.  */
+static void
 create_insn_reg_set (int regno, rtx insn)
 {
   struct reg_set_data *set;
@@ -1949,7 +1949,6 @@  create_insn_reg_set (int regno, rtx insn
   set->insn = insn;
   set->next_insn_set = INSN_REG_SET_LIST (insn);
   INSN_REG_SET_LIST (insn) = set;
-  return set;
 }
 
 /* Set up insn register uses for INSN and dependency context DEPS.  */
Index: sched-int.h
===================================================================
--- sched-int.h	(revision 204380)
+++ sched-int.h	(working copy)
@@ -28,6 +28,9 @@  along with GCC; see the file COPYING3.
 #include "df.h"
 #include "basic-block.h"
 
+/* True if we do pressure relief pass.  */
+extern bool sched_relief_p;
+
 /* Identificator of a scheduler pass.  */
 enum sched_pass_id_t { SCHED_PASS_UNKNOWN, SCHED_RGN_PASS, SCHED_EBB_PASS,
 		       SCHED_SMS_PASS, SCHED_SEL_PASS };
Index: sched-rgn.c
===================================================================
--- sched-rgn.c	(revision 204380)
+++ sched-rgn.c	(working copy)
@@ -3565,6 +3565,33 @@  advance_target_bb (basic_block bb, rtx i
 #endif
 
 static bool
+gate_handle_live_range_shrinkage (void)
+{
+#ifdef INSN_SCHEDULING
+  return flag_live_range_shrinkage;
+#else
+  return 0;
+#endif
+}
+
+/* Run instruction scheduler.  */
+static unsigned int
+rest_of_handle_live_range_shrinkage (void)
+{
+#ifdef INSN_SCHEDULING
+  int saved;
+
+  sched_relief_p = true;
+  saved = flag_schedule_interblock;
+  flag_schedule_interblock = false;
+  schedule_insns ();
+  flag_schedule_interblock = saved;
+  sched_relief_p = false;
+#endif
+  return 0;
+}
+
+static bool
 gate_handle_sched (void)
 {
 #ifdef INSN_SCHEDULING
@@ -3621,6 +3648,45 @@  rest_of_handle_sched2 (void)
 }
 
 namespace {
+
+const pass_data pass_data_live_range_shrinkage =
+{
+  RTL_PASS, /* type */
+  "lr_shrinkage", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  true, /* has_gate */
+  true, /* has_execute */
+  TV_LIVE_RANGE_SHRINKAGE, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  ( TODO_df_finish | TODO_verify_rtl_sharing
+    | TODO_verify_flow ), /* todo_flags_finish */
+};
+
+class pass_live_range_shrinkage : public rtl_opt_pass
+{
+public:
+  pass_live_range_shrinkage(gcc::context *ctxt)
+    : rtl_opt_pass(pass_data_live_range_shrinkage, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  bool gate () { return gate_handle_live_range_shrinkage (); }
+  unsigned int execute () { return rest_of_handle_live_range_shrinkage (); }
+
+}; // class pass_live_range_shrinkage
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_live_range_shrinkage (gcc::context *ctxt)
+{
+  return new pass_live_range_shrinkage (ctxt);
+}
+
+namespace {
 
 const pass_data pass_data_sched =
 {
Index: timevar.def
===================================================================
--- timevar.def	(revision 204380)
+++ timevar.def	(working copy)
@@ -223,6 +223,7 @@  DEFTIMEVAR (TV_COMBINE               , "
 DEFTIMEVAR (TV_IFCVT		     , "if-conversion")
 DEFTIMEVAR (TV_MODE_SWITCH           , "mode switching")
 DEFTIMEVAR (TV_SMS		     , "sms modulo scheduling")
+DEFTIMEVAR (TV_LIVE_RANGE_SHRINKAGE  , "live range shrinkage")
 DEFTIMEVAR (TV_SCHED                 , "scheduling")
 DEFTIMEVAR (TV_IRA		     , "integrated RA")
 DEFTIMEVAR (TV_LRA		     , "LRA non-specific")
Index: tree-pass.h
===================================================================
--- tree-pass.h	(revision 204380)
+++ tree-pass.h	(working copy)
@@ -530,6 +530,7 @@  extern rtl_opt_pass *make_pass_lower_sub
 extern rtl_opt_pass *make_pass_mode_switching (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_sms (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_sched (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_live_range_shrinkage (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_ira (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_reload (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_clean_state (gcc::context *ctxt);