Patchwork Add capability to run several iterations of early optimizations

login
register
mail settings
Submitter Richard Guenther
Date Oct. 28, 2011, 10:43 a.m.
Message ID <CAFiYyc1AHg7u0VA2XiZ50ozY6=BaPQx7Dct9-7M1rUzdgNj9Lw@mail.gmail.com>
Download mbox | patch
Permalink /patch/122386/
State New
Headers show

Comments

Richard Guenther - Oct. 28, 2011, 10:43 a.m.
On Fri, Oct 28, 2011 at 1:05 AM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
> Richard,
>
> Just as Matt posted his findings about the effect of iterating early optimizations, I've got the new patch ready.  This patch is essentially a complete rewrite and addresses the comments you made.
>
> On 18/10/2011, at 9:56 PM, Richard Guenther wrote:
>
>>>>
>>>> If we'd want to iterate early optimizations we'd want to do it by iterating
>>>> an IPA pass so that we benefit from more precise size estimates
>>>> when trying to inline a function the second time.
>>>
>>> Could you elaborate on this a bit?  Early optimizations are gimple passes, so I'm missing your point here.
>>
>> pass_early_local_passes is an IPA pass, you want to iterate
>> fn1, fn2, fn1, fn2, ..., not fn1, fn1 ..., fn2, fn2 ... precisely for better
>> inlining.  Thus you need to split pass_early_local_passes into pieces
>> so you can iterate one of the IPA pieces.
>
> Early_local_passes are now split into _main, _iter and _late parts.  To avoid changing the default case, _late part is merged into _main when no iterative optimizations are requested.
>
>>
>>>> Also statically
>>>> scheduling the passes will mess up dump files and you have no
>>>> chance of say, noticing that nothing changed for function f and its
>>>> callees in iteration N and thus you can skip processing them in
>>>> iteration N + 1.
>>>
>>> Yes, these are the shortcomings.  The dump files name changes can be fixed, e.g., by adding a suffix to the passes on iterations after the first one.  The analysis to avoid unnecessary iterations is more complex problem.
>
> To avoid changing the dump file names the patch appends "_iter" suffix to the dumps of iterative passes.
>
>>
>> Sure.  I analyzed early passes by manually duplicating them and
>> test that they do nothing for tramp3d, which they pretty much all did
>> at some point.
>>
>>>>
>>>> So, at least you should split the pass_early_local_passes IPA pass
>>>> into three, you'd iterate over the 2nd (definitely not over pass_split_functions
>>>> though), the third would be pass_profile and pass_split_functions only.
>>>> And you'd iterate from the place the 2nd IPA pass is executed, not
>>>> by scheduling them N times.
>>>
>>> OK, I will look into this.
>
> Done.
>
>>>
>>>>
>>>> Then you'd have to analyze the compile-time impact of the IPA
>>>> splitting on its own when not iterating.
>
> I decided to avoid this and keep the pass pipeline effectively the same when not running iterative optimizations.  This is achieved by scheduling pass_early_optimizations_late in different places in the pipeline depending on whether iterative optimizations are enabled or not.
>
> The patch bootstraps and passes regtest on i686-pc-linux-gnu {-m32/-m64} with 3 iterations enabled by default.  The only failures are 5 scan-dump tests that are due to more functions being inlined than expected.  With iterative optimizations disabled there is no change.
>
> I've kicked off SPEC2000/SPEC2006 benchmark runs to see the performance effect of the patch, and those will be posted in the same Google Docs spreadsheet in several days.
>
> OK for trunk?


@@ -255,7 +255,7 @@ cgraph_process_new_functions (void)
              /* When not optimizing, be sure we run early local passes anyway
                 to expand OMP.  */
              || !optimize)
-           execute_pass_list (pass_early_local_passes.pass.sub);
+           execute_early_local_passes_for_current_function ();

similar.

About all this -suffix stuff, I'd like to have the iterations simply re-use
the existing dump-files, thus statically sub-divide pass_early_local_passes
like

NEXT_PASS (pass_early_local_lowering_passes);
  {
      NEXT_PASS (pass_fixup_cfg);
      NEXT_PASS (pass_init_datastructures);
      NEXT_PASS (pass_expand_omp);

      NEXT_PASS (pass_referenced_vars);
      NEXT_PASS (pass_build_ssa);
      NEXT_PASS (pass_lower_vector);
      NEXT_PASS (pass_early_warn_uninitialized);
  }
/* The following you maybe iterate.  */
NEXT_PASS (pass_early_local_optimizations);
  {
      NEXT_PASS (pass_rebuild_cgraph_edges);
      NEXT_PASS (pass_inline_parameters);
      NEXT_PASS (pass_early_inline);
      NEXT_PASS (pass_all_early_optimizations);
        {
          struct opt_pass **p = &pass_all_early_optimizations.pass.sub;
          NEXT_PASS (pass_remove_cgraph_callee_edges);
          NEXT_PASS (pass_rename_ssa_copies);
          NEXT_PASS (pass_ccp);
          NEXT_PASS (pass_forwprop);
          /* pass_build_ealias is a dummy pass that ensures that we
             execute TODO_rebuild_alias at this point.  Re-building
             alias information also rewrites no longer addressed
             locals into SSA form if possible.  */
          NEXT_PASS (pass_build_ealias);
          NEXT_PASS (pass_sra_early);
          NEXT_PASS (pass_fre);
          NEXT_PASS (pass_copy_prop);
          NEXT_PASS (pass_merge_phi);
          NEXT_PASS (pass_cd_dce);
          NEXT_PASS (pass_early_ipa_sra);
          NEXT_PASS (pass_tail_recursion);
          NEXT_PASS (pass_convert_switch);
          NEXT_PASS (pass_cleanup_eh);
          NEXT_PASS (pass_profile);
          NEXT_PASS (pass_local_pure_const);
       }
   }
NEXT_PASS (pass_late_early_local_optimizations)
  {
          /* Split functions creates parts that are not run through
             early optimizations again.  It is thus good idea to do this
             late.  */
          NEXT_PASS (pass_split_functions);
      NEXT_PASS (pass_release_ssa_names);
      NEXT_PASS (pass_rebuild_cgraph_edges);
      NEXT_PASS (pass_inline_parameters);
  }

+++ b/gcc/toplev.c
@@ -1228,7 +1228,6 @@ general_init (const char *argv0)
   /* This must be done after global_init_params but before argument
      processing.  */
   init_ggc_heuristics();
-  init_optimization_passes ();
   statistics_early_init ();
   finish_params ();
 }
@@ -1989,6 +1988,8 @@ toplev_main (int argc, char **argv)
                  save_decoded_options, save_decoded_options_count,
                  UNKNOWN_LOCATION, global_dc);

+  init_optimization_passes ();
+
   handle_common_deferred_options ();

   init_local_tick ();

any reason for this?

+  /* Don't recurse or wonder on to the next pass when running
+     execute_ipa_pass_list below.  */
+  current->execute = NULL;
+  current->next = NULL;

that looks awkward ;)  Shouldn't instead

+    execute_ipa_pass_list (current);

not do what it does?  Thus, split that into two pieces, one that we
can iterate w/o the fiddling with current->?

I like this variant a lot better than the last one - still it lacks any
analysis-based justification for iteration (see my reply to Matt on
what I discussed with Honza).  Thus, I don't think we want to
merge this in its current form or in this stage1.

I'd also like to know _which_ early passes are worth iterating
(and if iterating is only worth if the iterated early inlining inlined
something).  I guess that goes back to the point that iterating
should be driven by the early inliner and/or  the fact whether
in the previous iteration new direct calls were discovered.

Richard.


> --
> Maxim Kuvyrkov
> CodeSourcery / Mentor Graphics
>
>
Maxim Kuvyrkov - Oct. 28, 2011, 10:21 p.m.
On 28/10/2011, at 11:43 PM, Richard Guenther wrote:

> On Fri, Oct 28, 2011 at 1:05 AM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
>> 
>> OK for trunk?
> 
> diff --git a/gcc/cgraph.c b/gcc/cgraph.c
> index f056d3d..4738b28 100644
> --- a/gcc/cgraph.c
> +++ b/gcc/cgraph.c
> @@ -2416,7 +2416,7 @@ cgraph_add_new_function (tree fndecl, bool lowered)
>            tree_lowering_passes (fndecl);
>            bitmap_obstack_initialize (NULL);
>            if (!gimple_in_ssa_p (DECL_STRUCT_FUNCTION (fndecl)))
> -             execute_pass_list (pass_early_local_passes.pass.sub);
> +             execute_early_local_passes_for_current_function ();
>            bitmap_obstack_release (NULL);
>            pop_cfun ();
>            current_function_decl = NULL;
> @@ -2441,7 +2441,7 @@ cgraph_add_new_function (tree fndecl, bool lowered)
>        gimple_register_cfg_hooks ();
>        bitmap_obstack_initialize (NULL);
>        if (!gimple_in_ssa_p (DECL_STRUCT_FUNCTION (fndecl)))
> -         execute_pass_list (pass_early_local_passes.pass.sub);
> +         execute_early_local_passes_for_current_function ();
> 
> I think these should only execute the lowering pieces of early local passes,
> let me see if that's properly split ...
> 
> @@ -255,7 +255,7 @@ cgraph_process_new_functions (void)
>              /* When not optimizing, be sure we run early local passes anyway
>                 to expand OMP.  */
>              || !optimize)
> -           execute_pass_list (pass_early_local_passes.pass.sub);
> +           execute_early_local_passes_for_current_function ();
> 
> similar.

In the case of tree_lowering_passes, I agree.  But with the calls to early_local_passes from cgraph_add_new_function/cgraph_process_new_functions we ought to run the full list of optimizations to make bodies of new functions ready for inlining to existing functions.  Note, cgraph_add_new_function() explicitly calls tree_lowering_passes and then invokes early_local_passes.

What do you think?  If we want to pursue this, it should be in a separate patch anyway.

> 
> About all this -suffix stuff, I'd like to have the iterations simply re-use
> the existing dump-files, thus statically sub-divide pass_early_local_passes
> like

I considered this when I was implementing and debugging iterative support, and it is better to keep dumps for main and iterative parts separate.  These are two different IPA passes, and their sub-passes should use separate dumps.

Given that the set of sub-passes of early_local_passes_main is substantially different from the set of sub-passes of early_local_passes_iter (the former has a bunch of init/expand/lower passes in it), mixing the dumps will be mighty confusing.  Without knowing the exact ordering of passes one will not be able to construct a chronological picture of optimizations.  Separating dumps of sub-passes of different IPA passes simplifies understanding of optimization ordering.

> 
> NEXT_PASS (pass_early_local_lowering_passes);
>  {
>      NEXT_PASS (pass_fixup_cfg);
>      NEXT_PASS (pass_init_datastructures);
>      NEXT_PASS (pass_expand_omp);
> 
>      NEXT_PASS (pass_referenced_vars);
>      NEXT_PASS (pass_build_ssa);
>      NEXT_PASS (pass_lower_vector);
>      NEXT_PASS (pass_early_warn_uninitialized);
>  }
> /* The following you maybe iterate.  */
> NEXT_PASS (pass_early_local_optimizations);
>  {
>      NEXT_PASS (pass_rebuild_cgraph_edges);
>      NEXT_PASS (pass_inline_parameters);
>      NEXT_PASS (pass_early_inline);
>      NEXT_PASS (pass_all_early_optimizations);
>        {
>          struct opt_pass **p = &pass_all_early_optimizations.pass.sub;
>          NEXT_PASS (pass_remove_cgraph_callee_edges);
>          NEXT_PASS (pass_rename_ssa_copies);
>          NEXT_PASS (pass_ccp);
>          NEXT_PASS (pass_forwprop);
>          /* pass_build_ealias is a dummy pass that ensures that we
>             execute TODO_rebuild_alias at this point.  Re-building
>             alias information also rewrites no longer addressed
>             locals into SSA form if possible.  */
>          NEXT_PASS (pass_build_ealias);
>          NEXT_PASS (pass_sra_early);
>          NEXT_PASS (pass_fre);
>          NEXT_PASS (pass_copy_prop);
>          NEXT_PASS (pass_merge_phi);
>          NEXT_PASS (pass_cd_dce);
>          NEXT_PASS (pass_early_ipa_sra);
>          NEXT_PASS (pass_tail_recursion);
>          NEXT_PASS (pass_convert_switch);
>          NEXT_PASS (pass_cleanup_eh);
>          NEXT_PASS (pass_profile);
>          NEXT_PASS (pass_local_pure_const);
>       }
>   }
> NEXT_PASS (pass_late_early_local_optimizations)
>  {
>          /* Split functions creates parts that are not run through
>             early optimizations again.  It is thus good idea to do this
>             late.  */
>          NEXT_PASS (pass_split_functions);
>      NEXT_PASS (pass_release_ssa_names);
>      NEXT_PASS (pass_rebuild_cgraph_edges);
>      NEXT_PASS (pass_inline_parameters);
>  }

This simple division into 3 stages will not work.  On one hand, one needs to add fixup/cleanup passes at the beginning and end of every SIMPLE_IPA pass; without those cgraph_verify() gets upset very easily.  On the other hand, when iterative optimizations are not enabled, and there is no need to decouple main and late parts, these additional fixup/cleanup passes would be unnecessary overhead.

Also, in the above division you effectively make 3 separate SIMPLE_IPA passes, and I don't know what the effect of that will be.

The pass formation and scheduling in the patch I posted may not look as pretty as what you describe above, but it is reasonably clean and easy to maintain.  One needs to change only passes in the main part, and those changes will be automatically picked up in iterative and late parts.

> 
> +++ b/gcc/toplev.c
> @@ -1228,7 +1228,6 @@ general_init (const char *argv0)
>   /* This must be done after global_init_params but before argument
>      processing.  */
>   init_ggc_heuristics();
> -  init_optimization_passes ();
>   statistics_early_init ();
>   finish_params ();
> }
> @@ -1989,6 +1988,8 @@ toplev_main (int argc, char **argv)
>                  save_decoded_options, save_decoded_options_count,
>                  UNKNOWN_LOCATION, global_dc);
> 
> +  init_optimization_passes ();
> +
>   handle_common_deferred_options ();
> 
>   init_local_tick ();
> 
> any reason for this?

Yes, this moves init_optimization_passes() after processing of --param options.  This is needed to be able to use PARAM_VALUE in passes initialization.

> 
> +  /* Don't recurse or wonder on to the next pass when running
> +     execute_ipa_pass_list below.  */
> +  current->execute = NULL;
> +  current->next = NULL;
> 
> that looks awkward ;)  Shouldn't instead
> 
> +    execute_ipa_pass_list (current);
> 
> not do what it does?  Thus, split that into two pieces, one that we
> can iterate w/o the fiddling with current->?

The choice is between the above and duplicating post-processing logic of execute_ipa_pass_list (i.e., call to cgraph_process_new_functions) in the iteration loop.  I agree that the above is a tad awkward, but it is cleaner than copy-paste of the final part of execute_ipa_pass_list.

> 
> I like this variant a lot better than the last one - still it lacks any
> analysis-based justification for iteration (see my reply to Matt on
> what I discussed with Honza).

Yes, having a way to tell whether a function have significantly changed would be awesome.  My approach here would be to make inline_parameters output feedback of how much the size/time metrics have changed for a function since previous run.  If the change is above X%, then queue functions callers for more optimizations.  Similarly, Martin's rebuild_cgraph_edges_and_devirt (when that goes into trunk) could queue new direct callees and current function for another iteration if new direct edges were resolved.

But such analysis is way out of scope of this patch, which adds infrastructure for iterative optimizations to be possible and reliable.

>  Thus, I don't think we want to
> merge this in its current form or in this stage1.

What is the benefit of pushing this to a later release?  If anything, merging the support for iterative optimizations now will allow us to consider adding the wonderful smartness to it later.  In the meantime, substituting that smartness with a knob is still a great alternative.

Thank you,

--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics
Matt - Oct. 28, 2011, 11:06 p.m.
On Sat, 29 Oct 2011, Maxim Kuvyrkov wrote:

>> I like this variant a lot better than the last one - still it lacks any
>> analysis-based justification for iteration (see my reply to Matt on
>> what I discussed with Honza).
>
> Yes, having a way to tell whether a function have significantly changed 
> would be awesome.  My approach here would be to make inline_parameters 
> output feedback of how much the size/time metrics have changed for a 
> function since previous run.  If the change is above X%, then queue 
> functions callers for more optimizations.  Similarly, Martin's 
> rebuild_cgraph_edges_and_devirt (when that goes into trunk) could queue 
> new direct callees and current function for another iteration if new 
> direct edges were resolved.

Figuring out the heuristic will need decent testing on a few projects to 
figure out what the "sweet spot" is (smallest binary for time/passes 
spent) for that given codebase. With a few data points, a reasonable stab 
at the metrics you mention can be had that would not terminate the 
iterations before the known optimial number of passes. Without those data 
points, it seems like making sure the metrics allow those "sweet spots" to 
be attained will be difficult.

>>  Thus, I don't think we want to
>> merge this in its current form or in this stage1.
>
> What is the benefit of pushing this to a later release?  If anything, 
> merging the support for iterative optimizations now will allow us to 
> consider adding the wonderful smartness to it later.  In the meantime, 
> substituting that smartness with a knob is still a great alternative.

I agree (of course). Having the knob will be very useful for testing and 
determining the acceptance criteria for the later "smartness". While 
terminating early would be a nice optimization, the feature is still 
intrinsically useful and deployable without it. In addition, when using 
LTO on nearly all the projects/modules I tested on, 3+ passes were 
always productive. To be fair, when not using LTO, beyond 2-3 passes did 
not often produce improvements unless individual compilation units were 
enormous.

There was also the question of if some of the improvements seen with 
multiple passes were indicative of deficiencies in early inlining, CFG, 
SRA, etc. If the knob is available, I'm happy to continue testing on the 
same projects I've filed recent LTO/graphite bugs against (glib, zlib, 
openssl, scummvm, binutils, etc) and write a report on what I observe as 
"suspicious" improvements that perhaps should be caught/made in a single 
pass.

It's worth noting again that while this is a useful feature in and of 
itself (especially when combined with LTO), it's *extremely* useful when 
coupled with the de-virtualization improvements submitted in other 
threads. The examples submitted for inclusion in the test suite aren't 
academic -- they are reductions of real-world performance issues from a 
mature (and shipping) C++-based networking product. Any C++ codebase that 
employs physical separation in their designs via Factory patterns, 
Interface Segregation, and/or Dependency Inversion will likely see 
improvements. To me, these enahncements combine to form one of the biggest 
leaps I've seen in C++ code optimization -- code that can be clean, OO, 
*and* fast.

Richard: If there's any additional testing or information I can reasonably 
provide to help get this in for this stage1, let me know.

Thanks!


--
tangled strands of DNA explain the way that I behave.
http://www.clock.org/~matt
Martin Jambor - Nov. 1, 2011, 8:22 p.m.
Hi,

On Fri, Oct 28, 2011 at 04:06:20PM -0700, Matt wrote:
>
...

> 
> I agree (of course). Having the knob will be very useful for testing
> and determining the acceptance criteria for the later "smartness".
> While terminating early would be a nice optimization, the feature is
> still intrinsically useful and deployable without it. In addition,
> when using LTO on nearly all the projects/modules I tested on, 3+
> passes were always productive. To be fair, when not using LTO,
> beyond 2-3 passes did not often produce improvements unless
> individual compilation units were enormous.

I'm quite surprised you get extra benefit with LTO since early
optimizations are exactly the part of middle-end which should produce
the same results, LTO or not.  So the only way I can imagine this can
happen is that inlining analysis gets somehow a much better input and
then can make much bigger use of it.  If this is because of extra
early inlining, we might try to be able to catch these situations when
doing IPA inlining decisions which would work regardless of any
iteration number cut-off.  If it is because of something else, it's
probably better to (at least try to) tweak the passes and/or inlining
analysis to understand each other straight away.

> 
> There was also the question of if some of the improvements seen with
> multiple passes were indicative of deficiencies in early inlining,
> CFG, SRA, 

SRA, because it is not flow-sensitive in any way, unfortunately
sometimes produces useless statements which then need to be cleanup up
by forwprop (and possibly dse and others).  We've already talked with
Richi about this and agreed the early one should be dumbed down a
little to produce much less of these.  I'm afraid I won't be able to
submit a patch doing that during this stage 1, though.

> etc. If the knob is available, I'm happy to continue
> testing on the same projects I've filed recent LTO/graphite bugs
> against (glib, zlib, openssl, scummvm, binutils, etc) and write a
> report on what I observe as "suspicious" improvements that perhaps
> should be caught/made in a single pass.
> 
> It's worth noting again that while this is a useful feature in and
> of itself (especially when combined with LTO), it's *extremely*
> useful when coupled with the de-virtualization improvements
> submitted in other threads. The examples submitted for inclusion in
> the test suite aren't academic -- they are reductions of real-world
> performance issues from a mature (and shipping) C++-based networking
> product. Any C++ codebase that employs physical separation in their
> designs via Factory patterns, Interface Segregation, and/or
> Dependency Inversion will likely see improvements. To me, these
> enahncements combine to form one of the biggest leaps I've seen in
> C++ code optimization -- code that can be clean, OO, *and* fast.

Well, while I'd understand that whenever there is a new direct call
graph edge, trying early inlining again might help or save some work
for the full inlining, I think that we should rather try to enhance
the current IPA infrastructure rather than grow another one from the
early optimizations, especially if we aim at LTO - iterating early
optimizations will not help reduce abstraction if it is spread across
a number of compilation units.

Martin
Richard Guenther - Nov. 1, 2011, 9:18 p.m.
On Sat, Oct 29, 2011 at 1:06 AM, Matt <matt@use.net> wrote:
> On Sat, 29 Oct 2011, Maxim Kuvyrkov wrote:
>
>>> I like this variant a lot better than the last one - still it lacks any
>>> analysis-based justification for iteration (see my reply to Matt on
>>> what I discussed with Honza).
>>
>> Yes, having a way to tell whether a function have significantly changed
>> would be awesome.  My approach here would be to make inline_parameters
>> output feedback of how much the size/time metrics have changed for a
>> function since previous run.  If the change is above X%, then queue
>> functions callers for more optimizations.  Similarly, Martin's
>> rebuild_cgraph_edges_and_devirt (when that goes into trunk) could queue new
>> direct callees and current function for another iteration if new direct
>> edges were resolved.
>
> Figuring out the heuristic will need decent testing on a few projects to
> figure out what the "sweet spot" is (smallest binary for time/passes spent)
> for that given codebase. With a few data points, a reasonable stab at the
> metrics you mention can be had that would not terminate the iterations
> before the known optimial number of passes. Without those data points, it
> seems like making sure the metrics allow those "sweet spots" to be attained
> will be difficult.

Well, sure - the same like with inlining heuristics.

>>>  Thus, I don't think we want to
>>> merge this in its current form or in this stage1.
>>
>> What is the benefit of pushing this to a later release?  If anything,
>> merging the support for iterative optimizations now will allow us to
>> consider adding the wonderful smartness to it later.  In the meantime,
>> substituting that smartness with a knob is still a great alternative.

The benefit?  The benifit is to not have another magic knob in there
that doesn't make too much sense and papers over real conceptual/algorithmic
issues.  Brute-force iterating is a hack, not a solution. (sorry)

> I agree (of course). Having the knob will be very useful for testing and
> determining the acceptance criteria for the later "smartness". While
> terminating early would be a nice optimization, the feature is still
> intrinsically useful and deployable without it. In addition, when using LTO
> on nearly all the projects/modules I tested on, 3+ passes were always
> productive.

If that is true then I'd really like to see testcases.  Because I am sure you
are just papering over (mabe even easy to fix) issues by the brute-force
iterating approach.  We also do not have a switch to run every pass twice
in succession, just because that would be as stupid as this.

> To be fair, when not using LTO, beyond 2-3 passes did not often
> produce improvements unless individual compilation units were enormous.
>
> There was also the question of if some of the improvements seen with
> multiple passes were indicative of deficiencies in early inlining, CFG, SRA,
> etc. If the knob is available, I'm happy to continue testing on the same
> projects I've filed recent LTO/graphite bugs against (glib, zlib, openssl,
> scummvm, binutils, etc) and write a report on what I observe as "suspicious"
> improvements that perhaps should be caught/made in a single pass.
>
> It's worth noting again that while this is a useful feature in and of itself
> (especially when combined with LTO), it's *extremely* useful when coupled
> with the de-virtualization improvements submitted in other threads. The
> examples submitted for inclusion in the test suite aren't academic -- they
> are reductions of real-world performance issues from a mature (and shipping)
> C++-based networking product. Any C++ codebase that employs physical
> separation in their designs via Factory patterns, Interface Segregation,
> and/or Dependency Inversion will likely see improvements. To me, these
> enahncements combine to form one of the biggest leaps I've seen in C++ code
> optimization -- code that can be clean, OO, *and* fast.

But iterating the whole early optimization pipeline is not a sensible approach
of attacking these.

Richard.

> Richard: If there's any additional testing or information I can reasonably
> provide to help get this in for this stage1, let me know.
>
> Thanks!
>
>
> --
> tangled strands of DNA explain the way that I behave.
> http://www.clock.org/~matt
>
Maxim Kuvyrkov - Nov. 8, 2011, 6:34 a.m.
On 2/11/2011, at 10:18 AM, Richard Guenther wrote:
> 
>>>>  Thus, I don't think we want to
>>>> merge this in its current form or in this stage1.
>>> 
>>> What is the benefit of pushing this to a later release?  If anything,
>>> merging the support for iterative optimizations now will allow us to
>>> consider adding the wonderful smartness to it later.  In the meantime,
>>> substituting that smartness with a knob is still a great alternative.
> 
> The benefit?  The benifit is to not have another magic knob in there
> that doesn't make too much sense and papers over real conceptual/algorithmic
> issues.  Brute-force iterating is a hack, not a solution. (sorry)
> 
>> I agree (of course). Having the knob will be very useful for testing and
>> determining the acceptance criteria for the later "smartness". While
>> terminating early would be a nice optimization, the feature is still
>> intrinsically useful and deployable without it. In addition, when using LTO
>> on nearly all the projects/modules I tested on, 3+ passes were always
>> productive.
> 
> If that is true then I'd really like to see testcases.  Because I am sure you
> are just papering over (mabe even easy to fix) issues by the brute-force
> iterating approach.  We also do not have a switch to run every pass twice
> in succession, just because that would be as stupid as this.
> 
>> To be fair, when not using LTO, beyond 2-3 passes did not often
>> produce improvements unless individual compilation units were enormous.
>> 
>> There was also the question of if some of the improvements seen with
>> multiple passes were indicative of deficiencies in early inlining, CFG, SRA,
>> etc. If the knob is available, I'm happy to continue testing on the same
>> projects I've filed recent LTO/graphite bugs against (glib, zlib, openssl,
>> scummvm, binutils, etc) and write a report on what I observe as "suspicious"
>> improvements that perhaps should be caught/made in a single pass.
>> 
>> It's worth noting again that while this is a useful feature in and of itself
>> (especially when combined with LTO), it's *extremely* useful when coupled
>> with the de-virtualization improvements submitted in other threads. The
>> examples submitted for inclusion in the test suite aren't academic -- they
>> are reductions of real-world performance issues from a mature (and shipping)
>> C++-based networking product. Any C++ codebase that employs physical
>> separation in their designs via Factory patterns, Interface Segregation,
>> and/or Dependency Inversion will likely see improvements. To me, these
>> enahncements combine to form one of the biggest leaps I've seen in C++ code
>> optimization -- code that can be clean, OO, *and* fast.
> 
> But iterating the whole early optimization pipeline is not a sensible approach
> of attacking these.

Richard,

You are making valid points.  Ideally, we would have a compiler with optimal pass pipeline that would not generate any better or worse code with second and subsequent iterations.  I disagree, however, that the patch discussed in this thread is a hack.  It is a feature.  It allows developers and power-users to experiment with the pass pipeline and see what comes out of it.

The benchmarks for the 2nd version of the patch have finally finished [*].  It shows that the additional iterations do make a difference of 2% for a number of benchmarks.  There are also trends like "additional iterations improve -m64 and hurt -m32" -- 254.gap, 256.bzip2, 300.twolf, 171.swim, 173.applu -- and "additional iterations improve -O2 and hurt -O3" -- 179.art, 400.perlbench.  There also are enough benchmarks that show straight improvement or straight regression.  How are we going to investigate and fix these missed optimization opportunities and regressions without the feature that we are discussing here?

We already provide certain features to enable users tinker with optimization pipeline:
1. -O0/-O1/-O2/-O3 options for entry users,
2. -f<pass>/-fno-<pass> for advanced users,
3. --param <obscure>=MAGIC for developers and compiler-savvy users.

Iterative optimizations should be no different.  "Yes", without this feature the world would be a simpler place for us to live in, as we would not have to know and care about regressions and missed opportunities in GCC's optimizations.  But, really, that would've been a boring world :-).

I feel that dropping this feature will cause the problems to stay unnoticed and never get fixed.

[*] https://docs.google.com/spreadsheet/ccc?key=0AvK0Y-Pgj7bNdFBQMEJ6d3laeFdvdk9lQ1p0LUFkVFE

Thank you,

--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics
Richard Guenther - Nov. 8, 2011, 11:12 a.m.
On Tue, Nov 8, 2011 at 7:34 AM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
> On 2/11/2011, at 10:18 AM, Richard Guenther wrote:
>>
>>>>>  Thus, I don't think we want to
>>>>> merge this in its current form or in this stage1.
>>>>
>>>> What is the benefit of pushing this to a later release?  If anything,
>>>> merging the support for iterative optimizations now will allow us to
>>>> consider adding the wonderful smartness to it later.  In the meantime,
>>>> substituting that smartness with a knob is still a great alternative.
>>
>> The benefit?  The benifit is to not have another magic knob in there
>> that doesn't make too much sense and papers over real conceptual/algorithmic
>> issues.  Brute-force iterating is a hack, not a solution. (sorry)
>>
>>> I agree (of course). Having the knob will be very useful for testing and
>>> determining the acceptance criteria for the later "smartness". While
>>> terminating early would be a nice optimization, the feature is still
>>> intrinsically useful and deployable without it. In addition, when using LTO
>>> on nearly all the projects/modules I tested on, 3+ passes were always
>>> productive.
>>
>> If that is true then I'd really like to see testcases.  Because I am sure you
>> are just papering over (mabe even easy to fix) issues by the brute-force
>> iterating approach.  We also do not have a switch to run every pass twice
>> in succession, just because that would be as stupid as this.
>>
>>> To be fair, when not using LTO, beyond 2-3 passes did not often
>>> produce improvements unless individual compilation units were enormous.
>>>
>>> There was also the question of if some of the improvements seen with
>>> multiple passes were indicative of deficiencies in early inlining, CFG, SRA,
>>> etc. If the knob is available, I'm happy to continue testing on the same
>>> projects I've filed recent LTO/graphite bugs against (glib, zlib, openssl,
>>> scummvm, binutils, etc) and write a report on what I observe as "suspicious"
>>> improvements that perhaps should be caught/made in a single pass.
>>>
>>> It's worth noting again that while this is a useful feature in and of itself
>>> (especially when combined with LTO), it's *extremely* useful when coupled
>>> with the de-virtualization improvements submitted in other threads. The
>>> examples submitted for inclusion in the test suite aren't academic -- they
>>> are reductions of real-world performance issues from a mature (and shipping)
>>> C++-based networking product. Any C++ codebase that employs physical
>>> separation in their designs via Factory patterns, Interface Segregation,
>>> and/or Dependency Inversion will likely see improvements. To me, these
>>> enahncements combine to form one of the biggest leaps I've seen in C++ code
>>> optimization -- code that can be clean, OO, *and* fast.
>>
>> But iterating the whole early optimization pipeline is not a sensible approach
>> of attacking these.
>
> Richard,
>
> You are making valid points.  Ideally, we would have a compiler with optimal pass pipeline that would not generate any better or worse code with second and subsequent iterations.  I disagree, however, that the patch discussed in this thread is a hack.  It is a feature.  It allows developers and power-users to experiment with the pass pipeline and see what comes out of it.
>
> The benchmarks for the 2nd version of the patch have finally finished [*].  It shows that the additional iterations do make a difference of 2% for a number of benchmarks.  There are also trends like "additional iterations improve -m64 and hurt -m32" -- 254.gap, 256.bzip2, 300.twolf, 171.swim, 173.applu -- and "additional iterations improve -O2 and hurt -O3" -- 179.art, 400.perlbench.  There also are enough benchmarks that show straight improvement or straight regression.  How are we going to investigate and fix these missed optimization opportunities and regressions without the feature that we are discussing here?
>
> We already provide certain features to enable users tinker with optimization pipeline:
> 1. -O0/-O1/-O2/-O3 options for entry users,
> 2. -f<pass>/-fno-<pass> for advanced users,
> 3. --param <obscure>=MAGIC for developers and compiler-savvy users.
>
> Iterative optimizations should be no different.  "Yes", without this feature the world would be a simpler place for us to live in, as we would not have to know and care about regressions and missed opportunities in GCC's optimizations.  But, really, that would've been a boring world :-).
>
> I feel that dropping this feature will cause the problems to stay unnoticed and never get fixed.

How?  You are not even _analyzing_ why you are getting benefits, or
filing bugs.  Did you experiment with reducing the set of passes you
iterate?  Thus, which passes really do help when iterating?  I've
asked you that before - and showed proper analysis that could be
done to detect the indirect inlining benefit that you might see (surely
not on SPEC 2000 though) - and said that yes, it _would_ make sense
to do sth about this specific case and yes, iterating for a selected
set of functions is a valid solution for the indirect inlining case.
It is never,
IMHO, when no inlining happened between the iterations.

You are basically saying, well - let's iterate optimizers some time
and hope they will catch something.  Sure they will.  Show the important
pieces and analyze why they need the iteration.

You probably should realize that with each of this "power-user" features
the GCC testing matrix becomes larger.  If I'd sell commercial support
for GCC I'd make all but -O* unsupported.

Thanks,
Richard.

Patch

diff --git a/gcc/cgraph.c b/gcc/cgraph.c
index f056d3d..4738b28 100644
--- a/gcc/cgraph.c
+++ b/gcc/cgraph.c
@@ -2416,7 +2416,7 @@  cgraph_add_new_function (tree fndecl, bool lowered)
            tree_lowering_passes (fndecl);
            bitmap_obstack_initialize (NULL);
            if (!gimple_in_ssa_p (DECL_STRUCT_FUNCTION (fndecl)))
-             execute_pass_list (pass_early_local_passes.pass.sub);
+             execute_early_local_passes_for_current_function ();
            bitmap_obstack_release (NULL);
            pop_cfun ();
            current_function_decl = NULL;
@@ -2441,7 +2441,7 @@  cgraph_add_new_function (tree fndecl, bool lowered)
        gimple_register_cfg_hooks ();
        bitmap_obstack_initialize (NULL);
        if (!gimple_in_ssa_p (DECL_STRUCT_FUNCTION (fndecl)))
-         execute_pass_list (pass_early_local_passes.pass.sub);
+         execute_early_local_passes_for_current_function ();

I think these should only execute the lowering pieces of early local passes,
let me see if that's properly split ...