Message ID | CAAe5K+V=F8SF0K0uFtu=f46TLzS-XLTn7vVVZvaqyi+GP05HfQ@mail.gmail.com |
---|---|
State | New |
Headers | show |
> Increasing the number of early inliner iterations from 1 to 2 enables more > indirect calls to be promoted/inlined before instrumentation. This in turn > reduces the instrumentation overhead, particularly for more expensive indirect > call topn profiling. How much difference you get here? One posibility would be also to run specialized ipa-cp before profile instrumentation. Honza > > Passes internal testing and regression tests. Ok for google/4_9? > > 2014-10-18 Teresa Johnson <tejohnson@google.com> > > Google ref b/17934523 > * opts.c (finish_options): Increase max-early-inliner-iterations to 2 > for profile-gen and profile-use builds. > > Index: opts.c > =================================================================== > --- opts.c (revision 216286) > +++ opts.c (working copy) > @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g > opts->x_param_values, opts_set->x_param_values); > } > > + if (opts->x_profile_arc_flag > + || opts->x_flag_branch_probabilities) > + { > + maybe_set_param_value > + (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2, > + opts->x_param_values, opts_set->x_param_values); > + } > + > if (!(opts->x_flag_auto_profile > || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities))) > { > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
The difference in instrumentation runtime is huge -- as topn profiler is pretty expensive to run. With FDO, it is probably better to make early inlining more aggressive in order to get more context sensitive profiling. David On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> Increasing the number of early inliner iterations from 1 to 2 enables more >> indirect calls to be promoted/inlined before instrumentation. This in turn >> reduces the instrumentation overhead, particularly for more expensive indirect >> call topn profiling. > > How much difference you get here? One posibility would be also to run specialized > ipa-cp before profile instrumentation. > > Honza >> >> Passes internal testing and regression tests. Ok for google/4_9? >> >> 2014-10-18 Teresa Johnson <tejohnson@google.com> >> >> Google ref b/17934523 >> * opts.c (finish_options): Increase max-early-inliner-iterations to 2 >> for profile-gen and profile-use builds. >> >> Index: opts.c >> =================================================================== >> --- opts.c (revision 216286) >> +++ opts.c (working copy) >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g >> opts->x_param_values, opts_set->x_param_values); >> } >> >> + if (opts->x_profile_arc_flag >> + || opts->x_flag_branch_probabilities) >> + { >> + maybe_set_param_value >> + (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2, >> + opts->x_param_values, opts_set->x_param_values); >> + } >> + >> if (!(opts->x_flag_auto_profile >> || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities))) >> { >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
ok. David On Sat, Oct 18, 2014 at 9:26 AM, Teresa Johnson <tejohnson@google.com> wrote: > Increasing the number of early inliner iterations from 1 to 2 enables more > indirect calls to be promoted/inlined before instrumentation. This in turn > reduces the instrumentation overhead, particularly for more expensive indirect > call topn profiling. > > Passes internal testing and regression tests. Ok for google/4_9? > > 2014-10-18 Teresa Johnson <tejohnson@google.com> > > Google ref b/17934523 > * opts.c (finish_options): Increase max-early-inliner-iterations to 2 > for profile-gen and profile-use builds. > > Index: opts.c > =================================================================== > --- opts.c (revision 216286) > +++ opts.c (working copy) > @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g > opts->x_param_values, opts_set->x_param_values); > } > > + if (opts->x_profile_arc_flag > + || opts->x_flag_branch_probabilities) > + { > + maybe_set_param_value > + (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2, > + opts->x_param_values, opts_set->x_param_values); > + } > + > if (!(opts->x_flag_auto_profile > || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities))) > { > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> The difference in instrumentation runtime is huge -- as topn profiler > is pretty expensive to run. > > With FDO, it is probably better to make early inlining more aggressive > in order to get more context sensitive profiling. I agree with that, I just would like to understand where increasing the iterations helps and if we can handle it without iterating (because Richi originally requested to drop the iteration for correcness issues) Do you have some examples? Honza > > David > > On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > >> Increasing the number of early inliner iterations from 1 to 2 enables more > >> indirect calls to be promoted/inlined before instrumentation. This in turn > >> reduces the instrumentation overhead, particularly for more expensive indirect > >> call topn profiling. > > > > How much difference you get here? One posibility would be also to run specialized > > ipa-cp before profile instrumentation. > > > > Honza > >> > >> Passes internal testing and regression tests. Ok for google/4_9? > >> > >> 2014-10-18 Teresa Johnson <tejohnson@google.com> > >> > >> Google ref b/17934523 > >> * opts.c (finish_options): Increase max-early-inliner-iterations to 2 > >> for profile-gen and profile-use builds. > >> > >> Index: opts.c > >> =================================================================== > >> --- opts.c (revision 216286) > >> +++ opts.c (working copy) > >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g > >> opts->x_param_values, opts_set->x_param_values); > >> } > >> > >> + if (opts->x_profile_arc_flag > >> + || opts->x_flag_branch_probabilities) > >> + { > >> + maybe_set_param_value > >> + (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2, > >> + opts->x_param_values, opts_set->x_param_values); > >> + } > >> + > >> if (!(opts->x_flag_auto_profile > >> || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities))) > >> { > >> > >> > >> -- > >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Sat, Oct 18, 2014 at 3:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> The difference in instrumentation runtime is huge -- as topn profiler >> is pretty expensive to run. >> >> With FDO, it is probably better to make early inlining more aggressive >> in order to get more context sensitive profiling. > > I agree with that, I just would like to understand where increasing the iterations > helps and if we can handle it without iterating (because Richi originally requested to > drop the iteration for correcness issues) > Do you have some examples? We can do FDO experiment by shutting down einline. (Note that increasing iteration to 2 did not actually improve performance with our benchmarks). David > Honza >> >> David >> >> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> Increasing the number of early inliner iterations from 1 to 2 enables more >> >> indirect calls to be promoted/inlined before instrumentation. This in turn >> >> reduces the instrumentation overhead, particularly for more expensive indirect >> >> call topn profiling. >> > >> > How much difference you get here? One posibility would be also to run specialized >> > ipa-cp before profile instrumentation. >> > >> > Honza >> >> >> >> Passes internal testing and regression tests. Ok for google/4_9? >> >> >> >> 2014-10-18 Teresa Johnson <tejohnson@google.com> >> >> >> >> Google ref b/17934523 >> >> * opts.c (finish_options): Increase max-early-inliner-iterations to 2 >> >> for profile-gen and profile-use builds. >> >> >> >> Index: opts.c >> >> =================================================================== >> >> --- opts.c (revision 216286) >> >> +++ opts.c (working copy) >> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g >> >> opts->x_param_values, opts_set->x_param_values); >> >> } >> >> >> >> + if (opts->x_profile_arc_flag >> >> + || opts->x_flag_branch_probabilities) >> >> + { >> >> + maybe_set_param_value >> >> + (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2, >> >> + opts->x_param_values, opts_set->x_param_values); >> >> + } >> >> + >> >> if (!(opts->x_flag_auto_profile >> >> || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities))) >> >> { >> >> >> >> >> >> -- >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> On Sat, Oct 18, 2014 at 3:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote: > >> The difference in instrumentation runtime is huge -- as topn profiler > >> is pretty expensive to run. > >> > >> With FDO, it is probably better to make early inlining more aggressive > >> in order to get more context sensitive profiling. > > > > I agree with that, I just would like to understand where increasing the iterations > > helps and if we can handle it without iterating (because Richi originally requested to > > drop the iteration for correcness issues) > > Do you have some examples? > > We can do FDO experiment by shutting down einline. (Note that > increasing iteration to 2 did not actually improve performance with > our benchmarks). I would be more interested in case where increasing iteration to 2 actually improves train run perfomrance. (einline was originally invented to make profiling useable on tramp3d ;) It seems to me that the cases handled by iteration are rather rare, so I am suprised you get important benefit from these. Perhaps we miss something obvious here. Honza > > David > > > Honza > >> > >> David > >> > >> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > >> >> Increasing the number of early inliner iterations from 1 to 2 enables more > >> >> indirect calls to be promoted/inlined before instrumentation. This in turn > >> >> reduces the instrumentation overhead, particularly for more expensive indirect > >> >> call topn profiling. > >> > > >> > How much difference you get here? One posibility would be also to run specialized > >> > ipa-cp before profile instrumentation. > >> > > >> > Honza > >> >> > >> >> Passes internal testing and regression tests. Ok for google/4_9? > >> >> > >> >> 2014-10-18 Teresa Johnson <tejohnson@google.com> > >> >> > >> >> Google ref b/17934523 > >> >> * opts.c (finish_options): Increase max-early-inliner-iterations to 2 > >> >> for profile-gen and profile-use builds. > >> >> > >> >> Index: opts.c > >> >> =================================================================== > >> >> --- opts.c (revision 216286) > >> >> +++ opts.c (working copy) > >> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g > >> >> opts->x_param_values, opts_set->x_param_values); > >> >> } > >> >> > >> >> + if (opts->x_profile_arc_flag > >> >> + || opts->x_flag_branch_probabilities) > >> >> + { > >> >> + maybe_set_param_value > >> >> + (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2, > >> >> + opts->x_param_values, opts_set->x_param_values); > >> >> + } > >> >> + > >> >> if (!(opts->x_flag_auto_profile > >> >> || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities))) > >> >> { > >> >> > >> >> > >> >> -- > >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Sat, Oct 18, 2014 at 4:26 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> On Sat, Oct 18, 2014 at 3:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> The difference in instrumentation runtime is huge -- as topn profiler >> >> is pretty expensive to run. >> >> >> >> With FDO, it is probably better to make early inlining more aggressive >> >> in order to get more context sensitive profiling. >> > >> > I agree with that, I just would like to understand where increasing the iterations >> > helps and if we can handle it without iterating (because Richi originally requested to >> > drop the iteration for correcness issues) >> > Do you have some examples? >> >> We can do FDO experiment by shutting down einline. (Note that >> increasing iteration to 2 did not actually improve performance with >> our benchmarks). > > I would be more interested in case where increasing iteration to 2 actually > improves train run perfomrance. (einline was originally invented to make > profiling useable on tramp3d ;) What is special about tram3d ? > It seems to me that the cases handled by iteration are rather rare, so I am > suprised you get important benefit from these. Perhaps we miss something > obvious here. For training run performance, as in this case, einline helps reducing indirect calls thus reduces instrumentation overhead. Instrumentation has another side-effect that it changes function body size, thus can reduce the amount of ipa-inline later. David > > Honza >> >> David >> >> > Honza >> >> >> >> David >> >> >> >> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> >> Increasing the number of early inliner iterations from 1 to 2 enables more >> >> >> indirect calls to be promoted/inlined before instrumentation. This in turn >> >> >> reduces the instrumentation overhead, particularly for more expensive indirect >> >> >> call topn profiling. >> >> > >> >> > How much difference you get here? One posibility would be also to run specialized >> >> > ipa-cp before profile instrumentation. >> >> > >> >> > Honza >> >> >> >> >> >> Passes internal testing and regression tests. Ok for google/4_9? >> >> >> >> >> >> 2014-10-18 Teresa Johnson <tejohnson@google.com> >> >> >> >> >> >> Google ref b/17934523 >> >> >> * opts.c (finish_options): Increase max-early-inliner-iterations to 2 >> >> >> for profile-gen and profile-use builds. >> >> >> >> >> >> Index: opts.c >> >> >> =================================================================== >> >> >> --- opts.c (revision 216286) >> >> >> +++ opts.c (working copy) >> >> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g >> >> >> opts->x_param_values, opts_set->x_param_values); >> >> >> } >> >> >> >> >> >> + if (opts->x_profile_arc_flag >> >> >> + || opts->x_flag_branch_probabilities) >> >> >> + { >> >> >> + maybe_set_param_value >> >> >> + (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2, >> >> >> + opts->x_param_values, opts_set->x_param_values); >> >> >> + } >> >> >> + >> >> >> if (!(opts->x_flag_auto_profile >> >> >> || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities))) >> >> >> { >> >> >> >> >> >> >> >> >> -- >> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Sat, Oct 18, 2014 at 4:26 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> On Sat, Oct 18, 2014 at 3:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> The difference in instrumentation runtime is huge -- as topn profiler >> >> is pretty expensive to run. >> >> >> >> With FDO, it is probably better to make early inlining more aggressive >> >> in order to get more context sensitive profiling. >> > >> > I agree with that, I just would like to understand where increasing the iterations >> > helps and if we can handle it without iterating (because Richi originally requested to >> > drop the iteration for correcness issues) >> > Do you have some examples? >> >> We can do FDO experiment by shutting down einline. (Note that >> increasing iteration to 2 did not actually improve performance with >> our benchmarks). > > I would be more interested in case where increasing iteration to 2 actually > improves train run perfomrance. (einline was originally invented to make > profiling useable on tramp3d ;) > It seems to me that the cases handled by iteration are rather rare, so I am > suprised you get important benefit from these. Perhaps we miss something > obvious here. The specific case was actually a call to upper_bound in bits/stl_algo.h with a specialized compare function. In the more recent versions of upper_bound, the call to the comparator was outlined into __upper_bound. With only one iteration of early inlining, we were inlining __upper_bound into upper_bound and into the caller. But the indirect call to the comparator was not promoted until the fre2 pass, so it didn't get early inlined. With 2 iterations of early inlining, enough optimization is apparently done between iterations to propagate the actual target and promote the indirect call after we inline __upper_bound and upper_bound that it is inlined in the second iteration. Thanks, Teresa > > Honza >> >> David >> >> > Honza >> >> >> >> David >> >> >> >> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> >> Increasing the number of early inliner iterations from 1 to 2 enables more >> >> >> indirect calls to be promoted/inlined before instrumentation. This in turn >> >> >> reduces the instrumentation overhead, particularly for more expensive indirect >> >> >> call topn profiling. >> >> > >> >> > How much difference you get here? One posibility would be also to run specialized >> >> > ipa-cp before profile instrumentation. >> >> > >> >> > Honza >> >> >> >> >> >> Passes internal testing and regression tests. Ok for google/4_9? >> >> >> >> >> >> 2014-10-18 Teresa Johnson <tejohnson@google.com> >> >> >> >> >> >> Google ref b/17934523 >> >> >> * opts.c (finish_options): Increase max-early-inliner-iterations to 2 >> >> >> for profile-gen and profile-use builds. >> >> >> >> >> >> Index: opts.c >> >> >> =================================================================== >> >> >> --- opts.c (revision 216286) >> >> >> +++ opts.c (working copy) >> >> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g >> >> >> opts->x_param_values, opts_set->x_param_values); >> >> >> } >> >> >> >> >> >> + if (opts->x_profile_arc_flag >> >> >> + || opts->x_flag_branch_probabilities) >> >> >> + { >> >> >> + maybe_set_param_value >> >> >> + (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2, >> >> >> + opts->x_param_values, opts_set->x_param_values); >> >> >> + } >> >> >> + >> >> >> if (!(opts->x_flag_auto_profile >> >> >> || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities))) >> >> >> { >> >> >> >> >> >> >> >> >> -- >> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Sat, Oct 18, 2014 at 4:19 PM, Xinliang David Li <davidxl@google.com> wrote: > On Sat, Oct 18, 2014 at 3:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >>> The difference in instrumentation runtime is huge -- as topn profiler >>> is pretty expensive to run. >>> >>> With FDO, it is probably better to make early inlining more aggressive >>> in order to get more context sensitive profiling. >> >> I agree with that, I just would like to understand where increasing the iterations >> helps and if we can handle it without iterating (because Richi originally requested to >> drop the iteration for correcness issues) >> Do you have some examples? > > We can do FDO experiment by shutting down einline. (Note that > increasing iteration to 2 did not actually improve performance with > our benchmarks). Early inlining itself has large performance impact for FDO (the runtime of the profile-use build). With it disabled, the FDO performance drops by >2% on average. The degradation is seen across all benchmarks except for one. David > > David > >> Honza >>> >>> David >>> >>> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >>> >> Increasing the number of early inliner iterations from 1 to 2 enables more >>> >> indirect calls to be promoted/inlined before instrumentation. This in turn >>> >> reduces the instrumentation overhead, particularly for more expensive indirect >>> >> call topn profiling. >>> > >>> > How much difference you get here? One posibility would be also to run specialized >>> > ipa-cp before profile instrumentation. >>> > >>> > Honza >>> >> >>> >> Passes internal testing and regression tests. Ok for google/4_9? >>> >> >>> >> 2014-10-18 Teresa Johnson <tejohnson@google.com> >>> >> >>> >> Google ref b/17934523 >>> >> * opts.c (finish_options): Increase max-early-inliner-iterations to 2 >>> >> for profile-gen and profile-use builds. >>> >> >>> >> Index: opts.c >>> >> =================================================================== >>> >> --- opts.c (revision 216286) >>> >> +++ opts.c (working copy) >>> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g >>> >> opts->x_param_values, opts_set->x_param_values); >>> >> } >>> >> >>> >> + if (opts->x_profile_arc_flag >>> >> + || opts->x_flag_branch_probabilities) >>> >> + { >>> >> + maybe_set_param_value >>> >> + (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2, >>> >> + opts->x_param_values, opts_set->x_param_values); >>> >> + } >>> >> + >>> >> if (!(opts->x_flag_auto_profile >>> >> || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities))) >>> >> { >>> >> >>> >> >>> >> -- >>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Mon, Oct 20, 2014 at 12:02 AM, Xinliang David Li <davidxl@google.com> wrote: > On Sat, Oct 18, 2014 at 4:19 PM, Xinliang David Li <davidxl@google.com> wrote: >> On Sat, Oct 18, 2014 at 3:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >>>> The difference in instrumentation runtime is huge -- as topn profiler >>>> is pretty expensive to run. >>>> >>>> With FDO, it is probably better to make early inlining more aggressive >>>> in order to get more context sensitive profiling. >>> >>> I agree with that, I just would like to understand where increasing the iterations >>> helps and if we can handle it without iterating (because Richi originally requested to >>> drop the iteration for correcness issues) Well, I requested to do any iteration with an IPA view in mind. That is, iterate for cgraph cycles for example where currently we face the situation that at least one function is inlined unoptimized. For this we'd like to first optimize without inlining (well, maybe inlining doesn't hurt) and then inline (and re-optimize if we inlined). Indirect edges are more interesting, but basically you'd want to re-inline once you discover new direct calls during early opts (but then make sure to do that only after the direct callee was early-optimized first). Thus it would be nice if somebody could improve on the currently very simple function ordering we apply early opts, integrating "iteration" in a better way (not iterating over all functions but only where it might make a difference, focused on inlining). >>> Do you have some examples? >> >> We can do FDO experiment by shutting down einline. (Note that >> increasing iteration to 2 did not actually improve performance with >> our benchmarks). > > Early inlining itself has large performance impact for FDO (the > runtime of the profile-use build). With it disabled, the FDO > performance drops by >2% on average. The degradation is seen across > all benchmarks except for one. Only 2%? You are lucky ;) For tramp3d introducing early inlining made a difference of 100000% ;) (yes, statistically for tramp3d we have for each assembler instruction generated 100 calls in the initial code ... wheee C++ template metaprogramming!) So indeed early inlining was absoultely required to make FDO usable at all. Richard. > David > > >> >> David >> >>> Honza >>>> >>>> David >>>> >>>> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >>>> >> Increasing the number of early inliner iterations from 1 to 2 enables more >>>> >> indirect calls to be promoted/inlined before instrumentation. This in turn >>>> >> reduces the instrumentation overhead, particularly for more expensive indirect >>>> >> call topn profiling. >>>> > >>>> > How much difference you get here? One posibility would be also to run specialized >>>> > ipa-cp before profile instrumentation. >>>> > >>>> > Honza >>>> >> >>>> >> Passes internal testing and regression tests. Ok for google/4_9? >>>> >> >>>> >> 2014-10-18 Teresa Johnson <tejohnson@google.com> >>>> >> >>>> >> Google ref b/17934523 >>>> >> * opts.c (finish_options): Increase max-early-inliner-iterations to 2 >>>> >> for profile-gen and profile-use builds. >>>> >> >>>> >> Index: opts.c >>>> >> =================================================================== >>>> >> --- opts.c (revision 216286) >>>> >> +++ opts.c (working copy) >>>> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g >>>> >> opts->x_param_values, opts_set->x_param_values); >>>> >> } >>>> >> >>>> >> + if (opts->x_profile_arc_flag >>>> >> + || opts->x_flag_branch_probabilities) >>>> >> + { >>>> >> + maybe_set_param_value >>>> >> + (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2, >>>> >> + opts->x_param_values, opts_set->x_param_values); >>>> >> + } >>>> >> + >>>> >> if (!(opts->x_flag_auto_profile >>>> >> || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities))) >>>> >> { >>>> >> >>>> >> >>>> >> -- >>>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Mon, Oct 20, 2014 at 1:32 AM, Richard Biener <richard.guenther@gmail.com> wrote: > On Mon, Oct 20, 2014 at 12:02 AM, Xinliang David Li <davidxl@google.com> wrote: >> On Sat, Oct 18, 2014 at 4:19 PM, Xinliang David Li <davidxl@google.com> wrote: >>> On Sat, Oct 18, 2014 at 3:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >>>>> The difference in instrumentation runtime is huge -- as topn profiler >>>>> is pretty expensive to run. >>>>> >>>>> With FDO, it is probably better to make early inlining more aggressive >>>>> in order to get more context sensitive profiling. >>>> >>>> I agree with that, I just would like to understand where increasing the iterations >>>> helps and if we can handle it without iterating (because Richi originally requested to >>>> drop the iteration for correcness issues) > > Well, I requested to do any iteration with an IPA view in mind. That is, > iterate for cgraph cycles for example where currently we face the situation > that at least one function is inlined unoptimized. For this we'd like to > first optimize without inlining (well, maybe inlining doesn't hurt) yes -- inlining decision made without callee cleanup is more conservative and should not hurt. >and then > inline (and re-optimize if we inlined). > > Indirect edges are more interesting, but basically you'd want to re-inline > once you discover new direct calls during early opts (but then make > sure to do that only after the direct callee was early-optimized first). > It would be interesting to inline the newly introduced direct calls if the callsites also have function pointer arguments that are known in the call context. > Thus it would be nice if somebody could improve on the currently very > simple function ordering we apply early opts, integrating "iteration" > in a better way (not iterating over all functions but only where it > might make a difference, focused on inlining). > >>>> Do you have some examples? >>> >>> We can do FDO experiment by shutting down einline. (Note that >>> increasing iteration to 2 did not actually improve performance with >>> our benchmarks). >> >> Early inlining itself has large performance impact for FDO (the >> runtime of the profile-use build). With it disabled, the FDO >> performance drops by >2% on average. The degradation is seen across >> all benchmarks except for one. > > Only 2%? You are lucky ;) 2% average is considered pretty significant for optimized build runtime performance. > For tramp3d introducing early inlining > made a difference of 100000% ;) (yes, statistically for tramp3d > we have for each assembler instruction generated 100 calls in the > initial code ... wheee C++ template metaprogramming!) Is this 100000% difference from instrumentation build or optimized build runtime? > > So indeed early inlining was absoultely required to make FDO usable at all. thanks, David > > Richard. > >> David >> >> >>> >>> David >>> >>>> Honza >>>>> >>>>> David >>>>> >>>>> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >>>>> >> Increasing the number of early inliner iterations from 1 to 2 enables more >>>>> >> indirect calls to be promoted/inlined before instrumentation. This in turn >>>>> >> reduces the instrumentation overhead, particularly for more expensive indirect >>>>> >> call topn profiling. >>>>> > >>>>> > How much difference you get here? One posibility would be also to run specialized >>>>> > ipa-cp before profile instrumentation. >>>>> > >>>>> > Honza >>>>> >> >>>>> >> Passes internal testing and regression tests. Ok for google/4_9? >>>>> >> >>>>> >> 2014-10-18 Teresa Johnson <tejohnson@google.com> >>>>> >> >>>>> >> Google ref b/17934523 >>>>> >> * opts.c (finish_options): Increase max-early-inliner-iterations to 2 >>>>> >> for profile-gen and profile-use builds. >>>>> >> >>>>> >> Index: opts.c >>>>> >> =================================================================== >>>>> >> --- opts.c (revision 216286) >>>>> >> +++ opts.c (working copy) >>>>> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g >>>>> >> opts->x_param_values, opts_set->x_param_values); >>>>> >> } >>>>> >> >>>>> >> + if (opts->x_profile_arc_flag >>>>> >> + || opts->x_flag_branch_probabilities) >>>>> >> + { >>>>> >> + maybe_set_param_value >>>>> >> + (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2, >>>>> >> + opts->x_param_values, opts_set->x_param_values); >>>>> >> + } >>>>> >> + >>>>> >> if (!(opts->x_flag_auto_profile >>>>> >> || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities))) >>>>> >> { >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Mon, Oct 20, 2014 at 5:53 PM, Xinliang David Li <davidxl@google.com> wrote: > On Mon, Oct 20, 2014 at 1:32 AM, Richard Biener > <richard.guenther@gmail.com> wrote: >> On Mon, Oct 20, 2014 at 12:02 AM, Xinliang David Li <davidxl@google.com> wrote: >>> On Sat, Oct 18, 2014 at 4:19 PM, Xinliang David Li <davidxl@google.com> wrote: >>>> On Sat, Oct 18, 2014 at 3:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >>>>>> The difference in instrumentation runtime is huge -- as topn profiler >>>>>> is pretty expensive to run. >>>>>> >>>>>> With FDO, it is probably better to make early inlining more aggressive >>>>>> in order to get more context sensitive profiling. >>>>> >>>>> I agree with that, I just would like to understand where increasing the iterations >>>>> helps and if we can handle it without iterating (because Richi originally requested to >>>>> drop the iteration for correcness issues) >> >> Well, I requested to do any iteration with an IPA view in mind. That is, >> iterate for cgraph cycles for example where currently we face the situation >> that at least one function is inlined unoptimized. For this we'd like to >> first optimize without inlining (well, maybe inlining doesn't hurt) > > yes -- inlining decision made without callee cleanup is more > conservative and should not hurt. > >>and then >> inline (and re-optimize if we inlined). >> >> Indirect edges are more interesting, but basically you'd want to re-inline >> once you discover new direct calls during early opts (but then make >> sure to do that only after the direct callee was early-optimized first). >> > > It would be interesting to inline the newly introduced direct calls if > the callsites also have function pointer arguments that are known in > the call context. > >> Thus it would be nice if somebody could improve on the currently very >> simple function ordering we apply early opts, integrating "iteration" >> in a better way (not iterating over all functions but only where it >> might make a difference, focused on inlining). >> >>>>> Do you have some examples? >>>> >>>> We can do FDO experiment by shutting down einline. (Note that >>>> increasing iteration to 2 did not actually improve performance with >>>> our benchmarks). >>> >>> Early inlining itself has large performance impact for FDO (the >>> runtime of the profile-use build). With it disabled, the FDO >>> performance drops by >2% on average. The degradation is seen across >>> all benchmarks except for one. >> >> Only 2%? You are lucky ;) > > 2% average is considered pretty significant for optimized build > runtime performance. > > >> For tramp3d introducing early inlining >> made a difference of 100000% ;) (yes, statistically for tramp3d >> we have for each assembler instruction generated 100 calls in the >> initial code ... wheee C++ template metaprogramming!) > > Is this 100000% difference from instrumentation build or optimized > build runtime? It's from instrumentation build. I don't remember any numbers for the improvement on optimized build with FDO vs. non-FDO. Richard. >> >> So indeed early inlining was absoultely required to make FDO usable at all. > > thanks, > > David >> >> Richard. >> >>> David >>> >>> >>>> >>>> David >>>> >>>>> Honza >>>>>> >>>>>> David >>>>>> >>>>>> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >>>>>> >> Increasing the number of early inliner iterations from 1 to 2 enables more >>>>>> >> indirect calls to be promoted/inlined before instrumentation. This in turn >>>>>> >> reduces the instrumentation overhead, particularly for more expensive indirect >>>>>> >> call topn profiling. >>>>>> > >>>>>> > How much difference you get here? One posibility would be also to run specialized >>>>>> > ipa-cp before profile instrumentation. >>>>>> > >>>>>> > Honza >>>>>> >> >>>>>> >> Passes internal testing and regression tests. Ok for google/4_9? >>>>>> >> >>>>>> >> 2014-10-18 Teresa Johnson <tejohnson@google.com> >>>>>> >> >>>>>> >> Google ref b/17934523 >>>>>> >> * opts.c (finish_options): Increase max-early-inliner-iterations to 2 >>>>>> >> for profile-gen and profile-use builds. >>>>>> >> >>>>>> >> Index: opts.c >>>>>> >> =================================================================== >>>>>> >> --- opts.c (revision 216286) >>>>>> >> +++ opts.c (working copy) >>>>>> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g >>>>>> >> opts->x_param_values, opts_set->x_param_values); >>>>>> >> } >>>>>> >> >>>>>> >> + if (opts->x_profile_arc_flag >>>>>> >> + || opts->x_flag_branch_probabilities) >>>>>> >> + { >>>>>> >> + maybe_set_param_value >>>>>> >> + (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2, >>>>>> >> + opts->x_param_values, opts_set->x_param_values); >>>>>> >> + } >>>>>> >> + >>>>>> >> if (!(opts->x_flag_auto_profile >>>>>> >> || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities))) >>>>>> >> { >>>>>> >> >>>>>> >> >>>>>> >> -- >>>>>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Index: opts.c =================================================================== --- opts.c (revision 216286) +++ opts.c (working copy) @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g opts->x_param_values, opts_set->x_param_values); } + if (opts->x_profile_arc_flag + || opts->x_flag_branch_probabilities) + { + maybe_set_param_value + (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2, + opts->x_param_values, opts_set->x_param_values); + } + if (!(opts->x_flag_auto_profile || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities))) {