diff mbox

[GOOGLE] Increase max-early-inliner-iterations to 2 for profile-gen and use

Message ID CAAe5K+V=F8SF0K0uFtu=f46TLzS-XLTn7vVVZvaqyi+GP05HfQ@mail.gmail.com
State New
Headers show

Commit Message

Teresa Johnson Oct. 18, 2014, 4:26 p.m. UTC
Increasing the number of early inliner iterations from 1 to 2 enables more
indirect calls to be promoted/inlined before instrumentation. This in turn
reduces the instrumentation overhead, particularly for more expensive indirect
call topn profiling.

Passes internal testing and regression tests. Ok for google/4_9?

2014-10-18  Teresa Johnson  <tejohnson@google.com>

        Google ref b/17934523
        * opts.c (finish_options): Increase max-early-inliner-iterations to 2
        for profile-gen and profile-use builds.

Comments

Jan Hubicka Oct. 18, 2014, 5:05 p.m. UTC | #1
> Increasing the number of early inliner iterations from 1 to 2 enables more
> indirect calls to be promoted/inlined before instrumentation. This in turn
> reduces the instrumentation overhead, particularly for more expensive indirect
> call topn profiling.

How much difference you get here? One posibility would be also to run specialized
ipa-cp before profile instrumentation.

Honza
> 
> Passes internal testing and regression tests. Ok for google/4_9?
> 
> 2014-10-18  Teresa Johnson  <tejohnson@google.com>
> 
>         Google ref b/17934523
>         * opts.c (finish_options): Increase max-early-inliner-iterations to 2
>         for profile-gen and profile-use builds.
> 
> Index: opts.c
> ===================================================================
> --- opts.c      (revision 216286)
> +++ opts.c      (working copy)
> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g
>          opts->x_param_values, opts_set->x_param_values);
>      }
> 
> +  if (opts->x_profile_arc_flag
> +      || opts->x_flag_branch_probabilities)
> +    {
> +      maybe_set_param_value
> +       (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2,
> +        opts->x_param_values, opts_set->x_param_values);
> +    }
> +
>    if (!(opts->x_flag_auto_profile
>          || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities)))
>      {
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Xinliang David Li Oct. 18, 2014, 9:51 p.m. UTC | #2
The difference in instrumentation runtime is huge -- as topn profiler
is pretty expensive to run.

With FDO, it is probably better to make early inlining more aggressive
in order to get more context sensitive profiling.

David

On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Increasing the number of early inliner iterations from 1 to 2 enables more
>> indirect calls to be promoted/inlined before instrumentation. This in turn
>> reduces the instrumentation overhead, particularly for more expensive indirect
>> call topn profiling.
>
> How much difference you get here? One posibility would be also to run specialized
> ipa-cp before profile instrumentation.
>
> Honza
>>
>> Passes internal testing and regression tests. Ok for google/4_9?
>>
>> 2014-10-18  Teresa Johnson  <tejohnson@google.com>
>>
>>         Google ref b/17934523
>>         * opts.c (finish_options): Increase max-early-inliner-iterations to 2
>>         for profile-gen and profile-use builds.
>>
>> Index: opts.c
>> ===================================================================
>> --- opts.c      (revision 216286)
>> +++ opts.c      (working copy)
>> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g
>>          opts->x_param_values, opts_set->x_param_values);
>>      }
>>
>> +  if (opts->x_profile_arc_flag
>> +      || opts->x_flag_branch_probabilities)
>> +    {
>> +      maybe_set_param_value
>> +       (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2,
>> +        opts->x_param_values, opts_set->x_param_values);
>> +    }
>> +
>>    if (!(opts->x_flag_auto_profile
>>          || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities)))
>>      {
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Xinliang David Li Oct. 18, 2014, 9:52 p.m. UTC | #3
ok.

David

On Sat, Oct 18, 2014 at 9:26 AM, Teresa Johnson <tejohnson@google.com> wrote:
> Increasing the number of early inliner iterations from 1 to 2 enables more
> indirect calls to be promoted/inlined before instrumentation. This in turn
> reduces the instrumentation overhead, particularly for more expensive indirect
> call topn profiling.
>
> Passes internal testing and regression tests. Ok for google/4_9?
>
> 2014-10-18  Teresa Johnson  <tejohnson@google.com>
>
>         Google ref b/17934523
>         * opts.c (finish_options): Increase max-early-inliner-iterations to 2
>         for profile-gen and profile-use builds.
>
> Index: opts.c
> ===================================================================
> --- opts.c      (revision 216286)
> +++ opts.c      (working copy)
> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g
>          opts->x_param_values, opts_set->x_param_values);
>      }
>
> +  if (opts->x_profile_arc_flag
> +      || opts->x_flag_branch_probabilities)
> +    {
> +      maybe_set_param_value
> +       (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2,
> +        opts->x_param_values, opts_set->x_param_values);
> +    }
> +
>    if (!(opts->x_flag_auto_profile
>          || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities)))
>      {
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Jan Hubicka Oct. 18, 2014, 10:27 p.m. UTC | #4
> The difference in instrumentation runtime is huge -- as topn profiler
> is pretty expensive to run.
> 
> With FDO, it is probably better to make early inlining more aggressive
> in order to get more context sensitive profiling.

I agree with that, I just would like to understand where increasing the iterations
helps and if we can handle it without iterating (because Richi originally requested to
drop the iteration for correcness issues)
Do you have some examples?

Honza
> 
> David
> 
> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> Increasing the number of early inliner iterations from 1 to 2 enables more
> >> indirect calls to be promoted/inlined before instrumentation. This in turn
> >> reduces the instrumentation overhead, particularly for more expensive indirect
> >> call topn profiling.
> >
> > How much difference you get here? One posibility would be also to run specialized
> > ipa-cp before profile instrumentation.
> >
> > Honza
> >>
> >> Passes internal testing and regression tests. Ok for google/4_9?
> >>
> >> 2014-10-18  Teresa Johnson  <tejohnson@google.com>
> >>
> >>         Google ref b/17934523
> >>         * opts.c (finish_options): Increase max-early-inliner-iterations to 2
> >>         for profile-gen and profile-use builds.
> >>
> >> Index: opts.c
> >> ===================================================================
> >> --- opts.c      (revision 216286)
> >> +++ opts.c      (working copy)
> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g
> >>          opts->x_param_values, opts_set->x_param_values);
> >>      }
> >>
> >> +  if (opts->x_profile_arc_flag
> >> +      || opts->x_flag_branch_probabilities)
> >> +    {
> >> +      maybe_set_param_value
> >> +       (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2,
> >> +        opts->x_param_values, opts_set->x_param_values);
> >> +    }
> >> +
> >>    if (!(opts->x_flag_auto_profile
> >>          || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities)))
> >>      {
> >>
> >>
> >> --
> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Xinliang David Li Oct. 18, 2014, 11:19 p.m. UTC | #5
On Sat, Oct 18, 2014 at 3:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> The difference in instrumentation runtime is huge -- as topn profiler
>> is pretty expensive to run.
>>
>> With FDO, it is probably better to make early inlining more aggressive
>> in order to get more context sensitive profiling.
>
> I agree with that, I just would like to understand where increasing the iterations
> helps and if we can handle it without iterating (because Richi originally requested to
> drop the iteration for correcness issues)
> Do you have some examples?

We can do FDO experiment by shutting down einline. (Note that
increasing iteration to 2 did not actually improve performance with
our benchmarks).

David

> Honza
>>
>> David
>>
>> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> Increasing the number of early inliner iterations from 1 to 2 enables more
>> >> indirect calls to be promoted/inlined before instrumentation. This in turn
>> >> reduces the instrumentation overhead, particularly for more expensive indirect
>> >> call topn profiling.
>> >
>> > How much difference you get here? One posibility would be also to run specialized
>> > ipa-cp before profile instrumentation.
>> >
>> > Honza
>> >>
>> >> Passes internal testing and regression tests. Ok for google/4_9?
>> >>
>> >> 2014-10-18  Teresa Johnson  <tejohnson@google.com>
>> >>
>> >>         Google ref b/17934523
>> >>         * opts.c (finish_options): Increase max-early-inliner-iterations to 2
>> >>         for profile-gen and profile-use builds.
>> >>
>> >> Index: opts.c
>> >> ===================================================================
>> >> --- opts.c      (revision 216286)
>> >> +++ opts.c      (working copy)
>> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g
>> >>          opts->x_param_values, opts_set->x_param_values);
>> >>      }
>> >>
>> >> +  if (opts->x_profile_arc_flag
>> >> +      || opts->x_flag_branch_probabilities)
>> >> +    {
>> >> +      maybe_set_param_value
>> >> +       (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2,
>> >> +        opts->x_param_values, opts_set->x_param_values);
>> >> +    }
>> >> +
>> >>    if (!(opts->x_flag_auto_profile
>> >>          || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities)))
>> >>      {
>> >>
>> >>
>> >> --
>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Jan Hubicka Oct. 18, 2014, 11:26 p.m. UTC | #6
> On Sat, Oct 18, 2014 at 3:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> The difference in instrumentation runtime is huge -- as topn profiler
> >> is pretty expensive to run.
> >>
> >> With FDO, it is probably better to make early inlining more aggressive
> >> in order to get more context sensitive profiling.
> >
> > I agree with that, I just would like to understand where increasing the iterations
> > helps and if we can handle it without iterating (because Richi originally requested to
> > drop the iteration for correcness issues)
> > Do you have some examples?
> 
> We can do FDO experiment by shutting down einline. (Note that
> increasing iteration to 2 did not actually improve performance with
> our benchmarks).

I would be more interested in case where increasing iteration to 2 actually
improves train run perfomrance. (einline was originally invented to make
profiling useable on tramp3d ;)
It seems to me that the cases handled by iteration are rather rare, so I am
suprised you get important benefit from these. Perhaps we miss something
obvious here.

Honza
> 
> David
> 
> > Honza
> >>
> >> David
> >>
> >> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> >> Increasing the number of early inliner iterations from 1 to 2 enables more
> >> >> indirect calls to be promoted/inlined before instrumentation. This in turn
> >> >> reduces the instrumentation overhead, particularly for more expensive indirect
> >> >> call topn profiling.
> >> >
> >> > How much difference you get here? One posibility would be also to run specialized
> >> > ipa-cp before profile instrumentation.
> >> >
> >> > Honza
> >> >>
> >> >> Passes internal testing and regression tests. Ok for google/4_9?
> >> >>
> >> >> 2014-10-18  Teresa Johnson  <tejohnson@google.com>
> >> >>
> >> >>         Google ref b/17934523
> >> >>         * opts.c (finish_options): Increase max-early-inliner-iterations to 2
> >> >>         for profile-gen and profile-use builds.
> >> >>
> >> >> Index: opts.c
> >> >> ===================================================================
> >> >> --- opts.c      (revision 216286)
> >> >> +++ opts.c      (working copy)
> >> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g
> >> >>          opts->x_param_values, opts_set->x_param_values);
> >> >>      }
> >> >>
> >> >> +  if (opts->x_profile_arc_flag
> >> >> +      || opts->x_flag_branch_probabilities)
> >> >> +    {
> >> >> +      maybe_set_param_value
> >> >> +       (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2,
> >> >> +        opts->x_param_values, opts_set->x_param_values);
> >> >> +    }
> >> >> +
> >> >>    if (!(opts->x_flag_auto_profile
> >> >>          || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities)))
> >> >>      {
> >> >>
> >> >>
> >> >> --
> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Xinliang David Li Oct. 18, 2014, 11:51 p.m. UTC | #7
On Sat, Oct 18, 2014 at 4:26 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Sat, Oct 18, 2014 at 3:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> The difference in instrumentation runtime is huge -- as topn profiler
>> >> is pretty expensive to run.
>> >>
>> >> With FDO, it is probably better to make early inlining more aggressive
>> >> in order to get more context sensitive profiling.
>> >
>> > I agree with that, I just would like to understand where increasing the iterations
>> > helps and if we can handle it without iterating (because Richi originally requested to
>> > drop the iteration for correcness issues)
>> > Do you have some examples?
>>
>> We can do FDO experiment by shutting down einline. (Note that
>> increasing iteration to 2 did not actually improve performance with
>> our benchmarks).
>
> I would be more interested in case where increasing iteration to 2 actually
> improves train run perfomrance. (einline was originally invented to make
> profiling useable on tramp3d ;)

What is special about tram3d ?

> It seems to me that the cases handled by iteration are rather rare, so I am
> suprised you get important benefit from these. Perhaps we miss something
> obvious here.

For training run performance, as in this case, einline helps reducing
indirect calls thus reduces instrumentation overhead. Instrumentation
has another side-effect that it changes function body size, thus can
reduce the amount of  ipa-inline later.

David

>
> Honza
>>
>> David
>>
>> > Honza
>> >>
>> >> David
>> >>
>> >> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> >> Increasing the number of early inliner iterations from 1 to 2 enables more
>> >> >> indirect calls to be promoted/inlined before instrumentation. This in turn
>> >> >> reduces the instrumentation overhead, particularly for more expensive indirect
>> >> >> call topn profiling.
>> >> >
>> >> > How much difference you get here? One posibility would be also to run specialized
>> >> > ipa-cp before profile instrumentation.
>> >> >
>> >> > Honza
>> >> >>
>> >> >> Passes internal testing and regression tests. Ok for google/4_9?
>> >> >>
>> >> >> 2014-10-18  Teresa Johnson  <tejohnson@google.com>
>> >> >>
>> >> >>         Google ref b/17934523
>> >> >>         * opts.c (finish_options): Increase max-early-inliner-iterations to 2
>> >> >>         for profile-gen and profile-use builds.
>> >> >>
>> >> >> Index: opts.c
>> >> >> ===================================================================
>> >> >> --- opts.c      (revision 216286)
>> >> >> +++ opts.c      (working copy)
>> >> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g
>> >> >>          opts->x_param_values, opts_set->x_param_values);
>> >> >>      }
>> >> >>
>> >> >> +  if (opts->x_profile_arc_flag
>> >> >> +      || opts->x_flag_branch_probabilities)
>> >> >> +    {
>> >> >> +      maybe_set_param_value
>> >> >> +       (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2,
>> >> >> +        opts->x_param_values, opts_set->x_param_values);
>> >> >> +    }
>> >> >> +
>> >> >>    if (!(opts->x_flag_auto_profile
>> >> >>          || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities)))
>> >> >>      {
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Teresa Johnson Oct. 18, 2014, 11:55 p.m. UTC | #8
On Sat, Oct 18, 2014 at 4:26 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Sat, Oct 18, 2014 at 3:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> The difference in instrumentation runtime is huge -- as topn profiler
>> >> is pretty expensive to run.
>> >>
>> >> With FDO, it is probably better to make early inlining more aggressive
>> >> in order to get more context sensitive profiling.
>> >
>> > I agree with that, I just would like to understand where increasing the iterations
>> > helps and if we can handle it without iterating (because Richi originally requested to
>> > drop the iteration for correcness issues)
>> > Do you have some examples?
>>
>> We can do FDO experiment by shutting down einline. (Note that
>> increasing iteration to 2 did not actually improve performance with
>> our benchmarks).
>
> I would be more interested in case where increasing iteration to 2 actually
> improves train run perfomrance. (einline was originally invented to make
> profiling useable on tramp3d ;)
> It seems to me that the cases handled by iteration are rather rare, so I am
> suprised you get important benefit from these. Perhaps we miss something
> obvious here.

The specific case was actually a call to upper_bound in
bits/stl_algo.h with a specialized compare function. In the more
recent versions of upper_bound, the call to the comparator was
outlined into __upper_bound. With only one iteration of early
inlining, we were inlining __upper_bound into upper_bound and into the
caller. But the indirect call to the comparator was not promoted until
the fre2 pass, so it didn't get early inlined. With 2 iterations of
early inlining, enough optimization is apparently done between
iterations to propagate the actual target and promote the indirect
call after we inline __upper_bound and upper_bound that it is inlined
in the second iteration.

Thanks,
Teresa

>
> Honza
>>
>> David
>>
>> > Honza
>> >>
>> >> David
>> >>
>> >> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> >> Increasing the number of early inliner iterations from 1 to 2 enables more
>> >> >> indirect calls to be promoted/inlined before instrumentation. This in turn
>> >> >> reduces the instrumentation overhead, particularly for more expensive indirect
>> >> >> call topn profiling.
>> >> >
>> >> > How much difference you get here? One posibility would be also to run specialized
>> >> > ipa-cp before profile instrumentation.
>> >> >
>> >> > Honza
>> >> >>
>> >> >> Passes internal testing and regression tests. Ok for google/4_9?
>> >> >>
>> >> >> 2014-10-18  Teresa Johnson  <tejohnson@google.com>
>> >> >>
>> >> >>         Google ref b/17934523
>> >> >>         * opts.c (finish_options): Increase max-early-inliner-iterations to 2
>> >> >>         for profile-gen and profile-use builds.
>> >> >>
>> >> >> Index: opts.c
>> >> >> ===================================================================
>> >> >> --- opts.c      (revision 216286)
>> >> >> +++ opts.c      (working copy)
>> >> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g
>> >> >>          opts->x_param_values, opts_set->x_param_values);
>> >> >>      }
>> >> >>
>> >> >> +  if (opts->x_profile_arc_flag
>> >> >> +      || opts->x_flag_branch_probabilities)
>> >> >> +    {
>> >> >> +      maybe_set_param_value
>> >> >> +       (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2,
>> >> >> +        opts->x_param_values, opts_set->x_param_values);
>> >> >> +    }
>> >> >> +
>> >> >>    if (!(opts->x_flag_auto_profile
>> >> >>          || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities)))
>> >> >>      {
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Xinliang David Li Oct. 19, 2014, 10:02 p.m. UTC | #9
On Sat, Oct 18, 2014 at 4:19 PM, Xinliang David Li <davidxl@google.com> wrote:
> On Sat, Oct 18, 2014 at 3:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> The difference in instrumentation runtime is huge -- as topn profiler
>>> is pretty expensive to run.
>>>
>>> With FDO, it is probably better to make early inlining more aggressive
>>> in order to get more context sensitive profiling.
>>
>> I agree with that, I just would like to understand where increasing the iterations
>> helps and if we can handle it without iterating (because Richi originally requested to
>> drop the iteration for correcness issues)
>> Do you have some examples?
>
> We can do FDO experiment by shutting down einline. (Note that
> increasing iteration to 2 did not actually improve performance with
> our benchmarks).

Early inlining itself has large performance impact for FDO (the
runtime of the profile-use build). With it disabled, the FDO
performance drops by >2% on average. The degradation is seen across
all benchmarks except for one.

David


>
> David
>
>> Honza
>>>
>>> David
>>>
>>> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> >> Increasing the number of early inliner iterations from 1 to 2 enables more
>>> >> indirect calls to be promoted/inlined before instrumentation. This in turn
>>> >> reduces the instrumentation overhead, particularly for more expensive indirect
>>> >> call topn profiling.
>>> >
>>> > How much difference you get here? One posibility would be also to run specialized
>>> > ipa-cp before profile instrumentation.
>>> >
>>> > Honza
>>> >>
>>> >> Passes internal testing and regression tests. Ok for google/4_9?
>>> >>
>>> >> 2014-10-18  Teresa Johnson  <tejohnson@google.com>
>>> >>
>>> >>         Google ref b/17934523
>>> >>         * opts.c (finish_options): Increase max-early-inliner-iterations to 2
>>> >>         for profile-gen and profile-use builds.
>>> >>
>>> >> Index: opts.c
>>> >> ===================================================================
>>> >> --- opts.c      (revision 216286)
>>> >> +++ opts.c      (working copy)
>>> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g
>>> >>          opts->x_param_values, opts_set->x_param_values);
>>> >>      }
>>> >>
>>> >> +  if (opts->x_profile_arc_flag
>>> >> +      || opts->x_flag_branch_probabilities)
>>> >> +    {
>>> >> +      maybe_set_param_value
>>> >> +       (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2,
>>> >> +        opts->x_param_values, opts_set->x_param_values);
>>> >> +    }
>>> >> +
>>> >>    if (!(opts->x_flag_auto_profile
>>> >>          || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities)))
>>> >>      {
>>> >>
>>> >>
>>> >> --
>>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Richard Biener Oct. 20, 2014, 8:32 a.m. UTC | #10
On Mon, Oct 20, 2014 at 12:02 AM, Xinliang David Li <davidxl@google.com> wrote:
> On Sat, Oct 18, 2014 at 4:19 PM, Xinliang David Li <davidxl@google.com> wrote:
>> On Sat, Oct 18, 2014 at 3:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>> The difference in instrumentation runtime is huge -- as topn profiler
>>>> is pretty expensive to run.
>>>>
>>>> With FDO, it is probably better to make early inlining more aggressive
>>>> in order to get more context sensitive profiling.
>>>
>>> I agree with that, I just would like to understand where increasing the iterations
>>> helps and if we can handle it without iterating (because Richi originally requested to
>>> drop the iteration for correcness issues)

Well, I requested to do any iteration with an IPA view in mind.  That is,
iterate for cgraph cycles for example where currently we face the situation
that at least one function is inlined unoptimized.  For this we'd like to
first optimize without inlining (well, maybe inlining doesn't hurt) and then
inline (and re-optimize if we inlined).

Indirect edges are more interesting, but basically you'd want to re-inline
once you discover new direct calls during early opts (but then make
sure to do that only after the direct callee was early-optimized first).

Thus it would be nice if somebody could improve on the currently very
simple function ordering we apply early opts, integrating "iteration"
in a better way (not iterating over all functions but only where it
might make a difference, focused on inlining).

>>> Do you have some examples?
>>
>> We can do FDO experiment by shutting down einline. (Note that
>> increasing iteration to 2 did not actually improve performance with
>> our benchmarks).
>
> Early inlining itself has large performance impact for FDO (the
> runtime of the profile-use build). With it disabled, the FDO
> performance drops by >2% on average. The degradation is seen across
> all benchmarks except for one.

Only 2%?  You are lucky ;)  For tramp3d introducing early inlining
made a difference of 100000% ;)  (yes, statistically for tramp3d
we have for each assembler instruction generated 100 calls in the
initial code ... wheee C++ template metaprogramming!)

So indeed early inlining was absoultely required to make FDO usable at all.

Richard.

> David
>
>
>>
>> David
>>
>>> Honza
>>>>
>>>> David
>>>>
>>>> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>> >> Increasing the number of early inliner iterations from 1 to 2 enables more
>>>> >> indirect calls to be promoted/inlined before instrumentation. This in turn
>>>> >> reduces the instrumentation overhead, particularly for more expensive indirect
>>>> >> call topn profiling.
>>>> >
>>>> > How much difference you get here? One posibility would be also to run specialized
>>>> > ipa-cp before profile instrumentation.
>>>> >
>>>> > Honza
>>>> >>
>>>> >> Passes internal testing and regression tests. Ok for google/4_9?
>>>> >>
>>>> >> 2014-10-18  Teresa Johnson  <tejohnson@google.com>
>>>> >>
>>>> >>         Google ref b/17934523
>>>> >>         * opts.c (finish_options): Increase max-early-inliner-iterations to 2
>>>> >>         for profile-gen and profile-use builds.
>>>> >>
>>>> >> Index: opts.c
>>>> >> ===================================================================
>>>> >> --- opts.c      (revision 216286)
>>>> >> +++ opts.c      (working copy)
>>>> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g
>>>> >>          opts->x_param_values, opts_set->x_param_values);
>>>> >>      }
>>>> >>
>>>> >> +  if (opts->x_profile_arc_flag
>>>> >> +      || opts->x_flag_branch_probabilities)
>>>> >> +    {
>>>> >> +      maybe_set_param_value
>>>> >> +       (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2,
>>>> >> +        opts->x_param_values, opts_set->x_param_values);
>>>> >> +    }
>>>> >> +
>>>> >>    if (!(opts->x_flag_auto_profile
>>>> >>          || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities)))
>>>> >>      {
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Xinliang David Li Oct. 20, 2014, 3:53 p.m. UTC | #11
On Mon, Oct 20, 2014 at 1:32 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Mon, Oct 20, 2014 at 12:02 AM, Xinliang David Li <davidxl@google.com> wrote:
>> On Sat, Oct 18, 2014 at 4:19 PM, Xinliang David Li <davidxl@google.com> wrote:
>>> On Sat, Oct 18, 2014 at 3:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>>> The difference in instrumentation runtime is huge -- as topn profiler
>>>>> is pretty expensive to run.
>>>>>
>>>>> With FDO, it is probably better to make early inlining more aggressive
>>>>> in order to get more context sensitive profiling.
>>>>
>>>> I agree with that, I just would like to understand where increasing the iterations
>>>> helps and if we can handle it without iterating (because Richi originally requested to
>>>> drop the iteration for correcness issues)
>
> Well, I requested to do any iteration with an IPA view in mind.  That is,
> iterate for cgraph cycles for example where currently we face the situation
> that at least one function is inlined unoptimized.  For this we'd like to
> first optimize without inlining (well, maybe inlining doesn't hurt)

yes -- inlining decision made without callee cleanup is more
conservative and should not hurt.

>and then
> inline (and re-optimize if we inlined).
>
> Indirect edges are more interesting, but basically you'd want to re-inline
> once you discover new direct calls during early opts (but then make
> sure to do that only after the direct callee was early-optimized first).
>

It would be interesting to inline the newly introduced direct calls if
the callsites also have function pointer arguments that are known in
the call context.

> Thus it would be nice if somebody could improve on the currently very
> simple function ordering we apply early opts, integrating "iteration"
> in a better way (not iterating over all functions but only where it
> might make a difference, focused on inlining).
>
>>>> Do you have some examples?
>>>
>>> We can do FDO experiment by shutting down einline. (Note that
>>> increasing iteration to 2 did not actually improve performance with
>>> our benchmarks).
>>
>> Early inlining itself has large performance impact for FDO (the
>> runtime of the profile-use build). With it disabled, the FDO
>> performance drops by >2% on average. The degradation is seen across
>> all benchmarks except for one.
>
> Only 2%?  You are lucky ;)

2% average is considered pretty significant for optimized build
runtime performance.


> For tramp3d introducing early inlining
> made a difference of 100000% ;)  (yes, statistically for tramp3d
> we have for each assembler instruction generated 100 calls in the
> initial code ... wheee C++ template metaprogramming!)

Is this 100000% difference from instrumentation build or optimized
build runtime?

>
> So indeed early inlining was absoultely required to make FDO usable at all.

thanks,

David
>
> Richard.
>
>> David
>>
>>
>>>
>>> David
>>>
>>>> Honza
>>>>>
>>>>> David
>>>>>
>>>>> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>>> >> Increasing the number of early inliner iterations from 1 to 2 enables more
>>>>> >> indirect calls to be promoted/inlined before instrumentation. This in turn
>>>>> >> reduces the instrumentation overhead, particularly for more expensive indirect
>>>>> >> call topn profiling.
>>>>> >
>>>>> > How much difference you get here? One posibility would be also to run specialized
>>>>> > ipa-cp before profile instrumentation.
>>>>> >
>>>>> > Honza
>>>>> >>
>>>>> >> Passes internal testing and regression tests. Ok for google/4_9?
>>>>> >>
>>>>> >> 2014-10-18  Teresa Johnson  <tejohnson@google.com>
>>>>> >>
>>>>> >>         Google ref b/17934523
>>>>> >>         * opts.c (finish_options): Increase max-early-inliner-iterations to 2
>>>>> >>         for profile-gen and profile-use builds.
>>>>> >>
>>>>> >> Index: opts.c
>>>>> >> ===================================================================
>>>>> >> --- opts.c      (revision 216286)
>>>>> >> +++ opts.c      (working copy)
>>>>> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g
>>>>> >>          opts->x_param_values, opts_set->x_param_values);
>>>>> >>      }
>>>>> >>
>>>>> >> +  if (opts->x_profile_arc_flag
>>>>> >> +      || opts->x_flag_branch_probabilities)
>>>>> >> +    {
>>>>> >> +      maybe_set_param_value
>>>>> >> +       (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2,
>>>>> >> +        opts->x_param_values, opts_set->x_param_values);
>>>>> >> +    }
>>>>> >> +
>>>>> >>    if (!(opts->x_flag_auto_profile
>>>>> >>          || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities)))
>>>>> >>      {
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Richard Biener Oct. 21, 2014, 7:53 a.m. UTC | #12
On Mon, Oct 20, 2014 at 5:53 PM, Xinliang David Li <davidxl@google.com> wrote:
> On Mon, Oct 20, 2014 at 1:32 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Mon, Oct 20, 2014 at 12:02 AM, Xinliang David Li <davidxl@google.com> wrote:
>>> On Sat, Oct 18, 2014 at 4:19 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>> On Sat, Oct 18, 2014 at 3:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>>>> The difference in instrumentation runtime is huge -- as topn profiler
>>>>>> is pretty expensive to run.
>>>>>>
>>>>>> With FDO, it is probably better to make early inlining more aggressive
>>>>>> in order to get more context sensitive profiling.
>>>>>
>>>>> I agree with that, I just would like to understand where increasing the iterations
>>>>> helps and if we can handle it without iterating (because Richi originally requested to
>>>>> drop the iteration for correcness issues)
>>
>> Well, I requested to do any iteration with an IPA view in mind.  That is,
>> iterate for cgraph cycles for example where currently we face the situation
>> that at least one function is inlined unoptimized.  For this we'd like to
>> first optimize without inlining (well, maybe inlining doesn't hurt)
>
> yes -- inlining decision made without callee cleanup is more
> conservative and should not hurt.
>
>>and then
>> inline (and re-optimize if we inlined).
>>
>> Indirect edges are more interesting, but basically you'd want to re-inline
>> once you discover new direct calls during early opts (but then make
>> sure to do that only after the direct callee was early-optimized first).
>>
>
> It would be interesting to inline the newly introduced direct calls if
> the callsites also have function pointer arguments that are known in
> the call context.
>
>> Thus it would be nice if somebody could improve on the currently very
>> simple function ordering we apply early opts, integrating "iteration"
>> in a better way (not iterating over all functions but only where it
>> might make a difference, focused on inlining).
>>
>>>>> Do you have some examples?
>>>>
>>>> We can do FDO experiment by shutting down einline. (Note that
>>>> increasing iteration to 2 did not actually improve performance with
>>>> our benchmarks).
>>>
>>> Early inlining itself has large performance impact for FDO (the
>>> runtime of the profile-use build). With it disabled, the FDO
>>> performance drops by >2% on average. The degradation is seen across
>>> all benchmarks except for one.
>>
>> Only 2%?  You are lucky ;)
>
> 2% average is considered pretty significant for optimized build
> runtime performance.
>
>
>> For tramp3d introducing early inlining
>> made a difference of 100000% ;)  (yes, statistically for tramp3d
>> we have for each assembler instruction generated 100 calls in the
>> initial code ... wheee C++ template metaprogramming!)
>
> Is this 100000% difference from instrumentation build or optimized
> build runtime?

It's from instrumentation build.  I don't remember any numbers for the
improvement on optimized build with FDO vs. non-FDO.

Richard.

>>
>> So indeed early inlining was absoultely required to make FDO usable at all.
>
> thanks,
>
> David
>>
>> Richard.
>>
>>> David
>>>
>>>
>>>>
>>>> David
>>>>
>>>>> Honza
>>>>>>
>>>>>> David
>>>>>>
>>>>>> On Sat, Oct 18, 2014 at 10:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>>>> >> Increasing the number of early inliner iterations from 1 to 2 enables more
>>>>>> >> indirect calls to be promoted/inlined before instrumentation. This in turn
>>>>>> >> reduces the instrumentation overhead, particularly for more expensive indirect
>>>>>> >> call topn profiling.
>>>>>> >
>>>>>> > How much difference you get here? One posibility would be also to run specialized
>>>>>> > ipa-cp before profile instrumentation.
>>>>>> >
>>>>>> > Honza
>>>>>> >>
>>>>>> >> Passes internal testing and regression tests. Ok for google/4_9?
>>>>>> >>
>>>>>> >> 2014-10-18  Teresa Johnson  <tejohnson@google.com>
>>>>>> >>
>>>>>> >>         Google ref b/17934523
>>>>>> >>         * opts.c (finish_options): Increase max-early-inliner-iterations to 2
>>>>>> >>         for profile-gen and profile-use builds.
>>>>>> >>
>>>>>> >> Index: opts.c
>>>>>> >> ===================================================================
>>>>>> >> --- opts.c      (revision 216286)
>>>>>> >> +++ opts.c      (working copy)
>>>>>> >> @@ -870,6 +869,14 @@ finish_options (struct gcc_options *opts, struct g
>>>>>> >>          opts->x_param_values, opts_set->x_param_values);
>>>>>> >>      }
>>>>>> >>
>>>>>> >> +  if (opts->x_profile_arc_flag
>>>>>> >> +      || opts->x_flag_branch_probabilities)
>>>>>> >> +    {
>>>>>> >> +      maybe_set_param_value
>>>>>> >> +       (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2,
>>>>>> >> +        opts->x_param_values, opts_set->x_param_values);
>>>>>> >> +    }
>>>>>> >> +
>>>>>> >>    if (!(opts->x_flag_auto_profile
>>>>>> >>          || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities)))
>>>>>> >>      {
>>>>>> >>
>>>>>> >>
>>>>>> >> --
>>>>>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
diff mbox

Patch

Index: opts.c
===================================================================
--- opts.c      (revision 216286)
+++ opts.c      (working copy)
@@ -870,6 +869,14 @@  finish_options (struct gcc_options *opts, struct g
         opts->x_param_values, opts_set->x_param_values);
     }

+  if (opts->x_profile_arc_flag
+      || opts->x_flag_branch_probabilities)
+    {
+      maybe_set_param_value
+       (PARAM_EARLY_INLINER_MAX_ITERATIONS, 2,
+        opts->x_param_values, opts_set->x_param_values);
+    }
+
   if (!(opts->x_flag_auto_profile
         || (opts->x_profile_arc_flag || opts->x_flag_branch_probabilities)))
     {