diff mbox series

[v2] Add inline growth bias param

Message ID 20191125124417.706-1-graham.markall@embecosm.com
State New
Headers show
Series [v2] Add inline growth bias param | expand

Commit Message

Graham Markall Nov. 25, 2019, 12:44 p.m. UTC
Hi Richard,

Many thanks for the suggestion of an alternative implementation. I tried
implementing the suggestion, and I had a couple of observations:

1. As well as applying the bias in compute_fn_summary, it seemed to also
be necessary to apply it in ip_update_overall_fn_summary to avoid an ICE
resulting from size_info->size and size_info->self_size differing at the
end of compute_fn_summary.

2. The results that it produced were exactly the same as those using the
param uninlined-function-insns, so it would probably be redundant to add
the additional parameter implemented this way.

The implementation I tried is at:
https://github.com/gmarkall/gcc/commit/1d22f65b8a392cffc235ecb49143a93d1720b91c

When I benchmarked the code size again using Embench with both the
inline-growth-bias param implemented this way, and the
uninlined-function-insns param the results were, in both cases:

Benchmark       0       1       2       3       4
---------       ------  ------  ------  ------  ------
aha-mont64       99.05   99.05   99.05   99.05   99.05
crc32           100.00  100.00  100.00  100.00  100.00
cubic            94.66   94.66   94.66   94.66   94.66
edn              96.14   96.14   96.14   96.14   96.14
huffbench        99.88   99.88   99.88   99.88   99.88
matmult-int     100.00  100.00  100.00  100.00  100.00
minver          100.56  100.56  100.56  100.56  100.56
nbody           101.13  101.13  101.13  101.13  101.13
nettle-aes       99.03   99.03   99.03   99.03   99.03
nettle-sha256    99.89   99.89   99.89   99.89   99.89
nsichneu        100.00  100.00  100.00  100.00  100.00
picojpeg         99.35   99.35  100.10  100.10  100.10
qrduino         101.81  101.81  101.81  101.81  101.81
sglib-combined  100.00  100.00  100.00  100.00  100.00
slre             99.09   99.42   99.42  100.49  100.49
st               96.36   96.36   96.36   96.36   96.36
statemate       100.22  100.22  100.22  100.22  100.22
ud              100.00  100.00  100.00  100.00  100.00
wikisort         99.48   99.48   99.48   99.48   99.48
Mean             99.28   99.30   99.34   99.40   99.40

These results show that:

1. Setting the uninlined-function-insns value to 0 rather than 2
generally seems to yield an improvement in code size on RISC-V, without
making any individual benchmark larger than the current default of 2.

2. There are two cases where the patch in the original version of the
patch (v1) makes a slightly greater improvement than using
uninlined-function-insns or the patch linked above (ufi). These are:

Benchmark       ufi      v1
---------       ---      --
qrduino         101.81   101.61
sglib-combined  100.00    95.87

(note that the v1 values are slightly different to those posted in the
original patch - I rebased the original patch on a more recent version
of the master branch and the values changed slightly.)

Additionally, for a proprietary application that we tested the flags
with, the code sizes were for using neither flag (Original) for
using uninlined-function-insns, and using the original patch, with the
parameter values that resulted in the smallest code size, were:

Original   ufi (val=0)  v1 (val=-2)
57380      57184        56756
(100.0%)   (99.66%)     (98.91%)


Patch v2
--------

In conclusion, it seems that in some cases there is some additional code
size saving that can be gained in specific cases using the original
implementation of the patch. I have tidied up the original version of
the patch and adjusted the option so that it is a param instead of a
flag as suggested by both Jeff and yourself. I chose a default param
value of 5 to allow for a reasonable scope for setting negative values -
although I haven't seen any result where setting the value beneath an
effective setting of -2 yielded an improvement, this does allow a little
more room in case there are cases where going below -2 would help.

Is there sufficient data to indicate that the additional parameter as
implemented in the attached patch is worth adding?


Many thanks,
Graham.



---
 gcc/ipa-inline.h | 21 ++++++++++++++++++++-
 gcc/params.def   |  5 +++++
 2 files changed, 25 insertions(+), 1 deletion(-)

Comments

Richard Biener Nov. 27, 2019, 8:10 a.m. UTC | #1
On Mon, Nov 25, 2019 at 1:44 PM Graham Markall
<graham.markall@embecosm.com> wrote:
>
> Hi Richard,
>
> Many thanks for the suggestion of an alternative implementation. I tried
> implementing the suggestion, and I had a couple of observations:
>
> 1. As well as applying the bias in compute_fn_summary, it seemed to also
> be necessary to apply it in ip_update_overall_fn_summary to avoid an ICE
> resulting from size_info->size and size_info->self_size differing at the
> end of compute_fn_summary.
>
> 2. The results that it produced were exactly the same as those using the
> param uninlined-function-insns, so it would probably be redundant to add
> the additional parameter implemented this way.
>
> The implementation I tried is at:
> https://github.com/gmarkall/gcc/commit/1d22f65b8a392cffc235ecb49143a93d1720b91c
>
> When I benchmarked the code size again using Embench with both the
> inline-growth-bias param implemented this way, and the
> uninlined-function-insns param the results were, in both cases:
>
> Benchmark       0       1       2       3       4
> ---------       ------  ------  ------  ------  ------
> aha-mont64       99.05   99.05   99.05   99.05   99.05
> crc32           100.00  100.00  100.00  100.00  100.00
> cubic            94.66   94.66   94.66   94.66   94.66
> edn              96.14   96.14   96.14   96.14   96.14
> huffbench        99.88   99.88   99.88   99.88   99.88
> matmult-int     100.00  100.00  100.00  100.00  100.00
> minver          100.56  100.56  100.56  100.56  100.56
> nbody           101.13  101.13  101.13  101.13  101.13
> nettle-aes       99.03   99.03   99.03   99.03   99.03
> nettle-sha256    99.89   99.89   99.89   99.89   99.89
> nsichneu        100.00  100.00  100.00  100.00  100.00
> picojpeg         99.35   99.35  100.10  100.10  100.10
> qrduino         101.81  101.81  101.81  101.81  101.81
> sglib-combined  100.00  100.00  100.00  100.00  100.00
> slre             99.09   99.42   99.42  100.49  100.49
> st               96.36   96.36   96.36   96.36   96.36
> statemate       100.22  100.22  100.22  100.22  100.22
> ud              100.00  100.00  100.00  100.00  100.00
> wikisort         99.48   99.48   99.48   99.48   99.48
> Mean             99.28   99.30   99.34   99.40   99.40
>
> These results show that:
>
> 1. Setting the uninlined-function-insns value to 0 rather than 2
> generally seems to yield an improvement in code size on RISC-V, without
> making any individual benchmark larger than the current default of 2.
>
> 2. There are two cases where the patch in the original version of the
> patch (v1) makes a slightly greater improvement than using
> uninlined-function-insns or the patch linked above (ufi). These are:
>
> Benchmark       ufi      v1
> ---------       ---      --
> qrduino         101.81   101.61
> sglib-combined  100.00    95.87
>
> (note that the v1 values are slightly different to those posted in the
> original patch - I rebased the original patch on a more recent version
> of the master branch and the values changed slightly.)
>
> Additionally, for a proprietary application that we tested the flags
> with, the code sizes were for using neither flag (Original) for
> using uninlined-function-insns, and using the original patch, with the
> parameter values that resulted in the smallest code size, were:
>
> Original   ufi (val=0)  v1 (val=-2)
> 57380      57184        56756
> (100.0%)   (99.66%)     (98.91%)
>
>
> Patch v2
> --------
>
> In conclusion, it seems that in some cases there is some additional code
> size saving that can be gained in specific cases using the original
> implementation of the patch. I have tidied up the original version of
> the patch and adjusted the option so that it is a param instead of a
> flag as suggested by both Jeff and yourself. I chose a default param
> value of 5 to allow for a reasonable scope for setting negative values -
> although I haven't seen any result where setting the value beneath an
> effective setting of -2 yielded an improvement, this does allow a little
> more room in case there are cases where going below -2 would help.
>
> Is there sufficient data to indicate that the additional parameter as
> implemented in the attached patch is worth adding?

There needs to be more explanation on what this really achieves
and as I said IIRC we should already have a --param that accounts
for this, --param uninlined-function-insns which maybe only misses
the ability to become negative.

Richard.

>
> Many thanks,
> Graham.
>
>
>
> ---
>  gcc/ipa-inline.h | 21 ++++++++++++++++++++-
>  gcc/params.def   |  5 +++++
>  2 files changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/ipa-inline.h b/gcc/ipa-inline.h
> index 18c8e1eebd0..fb10677ea8f 100644
> --- a/gcc/ipa-inline.h
> +++ b/gcc/ipa-inline.h
> @@ -21,6 +21,8 @@ along with GCC; see the file COPYING3.  If not see
>  #ifndef GCC_IPA_INLINE_H
>  #define GCC_IPA_INLINE_H
>
> +#include "params.h"
> +
>  /* Data we cache about callgraph edges during inlining to avoid expensive
>     re-computations during the greedy algorithm.  */
>  class edge_growth_cache_entry
> @@ -84,7 +86,24 @@ estimate_edge_growth (struct cgraph_edge *edge)
>  {
>    ipa_call_summary *s = ipa_call_summaries->get (edge);
>    gcc_checking_assert (s->call_stmt_size || !edge->callee->analyzed);
> -  return (estimate_edge_size (edge) - s->call_stmt_size);
> +
> +  int growth = (estimate_edge_size (edge) - s->call_stmt_size);
> +
> +  /* Bias function growth according to the bias parameter. The default is
> +     parameter value is 5 to allow for slight negative biases, so we subtract 5
> +     to allow an effective default value of 0.  */
> +  growth += PARAM_VALUE (PARAM_INLINE_GROWTH_BIAS) - 5;
> +
> +  /* Function size cannot be negative, so if the growth is negative to the point
> +     that it will reduce function size below 0, then we cap the growth such that
> +     it makes the function size exactly zero.  */
> +  struct cgraph_node *caller = edge->caller;
> +  ipa_size_summary *fs = ipa_size_summaries->get (caller);
> +  if (fs->size + growth < 0) {
> +    growth = -fs->size;
> +  }
> +
> +  return growth;
>  }
>
>  /* Return estimated callee runtime increase after inlining
> diff --git a/gcc/params.def b/gcc/params.def
> index 7928f6f071e..4274d8f36e0 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -112,6 +112,11 @@ DEFPARAM (PARAM_INLINE_HEURISTICS_HINT_PERCENT_O2,
>           "The scale (in percents) applied to inline-insns-single and auto limits when heuristics hints that inlining is very profitable.",
>           200, 100, 1000000)
>
> +DEFPARAM (PARAM_INLINE_GROWTH_BIAS,
> +         "inline-growth-bias",
> +         "Bias of inlining growth overhead (default 5 is equal to no bias).",
> +         5, 0, 1000000)
> +
>  DEFPARAM (PARAM_MAX_INLINE_INSNS_SIZE,
>           "max-inline-insns-size",
>           "The maximum number of instructions when inlining for size.",
> --
> 2.21.0
>
Jan Hubicka Nov. 27, 2019, 8:26 a.m. UTC | #2
> > 2. The results that it produced were exactly the same as those using the
> > param uninlined-function-insns, so it would probably be redundant to add
> > the additional parameter implemented this way.
> >
> > The implementation I tried is at:
> > https://github.com/gmarkall/gcc/commit/1d22f65b8a392cffc235ecb49143a93d1720b91c
> >
> > When I benchmarked the code size again using Embench with both the
> > inline-growth-bias param implemented this way, and the
> > uninlined-function-insns param the results were, in both cases:
> >
> > Benchmark       0       1       2       3       4
> > ---------       ------  ------  ------  ------  ------
> > aha-mont64       99.05   99.05   99.05   99.05   99.05
> > crc32           100.00  100.00  100.00  100.00  100.00
> > cubic            94.66   94.66   94.66   94.66   94.66
> > edn              96.14   96.14   96.14   96.14   96.14
> > huffbench        99.88   99.88   99.88   99.88   99.88
> > matmult-int     100.00  100.00  100.00  100.00  100.00
> > minver          100.56  100.56  100.56  100.56  100.56
> > nbody           101.13  101.13  101.13  101.13  101.13
> > nettle-aes       99.03   99.03   99.03   99.03   99.03
> > nettle-sha256    99.89   99.89   99.89   99.89   99.89
> > nsichneu        100.00  100.00  100.00  100.00  100.00
> > picojpeg         99.35   99.35  100.10  100.10  100.10
> > qrduino         101.81  101.81  101.81  101.81  101.81
> > sglib-combined  100.00  100.00  100.00  100.00  100.00
> > slre             99.09   99.42   99.42  100.49  100.49
> > st               96.36   96.36   96.36   96.36   96.36
> > statemate       100.22  100.22  100.22  100.22  100.22
> > ud              100.00  100.00  100.00  100.00  100.00
> > wikisort         99.48   99.48   99.48   99.48   99.48
> > Mean             99.28   99.30   99.34   99.40   99.40
> >
> > These results show that:
> >
> > 1. Setting the uninlined-function-insns value to 0 rather than 2
> > generally seems to yield an improvement in code size on RISC-V, without
> > making any individual benchmark larger than the current default of 2.

It would be interesting to understand this better - I am generally
trying the inline metrics to match reality and the constant of "2"
basically accounts the fact that offline copy of function has some
prologue/epilogue/return instructions which we do not see at gimple
level and thus we want to account it. These are always optimized out
after inliing.

I made this param mostly becased on presentaiton of s390 performance
where function calls, prologues, epilogues etc are significantly more
expensive.

So rather than biassing code size estimates to be less realistic I would
try to fix it on the inline heuristics side.

Do you have small examples for to look at?
> >
> > 2. There are two cases where the patch in the original version of the
> > patch (v1) makes a slightly greater improvement than using
> > uninlined-function-insns or the patch linked above (ufi). These are:
> >
> > Benchmark       ufi      v1
> > ---------       ---      --
> > qrduino         101.81   101.61
> > sglib-combined  100.00    95.87
> >
> > (note that the v1 values are slightly different to those posted in the
> > original patch - I rebased the original patch on a more recent version
> > of the master branch and the values changed slightly.)
> >
> > Additionally, for a proprietary application that we tested the flags
> > with, the code sizes were for using neither flag (Original) for
> > using uninlined-function-insns, and using the original patch, with the
> > parameter values that resulted in the smallest code size, were:
> >
> > Original   ufi (val=0)  v1 (val=-2)
> > 57380      57184        56756
> > (100.0%)   (99.66%)     (98.91%)
> >
> >
> > Patch v2
> > --------
> >
> > In conclusion, it seems that in some cases there is some additional code
> > size saving that can be gained in specific cases using the original
> > implementation of the patch. I have tidied up the original version of
> > the patch and adjusted the option so that it is a param instead of a
> > flag as suggested by both Jeff and yourself. I chose a default param
> > value of 5 to allow for a reasonable scope for setting negative values -
> > although I haven't seen any result where setting the value beneath an
> > effective setting of -2 yielded an improvement, this does allow a little
> > more room in case there are cases where going below -2 would help.
> >
> > Is there sufficient data to indicate that the additional parameter as
> > implemented in the attached patch is worth adding?
> 
> There needs to be more explanation on what this really achieves
> and as I said IIRC we should already have a --param that accounts
> for this, --param uninlined-function-insns which maybe only misses
> the ability to become negative.
> 
> Richard.
> 
> >
> > Many thanks,
> > Graham.
> >
> >
> >
> > ---
> >  gcc/ipa-inline.h | 21 ++++++++++++++++++++-
> >  gcc/params.def   |  5 +++++
> >  2 files changed, 25 insertions(+), 1 deletion(-)
> >
> > diff --git a/gcc/ipa-inline.h b/gcc/ipa-inline.h
> > index 18c8e1eebd0..fb10677ea8f 100644
> > --- a/gcc/ipa-inline.h
> > +++ b/gcc/ipa-inline.h
> > @@ -21,6 +21,8 @@ along with GCC; see the file COPYING3.  If not see
> >  #ifndef GCC_IPA_INLINE_H
> >  #define GCC_IPA_INLINE_H
> >
> > +#include "params.h"
> > +
> >  /* Data we cache about callgraph edges during inlining to avoid expensive
> >     re-computations during the greedy algorithm.  */
> >  class edge_growth_cache_entry
> > @@ -84,7 +86,24 @@ estimate_edge_growth (struct cgraph_edge *edge)
> >  {
> >    ipa_call_summary *s = ipa_call_summaries->get (edge);
> >    gcc_checking_assert (s->call_stmt_size || !edge->callee->analyzed);
> > -  return (estimate_edge_size (edge) - s->call_stmt_size);
> > +
> > +  int growth = (estimate_edge_size (edge) - s->call_stmt_size);
> > +
> > +  /* Bias function growth according to the bias parameter. The default is
> > +     parameter value is 5 to allow for slight negative biases, so we subtract 5
> > +     to allow an effective default value of 0.  */
> > +  growth += PARAM_VALUE (PARAM_INLINE_GROWTH_BIAS) - 5;
> > +
> > +  /* Function size cannot be negative, so if the growth is negative to the point
> > +     that it will reduce function size below 0, then we cap the growth such that
> > +     it makes the function size exactly zero.  */
> > +  struct cgraph_node *caller = edge->caller;
> > +  ipa_size_summary *fs = ipa_size_summaries->get (caller);
> > +  if (fs->size + growth < 0) {
> > +    growth = -fs->size;
> > +  }
> > +
> > +  return growth;

The esitmated growth is used in the following ways

1) it is compared to max-inline-insns-* and early-inlining-insns parameters
2) it is used to estimate overall code growth for inlining for size
3) it affects badness metrics.

Again instead of biassing growth, perhaps we should work out what of
1, 2, 3 needs returning.

Honza
> >  }
> >
> >  /* Return estimated callee runtime increase after inlining
> > diff --git a/gcc/params.def b/gcc/params.def
> > index 7928f6f071e..4274d8f36e0 100644
> > --- a/gcc/params.def
> > +++ b/gcc/params.def
> > @@ -112,6 +112,11 @@ DEFPARAM (PARAM_INLINE_HEURISTICS_HINT_PERCENT_O2,
> >           "The scale (in percents) applied to inline-insns-single and auto limits when heuristics hints that inlining is very profitable.",
> >           200, 100, 1000000)
> >
> > +DEFPARAM (PARAM_INLINE_GROWTH_BIAS,
> > +         "inline-growth-bias",
> > +         "Bias of inlining growth overhead (default 5 is equal to no bias).",
> > +         5, 0, 1000000)
> > +
> >  DEFPARAM (PARAM_MAX_INLINE_INSNS_SIZE,
> >           "max-inline-insns-size",
> >           "The maximum number of instructions when inlining for size.",
> > --
> > 2.21.0
> >
diff mbox series

Patch

diff --git a/gcc/ipa-inline.h b/gcc/ipa-inline.h
index 18c8e1eebd0..fb10677ea8f 100644
--- a/gcc/ipa-inline.h
+++ b/gcc/ipa-inline.h
@@ -21,6 +21,8 @@  along with GCC; see the file COPYING3.  If not see
 #ifndef GCC_IPA_INLINE_H
 #define GCC_IPA_INLINE_H
 
+#include "params.h"
+
 /* Data we cache about callgraph edges during inlining to avoid expensive
    re-computations during the greedy algorithm.  */
 class edge_growth_cache_entry
@@ -84,7 +86,24 @@  estimate_edge_growth (struct cgraph_edge *edge)
 {
   ipa_call_summary *s = ipa_call_summaries->get (edge);
   gcc_checking_assert (s->call_stmt_size || !edge->callee->analyzed);
-  return (estimate_edge_size (edge) - s->call_stmt_size);
+
+  int growth = (estimate_edge_size (edge) - s->call_stmt_size);
+
+  /* Bias function growth according to the bias parameter. The default is
+     parameter value is 5 to allow for slight negative biases, so we subtract 5
+     to allow an effective default value of 0.  */
+  growth += PARAM_VALUE (PARAM_INLINE_GROWTH_BIAS) - 5;
+
+  /* Function size cannot be negative, so if the growth is negative to the point
+     that it will reduce function size below 0, then we cap the growth such that
+     it makes the function size exactly zero.  */
+  struct cgraph_node *caller = edge->caller;
+  ipa_size_summary *fs = ipa_size_summaries->get (caller);
+  if (fs->size + growth < 0) {
+    growth = -fs->size;
+  }
+
+  return growth;
 }
 
 /* Return estimated callee runtime increase after inlining
diff --git a/gcc/params.def b/gcc/params.def
index 7928f6f071e..4274d8f36e0 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -112,6 +112,11 @@  DEFPARAM (PARAM_INLINE_HEURISTICS_HINT_PERCENT_O2,
 	  "The scale (in percents) applied to inline-insns-single and auto limits when heuristics hints that inlining is very profitable.",
 	  200, 100, 1000000)
 
+DEFPARAM (PARAM_INLINE_GROWTH_BIAS,
+	  "inline-growth-bias",
+	  "Bias of inlining growth overhead (default 5 is equal to no bias).",
+	  5, 0, 1000000)
+
 DEFPARAM (PARAM_MAX_INLINE_INSNS_SIZE,
 	  "max-inline-insns-size",
 	  "The maximum number of instructions when inlining for size.",