diff mbox

Increase lto-min-partition

Message ID AM5PR0802MB26108DF4E8A836E0701A1F1483C90@AM5PR0802MB2610.eurprd08.prod.outlook.com
State New
Headers show

Commit Message

Wilco Dijkstra Sept. 22, 2016, 1:13 p.m. UTC
Increase the lto-min-partition size to 50000 to reduce the number of partitions.
See eg. https://gcc.gnu.org/ml/gcc-patches/2016-04/msg00235.html for a concise 
explanation why 10000 is too small for modern CPU/memory size.  Additionally,
larger values increase optimization opportunities and reduce bad decisions in the
layout of global variables across partitions (anchors do not work well with LTO).
Looking at SPEC2000, 8 more benchmarks now use a single LTO partition which
is the most optimal.  Build time with LTO increases only slightly, eg. SPEC2006
now takes 2% more time on an 8-core ARM server.

ChangeLog:
2016-09-22  Wilco Dijkstra  <wdijkstr@arm.com>

    gcc/
	* params.def (MIN_PARTITION_SIZE): Increase to 50000.

--

Comments

Richard Biener Sept. 22, 2016, 1:36 p.m. UTC | #1
On Thu, Sep 22, 2016 at 3:13 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Increase the lto-min-partition size to 50000 to reduce the number of partitions.
> See eg. https://gcc.gnu.org/ml/gcc-patches/2016-04/msg00235.html for a concise
> explanation why 10000 is too small for modern CPU/memory size.  Additionally,
> larger values increase optimization opportunities and reduce bad decisions in the
> layout of global variables across partitions (anchors do not work well with LTO).
> Looking at SPEC2000, 8 more benchmarks now use a single LTO partition which
> is the most optimal.  Build time with LTO increases only slightly, eg. SPEC2006
> now takes 2% more time on an 8-core ARM server.

Ok.  Marcus, how many partitions do we get with libreoffice/firefox currently
(I suppose they all hit lto-max-partition now?)

Thanks,
Richard.

> ChangeLog:
> 2016-09-22  Wilco Dijkstra  <wdijkstr@arm.com>
>
>     gcc/
>         * params.def (MIN_PARTITION_SIZE): Increase to 50000.
>
> --
> diff --git a/gcc/params.def b/gcc/params.def
> index 79b7dd4cca9ec1bb67a64725fb1a596b6e937419..da8fd1825e15f2aa800b1c8b680985776c1080ed 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -1045,7 +1045,7 @@ DEFPARAM (PARAM_LTO_PARTITIONS,
>  DEFPARAM (MIN_PARTITION_SIZE,
>           "lto-min-partition",
>           "Minimal size of a partition for LTO (in estimated instructions).",
> -         10000, 0, 0)
> +         50000, 0, 0)
>
>  DEFPARAM (MAX_PARTITION_SIZE,
>           "lto-max-partition",
>
Markus Trippelsdorf Sept. 22, 2016, 1:42 p.m. UTC | #2
On 2016.09.22 at 15:36 +0200, Richard Biener wrote:
> On Thu, Sep 22, 2016 at 3:13 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> > Increase the lto-min-partition size to 50000 to reduce the number of partitions.
> > See eg. https://gcc.gnu.org/ml/gcc-patches/2016-04/msg00235.html for a concise
> > explanation why 10000 is too small for modern CPU/memory size.  Additionally,
> > larger values increase optimization opportunities and reduce bad decisions in the
> > layout of global variables across partitions (anchors do not work well with LTO).
> > Looking at SPEC2000, 8 more benchmarks now use a single LTO partition which
> > is the most optimal.  Build time with LTO increases only slightly, eg. SPEC2006
> > now takes 2% more time on an 8-core ARM server.
> 
> Ok.  Marcus, how many partitions do we get with libreoffice/firefox currently
> (I suppose they all hit lto-max-partition now?)

Yes. Even tramp3d currently gets 30 partitions. With this patch it gets
reduced to 20.
And I guess bigger projects like Firefox are unchanged at 32.
Markus Trippelsdorf Sept. 23, 2016, 1:02 p.m. UTC | #3
On 2016.09.22 at 15:42 +0200, Markus Trippelsdorf wrote:
> On 2016.09.22 at 15:36 +0200, Richard Biener wrote:
> > On Thu, Sep 22, 2016 at 3:13 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> > > Increase the lto-min-partition size to 50000 to reduce the number of partitions.
> > > See eg. https://gcc.gnu.org/ml/gcc-patches/2016-04/msg00235.html for a concise
> > > explanation why 10000 is too small for modern CPU/memory size.  Additionally,
> > > larger values increase optimization opportunities and reduce bad decisions in the
> > > layout of global variables across partitions (anchors do not work well with LTO).
> > > Looking at SPEC2000, 8 more benchmarks now use a single LTO partition which
> > > is the most optimal.  Build time with LTO increases only slightly, eg. SPEC2006
> > > now takes 2% more time on an 8-core ARM server.
> > 
> > Ok.  Marcus, how many partitions do we get with libreoffice/firefox currently
> > (I suppose they all hit lto-max-partition now?)
> 
> Yes. Even tramp3d currently gets 30 partitions. With this patch it gets
> reduced to 20.
> And I guess bigger projects like Firefox are unchanged at 32.

Sorry I've reported wrong numbers above.

lto-min-partition was already increased from 1000 to 10000 on trunk by
Prathamesh in April.
And tramp3d only uses ten partitions (lto-min-partition=10000).
With lto-min-partition=50000 (current patch) this decrease to only two
partitions. As a result we loose the possible speedup on many core
machines (-flto=n).

E.g. on my 4-core machine I get the following tramp3d compile times with
-flto=4:

lto-min-partition=50000: 20.146 total
lto-min-partition=10000: 16.299 total
lto-min-partition=1000 : 16.093 total

So 50000 looks too big to me. 

Also the "increased optimization opportunities" with fewer partitions
were unmeasurable in the past. If I recall correctly Honza once said
that there should be no difference between single vs. many partitions.
Richard Biener Sept. 23, 2016, 1:29 p.m. UTC | #4
On Fri, Sep 23, 2016 at 3:02 PM, Markus Trippelsdorf
<markus@trippelsdorf.de> wrote:
> On 2016.09.22 at 15:42 +0200, Markus Trippelsdorf wrote:
>> On 2016.09.22 at 15:36 +0200, Richard Biener wrote:
>> > On Thu, Sep 22, 2016 at 3:13 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>> > > Increase the lto-min-partition size to 50000 to reduce the number of partitions.
>> > > See eg. https://gcc.gnu.org/ml/gcc-patches/2016-04/msg00235.html for a concise
>> > > explanation why 10000 is too small for modern CPU/memory size.  Additionally,
>> > > larger values increase optimization opportunities and reduce bad decisions in the
>> > > layout of global variables across partitions (anchors do not work well with LTO).
>> > > Looking at SPEC2000, 8 more benchmarks now use a single LTO partition which
>> > > is the most optimal.  Build time with LTO increases only slightly, eg. SPEC2006
>> > > now takes 2% more time on an 8-core ARM server.
>> >
>> > Ok.  Marcus, how many partitions do we get with libreoffice/firefox currently
>> > (I suppose they all hit lto-max-partition now?)
>>
>> Yes. Even tramp3d currently gets 30 partitions. With this patch it gets
>> reduced to 20.
>> And I guess bigger projects like Firefox are unchanged at 32.
>
> Sorry I've reported wrong numbers above.
>
> lto-min-partition was already increased from 1000 to 10000 on trunk by
> Prathamesh in April.

Ah, I forgot about this.  10000 is equal to large-unit-insns btw and about
four times of large-function-insns.

> And tramp3d only uses ten partitions (lto-min-partition=10000).
> With lto-min-partition=50000 (current patch) this decrease to only two
> partitions. As a result we loose the possible speedup on many core
> machines (-flto=n).
>
> E.g. on my 4-core machine I get the following tramp3d compile times with
> -flto=4:
>
> lto-min-partition=50000: 20.146 total
> lto-min-partition=10000: 16.299 total
> lto-min-partition=1000 : 16.093 total
>
> So 50000 looks too big to me.

I think the issue is that the default number of partitions is too high
(32) which pessimizes 4-core machines if the units are too small.

Maybe we can tune the triplet lto-partitions, lto-min-partition and
lto-max-partition in a way that it roughly scales the number of
partitions produced with program size rather than quickly raising
to 32 and then hovering there until the first unit hits lto-max-partition?

> Also the "increased optimization opportunities" with fewer partitions
> were unmeasurable in the past. If I recall correctly Honza once said
> that there should be no difference between single vs. many partitions.

Well, it definitely makes a difference for late IPA passes (that's mainly
IPA PTA).

Richard.

> --
> Markus
Richard Biener Sept. 23, 2016, 1:31 p.m. UTC | #5
On Fri, Sep 23, 2016 at 3:29 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Sep 23, 2016 at 3:02 PM, Markus Trippelsdorf
> <markus@trippelsdorf.de> wrote:
>> On 2016.09.22 at 15:42 +0200, Markus Trippelsdorf wrote:
>>> On 2016.09.22 at 15:36 +0200, Richard Biener wrote:
>>> > On Thu, Sep 22, 2016 at 3:13 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>>> > > Increase the lto-min-partition size to 50000 to reduce the number of partitions.
>>> > > See eg. https://gcc.gnu.org/ml/gcc-patches/2016-04/msg00235.html for a concise
>>> > > explanation why 10000 is too small for modern CPU/memory size.  Additionally,
>>> > > larger values increase optimization opportunities and reduce bad decisions in the
>>> > > layout of global variables across partitions (anchors do not work well with LTO).
>>> > > Looking at SPEC2000, 8 more benchmarks now use a single LTO partition which
>>> > > is the most optimal.  Build time with LTO increases only slightly, eg. SPEC2006
>>> > > now takes 2% more time on an 8-core ARM server.
>>> >
>>> > Ok.  Marcus, how many partitions do we get with libreoffice/firefox currently
>>> > (I suppose they all hit lto-max-partition now?)
>>>
>>> Yes. Even tramp3d currently gets 30 partitions. With this patch it gets
>>> reduced to 20.
>>> And I guess bigger projects like Firefox are unchanged at 32.
>>
>> Sorry I've reported wrong numbers above.
>>
>> lto-min-partition was already increased from 1000 to 10000 on trunk by
>> Prathamesh in April.
>
> Ah, I forgot about this.  10000 is equal to large-unit-insns btw and about
> four times of large-function-insns.
>
>> And tramp3d only uses ten partitions (lto-min-partition=10000).
>> With lto-min-partition=50000 (current patch) this decrease to only two
>> partitions. As a result we loose the possible speedup on many core
>> machines (-flto=n).
>>
>> E.g. on my 4-core machine I get the following tramp3d compile times with
>> -flto=4:
>>
>> lto-min-partition=50000: 20.146 total
>> lto-min-partition=10000: 16.299 total
>> lto-min-partition=1000 : 16.093 total
>>
>> So 50000 looks too big to me.
>
> I think the issue is that the default number of partitions is too high
> (32) which pessimizes 4-core machines if the units are too small.
>
> Maybe we can tune the triplet lto-partitions, lto-min-partition and
> lto-max-partition in a way that it roughly scales the number of
> partitions produced with program size rather than quickly raising
> to 32 and then hovering there until the first unit hits lto-max-partition?

Which would imply lto-max-partition being on the order of
lto-partitions * lto-min-partition
or simply only having a single lto-partition-size param.

I suppose making all this runtime dependent on # cores isn't something we can do
as this will lead to code-generation changes.

Richard.

>
>> Also the "increased optimization opportunities" with fewer partitions
>> were unmeasurable in the past. If I recall correctly Honza once said
>> that there should be no difference between single vs. many partitions.
>
> Well, it definitely makes a difference for late IPA passes (that's mainly
> IPA PTA).
>
> Richard.
>
>> --
>> Markus
Wilco Dijkstra Sept. 23, 2016, 2:19 p.m. UTC | #6
Richard Biener wrote:
>On Fri, Sep 23, 2016 at 3:02 PM, Markus Trippelsdorf <markus@trippelsdorf.de> wrote:
> > And tramp3d only uses ten partitions (lto-min-partition=10000).
> > With lto-min-partition=50000 (current patch) this decrease to only two
> > partitions. As a result we loose the possible speedup on many core
> > machines (-flto=n).

Only if the size is close to the lto-min-partition. For larger applications there is
little difference.

> > E.g. on my 4-core machine I get the following tramp3d compile times with
> > -flto=4:
> >
> > lto-min-partition=50000: 20.146 total
> > lto-min-partition=10000: 16.299 total
> > lto-min-partition=1000 : 16.093 total
> >
> > So 50000 looks too big to me.

That's only 16 seconds? Seems like it's small so ideally it should have
used a single partition...

> I think the issue is that the default number of partitions is too high
> (32) which pessimizes 4-core machines if the units are too small.

Yes, 8 might be a better value as 32 core machines are rare.

> Maybe we can tune the triplet lto-partitions, lto-min-partition and
> lto-max-partition in a way that it roughly scales the number of
> partitions produced with program size rather than quickly raising
> to 32 and then hovering there until the first unit hits lto-max-partition?

Or use a single partition size rather than have the maximum size 
a hundred times the minimum size (which doesn't make sense at all).

> > Also the "increased optimization opportunities" with fewer partitions
> > were unmeasurable in the past. If I recall correctly Honza once said
> > that there should be no difference between single vs. many partitions.
>
> Well, it definitely makes a difference for late IPA passes (that's mainly
> IPA PTA).

Also anchors don't work with multiple partitions. I get around 1% gain
from using a single partition.

Wilco
Markus Trippelsdorf Sept. 23, 2016, 2:37 p.m. UTC | #7
On 2016.09.23 at 14:19 +0000, Wilco Dijkstra wrote:
> Richard Biener wrote:
> >On Fri, Sep 23, 2016 at 3:02 PM, Markus Trippelsdorf <markus@trippelsdorf.de> wrote:
> > > And tramp3d only uses ten partitions (lto-min-partition=10000).
> > > With lto-min-partition=50000 (current patch) this decrease to only two
> > > partitions. As a result we loose the possible speedup on many core
> > > machines (-flto=n).
> 
> Only if the size is close to the lto-min-partition. For larger applications there is
> little difference.
> 
> > > E.g. on my 4-core machine I get the following tramp3d compile times with
> > > -flto=4:
> > >
> > > lto-min-partition=50000: 20.146 total
> > > lto-min-partition=10000: 16.299 total
> > > lto-min-partition=1000 : 16.093 total
> > >
> > > So 50000 looks too big to me.
> 
> That's only 16 seconds? Seems like it's small so ideally it should have
> used a single partition...

What I wanted to point out is that you of course loose the speedup you'll
get from parallel running backends with only a single partition.

 % time g++ -w -Ofast tramp3d-v4.cpp                                                                                                                                    
g++ -w -Ofast tramp3d-v4.cpp  25.61s user 0.31s system 99% cpu 25.944 total

 % time g++ -flto=4 -w -Ofast tramp3d-v4.cpp                                                                                                                            
g++ -flto=4 -w -Ofast tramp3d-v4.cpp  28.15s user 1.02s system 181% cpu 16.075 total

 % time g++ --param=lto-partitions=1 -flto=4 -w -Ofast tramp3d-v4.cpp
g++ --param=lto-partitions=1 -flto=4 -w -Ofast tramp3d-v4.cpp  26.98s user 0.57s system 99% cpu 27.629 total
Wilco Dijkstra Sept. 23, 2016, 3:11 p.m. UTC | #8
Markus Trippelsdorf wrote:
> What I wanted to point out is that you of course loose the speedup you'll
> get from parallel running backends with only a single partition.

Absolutely. For every possible value of min-lto-partition you can find an
application that will build with more parallelism if you reduce the partition size.

So the question is whether it's the goal of LTO to build as parallel as possible
at all times? Or should it be set to a fairly large value that keeps plenty of
parallelism for large projects?

Wilco
Prathamesh Kulkarni Sept. 23, 2016, 3:18 p.m. UTC | #9
On 23 September 2016 at 19:49, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Richard Biener wrote:
>>On Fri, Sep 23, 2016 at 3:02 PM, Markus Trippelsdorf <markus@trippelsdorf.de> wrote:
>> > And tramp3d only uses ten partitions (lto-min-partition=10000).
>> > With lto-min-partition=50000 (current patch) this decrease to only two
>> > partitions. As a result we loose the possible speedup on many core
>> > machines (-flto=n).
>
> Only if the size is close to the lto-min-partition. For larger applications there is
> little difference.
>
>> > E.g. on my 4-core machine I get the following tramp3d compile times with
>> > -flto=4:
>> >
>> > lto-min-partition=50000: 20.146 total
>> > lto-min-partition=10000: 16.299 total
>> > lto-min-partition=1000 : 16.093 total
>> >
>> > So 50000 looks too big to me.
>
> That's only 16 seconds? Seems like it's small so ideally it should have
> used a single partition...
>
>> I think the issue is that the default number of partitions is too high
>> (32) which pessimizes 4-core machines if the units are too small.
>
> Yes, 8 might be a better value as 32 core machines are rare.
>
>> Maybe we can tune the triplet lto-partitions, lto-min-partition and
>> lto-max-partition in a way that it roughly scales the number of
>> partitions produced with program size rather than quickly raising
>> to 32 and then hovering there until the first unit hits lto-max-partition?
>
> Or use a single partition size rather than have the maximum size
> a hundred times the minimum size (which doesn't make sense at all).
>
>> > Also the "increased optimization opportunities" with fewer partitions
>> > were unmeasurable in the past. If I recall correctly Honza once said
>> > that there should be no difference between single vs. many partitions.
>>
>> Well, it definitely makes a difference for late IPA passes (that's mainly
>> IPA PTA).
>
> Also anchors don't work with multiple partitions. I get around 1% gain
> from using a single partition.
Hi Wilco,
I am working on LTO varpool partitioning to improve performance for
section anchors.
I posted a preliminary patch posted at:
https://gcc.gnu.org/ml/gcc/2016-07/msg00033.html
Unfortunately I haven't yet been able to benchmark it on ARM yet.
I am planning to restart working on it again soon.

Building with a single partition is not scalable. LTO build of
chromium with x86->arm
cross with a single partition results in "branch out of range"
assembler error. I added lto-max-partition
primarily to work around that limitation.

Thanks,
Prathamesh
>
> Wilco
>
Markus Trippelsdorf Sept. 24, 2016, 8:52 a.m. UTC | #10
On 2016.09.23 at 15:29 +0200, Richard Biener wrote:
> >
> > So 50000 looks too big to me.
> 
> I think the issue is that the default number of partitions is too high
> (32) which pessimizes 4-core machines if the units are too small.

The more partitions are used the less memory is required at LTRANS time.

If for example you limit partitions to 4 on a 4-core machine with 8GB
memory, you would start swapping when building Firefox.

And even lto-partitions=8 is slower than the default of 32:

(Firefox libxul build times with gcc-6.)

--param=lto-partitions=8 -flto=4:
1670.19s user 23.39s system 305% cpu 9:14.13 total

default -flto=4:
1668.94s user 32.51s system 320% cpu 8:50.36 total

If someone wants fewer partitions he can use -flto-partition=one/none 
or --param=lto-partitions=1.
Richard Biener Sept. 26, 2016, 7:42 a.m. UTC | #11
On Sat, Sep 24, 2016 at 10:52 AM, Markus Trippelsdorf
<markus@trippelsdorf.de> wrote:
> On 2016.09.23 at 15:29 +0200, Richard Biener wrote:
>> >
>> > So 50000 looks too big to me.
>>
>> I think the issue is that the default number of partitions is too high
>> (32) which pessimizes 4-core machines if the units are too small.
>
> The more partitions are used the less memory is required at LTRANS time.
>
> If for example you limit partitions to 4 on a 4-core machine with 8GB
> memory, you would start swapping when building Firefox.
>
> And even lto-partitions=8 is slower than the default of 32:
>
> (Firefox libxul build times with gcc-6.)
>
> --param=lto-partitions=8 -flto=4:
> 1670.19s user 23.39s system 305% cpu 9:14.13 total
>
> default -flto=4:
> 1668.94s user 32.51s system 320% cpu 8:50.36 total
>
> If someone wants fewer partitions he can use -flto-partition=one/none
> or --param=lto-partitions=1.

I know all this.  But then we seem to be stuck at 32 partitions from
an input size of 32 * lto-partition-min up to 32 * lto-partition-max
which is currently two orders of magnitude of difference in input size!

That can't be a good heuristic.

It's also about temporary disk space of which we use more the more
partitions we use (because we essentially duplicate the whole global
types/decls section for each partition).

I'm not saying increasing lto-partition-min is the best solution but it
certainly looks like the most appealing one to me.

Richard.

> --
> Markus
Markus Trippelsdorf Sept. 26, 2016, 9:31 a.m. UTC | #12
On 2016.09.26 at 09:42 +0200, Richard Biener wrote:
> On Sat, Sep 24, 2016 at 10:52 AM, Markus Trippelsdorf
> <markus@trippelsdorf.de> wrote:
> > On 2016.09.23 at 15:29 +0200, Richard Biener wrote:
> >> >
> >> > So 50000 looks too big to me.
> >>
> >> I think the issue is that the default number of partitions is too high
> >> (32) which pessimizes 4-core machines if the units are too small.
> >
> > The more partitions are used the less memory is required at LTRANS time.
> >
> > If for example you limit partitions to 4 on a 4-core machine with 8GB
> > memory, you would start swapping when building Firefox.
> >
> > And even lto-partitions=8 is slower than the default of 32:
> >
> > (Firefox libxul build times with gcc-6.)
> >
> > --param=lto-partitions=8 -flto=4:
> > 1670.19s user 23.39s system 305% cpu 9:14.13 total
> >
> > default -flto=4:
> > 1668.94s user 32.51s system 320% cpu 8:50.36 total
> >
> > If someone wants fewer partitions he can use -flto-partition=one/none
> > or --param=lto-partitions=1.
>
> I know all this.  But then we seem to be stuck at 32 partitions from
> an input size of 32 * lto-partition-min up to 32 * lto-partition-max
> which is currently two orders of magnitude of difference in input size!
>
> That can't be a good heuristic.
>
> It's also about temporary disk space of which we use more the more
> partitions we use (because we essentially duplicate the whole global
> types/decls section for each partition).
>
> I'm not saying increasing lto-partition-min is the best solution but it
> certainly looks like the most appealing one to me.

I think the current lto-partition-min value of 10000 is reasonable, and
the proposed value of 50000 seems excessive.

Also see the comment in gcc/lto/lto-partition.c:

 428    We compute the expected size of a partition as:
 429
 430      max (total_size / lto_partitions, min_partition_size)
 431
 432    We use dynamic expected size of partition so small programs are partitioned
 433    into enough partitions to allow use of multiple CPUs, while large programs
 434    are not partitioned too much.  Creating too many partitions significantly
 435    increases the streaming overhead.
...
 442    The function implements a simple greedy algorithm.  Nodes are being added
 443    to the current partition until after 3/4 of the expected partition size is
 444    reached.  Past this threshold, we keep track of boundary size (number of
 445    edges going to other partitions) and continue adding functions until after
 446    the current partition has grown to twice the expected partition size,
        or is bigger than max_partition_size.
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ : this sentence should be added.

 447    Then the process is undone to the point where the minimal ratio of boundary size
 448    and in-partition calls was reached.  */


--
Markus
Wilco Dijkstra Sept. 26, 2016, 12:14 p.m. UTC | #13
Markus Trippelsdorf wrote: 
> On 2016.09.26 at 09:42 +0200, Richard Biener wrote:
> > On Sat, Sep 24, 2016 at 10:52 AM, Markus Trippelsdorf
> > <markus@trippelsdorf.de> wrote:
> > > On 2016.09.23 at 15:29 +0200, Richard Biener wrote:

> > > If for example you limit partitions to 4 on a 4-core machine with 8GB
> > > memory, you would start swapping when building Firefox.
> >
> > > And even lto-partitions=8 is slower than the default of 32:

If certain applications swap with 8 partitions, other applications that are
4 times larger will still swap with 32 partitions, agreed?

Ie. it implies the max partition size is way too large, not that 32 partitions
is best. You'd set it as large as possible to avoid the overhead of having
lots of partitions, but small enough so that a typical machine wouldn't swap.

> Also see the comment in gcc/lto/lto-partition.c:

 428    We compute the expected size of a partition as:
 429
 430      max (total_size / lto_partitions, min_partition_size)

That looks a bit too simplistic with current default settings... So up to
32000 instructions (ie. binary size of ~130KB) it uses as many partitions
as possible of 10000 insns, after that it uses 32 partitions until 32000000
instructions...

Wilco
Wilco Dijkstra Sept. 26, 2016, 1:07 p.m. UTC | #14
Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> wrote:

> Hi Wilco,
> I am working on LTO varpool partitioning to improve performance for
> section anchors.
> I posted a preliminary patch posted at:
> https://gcc.gnu.org/ml/gcc/2016-07/msg00033.html
> Unfortunately I haven't yet been able to benchmark it on ARM yet.
> I am planning to restart working on it again soon.

Thanks, I'll have a look. However I'm not 100% convinced smarter symbol
partitioning is the best way forward. Although it should help, it doesn't take into
account which symbols are currently suitable as anchors (-fcommon
is still the default, and big arrays are not suitable). And you still have to make
difficult choices for symbols that are frequently used across most partitions.

So I believe the best solution is to assign anchors early on so that all partitions
can make use of anchors. Assuming we sort symbols on size and frequency,
it should be feasible to use a single anchor for all simple integer global variables
across the whole application. Assigning early should also allow common
variables to be used in anchors, further increasing the benefit.

Do you think that is feasible?

> Building with a single partition is not scalable. LTO build of
> chromium with x86->arm
> cross with a single partition results in "branch out of range"
> assembler error. I added lto-max-partition
> primarily to work around that limitation.

Yes, GCC doesn't split huge compilation units into multiple text sections
so that the linker can insert long branch veneers. So it's a workaround
for LTO but most RISC targets can still hit the same issue with a single
huge file.

Wilco
diff mbox

Patch

diff --git a/gcc/params.def b/gcc/params.def
index 79b7dd4cca9ec1bb67a64725fb1a596b6e937419..da8fd1825e15f2aa800b1c8b680985776c1080ed 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1045,7 +1045,7 @@  DEFPARAM (PARAM_LTO_PARTITIONS,
 DEFPARAM (MIN_PARTITION_SIZE,
 	  "lto-min-partition",
 	  "Minimal size of a partition for LTO (in estimated instructions).",
-	  10000, 0, 0)
+	  50000, 0, 0)
 
 DEFPARAM (MAX_PARTITION_SIZE,
 	  "lto-max-partition",