Message ID | AM5PR0802MB26108DF4E8A836E0701A1F1483C90@AM5PR0802MB2610.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
On Thu, Sep 22, 2016 at 3:13 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > Increase the lto-min-partition size to 50000 to reduce the number of partitions. > See eg. https://gcc.gnu.org/ml/gcc-patches/2016-04/msg00235.html for a concise > explanation why 10000 is too small for modern CPU/memory size. Additionally, > larger values increase optimization opportunities and reduce bad decisions in the > layout of global variables across partitions (anchors do not work well with LTO). > Looking at SPEC2000, 8 more benchmarks now use a single LTO partition which > is the most optimal. Build time with LTO increases only slightly, eg. SPEC2006 > now takes 2% more time on an 8-core ARM server. Ok. Marcus, how many partitions do we get with libreoffice/firefox currently (I suppose they all hit lto-max-partition now?) Thanks, Richard. > ChangeLog: > 2016-09-22 Wilco Dijkstra <wdijkstr@arm.com> > > gcc/ > * params.def (MIN_PARTITION_SIZE): Increase to 50000. > > -- > diff --git a/gcc/params.def b/gcc/params.def > index 79b7dd4cca9ec1bb67a64725fb1a596b6e937419..da8fd1825e15f2aa800b1c8b680985776c1080ed 100644 > --- a/gcc/params.def > +++ b/gcc/params.def > @@ -1045,7 +1045,7 @@ DEFPARAM (PARAM_LTO_PARTITIONS, > DEFPARAM (MIN_PARTITION_SIZE, > "lto-min-partition", > "Minimal size of a partition for LTO (in estimated instructions).", > - 10000, 0, 0) > + 50000, 0, 0) > > DEFPARAM (MAX_PARTITION_SIZE, > "lto-max-partition", >
On 2016.09.22 at 15:36 +0200, Richard Biener wrote: > On Thu, Sep 22, 2016 at 3:13 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > > Increase the lto-min-partition size to 50000 to reduce the number of partitions. > > See eg. https://gcc.gnu.org/ml/gcc-patches/2016-04/msg00235.html for a concise > > explanation why 10000 is too small for modern CPU/memory size. Additionally, > > larger values increase optimization opportunities and reduce bad decisions in the > > layout of global variables across partitions (anchors do not work well with LTO). > > Looking at SPEC2000, 8 more benchmarks now use a single LTO partition which > > is the most optimal. Build time with LTO increases only slightly, eg. SPEC2006 > > now takes 2% more time on an 8-core ARM server. > > Ok. Marcus, how many partitions do we get with libreoffice/firefox currently > (I suppose they all hit lto-max-partition now?) Yes. Even tramp3d currently gets 30 partitions. With this patch it gets reduced to 20. And I guess bigger projects like Firefox are unchanged at 32.
On 2016.09.22 at 15:42 +0200, Markus Trippelsdorf wrote: > On 2016.09.22 at 15:36 +0200, Richard Biener wrote: > > On Thu, Sep 22, 2016 at 3:13 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > > > Increase the lto-min-partition size to 50000 to reduce the number of partitions. > > > See eg. https://gcc.gnu.org/ml/gcc-patches/2016-04/msg00235.html for a concise > > > explanation why 10000 is too small for modern CPU/memory size. Additionally, > > > larger values increase optimization opportunities and reduce bad decisions in the > > > layout of global variables across partitions (anchors do not work well with LTO). > > > Looking at SPEC2000, 8 more benchmarks now use a single LTO partition which > > > is the most optimal. Build time with LTO increases only slightly, eg. SPEC2006 > > > now takes 2% more time on an 8-core ARM server. > > > > Ok. Marcus, how many partitions do we get with libreoffice/firefox currently > > (I suppose they all hit lto-max-partition now?) > > Yes. Even tramp3d currently gets 30 partitions. With this patch it gets > reduced to 20. > And I guess bigger projects like Firefox are unchanged at 32. Sorry I've reported wrong numbers above. lto-min-partition was already increased from 1000 to 10000 on trunk by Prathamesh in April. And tramp3d only uses ten partitions (lto-min-partition=10000). With lto-min-partition=50000 (current patch) this decrease to only two partitions. As a result we loose the possible speedup on many core machines (-flto=n). E.g. on my 4-core machine I get the following tramp3d compile times with -flto=4: lto-min-partition=50000: 20.146 total lto-min-partition=10000: 16.299 total lto-min-partition=1000 : 16.093 total So 50000 looks too big to me. Also the "increased optimization opportunities" with fewer partitions were unmeasurable in the past. If I recall correctly Honza once said that there should be no difference between single vs. many partitions.
On Fri, Sep 23, 2016 at 3:02 PM, Markus Trippelsdorf <markus@trippelsdorf.de> wrote: > On 2016.09.22 at 15:42 +0200, Markus Trippelsdorf wrote: >> On 2016.09.22 at 15:36 +0200, Richard Biener wrote: >> > On Thu, Sep 22, 2016 at 3:13 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: >> > > Increase the lto-min-partition size to 50000 to reduce the number of partitions. >> > > See eg. https://gcc.gnu.org/ml/gcc-patches/2016-04/msg00235.html for a concise >> > > explanation why 10000 is too small for modern CPU/memory size. Additionally, >> > > larger values increase optimization opportunities and reduce bad decisions in the >> > > layout of global variables across partitions (anchors do not work well with LTO). >> > > Looking at SPEC2000, 8 more benchmarks now use a single LTO partition which >> > > is the most optimal. Build time with LTO increases only slightly, eg. SPEC2006 >> > > now takes 2% more time on an 8-core ARM server. >> > >> > Ok. Marcus, how many partitions do we get with libreoffice/firefox currently >> > (I suppose they all hit lto-max-partition now?) >> >> Yes. Even tramp3d currently gets 30 partitions. With this patch it gets >> reduced to 20. >> And I guess bigger projects like Firefox are unchanged at 32. > > Sorry I've reported wrong numbers above. > > lto-min-partition was already increased from 1000 to 10000 on trunk by > Prathamesh in April. Ah, I forgot about this. 10000 is equal to large-unit-insns btw and about four times of large-function-insns. > And tramp3d only uses ten partitions (lto-min-partition=10000). > With lto-min-partition=50000 (current patch) this decrease to only two > partitions. As a result we loose the possible speedup on many core > machines (-flto=n). > > E.g. on my 4-core machine I get the following tramp3d compile times with > -flto=4: > > lto-min-partition=50000: 20.146 total > lto-min-partition=10000: 16.299 total > lto-min-partition=1000 : 16.093 total > > So 50000 looks too big to me. I think the issue is that the default number of partitions is too high (32) which pessimizes 4-core machines if the units are too small. Maybe we can tune the triplet lto-partitions, lto-min-partition and lto-max-partition in a way that it roughly scales the number of partitions produced with program size rather than quickly raising to 32 and then hovering there until the first unit hits lto-max-partition? > Also the "increased optimization opportunities" with fewer partitions > were unmeasurable in the past. If I recall correctly Honza once said > that there should be no difference between single vs. many partitions. Well, it definitely makes a difference for late IPA passes (that's mainly IPA PTA). Richard. > -- > Markus
On Fri, Sep 23, 2016 at 3:29 PM, Richard Biener <richard.guenther@gmail.com> wrote: > On Fri, Sep 23, 2016 at 3:02 PM, Markus Trippelsdorf > <markus@trippelsdorf.de> wrote: >> On 2016.09.22 at 15:42 +0200, Markus Trippelsdorf wrote: >>> On 2016.09.22 at 15:36 +0200, Richard Biener wrote: >>> > On Thu, Sep 22, 2016 at 3:13 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: >>> > > Increase the lto-min-partition size to 50000 to reduce the number of partitions. >>> > > See eg. https://gcc.gnu.org/ml/gcc-patches/2016-04/msg00235.html for a concise >>> > > explanation why 10000 is too small for modern CPU/memory size. Additionally, >>> > > larger values increase optimization opportunities and reduce bad decisions in the >>> > > layout of global variables across partitions (anchors do not work well with LTO). >>> > > Looking at SPEC2000, 8 more benchmarks now use a single LTO partition which >>> > > is the most optimal. Build time with LTO increases only slightly, eg. SPEC2006 >>> > > now takes 2% more time on an 8-core ARM server. >>> > >>> > Ok. Marcus, how many partitions do we get with libreoffice/firefox currently >>> > (I suppose they all hit lto-max-partition now?) >>> >>> Yes. Even tramp3d currently gets 30 partitions. With this patch it gets >>> reduced to 20. >>> And I guess bigger projects like Firefox are unchanged at 32. >> >> Sorry I've reported wrong numbers above. >> >> lto-min-partition was already increased from 1000 to 10000 on trunk by >> Prathamesh in April. > > Ah, I forgot about this. 10000 is equal to large-unit-insns btw and about > four times of large-function-insns. > >> And tramp3d only uses ten partitions (lto-min-partition=10000). >> With lto-min-partition=50000 (current patch) this decrease to only two >> partitions. As a result we loose the possible speedup on many core >> machines (-flto=n). >> >> E.g. on my 4-core machine I get the following tramp3d compile times with >> -flto=4: >> >> lto-min-partition=50000: 20.146 total >> lto-min-partition=10000: 16.299 total >> lto-min-partition=1000 : 16.093 total >> >> So 50000 looks too big to me. > > I think the issue is that the default number of partitions is too high > (32) which pessimizes 4-core machines if the units are too small. > > Maybe we can tune the triplet lto-partitions, lto-min-partition and > lto-max-partition in a way that it roughly scales the number of > partitions produced with program size rather than quickly raising > to 32 and then hovering there until the first unit hits lto-max-partition? Which would imply lto-max-partition being on the order of lto-partitions * lto-min-partition or simply only having a single lto-partition-size param. I suppose making all this runtime dependent on # cores isn't something we can do as this will lead to code-generation changes. Richard. > >> Also the "increased optimization opportunities" with fewer partitions >> were unmeasurable in the past. If I recall correctly Honza once said >> that there should be no difference between single vs. many partitions. > > Well, it definitely makes a difference for late IPA passes (that's mainly > IPA PTA). > > Richard. > >> -- >> Markus
Richard Biener wrote: >On Fri, Sep 23, 2016 at 3:02 PM, Markus Trippelsdorf <markus@trippelsdorf.de> wrote: > > And tramp3d only uses ten partitions (lto-min-partition=10000). > > With lto-min-partition=50000 (current patch) this decrease to only two > > partitions. As a result we loose the possible speedup on many core > > machines (-flto=n). Only if the size is close to the lto-min-partition. For larger applications there is little difference. > > E.g. on my 4-core machine I get the following tramp3d compile times with > > -flto=4: > > > > lto-min-partition=50000: 20.146 total > > lto-min-partition=10000: 16.299 total > > lto-min-partition=1000 : 16.093 total > > > > So 50000 looks too big to me. That's only 16 seconds? Seems like it's small so ideally it should have used a single partition... > I think the issue is that the default number of partitions is too high > (32) which pessimizes 4-core machines if the units are too small. Yes, 8 might be a better value as 32 core machines are rare. > Maybe we can tune the triplet lto-partitions, lto-min-partition and > lto-max-partition in a way that it roughly scales the number of > partitions produced with program size rather than quickly raising > to 32 and then hovering there until the first unit hits lto-max-partition? Or use a single partition size rather than have the maximum size a hundred times the minimum size (which doesn't make sense at all). > > Also the "increased optimization opportunities" with fewer partitions > > were unmeasurable in the past. If I recall correctly Honza once said > > that there should be no difference between single vs. many partitions. > > Well, it definitely makes a difference for late IPA passes (that's mainly > IPA PTA). Also anchors don't work with multiple partitions. I get around 1% gain from using a single partition. Wilco
On 2016.09.23 at 14:19 +0000, Wilco Dijkstra wrote: > Richard Biener wrote: > >On Fri, Sep 23, 2016 at 3:02 PM, Markus Trippelsdorf <markus@trippelsdorf.de> wrote: > > > And tramp3d only uses ten partitions (lto-min-partition=10000). > > > With lto-min-partition=50000 (current patch) this decrease to only two > > > partitions. As a result we loose the possible speedup on many core > > > machines (-flto=n). > > Only if the size is close to the lto-min-partition. For larger applications there is > little difference. > > > > E.g. on my 4-core machine I get the following tramp3d compile times with > > > -flto=4: > > > > > > lto-min-partition=50000: 20.146 total > > > lto-min-partition=10000: 16.299 total > > > lto-min-partition=1000 : 16.093 total > > > > > > So 50000 looks too big to me. > > That's only 16 seconds? Seems like it's small so ideally it should have > used a single partition... What I wanted to point out is that you of course loose the speedup you'll get from parallel running backends with only a single partition. % time g++ -w -Ofast tramp3d-v4.cpp g++ -w -Ofast tramp3d-v4.cpp 25.61s user 0.31s system 99% cpu 25.944 total % time g++ -flto=4 -w -Ofast tramp3d-v4.cpp g++ -flto=4 -w -Ofast tramp3d-v4.cpp 28.15s user 1.02s system 181% cpu 16.075 total % time g++ --param=lto-partitions=1 -flto=4 -w -Ofast tramp3d-v4.cpp g++ --param=lto-partitions=1 -flto=4 -w -Ofast tramp3d-v4.cpp 26.98s user 0.57s system 99% cpu 27.629 total
Markus Trippelsdorf wrote: > What I wanted to point out is that you of course loose the speedup you'll > get from parallel running backends with only a single partition. Absolutely. For every possible value of min-lto-partition you can find an application that will build with more parallelism if you reduce the partition size. So the question is whether it's the goal of LTO to build as parallel as possible at all times? Or should it be set to a fairly large value that keeps plenty of parallelism for large projects? Wilco
On 23 September 2016 at 19:49, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > Richard Biener wrote: >>On Fri, Sep 23, 2016 at 3:02 PM, Markus Trippelsdorf <markus@trippelsdorf.de> wrote: >> > And tramp3d only uses ten partitions (lto-min-partition=10000). >> > With lto-min-partition=50000 (current patch) this decrease to only two >> > partitions. As a result we loose the possible speedup on many core >> > machines (-flto=n). > > Only if the size is close to the lto-min-partition. For larger applications there is > little difference. > >> > E.g. on my 4-core machine I get the following tramp3d compile times with >> > -flto=4: >> > >> > lto-min-partition=50000: 20.146 total >> > lto-min-partition=10000: 16.299 total >> > lto-min-partition=1000 : 16.093 total >> > >> > So 50000 looks too big to me. > > That's only 16 seconds? Seems like it's small so ideally it should have > used a single partition... > >> I think the issue is that the default number of partitions is too high >> (32) which pessimizes 4-core machines if the units are too small. > > Yes, 8 might be a better value as 32 core machines are rare. > >> Maybe we can tune the triplet lto-partitions, lto-min-partition and >> lto-max-partition in a way that it roughly scales the number of >> partitions produced with program size rather than quickly raising >> to 32 and then hovering there until the first unit hits lto-max-partition? > > Or use a single partition size rather than have the maximum size > a hundred times the minimum size (which doesn't make sense at all). > >> > Also the "increased optimization opportunities" with fewer partitions >> > were unmeasurable in the past. If I recall correctly Honza once said >> > that there should be no difference between single vs. many partitions. >> >> Well, it definitely makes a difference for late IPA passes (that's mainly >> IPA PTA). > > Also anchors don't work with multiple partitions. I get around 1% gain > from using a single partition. Hi Wilco, I am working on LTO varpool partitioning to improve performance for section anchors. I posted a preliminary patch posted at: https://gcc.gnu.org/ml/gcc/2016-07/msg00033.html Unfortunately I haven't yet been able to benchmark it on ARM yet. I am planning to restart working on it again soon. Building with a single partition is not scalable. LTO build of chromium with x86->arm cross with a single partition results in "branch out of range" assembler error. I added lto-max-partition primarily to work around that limitation. Thanks, Prathamesh > > Wilco >
On 2016.09.23 at 15:29 +0200, Richard Biener wrote: > > > > So 50000 looks too big to me. > > I think the issue is that the default number of partitions is too high > (32) which pessimizes 4-core machines if the units are too small. The more partitions are used the less memory is required at LTRANS time. If for example you limit partitions to 4 on a 4-core machine with 8GB memory, you would start swapping when building Firefox. And even lto-partitions=8 is slower than the default of 32: (Firefox libxul build times with gcc-6.) --param=lto-partitions=8 -flto=4: 1670.19s user 23.39s system 305% cpu 9:14.13 total default -flto=4: 1668.94s user 32.51s system 320% cpu 8:50.36 total If someone wants fewer partitions he can use -flto-partition=one/none or --param=lto-partitions=1.
On Sat, Sep 24, 2016 at 10:52 AM, Markus Trippelsdorf <markus@trippelsdorf.de> wrote: > On 2016.09.23 at 15:29 +0200, Richard Biener wrote: >> > >> > So 50000 looks too big to me. >> >> I think the issue is that the default number of partitions is too high >> (32) which pessimizes 4-core machines if the units are too small. > > The more partitions are used the less memory is required at LTRANS time. > > If for example you limit partitions to 4 on a 4-core machine with 8GB > memory, you would start swapping when building Firefox. > > And even lto-partitions=8 is slower than the default of 32: > > (Firefox libxul build times with gcc-6.) > > --param=lto-partitions=8 -flto=4: > 1670.19s user 23.39s system 305% cpu 9:14.13 total > > default -flto=4: > 1668.94s user 32.51s system 320% cpu 8:50.36 total > > If someone wants fewer partitions he can use -flto-partition=one/none > or --param=lto-partitions=1. I know all this. But then we seem to be stuck at 32 partitions from an input size of 32 * lto-partition-min up to 32 * lto-partition-max which is currently two orders of magnitude of difference in input size! That can't be a good heuristic. It's also about temporary disk space of which we use more the more partitions we use (because we essentially duplicate the whole global types/decls section for each partition). I'm not saying increasing lto-partition-min is the best solution but it certainly looks like the most appealing one to me. Richard. > -- > Markus
On 2016.09.26 at 09:42 +0200, Richard Biener wrote: > On Sat, Sep 24, 2016 at 10:52 AM, Markus Trippelsdorf > <markus@trippelsdorf.de> wrote: > > On 2016.09.23 at 15:29 +0200, Richard Biener wrote: > >> > > >> > So 50000 looks too big to me. > >> > >> I think the issue is that the default number of partitions is too high > >> (32) which pessimizes 4-core machines if the units are too small. > > > > The more partitions are used the less memory is required at LTRANS time. > > > > If for example you limit partitions to 4 on a 4-core machine with 8GB > > memory, you would start swapping when building Firefox. > > > > And even lto-partitions=8 is slower than the default of 32: > > > > (Firefox libxul build times with gcc-6.) > > > > --param=lto-partitions=8 -flto=4: > > 1670.19s user 23.39s system 305% cpu 9:14.13 total > > > > default -flto=4: > > 1668.94s user 32.51s system 320% cpu 8:50.36 total > > > > If someone wants fewer partitions he can use -flto-partition=one/none > > or --param=lto-partitions=1. > > I know all this. But then we seem to be stuck at 32 partitions from > an input size of 32 * lto-partition-min up to 32 * lto-partition-max > which is currently two orders of magnitude of difference in input size! > > That can't be a good heuristic. > > It's also about temporary disk space of which we use more the more > partitions we use (because we essentially duplicate the whole global > types/decls section for each partition). > > I'm not saying increasing lto-partition-min is the best solution but it > certainly looks like the most appealing one to me. I think the current lto-partition-min value of 10000 is reasonable, and the proposed value of 50000 seems excessive. Also see the comment in gcc/lto/lto-partition.c: 428 We compute the expected size of a partition as: 429 430 max (total_size / lto_partitions, min_partition_size) 431 432 We use dynamic expected size of partition so small programs are partitioned 433 into enough partitions to allow use of multiple CPUs, while large programs 434 are not partitioned too much. Creating too many partitions significantly 435 increases the streaming overhead. ... 442 The function implements a simple greedy algorithm. Nodes are being added 443 to the current partition until after 3/4 of the expected partition size is 444 reached. Past this threshold, we keep track of boundary size (number of 445 edges going to other partitions) and continue adding functions until after 446 the current partition has grown to twice the expected partition size, or is bigger than max_partition_size. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ : this sentence should be added. 447 Then the process is undone to the point where the minimal ratio of boundary size 448 and in-partition calls was reached. */ -- Markus
Markus Trippelsdorf wrote: > On 2016.09.26 at 09:42 +0200, Richard Biener wrote: > > On Sat, Sep 24, 2016 at 10:52 AM, Markus Trippelsdorf > > <markus@trippelsdorf.de> wrote: > > > On 2016.09.23 at 15:29 +0200, Richard Biener wrote: > > > If for example you limit partitions to 4 on a 4-core machine with 8GB > > > memory, you would start swapping when building Firefox. > > > > > And even lto-partitions=8 is slower than the default of 32: If certain applications swap with 8 partitions, other applications that are 4 times larger will still swap with 32 partitions, agreed? Ie. it implies the max partition size is way too large, not that 32 partitions is best. You'd set it as large as possible to avoid the overhead of having lots of partitions, but small enough so that a typical machine wouldn't swap. > Also see the comment in gcc/lto/lto-partition.c: 428 We compute the expected size of a partition as: 429 430 max (total_size / lto_partitions, min_partition_size) That looks a bit too simplistic with current default settings... So up to 32000 instructions (ie. binary size of ~130KB) it uses as many partitions as possible of 10000 insns, after that it uses 32 partitions until 32000000 instructions... Wilco
Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> wrote: > Hi Wilco, > I am working on LTO varpool partitioning to improve performance for > section anchors. > I posted a preliminary patch posted at: > https://gcc.gnu.org/ml/gcc/2016-07/msg00033.html > Unfortunately I haven't yet been able to benchmark it on ARM yet. > I am planning to restart working on it again soon. Thanks, I'll have a look. However I'm not 100% convinced smarter symbol partitioning is the best way forward. Although it should help, it doesn't take into account which symbols are currently suitable as anchors (-fcommon is still the default, and big arrays are not suitable). And you still have to make difficult choices for symbols that are frequently used across most partitions. So I believe the best solution is to assign anchors early on so that all partitions can make use of anchors. Assuming we sort symbols on size and frequency, it should be feasible to use a single anchor for all simple integer global variables across the whole application. Assigning early should also allow common variables to be used in anchors, further increasing the benefit. Do you think that is feasible? > Building with a single partition is not scalable. LTO build of > chromium with x86->arm > cross with a single partition results in "branch out of range" > assembler error. I added lto-max-partition > primarily to work around that limitation. Yes, GCC doesn't split huge compilation units into multiple text sections so that the linker can insert long branch veneers. So it's a workaround for LTO but most RISC targets can still hit the same issue with a single huge file. Wilco
diff --git a/gcc/params.def b/gcc/params.def index 79b7dd4cca9ec1bb67a64725fb1a596b6e937419..da8fd1825e15f2aa800b1c8b680985776c1080ed 100644 --- a/gcc/params.def +++ b/gcc/params.def @@ -1045,7 +1045,7 @@ DEFPARAM (PARAM_LTO_PARTITIONS, DEFPARAM (MIN_PARTITION_SIZE, "lto-min-partition", "Minimal size of a partition for LTO (in estimated instructions).", - 10000, 0, 0) + 50000, 0, 0) DEFPARAM (MAX_PARTITION_SIZE, "lto-max-partition",