Message ID | CAAe5K+WL3a1uMUrX-K7vgrr5T9RTrZH1rCYsej-mTdC_HtfigQ@mail.gmail.com |
---|---|
State | New |
Headers | show |
On 11/19/13 10:24, Teresa Johnson wrote: > On Tue, Nov 19, 2013 at 7:44 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> Martin, >> can you, please, generate the updated systemtap with >> -freorder-blocks-and-partition enabled? >> >> I am in favour of enabling this - it is usefull pass and it is pointless ot >> have passes that are not enabled by default. >> Is there reason why this would not work on other ELF target? Is it working >> with Darwin and Windows? > > I don't know how to test these (I don't see any machines listed in the > gcc compile farm of those types). For Windows, I assume you mean > MinGW, which should be enabled as it is under i386. Should I disable > it there and for Darwin? > >> >>> This patch enables -freorder-blocks-and-partition by default for x86 >>> at -O2 and up. It is showing some modest gains in cpu2006 performance >>> with profile feedback and -O2 on an Intel Westmere system. Specifically, >>> I am seeing consistent improvements in 401.bzip2 (1.5-3%), 483.xalancbmk >>> (1.5-3%), and 453.povray (2.5-3%), and no apparent regressions. >> >> This actually sounds very good ;) >> >> Lets see how the systemtap graphs goes. If we will end up with problem >> of too many accesses to cold section, I would suggest making cold section >> subdivided into .unlikely and .unlikely.part (we could have better name) >> with the second consisting only of unlikely parts of hot&normal functions. >> >> This should reduce the problems we are seeing with mistakely identifying >> code to be cold because of roundoff errors (and it probably makes sense >> in general, too). >> We will however need to update gold and ld for that. > > Note that I don't think this would help much unless the linker is > changed to move the cold split section close to the hot section. There > is probably some fine-tuning we could do eventually in the linker > under -ffunction-sections without putting the split portions in a > separate section. I.e. clump the split parts together within unlikely. > But hopefully this can all be done later on as follow-on work to boost > the performance further. > >>> >>> Bootstrapped and tested on x86-64-unknown-linux-gnu with a normal >>> bootstrap, a profiledbootstrap and an LTO profiledbootstrap. All were >>> configured with --enable-languages=all,obj-c++ and tested for both >>> 32 and 64-bit with RUNTESTFLAGS="--target_board=unix\{-m32,-m64\}". >>> >>> It would be good to enable this for additional targets as a follow on, >>> but it needs more testing for both correctness and performance on those >>> other targets (i.e for correctness because I see a number of places >>> in other config/*/*.c files that do some special handling under this >>> option for different targets or simply disable it, so I am not sure >>> how well-tested it is under different architectural constraints). >>> >>> Ok for trunk? >>> >>> Thanks, >>> Teresa >>> >>> 2013-11-19 Teresa Johnson <tejohnson@google.com> >>> >>> * common/config/i386/i386-common.c: Enable >>> -freorder-blocks-and-partition at -O2 and up for x86. >>> * opts.c (finish_options): Only warn if -freorder-blocks-and- >>> partition was set on command line. >> >> You probably mis doc/invoke.texi update. >> Thank you for working on this! > > Yes, thanks. Here is the patch with the invoke.texi update. > > Teresa > > > 2013-11-19 Teresa Johnson <tejohnson@google.com> > > * common/config/i386/i386-common.c: Enable > -freorder-blocks-and-partition at -O2 and up for x86. > * doc/invoke.texi: Update -freorder-blocks-and-partition default. > * opts.c (finish_options): Only warn if -freorder-blocks-and- > partition was set on command line. This is fine. Glad to see we're getting some good benefits from this code. FWIW, I would expect the PA to show benefits from this work -- HP showed good results for block positioning and procedure splitting eons ago. A secondary effect you would expect to see would be an increase in the number of long branch stubs (static count), but a decrease in the number of those stubs that actually execute. Measuring those effects would obviously be out-of-scope. I totally understand that the PA is irrelevant these days, but seeing good results on another architecture would go a long way to answering the question "should this be enabled by default across the board?" jeff
On Tue, Nov 19, 2013 at 9:40 AM, Jeff Law <law@redhat.com> wrote: > On 11/19/13 10:24, Teresa Johnson wrote: >> >> On Tue, Nov 19, 2013 at 7:44 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >>> >>> Martin, >>> can you, please, generate the updated systemtap with >>> -freorder-blocks-and-partition enabled? >>> >>> I am in favour of enabling this - it is usefull pass and it is pointless >>> ot >>> have passes that are not enabled by default. >>> Is there reason why this would not work on other ELF target? Is it >>> working >>> with Darwin and Windows? >> >> >> I don't know how to test these (I don't see any machines listed in the >> gcc compile farm of those types). For Windows, I assume you mean >> MinGW, which should be enabled as it is under i386. Should I disable >> it there and for Darwin? >> >>> >>>> This patch enables -freorder-blocks-and-partition by default for x86 >>>> at -O2 and up. It is showing some modest gains in cpu2006 performance >>>> with profile feedback and -O2 on an Intel Westmere system. Specifically, >>>> I am seeing consistent improvements in 401.bzip2 (1.5-3%), 483.xalancbmk >>>> (1.5-3%), and 453.povray (2.5-3%), and no apparent regressions. >>> >>> >>> This actually sounds very good ;) >>> >>> Lets see how the systemtap graphs goes. If we will end up with problem >>> of too many accesses to cold section, I would suggest making cold section >>> subdivided into .unlikely and .unlikely.part (we could have better name) >>> with the second consisting only of unlikely parts of hot&normal >>> functions. >>> >>> This should reduce the problems we are seeing with mistakely identifying >>> code to be cold because of roundoff errors (and it probably makes sense >>> in general, too). >>> We will however need to update gold and ld for that. >> >> >> Note that I don't think this would help much unless the linker is >> changed to move the cold split section close to the hot section. There >> is probably some fine-tuning we could do eventually in the linker >> under -ffunction-sections without putting the split portions in a >> separate section. I.e. clump the split parts together within unlikely. >> But hopefully this can all be done later on as follow-on work to boost >> the performance further. >> >>>> >>>> Bootstrapped and tested on x86-64-unknown-linux-gnu with a normal >>>> bootstrap, a profiledbootstrap and an LTO profiledbootstrap. All were >>>> configured with --enable-languages=all,obj-c++ and tested for both >>>> 32 and 64-bit with RUNTESTFLAGS="--target_board=unix\{-m32,-m64\}". >>>> >>>> It would be good to enable this for additional targets as a follow on, >>>> but it needs more testing for both correctness and performance on those >>>> other targets (i.e for correctness because I see a number of places >>>> in other config/*/*.c files that do some special handling under this >>>> option for different targets or simply disable it, so I am not sure >>>> how well-tested it is under different architectural constraints). >>>> >>>> Ok for trunk? >>>> >>>> Thanks, >>>> Teresa >>>> >>>> 2013-11-19 Teresa Johnson <tejohnson@google.com> >>>> >>>> * common/config/i386/i386-common.c: Enable >>>> -freorder-blocks-and-partition at -O2 and up for x86. >>>> * opts.c (finish_options): Only warn if -freorder-blocks-and- >>>> partition was set on command line. >>> >>> >>> You probably mis doc/invoke.texi update. >>> Thank you for working on this! >> >> >> Yes, thanks. Here is the patch with the invoke.texi update. >> >> Teresa >> >> >> 2013-11-19 Teresa Johnson <tejohnson@google.com> >> >> * common/config/i386/i386-common.c: Enable >> -freorder-blocks-and-partition at -O2 and up for x86. >> * doc/invoke.texi: Update -freorder-blocks-and-partition default. >> * opts.c (finish_options): Only warn if -freorder-blocks-and- >> partition was set on command line. > > This is fine. Glad to see we're getting some good benefits from this code. Ok, thanks. It is in as r205058. > > FWIW, I would expect the PA to show benefits from this work -- HP showed > good results for block positioning and procedure splitting eons ago. A > secondary effect you would expect to see would be an increase in the number > of long branch stubs (static count), but a decrease in the number of those > stubs that actually execute. Measuring those effects would obviously be > out-of-scope. > > I totally understand that the PA is irrelevant these days, but seeing good > results on another architecture would go a long way to answering the > question "should this be enabled by default across the board?" Yep, we saw benefits for IPF on hp-ux as well. It would be great to eventually enable this across the board and only selectively disable. I know there were people interested in trying this with ARM, that would be a good relevant arch to try this with now to see if it can be enabled more widely. Teresa > > jeff > >
> Dear Teresa and Jan, > I tried to test Teresa's patch, but I've encountered two bugs > during usage of -fprofile-generate/use (one in SPEC CPU 2006 and > Inkscape). Thanks, this is non-LTO run. Is there a chance to get -flto version, too? So we see how things combine with -freorder-function > > This will be probably for Jan: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59266 > > second one: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59265 > > There are numbers I recorded for GIMP with and without block reordering. > > GIMP (-freorder-blocks-and-partition) > pages read (no readahead): 597 pages (4K) > > GIMP (-no-freorder-blocks-and-partition) > pages read (no readahead): 596 pages (4K) The graphs themselves seems bit odd however, why do we have so many accesses to cold section with -fno-reorder-blocks-and-partition again? Honza > > Martin > > On 19 November 2013 23:18, Teresa Johnson <tejohnson@google.com> wrote: > > On Tue, Nov 19, 2013 at 9:40 AM, Jeff Law <law@redhat.com> wrote: > >> On 11/19/13 10:24, Teresa Johnson wrote: > >>> > >>> On Tue, Nov 19, 2013 at 7:44 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > >>>> > >>>> Martin, > >>>> can you, please, generate the updated systemtap with > >>>> -freorder-blocks-and-partition enabled? > >>>> > >>>> I am in favour of enabling this - it is usefull pass and it is pointless > >>>> ot > >>>> have passes that are not enabled by default. > >>>> Is there reason why this would not work on other ELF target? Is it > >>>> working > >>>> with Darwin and Windows? > >>> > >>> > >>> I don't know how to test these (I don't see any machines listed in the > >>> gcc compile farm of those types). For Windows, I assume you mean > >>> MinGW, which should be enabled as it is under i386. Should I disable > >>> it there and for Darwin? > >>> > >>>> > >>>>> This patch enables -freorder-blocks-and-partition by default for x86 > >>>>> at -O2 and up. It is showing some modest gains in cpu2006 performance > >>>>> with profile feedback and -O2 on an Intel Westmere system. Specifically, > >>>>> I am seeing consistent improvements in 401.bzip2 (1.5-3%), 483.xalancbmk > >>>>> (1.5-3%), and 453.povray (2.5-3%), and no apparent regressions. > >>>> > >>>> > >>>> This actually sounds very good ;) > >>>> > >>>> Lets see how the systemtap graphs goes. If we will end up with problem > >>>> of too many accesses to cold section, I would suggest making cold section > >>>> subdivided into .unlikely and .unlikely.part (we could have better name) > >>>> with the second consisting only of unlikely parts of hot&normal > >>>> functions. > >>>> > >>>> This should reduce the problems we are seeing with mistakely identifying > >>>> code to be cold because of roundoff errors (and it probably makes sense > >>>> in general, too). > >>>> We will however need to update gold and ld for that. > >>> > >>> > >>> Note that I don't think this would help much unless the linker is > >>> changed to move the cold split section close to the hot section. There > >>> is probably some fine-tuning we could do eventually in the linker > >>> under -ffunction-sections without putting the split portions in a > >>> separate section. I.e. clump the split parts together within unlikely. > >>> But hopefully this can all be done later on as follow-on work to boost > >>> the performance further. > >>> > >>>>> > >>>>> Bootstrapped and tested on x86-64-unknown-linux-gnu with a normal > >>>>> bootstrap, a profiledbootstrap and an LTO profiledbootstrap. All were > >>>>> configured with --enable-languages=all,obj-c++ and tested for both > >>>>> 32 and 64-bit with RUNTESTFLAGS="--target_board=unix\{-m32,-m64\}". > >>>>> > >>>>> It would be good to enable this for additional targets as a follow on, > >>>>> but it needs more testing for both correctness and performance on those > >>>>> other targets (i.e for correctness because I see a number of places > >>>>> in other config/*/*.c files that do some special handling under this > >>>>> option for different targets or simply disable it, so I am not sure > >>>>> how well-tested it is under different architectural constraints). > >>>>> > >>>>> Ok for trunk? > >>>>> > >>>>> Thanks, > >>>>> Teresa > >>>>> > >>>>> 2013-11-19 Teresa Johnson <tejohnson@google.com> > >>>>> > >>>>> * common/config/i386/i386-common.c: Enable > >>>>> -freorder-blocks-and-partition at -O2 and up for x86. > >>>>> * opts.c (finish_options): Only warn if -freorder-blocks-and- > >>>>> partition was set on command line. > >>>> > >>>> > >>>> You probably mis doc/invoke.texi update. > >>>> Thank you for working on this! > >>> > >>> > >>> Yes, thanks. Here is the patch with the invoke.texi update. > >>> > >>> Teresa > >>> > >>> > >>> 2013-11-19 Teresa Johnson <tejohnson@google.com> > >>> > >>> * common/config/i386/i386-common.c: Enable > >>> -freorder-blocks-and-partition at -O2 and up for x86. > >>> * doc/invoke.texi: Update -freorder-blocks-and-partition default. > >>> * opts.c (finish_options): Only warn if -freorder-blocks-and- > >>> partition was set on command line. > >> > >> This is fine. Glad to see we're getting some good benefits from this code. > > > > Ok, thanks. It is in as r205058. > > > >> > >> FWIW, I would expect the PA to show benefits from this work -- HP showed > >> good results for block positioning and procedure splitting eons ago. A > >> secondary effect you would expect to see would be an increase in the number > >> of long branch stubs (static count), but a decrease in the number of those > >> stubs that actually execute. Measuring those effects would obviously be > >> out-of-scope. > >> > >> I totally understand that the PA is irrelevant these days, but seeing good > >> results on another architecture would go a long way to answering the > >> question "should this be enabled by default across the board?" > > > > Yep, we saw benefits for IPF on hp-ux as well. It would be great to > > eventually enable this across the board and only selectively disable. > > I know there were people interested in trying this with ARM, that > > would be a good relevant arch to try this with now to see if it can be > > enabled more widely. > > > > Teresa > > > >> > >> jeff > >> > >> > > > > > > > > -- > > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Thu, Nov 28, 2013 at 6:06 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> Dear Teresa and Jan, >> I tried to test Teresa's patch, but I've encountered two bugs >> during usage of -fprofile-generate/use (one in SPEC CPU 2006 and >> Inkscape). > > Thanks, this is non-LTO run. Is there a chance to get -flto version, too? > So we see how things combine with -freorder-function >> >> This will be probably for Jan: >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59266 >> >> second one: >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59265 >> >> There are numbers I recorded for GIMP with and without block reordering. >> >> GIMP (-freorder-blocks-and-partition) >> pages read (no readahead): 597 pages (4K) >> >> GIMP (-no-freorder-blocks-and-partition) >> pages read (no readahead): 596 pages (4K) > > The graphs themselves seems bit odd however, why do we have so many accesses > to cold section with -fno-reorder-blocks-and-partition again? Comparing the two graphs I don't see additional accesses in the cold section from -freorder-blocks-and-partition. For the most part the graphs look identical. In contrast, the graphs Martin had generated with and without -freorder-blocks-and-partition back in August had a significant increase in execution out of text.unlikely. I'm wondering if the -fno-reorder-blocks-and-partition graph really had that disabled. I am surprised that the size of the .text and .text.hot did not shrink from splitting. And the accesses in the cold section in both graphs look suspiciously like the accesses we ended up with in the cold section when enabling -freorder-blocks-and-partition back in Aug (although there are certainly a lot fewer than before, which is good news). Martin, can you check that the binary used for -fno-reorder-blocks-and-partition really doesn't have any splitting? Thanks, Teresa > > Honza >> >> Martin >> >> On 19 November 2013 23:18, Teresa Johnson <tejohnson@google.com> wrote: >> > On Tue, Nov 19, 2013 at 9:40 AM, Jeff Law <law@redhat.com> wrote: >> >> On 11/19/13 10:24, Teresa Johnson wrote: >> >>> >> >>> On Tue, Nov 19, 2013 at 7:44 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >>>> >> >>>> Martin, >> >>>> can you, please, generate the updated systemtap with >> >>>> -freorder-blocks-and-partition enabled? >> >>>> >> >>>> I am in favour of enabling this - it is usefull pass and it is pointless >> >>>> ot >> >>>> have passes that are not enabled by default. >> >>>> Is there reason why this would not work on other ELF target? Is it >> >>>> working >> >>>> with Darwin and Windows? >> >>> >> >>> >> >>> I don't know how to test these (I don't see any machines listed in the >> >>> gcc compile farm of those types). For Windows, I assume you mean >> >>> MinGW, which should be enabled as it is under i386. Should I disable >> >>> it there and for Darwin? >> >>> >> >>>> >> >>>>> This patch enables -freorder-blocks-and-partition by default for x86 >> >>>>> at -O2 and up. It is showing some modest gains in cpu2006 performance >> >>>>> with profile feedback and -O2 on an Intel Westmere system. Specifically, >> >>>>> I am seeing consistent improvements in 401.bzip2 (1.5-3%), 483.xalancbmk >> >>>>> (1.5-3%), and 453.povray (2.5-3%), and no apparent regressions. >> >>>> >> >>>> >> >>>> This actually sounds very good ;) >> >>>> >> >>>> Lets see how the systemtap graphs goes. If we will end up with problem >> >>>> of too many accesses to cold section, I would suggest making cold section >> >>>> subdivided into .unlikely and .unlikely.part (we could have better name) >> >>>> with the second consisting only of unlikely parts of hot&normal >> >>>> functions. >> >>>> >> >>>> This should reduce the problems we are seeing with mistakely identifying >> >>>> code to be cold because of roundoff errors (and it probably makes sense >> >>>> in general, too). >> >>>> We will however need to update gold and ld for that. >> >>> >> >>> >> >>> Note that I don't think this would help much unless the linker is >> >>> changed to move the cold split section close to the hot section. There >> >>> is probably some fine-tuning we could do eventually in the linker >> >>> under -ffunction-sections without putting the split portions in a >> >>> separate section. I.e. clump the split parts together within unlikely. >> >>> But hopefully this can all be done later on as follow-on work to boost >> >>> the performance further. >> >>> >> >>>>> >> >>>>> Bootstrapped and tested on x86-64-unknown-linux-gnu with a normal >> >>>>> bootstrap, a profiledbootstrap and an LTO profiledbootstrap. All were >> >>>>> configured with --enable-languages=all,obj-c++ and tested for both >> >>>>> 32 and 64-bit with RUNTESTFLAGS="--target_board=unix\{-m32,-m64\}". >> >>>>> >> >>>>> It would be good to enable this for additional targets as a follow on, >> >>>>> but it needs more testing for both correctness and performance on those >> >>>>> other targets (i.e for correctness because I see a number of places >> >>>>> in other config/*/*.c files that do some special handling under this >> >>>>> option for different targets or simply disable it, so I am not sure >> >>>>> how well-tested it is under different architectural constraints). >> >>>>> >> >>>>> Ok for trunk? >> >>>>> >> >>>>> Thanks, >> >>>>> Teresa >> >>>>> >> >>>>> 2013-11-19 Teresa Johnson <tejohnson@google.com> >> >>>>> >> >>>>> * common/config/i386/i386-common.c: Enable >> >>>>> -freorder-blocks-and-partition at -O2 and up for x86. >> >>>>> * opts.c (finish_options): Only warn if -freorder-blocks-and- >> >>>>> partition was set on command line. >> >>>> >> >>>> >> >>>> You probably mis doc/invoke.texi update. >> >>>> Thank you for working on this! >> >>> >> >>> >> >>> Yes, thanks. Here is the patch with the invoke.texi update. >> >>> >> >>> Teresa >> >>> >> >>> >> >>> 2013-11-19 Teresa Johnson <tejohnson@google.com> >> >>> >> >>> * common/config/i386/i386-common.c: Enable >> >>> -freorder-blocks-and-partition at -O2 and up for x86. >> >>> * doc/invoke.texi: Update -freorder-blocks-and-partition default. >> >>> * opts.c (finish_options): Only warn if -freorder-blocks-and- >> >>> partition was set on command line. >> >> >> >> This is fine. Glad to see we're getting some good benefits from this code. >> > >> > Ok, thanks. It is in as r205058. >> > >> >> >> >> FWIW, I would expect the PA to show benefits from this work -- HP showed >> >> good results for block positioning and procedure splitting eons ago. A >> >> secondary effect you would expect to see would be an increase in the number >> >> of long branch stubs (static count), but a decrease in the number of those >> >> stubs that actually execute. Measuring those effects would obviously be >> >> out-of-scope. >> >> >> >> I totally understand that the PA is irrelevant these days, but seeing good >> >> results on another architecture would go a long way to answering the >> >> question "should this be enabled by default across the board?" >> > >> > Yep, we saw benefits for IPF on hp-ux as well. It would be great to >> > eventually enable this across the board and only selectively disable. >> > I know there were people interested in trying this with ARM, that >> > would be a good relevant arch to try this with now to see if it can be >> > enabled more widely. >> > >> > Teresa >> > >> >> >> >> jeff >> >> >> >> >> > >> > >> > >> > -- >> > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 > > >
On 12/02/13 08:16, Teresa Johnson wrote: > > I'm wondering if the -fno-reorder-blocks-and-partition graph really > had that disabled. I am surprised that the size of the .text and > .text.hot did not shrink from splitting. Could be due to needing longer jump opcodes to reach the unlikely sections. jeff
Dear Teresa, I will today double check if the graphs are correct :) Martin On 2 December 2013 17:16, Jeff Law <law@redhat.com> wrote: > On 12/02/13 08:16, Teresa Johnson wrote: >> >> >> I'm wondering if the -fno-reorder-blocks-and-partition graph really >> had that disabled. I am surprised that the size of the .text and >> .text.hot did not shrink from splitting. > > Could be due to needing longer jump opcodes to reach the unlikely sections. > jeff >
Hello, I prepared a collection of systemtap graphs for GIMP. 1) just my profile-based function reordering: 550 pages 2) just -freorder-blocks-and-partitions: 646 pages 3) just -fno-reorder-blocks-and-partitions: 638 pages Please see attached data. Martin On 2 December 2013 17:52, Martin Liška <marxin.liska@gmail.com> wrote: > Dear Teresa, > I will today double check if the graphs are correct :) > > Martin > > On 2 December 2013 17:16, Jeff Law <law@redhat.com> wrote: >> On 12/02/13 08:16, Teresa Johnson wrote: >>> >>> >>> I'm wondering if the -fno-reorder-blocks-and-partition graph really >>> had that disabled. I am surprised that the size of the .text and >>> .text.hot did not shrink from splitting. >> >> Could be due to needing longer jump opcodes to reach the unlikely sections. >> jeff >>
On Wed, Dec 11, 2013 at 1:21 AM, Martin Liška <marxin.liska@gmail.com> wrote: > Hello, > I prepared a collection of systemtap graphs for GIMP. > > 1) just my profile-based function reordering: 550 pages > 2) just -freorder-blocks-and-partitions: 646 pages > 3) just -fno-reorder-blocks-and-partitions: 638 pages > > Please see attached data. Thanks for the data. A few observations/questions: With both 1) (your (time-based?) reordering) and 2) (-freorder-blocks-and-partitions) there are a fair amount of accesses out of the cold section. I'm not seeing so many accesses out of the cold section in the apps I am looking at with splitting enabled. In the case of splitting, it could either be non-representative profile data or profile data that isn't being maintained properly and lost, although I think I fixed most of those. If you have identified any of the cold split routines that are being executed in the case of 2) it would be interesting to look at the dumps. With 2) there is also a big clump towards the end which is being executed out of the cold section, which again would be interesting to investigate. Why is your function reordering in 1) accessing more out of the cold section than 3) (-fno-reorder-blocks-and-partitions)? Both 2) and 3) have both normal .text and .text.hot, whereas 1) only has .text. I wonder if that is contributing to the higher number of pages either of these has compared to 1), since the non-cold addresses are distributed across both sections? Teresa > > Martin > > On 2 December 2013 17:52, Martin Liška <marxin.liska@gmail.com> wrote: >> Dear Teresa, >> I will today double check if the graphs are correct :) >> >> Martin >> >> On 2 December 2013 17:16, Jeff Law <law@redhat.com> wrote: >>> On 12/02/13 08:16, Teresa Johnson wrote: >>>> >>>> >>>> I'm wondering if the -fno-reorder-blocks-and-partition graph really >>>> had that disabled. I am surprised that the size of the .text and >>>> .text.hot did not shrink from splitting. >>> >>> Could be due to needing longer jump opcodes to reach the unlikely sections. >>> jeff >>>
> On Wed, Dec 11, 2013 at 1:21 AM, Martin Liška <marxin.liska@gmail.com> wrote: > > Hello, > > I prepared a collection of systemtap graphs for GIMP. > > > > 1) just my profile-based function reordering: 550 pages > > 2) just -freorder-blocks-and-partitions: 646 pages > > 3) just -fno-reorder-blocks-and-partitions: 638 pages > > > > Please see attached data. > > Thanks for the data. A few observations/questions: > > With both 1) (your (time-based?) reordering) and 2) > (-freorder-blocks-and-partitions) there are a fair amount of accesses > out of the cold section. I'm not seeing so many accesses out of the Good point, I misread the description and assumed that 1) is time profiling + reorder-blcoks-and-partition. Martin, what version of GCC you used? Rong introduced bug into libgcov that made gcov streaming around fork to split summaries (so the number of runs did not match). I fixed it by 2013-11-18 Jan Hubicka <jh@suse.cz> * libgcov-driver.c (run_accounted): Make global level static. (gcov_exit_merge_summary): Silence warning; do not clear run_accounted here. (gcov_exit): Clear it here. * libgcov-driver.c (gcov_exit_merge_summary): Fix setting run_accounted. * libgcov-driver.c (get_gcov_dump_complete): Update comments. (all_prg, crc32): Remove static vars. (gcov_exit_compute_summary): Rewrite to return crc32; do not clear all_prg. (gcov_exit_merge_gcda): Add crc32 parameter. (gcov_exit_merge_summary): Add crc32 and all_prg parameter; do not account run if it was already accounted. (gcov_exit_dump_gcov): Add crc32 and all_prg parameters. (gcov_exit): Initialize all_prg; update. so please be sure you have this one in tree. If you do, can you please repeat the trick with locked unlikely section so we see why we get there even with -fno-reorder-blcoks-and-partition? > cold section in the apps I am looking at with splitting enabled. In > the case of splitting, it could either be non-representative profile > data or profile data that isn't being maintained properly and lost, > although I think I fixed most of those. If you have identified any of > the cold split routines that are being executed in the case of 2) it > would be interesting to look at the dumps. > > With 2) there is also a big clump towards the end which is being > executed out of the cold section, which again would be interesting to > investigate. I think the data towards the end comes from fact that Martin is manually quiting gimp and at that time he may do it differently (and after different delay) each time. This makes the graphs harder to read, but one should basically ignore everything after the huge gap that indicate that the app has started. > > Why is your function reordering in 1) accessing more out of the cold > section than 3) (-fno-reorder-blocks-and-partitions)? This seems like a scale differnce here (i.e. Martin took longer to quit gimp in gimp_reorder_blocks_and_partition). In the first gap you can see several red dots aligned vertically. I did not go into manually counting the dots, but it seems that they should be about the same. > > Both 2) and 3) have both normal .text and .text.hot, whereas 1) only > has .text. I wonder if that is contributing to the higher number of > pages either of these has compared to 1), since the non-cold addresses > are distributed across both sections? Looking at Martin's tree https://github.com/marxin/gcc/blob/time-profiler-patch2/gcc/varasm.c#L536 he has a hack in disables startup/hot sections, but keeps unlikely. I agree that for more comparable results we should keep all. But first lets work out why we have unlikely section accesses in again... Martin, is there any chance you can test these things on mainline rather than patched trees? Honza
> On Wed, Dec 11, 2013 at 1:21 AM, Martin Liška <marxin.liska@gmail.com> wrote: > > Hello, > > I prepared a collection of systemtap graphs for GIMP. > > > > 1) just my profile-based function reordering: 550 pages > > 2) just -freorder-blocks-and-partitions: 646 pages > > 3) just -fno-reorder-blocks-and-partitions: 638 pages > > > > Please see attached data. > > Thanks for the data. A few observations/questions: > > With both 1) (your (time-based?) reordering) and 2) > (-freorder-blocks-and-partitions) there are a fair amount of accesses > out of the cold section. I'm not seeing so many accesses out of the > cold section in the apps I am looking at with splitting enabled. In I see you already comitted the patch, so perhaps Martin's measurement assume the pass is off by default? I rebuilded GCC with profiledboostrap and with the linkerscript unmapping text.unlikely. I get ICE in: (gdb) bt #0 diagnostic_set_caret_max_width(diagnostic_context*, int) () at ../../gcc/diagnostic.c:108 #1 0x0000000000f68457 in diagnostic_initialize (context=0x18ae000 <global_diagnostic_context>, n_opts=n_opts@entry=1290) at ../../gcc/diagnostic.c:135 #2 0x000000000100050e in general_init (argv0=<optimized out>) at ../../gcc/toplev.c:1110 #3 toplev_main(int, char**) () at ../../gcc/toplev.c:1922 #4 0x00007ffff774cbe5 in __libc_start_main () from /lib64/libc.so.6 #5 0x0000000000f7898d in _start () at ../sysdeps/x86_64/start.S:122 That is relatively early in startup process. The function seems inlined and it fails only on second invocation, did not have time to investigate further, yet while without -fprofile-use it starts... On our periodic testers I see off-noise improvement in crafty 2200->2300 and regression on Vortex, 2900->2800, plus code size increase. Honza
On 15 December 2013 23:17, Martin Liška <marxin.liska@gmail.com> wrote: > Dear Jan and Teresa, > Jan was right that I've been using changes which were commited by > Teresa and do live in trunk. So the graph with time profile presented > in my previous post was really with enabled > -freorder-blocks-and-partition. I removed the hack in varasm.c and I > do use classic section layout. Please open the following dump > (includes PDF graph+html report that shows functions with time profile > located in cold section and all -fdump-ipa-all dumps): > > https://drive.google.com/file/d/0B0pisUJ80pO1YW1QWUFkZjdqME0/edit?usp=sharing > > Apart from that, I created also PDF graph (https://drive.google.com/file/d/0B0pisUJ80pO1aHhPWW56dXpLVTQ/edit?usp=sharing) that > shows that time profile is almost perfect for GIMP. I miss just some > examples that do not have profile in generate phase. > > I will merge current trunk and prepare final patch. > > Are there any other data that you want to be prepared? > > Martin > > > On 13 December 2013 02:13, Jan Hubicka <hubicka@ucw.cz> wrote: >>> On Wed, Dec 11, 2013 at 1:21 AM, Martin Liška <marxin.liska@gmail.com> wrote: >>> > Hello, >>> > I prepared a collection of systemtap graphs for GIMP. >>> > >>> > 1) just my profile-based function reordering: 550 pages >>> > 2) just -freorder-blocks-and-partitions: 646 pages >>> > 3) just -fno-reorder-blocks-and-partitions: 638 pages >>> > >>> > Please see attached data. >>> >>> Thanks for the data. A few observations/questions: >>> >>> With both 1) (your (time-based?) reordering) and 2) >>> (-freorder-blocks-and-partitions) there are a fair amount of accesses >>> out of the cold section. I'm not seeing so many accesses out of the >>> cold section in the apps I am looking at with splitting enabled. In >> >> I see you already comitted the patch, so perhaps Martin's measurement assume >> the pass is off by default? >> >> I rebuilded GCC with profiledboostrap and with the linkerscript unmapping >> text.unlikely. I get ICE in: >> (gdb) bt >> #0 diagnostic_set_caret_max_width(diagnostic_context*, int) () at ../../gcc/diagnostic.c:108 >> #1 0x0000000000f68457 in diagnostic_initialize (context=0x18ae000 <global_diagnostic_context>, n_opts=n_opts@entry=1290) at ../../gcc/diagnostic.c:135 >> #2 0x000000000100050e in general_init (argv0=<optimized out>) at ../../gcc/toplev.c:1110 >> #3 toplev_main(int, char**) () at ../../gcc/toplev.c:1922 >> #4 0x00007ffff774cbe5 in __libc_start_main () from /lib64/libc.so.6 >> #5 0x0000000000f7898d in _start () at ../sysdeps/x86_64/start.S:122 >> >> That is relatively early in startup process. The function seems inlined and >> it fails only on second invocation, did not have time to investigate further, >> yet while without -fprofile-use it starts... >> >> On our periodic testers I see off-noise improvement in crafty 2200->2300 >> and regression on Vortex, 2900->2800, plus code size increase. >> >> Honza
Thanks for the data. A few questions: - Do you have the raw data used to generate your pdfs available? Since you gave me the binaries, if I have the data in terms of exactly what addresses are being plotted I can correlate with the specific cold functions via nm. Once I know what cold functions are being hit, I would then need the .i files and the .gcda files to reproduce the build. - I tried running the binaries, but don't have the necessary shared library dependencies installed on my system: $ ldd gimp-2.8 | grep found libgimpwidgets-2.0.so.0 => not found libgimpconfig-2.0.so.0 => not found libgimpcolor-2.0.so.0 => not found libgimpmath-2.0.so.0 => not found libgimpthumb-2.0.so.0 => not found libgimpmodule-2.0.so.0 => not found libgimpbase-2.0.so.0 => not found libgegl-0.2.so.0 => not found libbabl-0.1.so.0 => not found I'll try to get these installed, but the last time I did that in an attempt to build gimp I had a lot of trouble trying to get the right versions and get them to build for me - any chance you could build an archive version of the gimp binary? Thanks, Teresa On Sun, Dec 15, 2013 at 2:19 PM, Martin Liška <marxin.liska@gmail.com> wrote: > On 15 December 2013 23:17, Martin Liška <marxin.liska@gmail.com> wrote: >> Dear Jan and Teresa, >> Jan was right that I've been using changes which were commited by >> Teresa and do live in trunk. So the graph with time profile presented >> in my previous post was really with enabled >> -freorder-blocks-and-partition. I removed the hack in varasm.c and I >> do use classic section layout. Please open the following dump >> (includes PDF graph+html report that shows functions with time profile >> located in cold section and all -fdump-ipa-all dumps): >> >> https://drive.google.com/file/d/0B0pisUJ80pO1YW1QWUFkZjdqME0/edit?usp=sharing >> >> Apart from that, I created also PDF graph (https://drive.google.com/file/d/0B0pisUJ80pO1aHhPWW56dXpLVTQ/edit?usp=sharing) that >> shows that time profile is almost perfect for GIMP. I miss just some >> examples that do not have profile in generate phase. >> >> I will merge current trunk and prepare final patch. >> >> Are there any other data that you want to be prepared? >> >> Martin >> >> >> On 13 December 2013 02:13, Jan Hubicka <hubicka@ucw.cz> wrote: >>>> On Wed, Dec 11, 2013 at 1:21 AM, Martin Liška <marxin.liska@gmail.com> wrote: >>>> > Hello, >>>> > I prepared a collection of systemtap graphs for GIMP. >>>> > >>>> > 1) just my profile-based function reordering: 550 pages >>>> > 2) just -freorder-blocks-and-partitions: 646 pages >>>> > 3) just -fno-reorder-blocks-and-partitions: 638 pages >>>> > >>>> > Please see attached data. >>>> >>>> Thanks for the data. A few observations/questions: >>>> >>>> With both 1) (your (time-based?) reordering) and 2) >>>> (-freorder-blocks-and-partitions) there are a fair amount of accesses >>>> out of the cold section. I'm not seeing so many accesses out of the >>>> cold section in the apps I am looking at with splitting enabled. In >>> >>> I see you already comitted the patch, so perhaps Martin's measurement assume >>> the pass is off by default? >>> >>> I rebuilded GCC with profiledboostrap and with the linkerscript unmapping >>> text.unlikely. I get ICE in: >>> (gdb) bt >>> #0 diagnostic_set_caret_max_width(diagnostic_context*, int) () at ../../gcc/diagnostic.c:108 >>> #1 0x0000000000f68457 in diagnostic_initialize (context=0x18ae000 <global_diagnostic_context>, n_opts=n_opts@entry=1290) at ../../gcc/diagnostic.c:135 >>> #2 0x000000000100050e in general_init (argv0=<optimized out>) at ../../gcc/toplev.c:1110 >>> #3 toplev_main(int, char**) () at ../../gcc/toplev.c:1922 >>> #4 0x00007ffff774cbe5 in __libc_start_main () from /lib64/libc.so.6 >>> #5 0x0000000000f7898d in _start () at ../sysdeps/x86_64/start.S:122 >>> >>> That is relatively early in startup process. The function seems inlined and >>> it fails only on second invocation, did not have time to investigate further, >>> yet while without -fprofile-use it starts... >>> >>> On our periodic testers I see off-noise improvement in crafty 2200->2300 >>> and regression on Vortex, 2900->2800, plus code size increase. >>> >>> Honza
Index: common/config/i386/i386-common.c =================================================================== --- common/config/i386/i386-common.c (revision 205001) +++ common/config/i386/i386-common.c (working copy) @@ -789,6 +789,8 @@ static const struct default_options ix86_option_op { /* Enable redundant extension instructions removal at -O2 and higher. */ { OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 }, + /* Enable function splitting at -O2 and higher. */ + { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_and_partition, NULL, 1 }, /* Turn off -fschedule-insns by default. It tends to make the problem with not enough registers even worse. */ { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 }, Index: doc/invoke.texi =================================================================== --- doc/invoke.texi (revision 205001) +++ doc/invoke.texi (working copy) @@ -8172,6 +8172,8 @@ exception handling, for linkonce sections, for fun section attribute and on any architecture that does not support named sections. +Enabled for x86 at levels @option{-O2}, @option{-O3}. + @item -freorder-functions @opindex freorder-functions Reorder functions in the object file in order to Index: opts.c =================================================================== --- opts.c (revision 205001) +++ opts.c (working copy) @@ -737,9 +737,10 @@ finish_options (struct gcc_options *opts, struct g && opts->x_flag_reorder_blocks_and_partition && (ui_except == UI_SJLJ || ui_except >= UI_TARGET)) { - inform (loc, - "-freorder-blocks-and-partition does not work " - "with exceptions on this architecture"); + if (opts_set->x_flag_reorder_blocks_and_partition) + inform (loc, + "-freorder-blocks-and-partition does not work " + "with exceptions on this architecture"); opts->x_flag_reorder_blocks_and_partition = 0; opts->x_flag_reorder_blocks = 1; } @@ -752,9 +753,10 @@ finish_options (struct gcc_options *opts, struct g && opts->x_flag_reorder_blocks_and_partition && (ui_except == UI_SJLJ || ui_except >= UI_TARGET)) { - inform (loc, - "-freorder-blocks-and-partition does not support " - "unwind info on this architecture"); + if (opts_set->x_flag_reorder_blocks_and_partition) + inform (loc, + "-freorder-blocks-and-partition does not support " + "unwind info on this architecture"); opts->x_flag_reorder_blocks_and_partition = 0; opts->x_flag_reorder_blocks = 1; } @@ -769,9 +771,10 @@ finish_options (struct gcc_options *opts, struct g && targetm_common.unwind_tables_default && (ui_except == UI_SJLJ || ui_except >= UI_TARGET)))) { - inform (loc, - "-freorder-blocks-and-partition does not work " - "on this architecture"); + if (opts_set->x_flag_reorder_blocks_and_partition) + inform (loc, + "-freorder-blocks-and-partition does not work " + "on this architecture"); opts->x_flag_reorder_blocks_and_partition = 0; opts->x_flag_reorder_blocks = 1; }