Patchwork [RFC] Old school parallelization of WPA streaming

login
register
mail settings
Submitter Jan Hubicka
Date Aug. 21, 2013, 2:17 p.m.
Message ID <20130821141747.GD24782@kam.mff.cuni.cz>
Download mbox | patch
Permalink /patch/268816/
State New
Headers show

Comments

Jan Hubicka - Aug. 21, 2013, 2:17 p.m.
Hi,
this is my attempt to bring GCC into wonderful era of multicore CPUs :)
It is a hack, but it seems to help quite a lot.  About 50% of WPA time is spent
by streaming the individual ltrans .o files.  This can be easily parallelized
by fork - we do nothing afterwards, just exit and pass the list to the linker.

So until we are thread safe, perhaps this may be a solution? (or on unixish
systems probably it can be solution forever)  I added a logic parsing -flto=24
and do number of streaming processes user asked for.

For -flto=jobserver I simply fork all 32 processes.  It may not be a disaster,
but perhaps we should figure out how to communicate with jobserver.  At first
glance on document on how it works, it seems easy to add. Perhaps we can even
convicne GNU Make folks to put simple helpers to libiberty?

We also may figure out number of CPUs (is it available i.e. from libgomp)
and use it by default even if user do not care to pass number of processes.
Naturally these streaming forks should be cheap memory wise. I hope Martin
will get me some actual numbers.

With the patch the WPA time of firefox goes down to 2 minutes (4.8 needs about
30 minutes and without the hack one needs about 5 minutes)

Before:
Execution times (seconds)
 phase setup             :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall    1398 kB ( 0%) ggc
 phase opt and generate  :  39.73 (17%) usr   0.49 ( 3%) sys  40.26 (16%) wall  347726 kB ( 5%) ggc
 phase stream in         :  82.43 (35%) usr   2.15 (14%) sys  84.62 (34%) wall 5970152 kB (94%) ggc
 phase stream out        : 114.05 (48%) usr  12.86 (83%) sys 127.26 (50%) wall    6868 kB ( 0%) ggc
 garbage collection      :   3.07 ( 1%) usr   0.00 ( 0%) sys   3.08 ( 1%) wall       0 kB ( 0%) ggc
 callgraph optimization  :   0.34 ( 0%) usr   0.00 ( 0%) sys   0.33 ( 0%) wall      30 kB ( 0%) ggc
 ipa dead code removal   :   4.91 ( 2%) usr   0.11 ( 1%) sys   5.16 ( 2%) wall     113 kB ( 0%) ggc
 ipa inheritance graph   :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall     927 kB ( 0%) ggc
 ipa virtual call target :   5.11 ( 2%) usr   0.05 ( 0%) sys   4.99 ( 2%) wall   55296 kB ( 1%) ggc
 ipa cp                  :   2.65 ( 1%) usr   0.17 ( 1%) sys   2.80 ( 1%) wall  188629 kB ( 3%) ggc
 ipa inlining heuristics :  18.49 ( 8%) usr   0.29 ( 2%) sys  18.79 ( 7%) wall  439981 kB ( 7%) ggc
 ipa lto gimple in       :   0.12 ( 0%) usr   0.01 ( 0%) sys   0.15 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto gimple out      :  16.66 ( 7%) usr   1.26 ( 8%) sys  17.97 ( 7%) wall       0 kB ( 0%) ggc
 ipa lto decl in         :  68.70 (29%) usr   1.50 (10%) sys  70.23 (28%) wall 5181795 kB (82%) ggc
 ipa lto decl out        :  93.09 (39%) usr   4.93 (32%) sys  98.07 (39%) wall       0 kB ( 0%) ggc
 ipa lto cgraph I/O      :   1.65 ( 1%) usr   0.27 ( 2%) sys   1.92 ( 1%) wall  428974 kB ( 7%) ggc
 ipa lto decl merge      :   3.66 ( 2%) usr   0.00 ( 0%) sys   3.65 ( 1%) wall    8288 kB ( 0%) ggc
 ipa lto cgraph merge    :   3.42 ( 1%) usr   0.00 ( 0%) sys   3.42 ( 1%) wall   13725 kB ( 0%) ggc
 whopr wpa               :   3.58 ( 2%) usr   0.02 ( 0%) sys   3.59 ( 1%) wall    6871 kB ( 0%) ggc
 whopr wpa I/O           :   0.99 ( 0%) usr   6.65 (43%) sys   7.92 ( 3%) wall       0 kB ( 0%) ggc 
 whopr partitioning      :   2.63 ( 1%) usr   0.01 ( 0%) sys   2.66 ( 1%) wall       0 kB ( 0%) ggc
 ipa reference           :   3.08 ( 1%) usr   0.08 ( 1%) sys   3.18 ( 1%) wall       0 kB ( 0%) ggc
 whopr partitioning      :   2.63 ( 1%) usr   0.01 ( 0%) sys   2.66 ( 1%) wall       0 kB ( 0%) ggc
 ipa reference           :   3.08 ( 1%) usr   0.08 ( 1%) sys   3.18 ( 1%) wall       0 kB ( 0%) ggc
 ipa profile             :   0.43 ( 0%) usr   0.05 ( 0%) sys   0.48 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   3.00 ( 1%) usr   0.06 ( 0%) sys   3.07 ( 1%) wall       0 kB ( 0%) ggc
 varconst                :   0.03 ( 0%) usr   0.04 ( 0%) sys   0.06 ( 0%) wall       0 kB ( 0%) ggc
 unaccounted todo        :   0.48 ( 0%) usr   0.00 ( 0%) sys   0.50 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 : 236.22            15.50           252.15            6326146 kB

after:
Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall    1399 kB ( 0%) ggc
 phase opt and generate  :  35.49 (28%) usr   0.44 ( 6%) sys  35.95 (26%) wall  313971 kB ( 5%) ggc
 phase stream in         :  82.98 (64%) usr   2.10 (30%) sys  85.13 (61%) wall 5969191 kB (95%) ggc
 phase stream out        :  10.37 ( 8%) usr   4.49 (64%) sys  17.33 (13%) wall    5813 kB ( 0%) ggc
 garbage collection      :   3.00 ( 2%) usr   0.00 ( 0%) sys   2.99 ( 2%) wall       0 kB ( 0%) ggc
 callgraph optimization  :   0.33 ( 0%) usr   0.00 ( 0%) sys   0.33 ( 0%) wall      30 kB ( 0%) ggc
 ipa dead code removal   :   4.91 ( 4%) usr   0.10 ( 1%) sys   5.04 ( 4%) wall     114 kB ( 0%) ggc
 ipa inheritance graph   :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.10 ( 0%) wall     792 kB ( 0%) ggc
 ipa virtual call target :   2.14 ( 2%) usr   0.01 ( 0%) sys   2.15 ( 2%) wall   21661 kB ( 0%) ggc
 ipa cp                  :   2.34 ( 2%) usr   0.18 ( 3%) sys   2.52 ( 2%) wall  188629 kB ( 3%) ggc
 ipa inlining heuristics :  18.43 (14%) usr   0.26 ( 4%) sys  18.68 (13%) wall  439993 kB ( 7%) ggc
 ipa lto gimple in       :   0.05 ( 0%) usr   0.05 ( 1%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto gimple out      :   0.44 ( 0%) usr   0.06 ( 1%) sys   0.50 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto decl in         :  69.27 (54%) usr   1.52 (22%) sys  70.87 (51%) wall 5180837 kB (82%) ggc
 ipa lto decl out        :   7.77 ( 6%) usr   0.51 ( 7%) sys   8.28 ( 6%) wall       0 kB ( 0%) ggc
 ipa lto cgraph I/O      :   1.71 ( 1%) usr   0.19 ( 3%) sys   1.90 ( 1%) wall  428974 kB ( 7%) ggc
 ipa lto decl merge      :   3.66 ( 3%) usr   0.00 ( 0%) sys   3.67 ( 3%) wall    8288 kB ( 0%) ggc
 ipa lto cgraph merge    :   3.40 ( 3%) usr   0.00 ( 0%) sys   3.39 ( 2%) wall   13725 kB ( 0%) ggc
 whopr wpa               :   3.19 ( 2%) usr   0.00 ( 0%) sys   3.19 ( 2%) wall    5816 kB ( 0%) ggc
 whopr wpa I/O           :   0.00 ( 0%) usr   3.92 (56%) sys   6.39 ( 5%) wall       0 kB ( 0%) ggc
 whopr partitioning      :   1.44 ( 1%) usr   0.02 ( 0%) sys   1.45 ( 1%) wall       0 kB ( 0%) ggc 
 ipa reference           :   2.64 ( 2%) usr   0.08 ( 1%) sys   2.74 ( 2%) wall       0 kB ( 0%) ggc
 ipa profile             :   0.46 ( 0%) usr   0.02 ( 0%) sys   0.47 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   3.02 ( 2%) usr   0.05 ( 1%) sys   3.08 ( 2%) wall       0 kB ( 0%) ggc
 varconst                :   0.06 ( 0%) usr   0.06 ( 1%) sys   0.07 ( 0%) wall       0 kB ( 0%) ggc
 unaccounted todo        :   0.48 ( 0%) usr   0.00 ( 0%) sys   0.48 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 : 128.85             7.05           138.45            6290376 kB

real 7m42.126s user 53m13.816s sys 2m16.993s

Seems almost like we are at level WPA does something useful.  IPA inlining has definitely room
for improvement. I think to significantly speedup LTO decl in, I think we need to implement
Richard's original idea of compoaring the SCC in pickled version and materializing only
theose that are unique.

Looks resonable?

Honza

	* lto-cgraph.c (asm_nodes_output): Make global.
	* lto-streamer.h (asm_nodes_output): Declare.
	* lto-wrapper.c (parallel, jobserver): Make global.
	(run_gcc): Pass down -fparallelism

	* lto.c (lto_parallelism): New variable.
	(do_stream_out): New function.
	(stream_out): New function.
	(lto_wpa_write_files): Use it.
	* lang.opt (fparallelism): New.
	* lto.h (lto_parallelism): Declare.
	* lto-lang.c (lto_handle_option): Add fparalelism.
Andi Kleen - Aug. 21, 2013, 2:58 p.m.
On Wed, Aug 21, 2013 at 04:17:48PM +0200, Jan Hubicka wrote:
> Hi,
> this is my attempt to bring GCC into wonderful era of multicore CPUs :)
> It is a hack, but it seems to help quite a lot.  About 50% of WPA time is spent
> by streaming the individual ltrans .o files.  This can be easily parallelized
> by fork - we do nothing afterwards, just exit and pass the list to the linker.

One risk is if someone streams to a spinning disk it may add more seeks for 
the parallel IO. But I think it's a reasonable tradeoffs.

We should also use a faster compressor

> For -flto=jobserver I simply fork all 32 processes.  It may not be a disaster,
> but perhaps we should figure out how to communicate with jobserver.  At first
> glance on document on how it works, it seems easy to add. Perhaps we can even
> convicne GNU Make folks to put simple helpers to libiberty?

lto=jobserver is still broken and confuses tokens on large builds (ends
with a 0 read) I did some debugging recently, and I suspect a Linux kernel
bug now. Still haven't tracked it down.

Any workarounds would need make changs unfortunately.

> 
> We also may figure out number of CPUs (is it available i.e. from libgomp)

sysconf(_SC_NPROCESSORS_ONLN) ? 

> and use it by default even if user do not care to pass number of processes.
> Naturally these streaming forks should be cheap memory wise. I hope Martin
> will get me some actual numbers.
> 
> With the patch the WPA time of firefox goes down to 2 minutes (4.8 needs about
> 30 minutes and without the hack one needs about 5 minutes)

Cool!

I'll try it on my builds
>  
> +fparallelism=
> +LTO Joined
> +Run the link-time optimizer in whole program analysis (WPA) mode.

The description does not make sense

Rest of patch looks good from a quick read, although I would prefer to 
do the waiting for children in the "parent", not the "last one"

-Andi
Richard Guenther - Aug. 21, 2013, 3:55 p.m.
Andi Kleen <ak@linux.intel.com> wrote:
>On Wed, Aug 21, 2013 at 04:17:48PM +0200, Jan Hubicka wrote:
>> Hi,
>> this is my attempt to bring GCC into wonderful era of multicore CPUs
>:)
>> It is a hack, but it seems to help quite a lot.  About 50% of WPA
>time is spent
>> by streaming the individual ltrans .o files.  This can be easily
>parallelized
>> by fork - we do nothing afterwards, just exit and pass the list to
>the linker.
>
>One risk is if someone streams to a spinning disk it may add more seeks
>for 
>the parallel IO. But I think it's a reasonable tradeoffs.

It'll also wreck all WPA dump files.

>We should also use a faster compressor

And we should avoid uncompressing the function sections...

That said, the patch is enough of a hack that I don't ever want to debug a bug in it....

I also fail to see why threads should not work here.  Maybe simply annotate gcc with openmp?

Richard.

>> For -flto=jobserver I simply fork all 32 processes.  It may not be a
>disaster,?
>> but perhaps we should figure out how to communicate with jobserver. 
>At first
>> glance on document on how it works, it seems easy to add. Perhaps we
>can even
>> convicne GNU Make folks to put simple helpers to libiberty?
>
>lto=jobserver is still broken and confuses tokens on large builds (ends
>with a 0 read) I did some debugging recently, and I suspect a Linux
>kernel
>bug now. Still haven't tracked it down.
>
>Any workarounds would need make changs unfortunately.
>
>> 
>> We also may figure out number of CPUs (is it available i.e. from
>libgomp)
>
>sysconf(_SC_NPROCESSORS_ONLN) ? 
>
>> and use it by default even if user do not care to pass number of
>processes.
>> Naturally these streaming forks should be cheap memory wise. I hope
>Martin
>> will get me some actual numbers.
>> 
>> With the patch the WPA time of firefox goes down to 2 minutes (4.8
>needs about
>> 30 minutes and without the hack one needs about 5 minutes)
>
>Cool!
>
>I'll try it on my builds
>>  
>> +fparallelism=
>> +LTO Joined
>> +Run the link-time optimizer in whole program analysis (WPA) mode.
>
>The description does not make sense
>
>Rest of patch looks good from a quick read, although I would prefer to 
>do the waiting for children in the "parent", not the "last one"
>
>-Andi
Andi Kleen - Aug. 21, 2013, 4:15 p.m.
> I also fail to see why threads should not work here.  Maybe simply annotate gcc with openmp?

Don't you have to set a environment variable to set the number of threads
for openmp?

Otherwise it sounds like a reasonable way to do it.

-Andi
Jan Hubicka - Aug. 21, 2013, 9:53 p.m.
> >
> >One risk is if someone streams to a spinning disk it may add more seeks
> >for 
> >the parallel IO. But I think it's a reasonable tradeoffs.
> 
> It'll also wreck all WPA dump files.

We do not dump anything during the main streaming.  If we now stream 2GB for firefox,
I think we can hope to mostly fit in cache with the whole machinery.

We will need to flush cgraph file prior forking and close it in forked process.
It is only one that remains cross fork boundary IMO.
> 
> >We should also use a faster compressor
> 
> And we should avoid uncompressing the function sections...

Yep, we also need to avoid carring whole tree stream of the original source
unit whenever we stream out function from it.  I think function sections should
have two parts - the references to global trees that is uncompressed and
transleted during WPA streaming plus compressed binary blob with the body that
is copied over.
> 
> That said, the patch is enough of a hack that I don't ever want to debug a bug in it....
> 
> I also fail to see why threads should not work here.  Maybe simply annotate gcc with openmp?

It means pushing global state of lto-streamer into a context variable + moving
it out of GGC or making GGC thread safe.  I would hope that David Malcolm would
be interested in doing this, but it is bit more I have time for right now during
the labs conference.

To be honest I fail to see how bug in openmp annotated program would be easier
to debug than the fork variant.

Honza
Jan Hubicka - Aug. 21, 2013, 9:55 p.m.
> 
> We should also use a faster compressor

Yep, at least once it arrives higher in profiles. So far other stuff is a lot slower.
> 
> > For -flto=jobserver I simply fork all 32 processes.  It may not be a disaster,
> > but perhaps we should figure out how to communicate with jobserver.  At first
> > glance on document on how it works, it seems easy to add. Perhaps we can even
> > convicne GNU Make folks to put simple helpers to libiberty?
> 
> lto=jobserver is still broken and confuses tokens on large builds (ends
> with a 0 read) I did some debugging recently, and I suspect a Linux kernel
> bug now. Still haven't tracked it down.
> 
> Any workarounds would need make changs unfortunately.
> 
> > 
> > We also may figure out number of CPUs (is it available i.e. from libgomp)
> 
> sysconf(_SC_NPROCESSORS_ONLN) ? 

OK, thanks :)

> >  
> > +fparallelism=
> > +LTO Joined
> > +Run the link-time optimizer in whole program analysis (WPA) mode.
> 
> The description does not make sense

Yup, a psto.
> 
> Rest of patch looks good from a quick read, although I would prefer to 
> do the waiting for children in the "parent", not the "last one"

The parent process does all the forking + waiting.  Only the last section
is streamed by the parent process since I do not see reason for forking for
it.

Honza
Michael Matz - Aug. 28, 2013, 3:41 p.m.
Hi,

On Wed, 21 Aug 2013, Richard Biener wrote:

> I also fail to see why threads should not work here.  Maybe simply 
> annotate gcc with openmp?

Threads simply don't work here, because the whole streamer infrastructure 
(or anything else in GCC for that matter) isn't thread safe (you'd have to 
have multiple streamer objects, multiple SCC finder objects, and you'd 
have to audit everything for not using any other shared resources).  
Fork-fire-forget is really a much simpler choice here IMO; no worries 
about shared resources, less debug hassle.


Ciao,
Michael.
Richard Guenther - Aug. 29, 2013, 7:33 a.m.
On Wed, 28 Aug 2013, Michael Matz wrote:

> Hi,
> 
> On Wed, 21 Aug 2013, Richard Biener wrote:
> 
> > I also fail to see why threads should not work here.  Maybe simply 
> > annotate gcc with openmp?
> 
> Threads simply don't work here, because the whole streamer infrastructure 
> (or anything else in GCC for that matter) isn't thread safe (you'd have to 
> have multiple streamer objects, multiple SCC finder objects, and you'd 
> have to audit everything for not using any other shared resources).

Hm, yeah, of course.

> Fork-fire-forget is really a much simpler choice here IMO; no worries 
> about shared resources, less debug hassle.

It might be not as cheap as it is on Linux hosts on other hosts of
course.  Also I'd rather try to avoid I/O than solving the issue
by parallelizing it.  Of course we can always come back to this
kind of hack later.

Richard.
Jan Hubicka - Aug. 29, 2013, 12:51 p.m.
Jakub,
I am adding you to CC since I put my current toughts on LTO and debug info
in here.
> > Fork-fire-forget is really a much simpler choice here IMO; no worries 
> > about shared resources, less debug hassle.
> 
> It might be not as cheap as it is on Linux hosts on other hosts of
> course.  Also I'd rather try to avoid I/O than solving the issue

I still have some items on list here
 1) avoid function sections to be decompressed by WPA
    (this won't cause much compile time improvements as decompression is
     well bellow 10% of runtime)
 2) put variable initializers into named sections just as function bodies
    are.
    Seeing Martin's systemtaps of firefox/gimp/inkscape, to my surprise the
    initializers are actually about as big as the text segment.  While
    it seems bit wasteful to pust single integer_cst there (and we can
    special case this), it seems that there is a promise for vtables
    and other stuff.

    To make devirt work, we will need to load vtables into memory (or
    invent representation to stream them other way that would be similarly
    big). Still we will avoid need to load them in 5000 copies and merge
    them.
 3) I think good part of function/partitioning overhead is because abstract
    origin streaming is utterly broken.

    Currently we can have DECL_ABSTRACT_ORIGIN on a function.  This I can now
    track by used_as_abstract_origin flag and I can stream those functions
    into partitins using them.

    This is still wrong for multitude of reasons

    1) we really want DECL_INITIAL tree of the functions used as abstract
       origins in the form before any gimple optimizations happened on them.
       (that is when debug hook is called)
       This is not what happens - we stream the tree as it looks during
       TLO streaming time - i.e. after early optimizations.

       I think we may just (at a time calling the debug hook) duplicate DECL_INITIAL
       same way we duplicate decls for save_function_body and saving it elsewhere.
       Making this tree to be abstract origin of the offline copy of the function itself.

    2) dwarf2out doesn't really the DECL_INITIAL tree so it does something useful
       only when it is already there. 
       It can simply call cgraph_get_body when it needs the DECL_INITIAL, but it
       doesn't becuase push_cfun causes ICE.
       If we really can't push_cfun from middle of RTL queueu, I suppose I can
       just save it elsewhere

    3) It is not only toplevel decl that has origin, but all local vars in the
       function.

       I think this goes terribly wrong - these decls are not indexable so they
       are stored into function section of every function referring to them.
       They are then read in many duplicates and never merged with the DECL_INITIAL
       tree of the actual abstract origin. For some reason dwarf2out doesn't
       seem to ICE, but I also do not see how this can produce working debug.
       Moreover I think the duplicates contribute to our current debug info
       size problems with LTO.

       If we solve 1) as discussed by above (i.e. by having separate
       block trees for functions that are abstract origins), we can then solve 3)
       by streaming those into global decl stream and make cross-function_context
       tree references to become global.

    4) Of course after early inlining function may need abstract origins from
       multiple other functions.  I do not track this at all.
       May be easy to just collect a vector of functions that are needed into
       cgraph_node.

    Of course solving 1)-4) is bit of early debug info without actually going to
    stream the dwarf dies, but by using the BLOCK trees as a temporary representation.
    Incrementally we can have this saved BLOCK tree to be a dwarf DIE and have
    origins to point to them instead of decls.

    To get resonable streaming performance it would be nice to have way to get
    abstract origin references cross-partition that debug info can accomplish.

Said that, I now have the fork() patch in all my trees and enjoy 50% faster
WPA times.  I changed my mind about claim that stremaing should be disk bound -
it is hard to hope for disk boundness for something that should fit in cache.

We went down from 5GB to 2GB of streaming for Firefox that is good.  But we will
see again 4GB once Martin's code layout work will land.  I think it is from good
part because of the origin fun above.

Honza

> by parallelizing it.  Of course we can always come back to this
> kind of hack later.
> 
> Richard.
Michael Matz - Aug. 29, 2013, 1:15 p.m.
Hi,

On Thu, 29 Aug 2013, Richard Biener wrote:

> > Fork-fire-forget is really a much simpler choice here IMO; no worries 
> > about shared resources, less debug hassle.
> 
> It might be not as cheap as it is on Linux hosts on other hosts of
> course.

Sure.  Don't use it there then.  Not a reason for not having the 
improvement on linux.

> Also I'd rather try to avoid I/O than solving the issue by parallelizing 
> it.

Of course.  There's always something still better.

> Of course we can always come back to this kind of hack later.

For 4.9 latest, if we don't have anything nicer by then.  OTOH we could 
also remove Honzas patch when and if something better comes around ;)


Ciao,
Michael.
Richard Guenther - Aug. 29, 2013, 1:39 p.m.
On Thu, 29 Aug 2013, Jan Hubicka wrote:

> Jakub,
> I am adding you to CC since I put my current toughts on LTO and debug info
> in here.
> > > Fork-fire-forget is really a much simpler choice here IMO; no worries 
> > > about shared resources, less debug hassle.
> > 
> > It might be not as cheap as it is on Linux hosts on other hosts of
> > course.  Also I'd rather try to avoid I/O than solving the issue
> 
> I still have some items on list here
>  1) avoid function sections to be decompressed by WPA
>     (this won't cause much compile time improvements as decompression is
>      well bellow 10% of runtime)

still low-hanging

finally get a LTO section header!  (with a flag telling whether the
section is compressed)

>  2) put variable initializers into named sections just as function bodies
>     are.
>     Seeing Martin's systemtaps of firefox/gimp/inkscape, to my surprise the
>     initializers are actually about as big as the text segment.  While
>     it seems bit wasteful to pust single integer_cst there (and we can
>     special case this), it seems that there is a promise for vtables
>     and other stuff.
> 
>     To make devirt work, we will need to load vtables into memory (or
>     invent representation to stream them other way that would be similarly
>     big). Still we will avoid need to load them in 5000 copies and merge
>     them.
>  3) I think good part of function/partitioning overhead is because abstract
>     origin streaming is utterly broken.
> 
>     Currently we can have DECL_ABSTRACT_ORIGIN on a function.  This I can now
>     track by used_as_abstract_origin flag and I can stream those functions
>     into partitins using them.
> 
>     This is still wrong for multitude of reasons
> 
>     1) we really want DECL_INITIAL tree of the functions used as abstract
>        origins in the form before any gimple optimizations happened on them.
>        (that is when debug hook is called)
>        This is not what happens - we stream the tree as it looks during
>        TLO streaming time - i.e. after early optimizations.
> 
>        I think we may just (at a time calling the debug hook) duplicate DECL_INITIAL
>        same way we duplicate decls for save_function_body and saving it elsewhere.
>        Making this tree to be abstract origin of the offline copy of the function itself.
> 
>     2) dwarf2out doesn't really the DECL_INITIAL tree so it does something useful
>        only when it is already there. 
>        It can simply call cgraph_get_body when it needs the DECL_INITIAL, but it
>        doesn't becuase push_cfun causes ICE.
>        If we really can't push_cfun from middle of RTL queueu, I suppose I can
>        just save it elsewhere
> 
>     3) It is not only toplevel decl that has origin, but all local vars in the
>        function.
> 
>        I think this goes terribly wrong - these decls are not indexable so they
>        are stored into function section of every function referring to them.
>        They are then read in many duplicates and never merged with the DECL_INITIAL
>        tree of the actual abstract origin. For some reason dwarf2out doesn't
>        seem to ICE, but I also do not see how this can produce working debug.
>        Moreover I think the duplicates contribute to our current debug info
>        size problems with LTO.
> 
>        If we solve 1) as discussed by above (i.e. by having separate
>        block trees for functions that are abstract origins), we can then solve 3)
>        by streaming those into global decl stream and make cross-function_context
>        tree references to become global.
> 
>     4) Of course after early inlining function may need abstract origins from
>        multiple other functions.  I do not track this at all.
>        May be easy to just collect a vector of functions that are needed into
>        cgraph_node.
> 
>     Of course solving 1)-4) is bit of early debug info without actually going to
>     stream the dwarf dies, but by using the BLOCK trees as a temporary representation.
>     Incrementally we can have this saved BLOCK tree to be a dwarf DIE and have
>     origins to point to them instead of decls.
> 
>     To get resonable streaming performance it would be nice to have way to get
>     abstract origin references cross-partition that debug info can accomplish.

Most of the abstract origin stuff is dropped on the floor by streaming
because you cannot really stream that stuff.  And yes, we need early
debug info to generate the offline abstract origin copy of later inlined
functions, and yes, we have to handle streaming / referencing those in
some way.  But OTOH abstract origins are an optimization for debug info
size, so we can as well not have them.

> Said that, I now have the fork() patch in all my trees and enjoy 50% faster
> WPA times.  I changed my mind about claim that stremaing should be disk bound -
> it is hard to hope for disk boundness for something that should fit in cache.

It should at least limit its fork rate according to -flto=N or jobserver.

> We went down from 5GB to 2GB of streaming for Firefox that is good.  But we will
> see again 4GB once Martin's code layout work will land.  I think it is from good
> part because of the origin fun above.

Ugh.

Richard.
Jan Hubicka - Aug. 29, 2013, 1:58 p.m.
> > Said that, I now have the fork() patch in all my trees and enjoy 50% faster
> > WPA times.  I changed my mind about claim that stremaing should be disk bound -
> > it is hard to hope for disk boundness for something that should fit in cache.
> 
> It should at least limit its fork rate according to -flto=N or jobserver.
It limits forks to -flto=N.
If the patch seems resonable, I will look into posiblity of adding my jobserver client
based on GNU make code.

I also think with -flto we want wrapper to figure out number of threads and suppy
default =N (i.e. nonparallel lto would be -flto=0).  Most people don't want to worry
about =n/=jobserv parameters and those few projects that don't want to start too many
processes to not explode in memory use can get their build machinery right.

Honza
> 
> > We went down from 5GB to 2GB of streaming for Firefox that is good.  But we will
> > see again 4GB once Martin's code layout work will land.  I think it is from good
> > part because of the origin fun above.
> 
> Ugh.
> 
> Richard.
Andi Kleen - Aug. 30, 2013, 2:36 a.m.
On Thu, Aug 29, 2013 at 03:58:45PM +0200, Jan Hubicka wrote:
> > > Said that, I now have the fork() patch in all my trees and enjoy 50% faster
> > > WPA times.  I changed my mind about claim that stremaing should be disk bound -
> > > it is hard to hope for disk boundness for something that should fit in cache.
> > 
> > It should at least limit its fork rate according to -flto=N or jobserver.
> It limits forks to -flto=N.
> If the patch seems resonable, I will look into posiblity of adding my jobserver client
> based on GNU make code.
> 
> I also think with -flto we want wrapper to figure out number of threads and suppy
> default =N (i.e. nonparallel lto would be -flto=0).  Most people don't want to worry
> about =n/=jobserv parameters and those few projects that don't want to start too many
> processes to not explode in memory use can get their build machinery right.
> 

Job server should do that already. You get whatever the user specifies with -j
on the top level make. That's imho the right area to control this.

The only problem is we need to work around the jobserver pipe bug first
I suspect this may need a change in make :-/

-Andi

Patch

Index: lto-cgraph.c
===================================================================
--- lto-cgraph.c	(revision 201891)
+++ lto-cgraph.c	(working copy)
@@ -50,6 +50,9 @@  along with GCC; see the file COPYING3.
 #include "context.h"
 #include "pass_manager.h"
 
+/* True when asm nodes has been output.  */
+bool asm_nodes_output = false;
+
 static void output_cgraph_opt_summary (void);
 static void input_cgraph_opt_summary (vec<symtab_node>  nodes);
 
@@ -852,7 +855,6 @@  output_symtab (void)
   lto_symtab_encoder_iterator lsei;
   int i, n_nodes;
   lto_symtab_encoder_t encoder;
-  static bool asm_nodes_output = false;
 
   if (flag_wpa)
     output_cgraph_opt_summary ();
Index: lto-streamer.h
===================================================================
--- lto-streamer.h	(revision 201891)
+++ lto-streamer.h	(working copy)
@@ -870,6 +870,7 @@  void lto_output_location (struct output_
 
 
 /* In lto-cgraph.c  */
+extern bool asm_nodes_output;
 lto_symtab_encoder_t lto_symtab_encoder_new (bool);
 int lto_symtab_encoder_encode (lto_symtab_encoder_t, symtab_node);
 void lto_symtab_encoder_delete (lto_symtab_encoder_t);
Index: lto-wrapper.c
===================================================================
--- lto-wrapper.c	(revision 201891)
+++ lto-wrapper.c	(working copy)
@@ -56,6 +56,9 @@  along with GCC; see the file COPYING3.
 
 int debug;				/* true if -save-temps.  */
 int verbose;				/* true if -v.  */
+int parallel = 0;			/* number of parallel builds specified
+					   by -flto=N  */
+int jobserver = 0;			/* true if -flto=jobserver was used.  */
 
 enum lto_mode_d {
   LTO_MODE_NONE,			/* Not doing LTO.  */
@@ -445,8 +448,6 @@  run_gcc (unsigned argc, char *argv[])
   char *list_option_full = NULL;
   const char *linker_output = NULL;
   const char *collect_gcc, *collect_gcc_options;
-  int parallel = 0;
-  int jobserver = 0;
   bool no_partition = false;
   struct cl_decoded_option *fdecoded_options = NULL;
   unsigned int fdecoded_options_count = 0;
@@ -630,6 +631,16 @@  run_gcc (unsigned argc, char *argv[])
 	      if (parallel <= 1)
 		parallel = 0;
 	    }
+	  if (jobserver)
+	    {
+	      obstack_ptr_grow (&argv_obstack, xstrdup ("-fparallelism=jobserver"));
+	    }
+	  else if (parallel > 1)
+	    {
+	      char buf[256];
+	      sprintf (buf, "-fparallelism=%i", parallel);
+	      obstack_ptr_grow (&argv_obstack, xstrdup (buf));
+	    }
 	  /* Fallthru.  */
 
 	case OPT_flto:
Index: lto/lto.c
===================================================================
--- lto/lto.c	(revision 201891)
+++ lto/lto.c	(working copy)
@@ -49,6 +49,9 @@  along with GCC; see the file COPYING3.
 #include "context.h"
 #include "pass_manager.h"
 
+/* Number of parallel tasks to run, -1 if we want to use GNU Make jobserver.  */
+int lto_parallelism;
+
 static GTY(()) tree first_personality_decl;
 
 /* Returns a hash code for P.  */
@@ -3002,6 +3005,98 @@  cmp_partitions_order (const void *a, con
   return orderb - ordera;
 }
 
+/* Actually stream out ENCODER into TEMP_FILENAME.  */
+
+void
+do_stream_out (char *temp_filename, lto_symtab_encoder_t encoder)
+{
+  lto_file *file = lto_obj_file_open (temp_filename, true);
+  if (!file)
+    fatal_error ("lto_obj_file_open() failed");
+  lto_set_current_out_file (file);
+
+  ipa_write_optimization_summaries (encoder);
+
+  lto_set_current_out_file (NULL);
+  lto_obj_file_close (file);
+  free (file);
+}
+
+/* Wait for forked process and signal errors.  */
+#ifdef HAVE_WORKING_FORK
+void
+wait_for_child ()
+{
+  int status;
+  do
+    {
+      int w = waitpid(0, &status, WUNTRACED | WCONTINUED);
+      if (w == -1)
+	fatal_error ("waitpid failed");
+
+      if (WIFEXITED (status) && WEXITSTATUS (status))
+	fatal_error ("streaming subprocess failed");
+      else if (WIFSIGNALED (status))
+	fatal_error ("streaming subprocess was killed by signal");
+    }
+  while (!WIFEXITED(status) && !WIFSIGNALED(status));
+}
+#endif
+
+/* Stream out ENCODER into TEMP_FILENAME
+   Fork if that seems to help.  */
+
+void
+stream_out (char *temp_filename, lto_symtab_encoder_t encoder, bool last)
+{
+#ifdef HAVE_WORKING_FORK
+  static int nruns;
+
+  if (!lto_parallelism || lto_parallelism == 1)
+    {
+      do_stream_out (temp_filename, encoder);
+      return;
+    }
+
+  /* Do not run more than LTO_PARALLELISM streamings
+     FIXME: we ignore limits on jobserver.  */
+  if (lto_parallelism > 0 && nruns >= lto_parallelism)
+    {
+      wait_for_child ();
+      nruns --;
+    }
+  /* If this is not the last parallel partition, execute new
+     streaming process.  */
+  if (!last)
+    {
+      pid_t cpid = fork ();
+
+      if (!cpid)
+	{
+	  setproctitle ("lto1-wpa-streaming");
+	  do_stream_out (temp_filename, encoder);
+	  exit (0);
+	}
+      /* Fork failed; lets do the job ourseleves.  */
+      else if (cpid == -1)
+        do_stream_out (temp_filename, encoder);
+      else
+	nruns++;
+    }
+  /* Last partition; stream it and wait for all children to die.  */
+  else
+    {
+      int i;
+      do_stream_out (temp_filename, encoder);
+      for (i = 0; i < nruns; i++)
+	wait_for_child ();
+    }
+  asm_nodes_output = true;
+#else
+  do_stream_out (temp_filename, encoder);
+#endif
+}
+
 /* Write all output files in WPA mode and the file with the list of
    LTRANS units.  */
 
@@ -3009,18 +3104,15 @@  static void
 lto_wpa_write_files (void)
 {
   unsigned i, n_sets;
-  lto_file *file;
   ltrans_partition part;
   FILE *ltrans_output_list_stream;
   char *temp_filename;
+  vec <char *>temp_filenames = vNULL;
   size_t blen;
 
   /* Open the LTRANS output list.  */
   if (!ltrans_output_list)
     fatal_error ("no LTRANS output list filename provided");
-  ltrans_output_list_stream = fopen (ltrans_output_list, "w");
-  if (ltrans_output_list_stream == NULL)
-    fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list);
 
   timevar_push (TV_WHOPR_WPA);
 
@@ -3056,14 +3148,10 @@  lto_wpa_write_files (void)
 			   : cmp_partitions_order);
   for (i = 0; i < n_sets; i++)
     {
-      size_t len;
       ltrans_partition part = ltrans_partitions[i];
 
       /* Write all the nodes in SET.  */
       sprintf (temp_filename + blen, "%u.o", i);
-      file = lto_obj_file_open (temp_filename, true);
-      if (!file)
-	fatal_error ("lto_obj_file_open() failed");
 
       if (!quiet_flag)
 	fprintf (stderr, " %s (%s %i insns)", temp_filename, part->name, part->insns);
@@ -3105,21 +3193,25 @@  lto_wpa_write_files (void)
 	}
       gcc_checking_assert (lto_symtab_encoder_size (part->encoder) || !i);
 
-      lto_set_current_out_file (file);
-
-      ipa_write_optimization_summaries (part->encoder);
+      stream_out (temp_filename, part->encoder, i == n_sets - 1);
 
-      lto_set_current_out_file (NULL);
-      lto_obj_file_close (file);
-      free (file);
       part->encoder = NULL;
 
-      len = strlen (temp_filename);
-      if (fwrite (temp_filename, 1, len, ltrans_output_list_stream) < len
+      temp_filenames.safe_push (xstrdup (temp_filename));
+    }
+  ltrans_output_list_stream = fopen (ltrans_output_list, "w");
+  if (ltrans_output_list_stream == NULL)
+    fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list);
+  for (i = 0; i < n_sets; i++)
+    {
+      unsigned int len = strlen (temp_filenames[i]);
+      if (fwrite (temp_filenames[i], 1, len, ltrans_output_list_stream) < len
 	  || fwrite ("\n", 1, 1, ltrans_output_list_stream) < 1)
 	fatal_error ("writing to LTRANS output list %s: %m",
 		     ltrans_output_list);
+     free (temp_filenames[i]);
     }
+  temp_filenames.release();
 
   lto_stats.num_output_files += n_sets;
 
Index: lto/lang.opt
===================================================================
--- lto/lang.opt	(revision 201891)
+++ lto/lang.opt	(working copy)
@@ -32,6 +32,10 @@  fltrans-output-list=
 LTO Joined Var(ltrans_output_list)
 Specify a file to which a list of files output by LTRANS is written.
 
+fparallelism=
+LTO Joined
+Run the link-time optimizer in whole program analysis (WPA) mode.
+
 fwpa
 LTO Driver Report Var(flag_wpa)
 Run the link-time optimizer in whole program analysis (WPA) mode.
Index: lto/lto.h
===================================================================
--- lto/lto.h	(revision 201891)
+++ lto/lto.h	(working copy)
@@ -39,6 +39,7 @@  extern const char *resolution_file_name;
 extern tree lto_eh_personality (void);
 extern void lto_main (void);
 extern void lto_read_all_file_options (void);
+extern int lto_parallelism;
 
 /* In lto-elf.c or lto-coff.c  */
 extern lto_file *lto_obj_file_open (const char *filename, bool writable);
Index: lto/lto-lang.c
===================================================================
--- lto/lto-lang.c	(revision 201891)
+++ lto/lto-lang.c	(working copy)
@@ -735,6 +735,19 @@  lto_handle_option (size_t scode, const c
       warn_psabi = value;
       break;
 
+    case OPT_fparallelism_:
+      if (!arg)
+	lto_parallelism = 1;
+      else if (!strcmp (arg, "jobserver"))
+	lto_parallelism = -1;
+      else
+	{
+	  lto_parallelism = atoi (arg);
+	  if (lto_parallelism <= 0)
+	    lto_parallelism = 0;
+	}
+      break;
+
     default:
       break;
     }