[RFC] Old school parallelization of WPA streaming

Hi,
this is my attempt to bring GCC into wonderful era of multicore CPUs :)
It is a hack, but it seems to help quite a lot.  About 50% of WPA time is spent
by streaming the individual ltrans .o files.  This can be easily parallelized
by fork - we do nothing afterwards, just exit and pass the list to the linker.

So until we are thread safe, perhaps this may be a solution? (or on unixish
systems probably it can be solution forever)  I added a logic parsing -flto=24
and do number of streaming processes user asked for.

For -flto=jobserver I simply fork all 32 processes.  It may not be a disaster,
but perhaps we should figure out how to communicate with jobserver.  At first
glance on document on how it works, it seems easy to add. Perhaps we can even
convicne GNU Make folks to put simple helpers to libiberty?

We also may figure out number of CPUs (is it available i.e. from libgomp)
and use it by default even if user do not care to pass number of processes.
Naturally these streaming forks should be cheap memory wise. I hope Martin
will get me some actual numbers.

With the patch the WPA time of firefox goes down to 2 minutes (4.8 needs about
30 minutes and without the hack one needs about 5 minutes)

Before:
Execution times (seconds)
 phase setup             :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall    1398 kB ( 0%) ggc
 phase opt and generate  :  39.73 (17%) usr   0.49 ( 3%) sys  40.26 (16%) wall  347726 kB ( 5%) ggc
 phase stream in         :  82.43 (35%) usr   2.15 (14%) sys  84.62 (34%) wall 5970152 kB (94%) ggc
 phase stream out        : 114.05 (48%) usr  12.86 (83%) sys 127.26 (50%) wall    6868 kB ( 0%) ggc
 garbage collection      :   3.07 ( 1%) usr   0.00 ( 0%) sys   3.08 ( 1%) wall       0 kB ( 0%) ggc
 callgraph optimization  :   0.34 ( 0%) usr   0.00 ( 0%) sys   0.33 ( 0%) wall      30 kB ( 0%) ggc
 ipa dead code removal   :   4.91 ( 2%) usr   0.11 ( 1%) sys   5.16 ( 2%) wall     113 kB ( 0%) ggc
 ipa inheritance graph   :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall     927 kB ( 0%) ggc
 ipa virtual call target :   5.11 ( 2%) usr   0.05 ( 0%) sys   4.99 ( 2%) wall   55296 kB ( 1%) ggc
 ipa cp                  :   2.65 ( 1%) usr   0.17 ( 1%) sys   2.80 ( 1%) wall  188629 kB ( 3%) ggc
 ipa inlining heuristics :  18.49 ( 8%) usr   0.29 ( 2%) sys  18.79 ( 7%) wall  439981 kB ( 7%) ggc
 ipa lto gimple in       :   0.12 ( 0%) usr   0.01 ( 0%) sys   0.15 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto gimple out      :  16.66 ( 7%) usr   1.26 ( 8%) sys  17.97 ( 7%) wall       0 kB ( 0%) ggc
 ipa lto decl in         :  68.70 (29%) usr   1.50 (10%) sys  70.23 (28%) wall 5181795 kB (82%) ggc
 ipa lto decl out        :  93.09 (39%) usr   4.93 (32%) sys  98.07 (39%) wall       0 kB ( 0%) ggc
 ipa lto cgraph I/O      :   1.65 ( 1%) usr   0.27 ( 2%) sys   1.92 ( 1%) wall  428974 kB ( 7%) ggc
 ipa lto decl merge      :   3.66 ( 2%) usr   0.00 ( 0%) sys   3.65 ( 1%) wall    8288 kB ( 0%) ggc
 ipa lto cgraph merge    :   3.42 ( 1%) usr   0.00 ( 0%) sys   3.42 ( 1%) wall   13725 kB ( 0%) ggc
 whopr wpa               :   3.58 ( 2%) usr   0.02 ( 0%) sys   3.59 ( 1%) wall    6871 kB ( 0%) ggc
 whopr wpa I/O           :   0.99 ( 0%) usr   6.65 (43%) sys   7.92 ( 3%) wall       0 kB ( 0%) ggc 
 whopr partitioning      :   2.63 ( 1%) usr   0.01 ( 0%) sys   2.66 ( 1%) wall       0 kB ( 0%) ggc
 ipa reference           :   3.08 ( 1%) usr   0.08 ( 1%) sys   3.18 ( 1%) wall       0 kB ( 0%) ggc
 whopr partitioning      :   2.63 ( 1%) usr   0.01 ( 0%) sys   2.66 ( 1%) wall       0 kB ( 0%) ggc
 ipa reference           :   3.08 ( 1%) usr   0.08 ( 1%) sys   3.18 ( 1%) wall       0 kB ( 0%) ggc
 ipa profile             :   0.43 ( 0%) usr   0.05 ( 0%) sys   0.48 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   3.00 ( 1%) usr   0.06 ( 0%) sys   3.07 ( 1%) wall       0 kB ( 0%) ggc
 varconst                :   0.03 ( 0%) usr   0.04 ( 0%) sys   0.06 ( 0%) wall       0 kB ( 0%) ggc
 unaccounted todo        :   0.48 ( 0%) usr   0.00 ( 0%) sys   0.50 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 : 236.22            15.50           252.15            6326146 kB

after:
Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall    1399 kB ( 0%) ggc
 phase opt and generate  :  35.49 (28%) usr   0.44 ( 6%) sys  35.95 (26%) wall  313971 kB ( 5%) ggc
 phase stream in         :  82.98 (64%) usr   2.10 (30%) sys  85.13 (61%) wall 5969191 kB (95%) ggc
 phase stream out        :  10.37 ( 8%) usr   4.49 (64%) sys  17.33 (13%) wall    5813 kB ( 0%) ggc
 garbage collection      :   3.00 ( 2%) usr   0.00 ( 0%) sys   2.99 ( 2%) wall       0 kB ( 0%) ggc
 callgraph optimization  :   0.33 ( 0%) usr   0.00 ( 0%) sys   0.33 ( 0%) wall      30 kB ( 0%) ggc
 ipa dead code removal   :   4.91 ( 4%) usr   0.10 ( 1%) sys   5.04 ( 4%) wall     114 kB ( 0%) ggc
 ipa inheritance graph   :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.10 ( 0%) wall     792 kB ( 0%) ggc
 ipa virtual call target :   2.14 ( 2%) usr   0.01 ( 0%) sys   2.15 ( 2%) wall   21661 kB ( 0%) ggc
 ipa cp                  :   2.34 ( 2%) usr   0.18 ( 3%) sys   2.52 ( 2%) wall  188629 kB ( 3%) ggc
 ipa inlining heuristics :  18.43 (14%) usr   0.26 ( 4%) sys  18.68 (13%) wall  439993 kB ( 7%) ggc
 ipa lto gimple in       :   0.05 ( 0%) usr   0.05 ( 1%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto gimple out      :   0.44 ( 0%) usr   0.06 ( 1%) sys   0.50 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto decl in         :  69.27 (54%) usr   1.52 (22%) sys  70.87 (51%) wall 5180837 kB (82%) ggc
 ipa lto decl out        :   7.77 ( 6%) usr   0.51 ( 7%) sys   8.28 ( 6%) wall       0 kB ( 0%) ggc
 ipa lto cgraph I/O      :   1.71 ( 1%) usr   0.19 ( 3%) sys   1.90 ( 1%) wall  428974 kB ( 7%) ggc
 ipa lto decl merge      :   3.66 ( 3%) usr   0.00 ( 0%) sys   3.67 ( 3%) wall    8288 kB ( 0%) ggc
 ipa lto cgraph merge    :   3.40 ( 3%) usr   0.00 ( 0%) sys   3.39 ( 2%) wall   13725 kB ( 0%) ggc
 whopr wpa               :   3.19 ( 2%) usr   0.00 ( 0%) sys   3.19 ( 2%) wall    5816 kB ( 0%) ggc
 whopr wpa I/O           :   0.00 ( 0%) usr   3.92 (56%) sys   6.39 ( 5%) wall       0 kB ( 0%) ggc
 whopr partitioning      :   1.44 ( 1%) usr   0.02 ( 0%) sys   1.45 ( 1%) wall       0 kB ( 0%) ggc 
 ipa reference           :   2.64 ( 2%) usr   0.08 ( 1%) sys   2.74 ( 2%) wall       0 kB ( 0%) ggc
 ipa profile             :   0.46 ( 0%) usr   0.02 ( 0%) sys   0.47 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   3.02 ( 2%) usr   0.05 ( 1%) sys   3.08 ( 2%) wall       0 kB ( 0%) ggc
 varconst                :   0.06 ( 0%) usr   0.06 ( 1%) sys   0.07 ( 0%) wall       0 kB ( 0%) ggc
 unaccounted todo        :   0.48 ( 0%) usr   0.00 ( 0%) sys   0.48 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 : 128.85             7.05           138.45            6290376 kB

real 7m42.126s user 53m13.816s sys 2m16.993s

Seems almost like we are at level WPA does something useful.  IPA inlining has definitely room
for improvement. I think to significantly speedup LTO decl in, I think we need to implement
Richard's original idea of compoaring the SCC in pickled version and materializing only
theose that are unique.

Looks resonable?

Honza

	* lto-cgraph.c (asm_nodes_output): Make global.
	* lto-streamer.h (asm_nodes_output): Declare.
	* lto-wrapper.c (parallel, jobserver): Make global.
	(run_gcc): Pass down -fparallelism

	* lto.c (lto_parallelism): New variable.
	(do_stream_out): New function.
	(stream_out): New function.
	(lto_wpa_write_files): Use it.
	* lang.opt (fparallelism): New.
	* lto.h (lto_parallelism): Declare.
	* lto-lang.c (lto_handle_option): Add fparalelism.

[RFC] Old school parallelization of WPA streaming

Commit Message

Comments

Patch