Message ID | 20130821141747.GD24782@kam.mff.cuni.cz |
---|---|
State | New |
Headers | show |
On Wed, Aug 21, 2013 at 04:17:48PM +0200, Jan Hubicka wrote: > Hi, > this is my attempt to bring GCC into wonderful era of multicore CPUs :) > It is a hack, but it seems to help quite a lot. About 50% of WPA time is spent > by streaming the individual ltrans .o files. This can be easily parallelized > by fork - we do nothing afterwards, just exit and pass the list to the linker. One risk is if someone streams to a spinning disk it may add more seeks for the parallel IO. But I think it's a reasonable tradeoffs. We should also use a faster compressor > For -flto=jobserver I simply fork all 32 processes. It may not be a disaster, > but perhaps we should figure out how to communicate with jobserver. At first > glance on document on how it works, it seems easy to add. Perhaps we can even > convicne GNU Make folks to put simple helpers to libiberty? lto=jobserver is still broken and confuses tokens on large builds (ends with a 0 read) I did some debugging recently, and I suspect a Linux kernel bug now. Still haven't tracked it down. Any workarounds would need make changs unfortunately. > > We also may figure out number of CPUs (is it available i.e. from libgomp) sysconf(_SC_NPROCESSORS_ONLN) ? > and use it by default even if user do not care to pass number of processes. > Naturally these streaming forks should be cheap memory wise. I hope Martin > will get me some actual numbers. > > With the patch the WPA time of firefox goes down to 2 minutes (4.8 needs about > 30 minutes and without the hack one needs about 5 minutes) Cool! I'll try it on my builds > > +fparallelism= > +LTO Joined > +Run the link-time optimizer in whole program analysis (WPA) mode. The description does not make sense Rest of patch looks good from a quick read, although I would prefer to do the waiting for children in the "parent", not the "last one" -Andi
Andi Kleen <ak@linux.intel.com> wrote: >On Wed, Aug 21, 2013 at 04:17:48PM +0200, Jan Hubicka wrote: >> Hi, >> this is my attempt to bring GCC into wonderful era of multicore CPUs >:) >> It is a hack, but it seems to help quite a lot. About 50% of WPA >time is spent >> by streaming the individual ltrans .o files. This can be easily >parallelized >> by fork - we do nothing afterwards, just exit and pass the list to >the linker. > >One risk is if someone streams to a spinning disk it may add more seeks >for >the parallel IO. But I think it's a reasonable tradeoffs. It'll also wreck all WPA dump files. >We should also use a faster compressor And we should avoid uncompressing the function sections... That said, the patch is enough of a hack that I don't ever want to debug a bug in it.... I also fail to see why threads should not work here. Maybe simply annotate gcc with openmp? Richard. >> For -flto=jobserver I simply fork all 32 processes. It may not be a >disaster,? >> but perhaps we should figure out how to communicate with jobserver. >At first >> glance on document on how it works, it seems easy to add. Perhaps we >can even >> convicne GNU Make folks to put simple helpers to libiberty? > >lto=jobserver is still broken and confuses tokens on large builds (ends >with a 0 read) I did some debugging recently, and I suspect a Linux >kernel >bug now. Still haven't tracked it down. > >Any workarounds would need make changs unfortunately. > >> >> We also may figure out number of CPUs (is it available i.e. from >libgomp) > >sysconf(_SC_NPROCESSORS_ONLN) ? > >> and use it by default even if user do not care to pass number of >processes. >> Naturally these streaming forks should be cheap memory wise. I hope >Martin >> will get me some actual numbers. >> >> With the patch the WPA time of firefox goes down to 2 minutes (4.8 >needs about >> 30 minutes and without the hack one needs about 5 minutes) > >Cool! > >I'll try it on my builds >> >> +fparallelism= >> +LTO Joined >> +Run the link-time optimizer in whole program analysis (WPA) mode. > >The description does not make sense > >Rest of patch looks good from a quick read, although I would prefer to >do the waiting for children in the "parent", not the "last one" > >-Andi
> I also fail to see why threads should not work here. Maybe simply annotate gcc with openmp?
Don't you have to set a environment variable to set the number of threads
for openmp?
Otherwise it sounds like a reasonable way to do it.
-Andi
> > > >One risk is if someone streams to a spinning disk it may add more seeks > >for > >the parallel IO. But I think it's a reasonable tradeoffs. > > It'll also wreck all WPA dump files. We do not dump anything during the main streaming. If we now stream 2GB for firefox, I think we can hope to mostly fit in cache with the whole machinery. We will need to flush cgraph file prior forking and close it in forked process. It is only one that remains cross fork boundary IMO. > > >We should also use a faster compressor > > And we should avoid uncompressing the function sections... Yep, we also need to avoid carring whole tree stream of the original source unit whenever we stream out function from it. I think function sections should have two parts - the references to global trees that is uncompressed and transleted during WPA streaming plus compressed binary blob with the body that is copied over. > > That said, the patch is enough of a hack that I don't ever want to debug a bug in it.... > > I also fail to see why threads should not work here. Maybe simply annotate gcc with openmp? It means pushing global state of lto-streamer into a context variable + moving it out of GGC or making GGC thread safe. I would hope that David Malcolm would be interested in doing this, but it is bit more I have time for right now during the labs conference. To be honest I fail to see how bug in openmp annotated program would be easier to debug than the fork variant. Honza
> > We should also use a faster compressor Yep, at least once it arrives higher in profiles. So far other stuff is a lot slower. > > > For -flto=jobserver I simply fork all 32 processes. It may not be a disaster, > > but perhaps we should figure out how to communicate with jobserver. At first > > glance on document on how it works, it seems easy to add. Perhaps we can even > > convicne GNU Make folks to put simple helpers to libiberty? > > lto=jobserver is still broken and confuses tokens on large builds (ends > with a 0 read) I did some debugging recently, and I suspect a Linux kernel > bug now. Still haven't tracked it down. > > Any workarounds would need make changs unfortunately. > > > > > We also may figure out number of CPUs (is it available i.e. from libgomp) > > sysconf(_SC_NPROCESSORS_ONLN) ? OK, thanks :) > > > > +fparallelism= > > +LTO Joined > > +Run the link-time optimizer in whole program analysis (WPA) mode. > > The description does not make sense Yup, a psto. > > Rest of patch looks good from a quick read, although I would prefer to > do the waiting for children in the "parent", not the "last one" The parent process does all the forking + waiting. Only the last section is streamed by the parent process since I do not see reason for forking for it. Honza
Hi, On Wed, 21 Aug 2013, Richard Biener wrote: > I also fail to see why threads should not work here. Maybe simply > annotate gcc with openmp? Threads simply don't work here, because the whole streamer infrastructure (or anything else in GCC for that matter) isn't thread safe (you'd have to have multiple streamer objects, multiple SCC finder objects, and you'd have to audit everything for not using any other shared resources). Fork-fire-forget is really a much simpler choice here IMO; no worries about shared resources, less debug hassle. Ciao, Michael.
On Wed, 28 Aug 2013, Michael Matz wrote: > Hi, > > On Wed, 21 Aug 2013, Richard Biener wrote: > > > I also fail to see why threads should not work here. Maybe simply > > annotate gcc with openmp? > > Threads simply don't work here, because the whole streamer infrastructure > (or anything else in GCC for that matter) isn't thread safe (you'd have to > have multiple streamer objects, multiple SCC finder objects, and you'd > have to audit everything for not using any other shared resources). Hm, yeah, of course. > Fork-fire-forget is really a much simpler choice here IMO; no worries > about shared resources, less debug hassle. It might be not as cheap as it is on Linux hosts on other hosts of course. Also I'd rather try to avoid I/O than solving the issue by parallelizing it. Of course we can always come back to this kind of hack later. Richard.
Jakub, I am adding you to CC since I put my current toughts on LTO and debug info in here. > > Fork-fire-forget is really a much simpler choice here IMO; no worries > > about shared resources, less debug hassle. > > It might be not as cheap as it is on Linux hosts on other hosts of > course. Also I'd rather try to avoid I/O than solving the issue I still have some items on list here 1) avoid function sections to be decompressed by WPA (this won't cause much compile time improvements as decompression is well bellow 10% of runtime) 2) put variable initializers into named sections just as function bodies are. Seeing Martin's systemtaps of firefox/gimp/inkscape, to my surprise the initializers are actually about as big as the text segment. While it seems bit wasteful to pust single integer_cst there (and we can special case this), it seems that there is a promise for vtables and other stuff. To make devirt work, we will need to load vtables into memory (or invent representation to stream them other way that would be similarly big). Still we will avoid need to load them in 5000 copies and merge them. 3) I think good part of function/partitioning overhead is because abstract origin streaming is utterly broken. Currently we can have DECL_ABSTRACT_ORIGIN on a function. This I can now track by used_as_abstract_origin flag and I can stream those functions into partitins using them. This is still wrong for multitude of reasons 1) we really want DECL_INITIAL tree of the functions used as abstract origins in the form before any gimple optimizations happened on them. (that is when debug hook is called) This is not what happens - we stream the tree as it looks during TLO streaming time - i.e. after early optimizations. I think we may just (at a time calling the debug hook) duplicate DECL_INITIAL same way we duplicate decls for save_function_body and saving it elsewhere. Making this tree to be abstract origin of the offline copy of the function itself. 2) dwarf2out doesn't really the DECL_INITIAL tree so it does something useful only when it is already there. It can simply call cgraph_get_body when it needs the DECL_INITIAL, but it doesn't becuase push_cfun causes ICE. If we really can't push_cfun from middle of RTL queueu, I suppose I can just save it elsewhere 3) It is not only toplevel decl that has origin, but all local vars in the function. I think this goes terribly wrong - these decls are not indexable so they are stored into function section of every function referring to them. They are then read in many duplicates and never merged with the DECL_INITIAL tree of the actual abstract origin. For some reason dwarf2out doesn't seem to ICE, but I also do not see how this can produce working debug. Moreover I think the duplicates contribute to our current debug info size problems with LTO. If we solve 1) as discussed by above (i.e. by having separate block trees for functions that are abstract origins), we can then solve 3) by streaming those into global decl stream and make cross-function_context tree references to become global. 4) Of course after early inlining function may need abstract origins from multiple other functions. I do not track this at all. May be easy to just collect a vector of functions that are needed into cgraph_node. Of course solving 1)-4) is bit of early debug info without actually going to stream the dwarf dies, but by using the BLOCK trees as a temporary representation. Incrementally we can have this saved BLOCK tree to be a dwarf DIE and have origins to point to them instead of decls. To get resonable streaming performance it would be nice to have way to get abstract origin references cross-partition that debug info can accomplish. Said that, I now have the fork() patch in all my trees and enjoy 50% faster WPA times. I changed my mind about claim that stremaing should be disk bound - it is hard to hope for disk boundness for something that should fit in cache. We went down from 5GB to 2GB of streaming for Firefox that is good. But we will see again 4GB once Martin's code layout work will land. I think it is from good part because of the origin fun above. Honza > by parallelizing it. Of course we can always come back to this > kind of hack later. > > Richard.
Hi, On Thu, 29 Aug 2013, Richard Biener wrote: > > Fork-fire-forget is really a much simpler choice here IMO; no worries > > about shared resources, less debug hassle. > > It might be not as cheap as it is on Linux hosts on other hosts of > course. Sure. Don't use it there then. Not a reason for not having the improvement on linux. > Also I'd rather try to avoid I/O than solving the issue by parallelizing > it. Of course. There's always something still better. > Of course we can always come back to this kind of hack later. For 4.9 latest, if we don't have anything nicer by then. OTOH we could also remove Honzas patch when and if something better comes around ;) Ciao, Michael.
On Thu, 29 Aug 2013, Jan Hubicka wrote: > Jakub, > I am adding you to CC since I put my current toughts on LTO and debug info > in here. > > > Fork-fire-forget is really a much simpler choice here IMO; no worries > > > about shared resources, less debug hassle. > > > > It might be not as cheap as it is on Linux hosts on other hosts of > > course. Also I'd rather try to avoid I/O than solving the issue > > I still have some items on list here > 1) avoid function sections to be decompressed by WPA > (this won't cause much compile time improvements as decompression is > well bellow 10% of runtime) still low-hanging finally get a LTO section header! (with a flag telling whether the section is compressed) > 2) put variable initializers into named sections just as function bodies > are. > Seeing Martin's systemtaps of firefox/gimp/inkscape, to my surprise the > initializers are actually about as big as the text segment. While > it seems bit wasteful to pust single integer_cst there (and we can > special case this), it seems that there is a promise for vtables > and other stuff. > > To make devirt work, we will need to load vtables into memory (or > invent representation to stream them other way that would be similarly > big). Still we will avoid need to load them in 5000 copies and merge > them. > 3) I think good part of function/partitioning overhead is because abstract > origin streaming is utterly broken. > > Currently we can have DECL_ABSTRACT_ORIGIN on a function. This I can now > track by used_as_abstract_origin flag and I can stream those functions > into partitins using them. > > This is still wrong for multitude of reasons > > 1) we really want DECL_INITIAL tree of the functions used as abstract > origins in the form before any gimple optimizations happened on them. > (that is when debug hook is called) > This is not what happens - we stream the tree as it looks during > TLO streaming time - i.e. after early optimizations. > > I think we may just (at a time calling the debug hook) duplicate DECL_INITIAL > same way we duplicate decls for save_function_body and saving it elsewhere. > Making this tree to be abstract origin of the offline copy of the function itself. > > 2) dwarf2out doesn't really the DECL_INITIAL tree so it does something useful > only when it is already there. > It can simply call cgraph_get_body when it needs the DECL_INITIAL, but it > doesn't becuase push_cfun causes ICE. > If we really can't push_cfun from middle of RTL queueu, I suppose I can > just save it elsewhere > > 3) It is not only toplevel decl that has origin, but all local vars in the > function. > > I think this goes terribly wrong - these decls are not indexable so they > are stored into function section of every function referring to them. > They are then read in many duplicates and never merged with the DECL_INITIAL > tree of the actual abstract origin. For some reason dwarf2out doesn't > seem to ICE, but I also do not see how this can produce working debug. > Moreover I think the duplicates contribute to our current debug info > size problems with LTO. > > If we solve 1) as discussed by above (i.e. by having separate > block trees for functions that are abstract origins), we can then solve 3) > by streaming those into global decl stream and make cross-function_context > tree references to become global. > > 4) Of course after early inlining function may need abstract origins from > multiple other functions. I do not track this at all. > May be easy to just collect a vector of functions that are needed into > cgraph_node. > > Of course solving 1)-4) is bit of early debug info without actually going to > stream the dwarf dies, but by using the BLOCK trees as a temporary representation. > Incrementally we can have this saved BLOCK tree to be a dwarf DIE and have > origins to point to them instead of decls. > > To get resonable streaming performance it would be nice to have way to get > abstract origin references cross-partition that debug info can accomplish. Most of the abstract origin stuff is dropped on the floor by streaming because you cannot really stream that stuff. And yes, we need early debug info to generate the offline abstract origin copy of later inlined functions, and yes, we have to handle streaming / referencing those in some way. But OTOH abstract origins are an optimization for debug info size, so we can as well not have them. > Said that, I now have the fork() patch in all my trees and enjoy 50% faster > WPA times. I changed my mind about claim that stremaing should be disk bound - > it is hard to hope for disk boundness for something that should fit in cache. It should at least limit its fork rate according to -flto=N or jobserver. > We went down from 5GB to 2GB of streaming for Firefox that is good. But we will > see again 4GB once Martin's code layout work will land. I think it is from good > part because of the origin fun above. Ugh. Richard.
> > Said that, I now have the fork() patch in all my trees and enjoy 50% faster > > WPA times. I changed my mind about claim that stremaing should be disk bound - > > it is hard to hope for disk boundness for something that should fit in cache. > > It should at least limit its fork rate according to -flto=N or jobserver. It limits forks to -flto=N. If the patch seems resonable, I will look into posiblity of adding my jobserver client based on GNU make code. I also think with -flto we want wrapper to figure out number of threads and suppy default =N (i.e. nonparallel lto would be -flto=0). Most people don't want to worry about =n/=jobserv parameters and those few projects that don't want to start too many processes to not explode in memory use can get their build machinery right. Honza > > > We went down from 5GB to 2GB of streaming for Firefox that is good. But we will > > see again 4GB once Martin's code layout work will land. I think it is from good > > part because of the origin fun above. > > Ugh. > > Richard.
On Thu, Aug 29, 2013 at 03:58:45PM +0200, Jan Hubicka wrote: > > > Said that, I now have the fork() patch in all my trees and enjoy 50% faster > > > WPA times. I changed my mind about claim that stremaing should be disk bound - > > > it is hard to hope for disk boundness for something that should fit in cache. > > > > It should at least limit its fork rate according to -flto=N or jobserver. > It limits forks to -flto=N. > If the patch seems resonable, I will look into posiblity of adding my jobserver client > based on GNU make code. > > I also think with -flto we want wrapper to figure out number of threads and suppy > default =N (i.e. nonparallel lto would be -flto=0). Most people don't want to worry > about =n/=jobserv parameters and those few projects that don't want to start too many > processes to not explode in memory use can get their build machinery right. > Job server should do that already. You get whatever the user specifies with -j on the top level make. That's imho the right area to control this. The only problem is we need to work around the jobserver pipe bug first I suspect this may need a change in make :-/ -Andi
Index: lto-cgraph.c =================================================================== --- lto-cgraph.c (revision 201891) +++ lto-cgraph.c (working copy) @@ -50,6 +50,9 @@ along with GCC; see the file COPYING3. #include "context.h" #include "pass_manager.h" +/* True when asm nodes has been output. */ +bool asm_nodes_output = false; + static void output_cgraph_opt_summary (void); static void input_cgraph_opt_summary (vec<symtab_node> nodes); @@ -852,7 +855,6 @@ output_symtab (void) lto_symtab_encoder_iterator lsei; int i, n_nodes; lto_symtab_encoder_t encoder; - static bool asm_nodes_output = false; if (flag_wpa) output_cgraph_opt_summary (); Index: lto-streamer.h =================================================================== --- lto-streamer.h (revision 201891) +++ lto-streamer.h (working copy) @@ -870,6 +870,7 @@ void lto_output_location (struct output_ /* In lto-cgraph.c */ +extern bool asm_nodes_output; lto_symtab_encoder_t lto_symtab_encoder_new (bool); int lto_symtab_encoder_encode (lto_symtab_encoder_t, symtab_node); void lto_symtab_encoder_delete (lto_symtab_encoder_t); Index: lto-wrapper.c =================================================================== --- lto-wrapper.c (revision 201891) +++ lto-wrapper.c (working copy) @@ -56,6 +56,9 @@ along with GCC; see the file COPYING3. int debug; /* true if -save-temps. */ int verbose; /* true if -v. */ +int parallel = 0; /* number of parallel builds specified + by -flto=N */ +int jobserver = 0; /* true if -flto=jobserver was used. */ enum lto_mode_d { LTO_MODE_NONE, /* Not doing LTO. */ @@ -445,8 +448,6 @@ run_gcc (unsigned argc, char *argv[]) char *list_option_full = NULL; const char *linker_output = NULL; const char *collect_gcc, *collect_gcc_options; - int parallel = 0; - int jobserver = 0; bool no_partition = false; struct cl_decoded_option *fdecoded_options = NULL; unsigned int fdecoded_options_count = 0; @@ -630,6 +631,16 @@ run_gcc (unsigned argc, char *argv[]) if (parallel <= 1) parallel = 0; } + if (jobserver) + { + obstack_ptr_grow (&argv_obstack, xstrdup ("-fparallelism=jobserver")); + } + else if (parallel > 1) + { + char buf[256]; + sprintf (buf, "-fparallelism=%i", parallel); + obstack_ptr_grow (&argv_obstack, xstrdup (buf)); + } /* Fallthru. */ case OPT_flto: Index: lto/lto.c =================================================================== --- lto/lto.c (revision 201891) +++ lto/lto.c (working copy) @@ -49,6 +49,9 @@ along with GCC; see the file COPYING3. #include "context.h" #include "pass_manager.h" +/* Number of parallel tasks to run, -1 if we want to use GNU Make jobserver. */ +int lto_parallelism; + static GTY(()) tree first_personality_decl; /* Returns a hash code for P. */ @@ -3002,6 +3005,98 @@ cmp_partitions_order (const void *a, con return orderb - ordera; } +/* Actually stream out ENCODER into TEMP_FILENAME. */ + +void +do_stream_out (char *temp_filename, lto_symtab_encoder_t encoder) +{ + lto_file *file = lto_obj_file_open (temp_filename, true); + if (!file) + fatal_error ("lto_obj_file_open() failed"); + lto_set_current_out_file (file); + + ipa_write_optimization_summaries (encoder); + + lto_set_current_out_file (NULL); + lto_obj_file_close (file); + free (file); +} + +/* Wait for forked process and signal errors. */ +#ifdef HAVE_WORKING_FORK +void +wait_for_child () +{ + int status; + do + { + int w = waitpid(0, &status, WUNTRACED | WCONTINUED); + if (w == -1) + fatal_error ("waitpid failed"); + + if (WIFEXITED (status) && WEXITSTATUS (status)) + fatal_error ("streaming subprocess failed"); + else if (WIFSIGNALED (status)) + fatal_error ("streaming subprocess was killed by signal"); + } + while (!WIFEXITED(status) && !WIFSIGNALED(status)); +} +#endif + +/* Stream out ENCODER into TEMP_FILENAME + Fork if that seems to help. */ + +void +stream_out (char *temp_filename, lto_symtab_encoder_t encoder, bool last) +{ +#ifdef HAVE_WORKING_FORK + static int nruns; + + if (!lto_parallelism || lto_parallelism == 1) + { + do_stream_out (temp_filename, encoder); + return; + } + + /* Do not run more than LTO_PARALLELISM streamings + FIXME: we ignore limits on jobserver. */ + if (lto_parallelism > 0 && nruns >= lto_parallelism) + { + wait_for_child (); + nruns --; + } + /* If this is not the last parallel partition, execute new + streaming process. */ + if (!last) + { + pid_t cpid = fork (); + + if (!cpid) + { + setproctitle ("lto1-wpa-streaming"); + do_stream_out (temp_filename, encoder); + exit (0); + } + /* Fork failed; lets do the job ourseleves. */ + else if (cpid == -1) + do_stream_out (temp_filename, encoder); + else + nruns++; + } + /* Last partition; stream it and wait for all children to die. */ + else + { + int i; + do_stream_out (temp_filename, encoder); + for (i = 0; i < nruns; i++) + wait_for_child (); + } + asm_nodes_output = true; +#else + do_stream_out (temp_filename, encoder); +#endif +} + /* Write all output files in WPA mode and the file with the list of LTRANS units. */ @@ -3009,18 +3104,15 @@ static void lto_wpa_write_files (void) { unsigned i, n_sets; - lto_file *file; ltrans_partition part; FILE *ltrans_output_list_stream; char *temp_filename; + vec <char *>temp_filenames = vNULL; size_t blen; /* Open the LTRANS output list. */ if (!ltrans_output_list) fatal_error ("no LTRANS output list filename provided"); - ltrans_output_list_stream = fopen (ltrans_output_list, "w"); - if (ltrans_output_list_stream == NULL) - fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list); timevar_push (TV_WHOPR_WPA); @@ -3056,14 +3148,10 @@ lto_wpa_write_files (void) : cmp_partitions_order); for (i = 0; i < n_sets; i++) { - size_t len; ltrans_partition part = ltrans_partitions[i]; /* Write all the nodes in SET. */ sprintf (temp_filename + blen, "%u.o", i); - file = lto_obj_file_open (temp_filename, true); - if (!file) - fatal_error ("lto_obj_file_open() failed"); if (!quiet_flag) fprintf (stderr, " %s (%s %i insns)", temp_filename, part->name, part->insns); @@ -3105,21 +3193,25 @@ lto_wpa_write_files (void) } gcc_checking_assert (lto_symtab_encoder_size (part->encoder) || !i); - lto_set_current_out_file (file); - - ipa_write_optimization_summaries (part->encoder); + stream_out (temp_filename, part->encoder, i == n_sets - 1); - lto_set_current_out_file (NULL); - lto_obj_file_close (file); - free (file); part->encoder = NULL; - len = strlen (temp_filename); - if (fwrite (temp_filename, 1, len, ltrans_output_list_stream) < len + temp_filenames.safe_push (xstrdup (temp_filename)); + } + ltrans_output_list_stream = fopen (ltrans_output_list, "w"); + if (ltrans_output_list_stream == NULL) + fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list); + for (i = 0; i < n_sets; i++) + { + unsigned int len = strlen (temp_filenames[i]); + if (fwrite (temp_filenames[i], 1, len, ltrans_output_list_stream) < len || fwrite ("\n", 1, 1, ltrans_output_list_stream) < 1) fatal_error ("writing to LTRANS output list %s: %m", ltrans_output_list); + free (temp_filenames[i]); } + temp_filenames.release(); lto_stats.num_output_files += n_sets; Index: lto/lang.opt =================================================================== --- lto/lang.opt (revision 201891) +++ lto/lang.opt (working copy) @@ -32,6 +32,10 @@ fltrans-output-list= LTO Joined Var(ltrans_output_list) Specify a file to which a list of files output by LTRANS is written. +fparallelism= +LTO Joined +Run the link-time optimizer in whole program analysis (WPA) mode. + fwpa LTO Driver Report Var(flag_wpa) Run the link-time optimizer in whole program analysis (WPA) mode. Index: lto/lto.h =================================================================== --- lto/lto.h (revision 201891) +++ lto/lto.h (working copy) @@ -39,6 +39,7 @@ extern const char *resolution_file_name; extern tree lto_eh_personality (void); extern void lto_main (void); extern void lto_read_all_file_options (void); +extern int lto_parallelism; /* In lto-elf.c or lto-coff.c */ extern lto_file *lto_obj_file_open (const char *filename, bool writable); Index: lto/lto-lang.c =================================================================== --- lto/lto-lang.c (revision 201891) +++ lto/lto-lang.c (working copy) @@ -735,6 +735,19 @@ lto_handle_option (size_t scode, const c warn_psabi = value; break; + case OPT_fparallelism_: + if (!arg) + lto_parallelism = 1; + else if (!strcmp (arg, "jobserver")) + lto_parallelism = -1; + else + { + lto_parallelism = atoi (arg); + if (lto_parallelism <= 0) + lto_parallelism = 0; + } + break; + default: break; }