diff mbox

[RFC] Old school parallelization of WPA streaming

Message ID 20131120230239.GA28683@atrey.karlin.mff.cuni.cz
State New
Headers show

Commit Message

Jan Hubicka Nov. 20, 2013, 11:02 p.m. UTC
Hi,
I am not sure where we converged concerning the fork trick.  I am using it in my
tree for months and it does save my waiting time for WPA compilations, so I am
re-attaching the patch.

Does it seem resonable for mainline?

As for other plans mentioned on this thread
> > 
> > I still have some items on list here
> >  1) avoid function sections to be decompressed by WPA
> >     (this won't cause much compile time improvements as decompression is
> >      well bellow 10% of runtime)
> 
> still low-hanging
> 
> finally get a LTO section header!  (with a flag telling whether the
> section is compressed)

I have patch for it somewhere (not particularly clean, we need to dig more into
the basic section handling code in LTO). The benefits however was quite small
(we get dominated by decls and types still), so perhaps this can wait for next
stage1 or a development branch.
> 
> >  2) put variable initializers into named sections just as function bodies
> >     are.
> >     Seeing Martin's systemtaps of firefox/gimp/inkscape, to my surprise the
> >     initializers are actually about as big as the text segment.  While
> >     it seems bit wasteful to pust single integer_cst there (and we can
> >     special case this), it seems that there is a promise for vtables
> >     and other stuff.
> > 
> >     To make devirt work, we will need to load vtables into memory (or
> >     invent representation to stream them other way that would be similarly
> >     big). Still we will avoid need to load them in 5000 copies and merge
> >     them.

Did not fnish this, unfortunately (devirtualization was more involved and
I lost track on this one).  I had a prototype working where savings was about
15% of WPA memory.  I will try to get cleaner implementation soon.

> >  3) I think good part of function/partitioning overhead is because abstract
> >     origin streaming is utterly broken.

Yep, this is definitely still in longer term plans only.

Honza

	* lto-cgraph.c (asm_nodes_output): Make global.
	* lto-streamer.h (asm_nodes_output): Declare.
	* lto-wrapper.c (parallel, jobserver): Make global.
	(run_gcc): Pass down -fparallelism

	* lto.c (lto_parallelism): New variable.
	(do_stream_out): New function.
	(stream_out): New function.
	(lto_wpa_write_files): Use it.
	* lang.opt (fparallelism): New.
	* lto.h (lto_parallelism): Declare.
	* lto-lang.c (lto_handle_option): Add fparalelism.

Comments

Richard Biener Nov. 21, 2013, 9:24 a.m. UTC | #1
On Thu, 21 Nov 2013, Jan Hubicka wrote:

> Hi,
> I am not sure where we converged concerning the fork trick.  I am using it in my
> tree for months and it does save my waiting time for WPA compilations, so I am
> re-attaching the patch.
> 
> Does it seem resonable for mainline?
> 
> As for other plans mentioned on this thread
> > > 
> > > I still have some items on list here
> > >  1) avoid function sections to be decompressed by WPA
> > >     (this won't cause much compile time improvements as decompression is
> > >      well bellow 10% of runtime)
> > 
> > still low-hanging
> > 
> > finally get a LTO section header!  (with a flag telling whether the
> > section is compressed)
> 
> I have patch for it somewhere (not particularly clean, we need to dig more into
> the basic section handling code in LTO). The benefits however was quite small
> (we get dominated by decls and types still), so perhaps this can wait for next
> stage1 or a development branch.
> > 
> > >  2) put variable initializers into named sections just as function bodies
> > >     are.
> > >     Seeing Martin's systemtaps of firefox/gimp/inkscape, to my surprise the
> > >     initializers are actually about as big as the text segment.  While
> > >     it seems bit wasteful to pust single integer_cst there (and we can
> > >     special case this), it seems that there is a promise for vtables
> > >     and other stuff.
> > > 
> > >     To make devirt work, we will need to load vtables into memory (or
> > >     invent representation to stream them other way that would be similarly
> > >     big). Still we will avoid need to load them in 5000 copies and merge
> > >     them.
> 
> Did not fnish this, unfortunately (devirtualization was more involved and
> I lost track on this one).  I had a prototype working where savings was about
> 15% of WPA memory.  I will try to get cleaner implementation soon.
> 
> > >  3) I think good part of function/partitioning overhead is because abstract
> > >     origin streaming is utterly broken.
> 
> Yep, this is definitely still in longer term plans only.

Why do you need an additional -fparallelism?  Wouldn't
-fwpa=... be a better match, matching -flto=...?  As we already
pass down a -fwpa option to WPA this would make things easier, no?

Thanks,
Richard.

> Honza
> 
> 	* lto-cgraph.c (asm_nodes_output): Make global.
> 	* lto-streamer.h (asm_nodes_output): Declare.
> 	* lto-wrapper.c (parallel, jobserver): Make global.
> 	(run_gcc): Pass down -fparallelism
> 
> 	* lto.c (lto_parallelism): New variable.
> 	(do_stream_out): New function.
> 	(stream_out): New function.
> 	(lto_wpa_write_files): Use it.
> 	* lang.opt (fparallelism): New.
> 	* lto.h (lto_parallelism): Declare.
> 	* lto-lang.c (lto_handle_option): Add fparalelism.
> 
> Index: lto-cgraph.c
> ===================================================================
> --- lto-cgraph.c	(revision 201891)
> +++ lto-cgraph.c	(working copy)
> @@ -50,6 +50,9 @@ along with GCC; see the file COPYING3.
>  #include "context.h"
>  #include "pass_manager.h"
>  
> +/* True when asm nodes has been output.  */
> +bool asm_nodes_output = false;
> +
>  static void output_cgraph_opt_summary (void);
>  static void input_cgraph_opt_summary (vec<symtab_node>  nodes);
>  
> @@ -852,7 +855,6 @@ output_symtab (void)
>    lto_symtab_encoder_iterator lsei;
>    int i, n_nodes;
>    lto_symtab_encoder_t encoder;
> -  static bool asm_nodes_output = false;
>  
>    if (flag_wpa)
>      output_cgraph_opt_summary ();
> Index: lto-streamer.h
> ===================================================================
> --- lto-streamer.h	(revision 201891)
> +++ lto-streamer.h	(working copy)
> @@ -870,6 +870,7 @@ void lto_output_location (struct output_
>  
>  
>  /* In lto-cgraph.c  */
> +extern bool asm_nodes_output;
>  lto_symtab_encoder_t lto_symtab_encoder_new (bool);
>  int lto_symtab_encoder_encode (lto_symtab_encoder_t, symtab_node);
>  void lto_symtab_encoder_delete (lto_symtab_encoder_t);
> Index: lto-wrapper.c
> ===================================================================
> --- lto-wrapper.c	(revision 201891)
> +++ lto-wrapper.c	(working copy)
> @@ -56,6 +56,9 @@ along with GCC; see the file COPYING3.
>  
>  int debug;				/* true if -save-temps.  */
>  int verbose;				/* true if -v.  */
> +int parallel = 0;			/* number of parallel builds specified
> +					   by -flto=N  */
> +int jobserver = 0;			/* true if -flto=jobserver was used.  */
>  
>  enum lto_mode_d {
>    LTO_MODE_NONE,			/* Not doing LTO.  */
> @@ -445,8 +448,6 @@ run_gcc (unsigned argc, char *argv[])
>    char *list_option_full = NULL;
>    const char *linker_output = NULL;
>    const char *collect_gcc, *collect_gcc_options;
> -  int parallel = 0;
> -  int jobserver = 0;
>    bool no_partition = false;
>    struct cl_decoded_option *fdecoded_options = NULL;
>    unsigned int fdecoded_options_count = 0;
> @@ -630,6 +631,16 @@ run_gcc (unsigned argc, char *argv[])
>  	      if (parallel <= 1)
>  		parallel = 0;
>  	    }
> +	  if (jobserver)
> +	    {
> +	      obstack_ptr_grow (&argv_obstack, xstrdup ("-fparallelism=jobserver"));
> +	    }
> +	  else if (parallel > 1)
> +	    {
> +	      char buf[256];
> +	      sprintf (buf, "-fparallelism=%i", parallel);
> +	      obstack_ptr_grow (&argv_obstack, xstrdup (buf));
> +	    }
>  	  /* Fallthru.  */
>  
>  	case OPT_flto:
> Index: lto/lto.c
> ===================================================================
> --- lto/lto.c	(revision 201891)
> +++ lto/lto.c	(working copy)
> @@ -49,6 +49,9 @@ along with GCC; see the file COPYING3.
>  #include "context.h"
>  #include "pass_manager.h"
>  
> +/* Number of parallel tasks to run, -1 if we want to use GNU Make jobserver.  */
> +int lto_parallelism;
> +
>  static GTY(()) tree first_personality_decl;
>  
>  /* Returns a hash code for P.  */
> @@ -3002,6 +3005,98 @@ cmp_partitions_order (const void *a, con
>    return orderb - ordera;
>  }
>  
> +/* Actually stream out ENCODER into TEMP_FILENAME.  */
> +
> +void
> +do_stream_out (char *temp_filename, lto_symtab_encoder_t encoder)
> +{
> +  lto_file *file = lto_obj_file_open (temp_filename, true);
> +  if (!file)
> +    fatal_error ("lto_obj_file_open() failed");
> +  lto_set_current_out_file (file);
> +
> +  ipa_write_optimization_summaries (encoder);
> +
> +  lto_set_current_out_file (NULL);
> +  lto_obj_file_close (file);
> +  free (file);
> +}
> +
> +/* Wait for forked process and signal errors.  */
> +#ifdef HAVE_WORKING_FORK
> +void
> +wait_for_child ()
> +{
> +  int status;
> +  do
> +    {
> +      int w = waitpid(0, &status, WUNTRACED | WCONTINUED);
> +      if (w == -1)
> +	fatal_error ("waitpid failed");
> +
> +      if (WIFEXITED (status) && WEXITSTATUS (status))
> +	fatal_error ("streaming subprocess failed");
> +      else if (WIFSIGNALED (status))
> +	fatal_error ("streaming subprocess was killed by signal");
> +    }
> +  while (!WIFEXITED(status) && !WIFSIGNALED(status));
> +}
> +#endif
> +
> +/* Stream out ENCODER into TEMP_FILENAME
> +   Fork if that seems to help.  */
> +
> +void
> +stream_out (char *temp_filename, lto_symtab_encoder_t encoder, bool last)
> +{
> +#ifdef HAVE_WORKING_FORK
> +  static int nruns;
> +
> +  if (!lto_parallelism || lto_parallelism == 1)
> +    {
> +      do_stream_out (temp_filename, encoder);
> +      return;
> +    }
> +
> +  /* Do not run more than LTO_PARALLELISM streamings
> +     FIXME: we ignore limits on jobserver.  */
> +  if (lto_parallelism > 0 && nruns >= lto_parallelism)
> +    {
> +      wait_for_child ();
> +      nruns --;
> +    }
> +  /* If this is not the last parallel partition, execute new
> +     streaming process.  */
> +  if (!last)
> +    {
> +      pid_t cpid = fork ();
> +
> +      if (!cpid)
> +	{
> +	  setproctitle ("lto1-wpa-streaming");
> +	  do_stream_out (temp_filename, encoder);
> +	  exit (0);
> +	}
> +      /* Fork failed; lets do the job ourseleves.  */
> +      else if (cpid == -1)
> +        do_stream_out (temp_filename, encoder);
> +      else
> +	nruns++;
> +    }
> +  /* Last partition; stream it and wait for all children to die.  */
> +  else
> +    {
> +      int i;
> +      do_stream_out (temp_filename, encoder);
> +      for (i = 0; i < nruns; i++)
> +	wait_for_child ();
> +    }
> +  asm_nodes_output = true;
> +#else
> +  do_stream_out (temp_filename, encoder);
> +#endif
> +}
> +
>  /* Write all output files in WPA mode and the file with the list of
>     LTRANS units.  */
>  
> @@ -3009,18 +3104,15 @@ static void
>  lto_wpa_write_files (void)
>  {
>    unsigned i, n_sets;
> -  lto_file *file;
>    ltrans_partition part;
>    FILE *ltrans_output_list_stream;
>    char *temp_filename;
> +  vec <char *>temp_filenames = vNULL;
>    size_t blen;
>  
>    /* Open the LTRANS output list.  */
>    if (!ltrans_output_list)
>      fatal_error ("no LTRANS output list filename provided");
> -  ltrans_output_list_stream = fopen (ltrans_output_list, "w");
> -  if (ltrans_output_list_stream == NULL)
> -    fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list);
>  
>    timevar_push (TV_WHOPR_WPA);
>  
> @@ -3056,14 +3148,10 @@ lto_wpa_write_files (void)
>  			   : cmp_partitions_order);
>    for (i = 0; i < n_sets; i++)
>      {
> -      size_t len;
>        ltrans_partition part = ltrans_partitions[i];
>  
>        /* Write all the nodes in SET.  */
>        sprintf (temp_filename + blen, "%u.o", i);
> -      file = lto_obj_file_open (temp_filename, true);
> -      if (!file)
> -	fatal_error ("lto_obj_file_open() failed");
>  
>        if (!quiet_flag)
>  	fprintf (stderr, " %s (%s %i insns)", temp_filename, part->name, part->insns);
> @@ -3105,21 +3193,25 @@ lto_wpa_write_files (void)
>  	}
>        gcc_checking_assert (lto_symtab_encoder_size (part->encoder) || !i);
>  
> -      lto_set_current_out_file (file);
> -
> -      ipa_write_optimization_summaries (part->encoder);
> +      stream_out (temp_filename, part->encoder, i == n_sets - 1);
>  
> -      lto_set_current_out_file (NULL);
> -      lto_obj_file_close (file);
> -      free (file);
>        part->encoder = NULL;
>  
> -      len = strlen (temp_filename);
> -      if (fwrite (temp_filename, 1, len, ltrans_output_list_stream) < len
> +      temp_filenames.safe_push (xstrdup (temp_filename));
> +    }
> +  ltrans_output_list_stream = fopen (ltrans_output_list, "w");
> +  if (ltrans_output_list_stream == NULL)
> +    fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list);
> +  for (i = 0; i < n_sets; i++)
> +    {
> +      unsigned int len = strlen (temp_filenames[i]);
> +      if (fwrite (temp_filenames[i], 1, len, ltrans_output_list_stream) < len
>  	  || fwrite ("\n", 1, 1, ltrans_output_list_stream) < 1)
>  	fatal_error ("writing to LTRANS output list %s: %m",
>  		     ltrans_output_list);
> +     free (temp_filenames[i]);
>      }
> +  temp_filenames.release();
>  
>    lto_stats.num_output_files += n_sets;
>  
> Index: lto/lang.opt
> ===================================================================
> --- lto/lang.opt	(revision 201891)
> +++ lto/lang.opt	(working copy)
> @@ -32,6 +32,10 @@ fltrans-output-list=
>  LTO Joined Var(ltrans_output_list)
>  Specify a file to which a list of files output by LTRANS is written.
>  
> +fparallelism=
> +LTO Joined
> +Run the link-time optimizer in whole program analysis (WPA) mode.
> +
>  fwpa
>  LTO Driver Report Var(flag_wpa)
>  Run the link-time optimizer in whole program analysis (WPA) mode.
> Index: lto/lto.h
> ===================================================================
> --- lto/lto.h	(revision 201891)
> +++ lto/lto.h	(working copy)
> @@ -39,6 +39,7 @@ extern const char *resolution_file_name;
>  extern tree lto_eh_personality (void);
>  extern void lto_main (void);
>  extern void lto_read_all_file_options (void);
> +extern int lto_parallelism;
>  
>  /* In lto-elf.c or lto-coff.c  */
>  extern lto_file *lto_obj_file_open (const char *filename, bool writable);
> Index: lto/lto-lang.c
> ===================================================================
> --- lto/lto-lang.c	(revision 201891)
> +++ lto/lto-lang.c	(working copy)
> @@ -735,6 +735,19 @@ lto_handle_option (size_t scode, const c
>        warn_psabi = value;
>        break;
>  
> +    case OPT_fparallelism_:
> +      if (!arg)
> +	lto_parallelism = 1;
> +      else if (!strcmp (arg, "jobserver"))
> +	lto_parallelism = -1;
> +      else
> +	{
> +	  lto_parallelism = atoi (arg);
> +	  if (lto_parallelism <= 0)
> +	    lto_parallelism = 0;
> +	}
> +      break;
> +
>      default:
>        break;
>      }
> 
>
Jan Hubicka Nov. 21, 2013, 10:19 a.m. UTC | #2
> 
> Why do you need an additional -fparallelism?  Wouldn't
> -fwpa=... be a better match, matching -flto=...?  As we already
> pass down a -fwpa option to WPA this would make things easier, no?

My plan was to possibly use same option later for parallelizing more parts of
compiler, not only WPA streaming. Streaming in may have some chance if we get
into thread safety of GGC or move sufficient amount of stuff out of GGC.  Also
we can parallelize inliner heuristic or IPA-PTA if it will ever work. So it
would make sense with -flto-partition=none and perhaps with local optimization,
too.

But I can definitely update the patch to use -fwpa=N and we can deal with this
once this becomes real. (i.e. I have no clue how to parallelize inliner without
making its decisions dependent on the parallelizm and declining with parallelizm
increased nor I have real plans for stream in procedure)

Honza
> 
> Thanks,
> Richard.
> 
> > Honza
> > 
> > 	* lto-cgraph.c (asm_nodes_output): Make global.
> > 	* lto-streamer.h (asm_nodes_output): Declare.
> > 	* lto-wrapper.c (parallel, jobserver): Make global.
> > 	(run_gcc): Pass down -fparallelism
> > 
> > 	* lto.c (lto_parallelism): New variable.
> > 	(do_stream_out): New function.
> > 	(stream_out): New function.
> > 	(lto_wpa_write_files): Use it.
> > 	* lang.opt (fparallelism): New.
> > 	* lto.h (lto_parallelism): Declare.
> > 	* lto-lang.c (lto_handle_option): Add fparalelism.
> > 
> > Index: lto-cgraph.c
> > ===================================================================
> > --- lto-cgraph.c	(revision 201891)
> > +++ lto-cgraph.c	(working copy)
> > @@ -50,6 +50,9 @@ along with GCC; see the file COPYING3.
> >  #include "context.h"
> >  #include "pass_manager.h"
> >  
> > +/* True when asm nodes has been output.  */
> > +bool asm_nodes_output = false;
> > +
> >  static void output_cgraph_opt_summary (void);
> >  static void input_cgraph_opt_summary (vec<symtab_node>  nodes);
> >  
> > @@ -852,7 +855,6 @@ output_symtab (void)
> >    lto_symtab_encoder_iterator lsei;
> >    int i, n_nodes;
> >    lto_symtab_encoder_t encoder;
> > -  static bool asm_nodes_output = false;
> >  
> >    if (flag_wpa)
> >      output_cgraph_opt_summary ();
> > Index: lto-streamer.h
> > ===================================================================
> > --- lto-streamer.h	(revision 201891)
> > +++ lto-streamer.h	(working copy)
> > @@ -870,6 +870,7 @@ void lto_output_location (struct output_
> >  
> >  
> >  /* In lto-cgraph.c  */
> > +extern bool asm_nodes_output;
> >  lto_symtab_encoder_t lto_symtab_encoder_new (bool);
> >  int lto_symtab_encoder_encode (lto_symtab_encoder_t, symtab_node);
> >  void lto_symtab_encoder_delete (lto_symtab_encoder_t);
> > Index: lto-wrapper.c
> > ===================================================================
> > --- lto-wrapper.c	(revision 201891)
> > +++ lto-wrapper.c	(working copy)
> > @@ -56,6 +56,9 @@ along with GCC; see the file COPYING3.
> >  
> >  int debug;				/* true if -save-temps.  */
> >  int verbose;				/* true if -v.  */
> > +int parallel = 0;			/* number of parallel builds specified
> > +					   by -flto=N  */
> > +int jobserver = 0;			/* true if -flto=jobserver was used.  */
> >  
> >  enum lto_mode_d {
> >    LTO_MODE_NONE,			/* Not doing LTO.  */
> > @@ -445,8 +448,6 @@ run_gcc (unsigned argc, char *argv[])
> >    char *list_option_full = NULL;
> >    const char *linker_output = NULL;
> >    const char *collect_gcc, *collect_gcc_options;
> > -  int parallel = 0;
> > -  int jobserver = 0;
> >    bool no_partition = false;
> >    struct cl_decoded_option *fdecoded_options = NULL;
> >    unsigned int fdecoded_options_count = 0;
> > @@ -630,6 +631,16 @@ run_gcc (unsigned argc, char *argv[])
> >  	      if (parallel <= 1)
> >  		parallel = 0;
> >  	    }
> > +	  if (jobserver)
> > +	    {
> > +	      obstack_ptr_grow (&argv_obstack, xstrdup ("-fparallelism=jobserver"));
> > +	    }
> > +	  else if (parallel > 1)
> > +	    {
> > +	      char buf[256];
> > +	      sprintf (buf, "-fparallelism=%i", parallel);
> > +	      obstack_ptr_grow (&argv_obstack, xstrdup (buf));
> > +	    }
> >  	  /* Fallthru.  */
> >  
> >  	case OPT_flto:
> > Index: lto/lto.c
> > ===================================================================
> > --- lto/lto.c	(revision 201891)
> > +++ lto/lto.c	(working copy)
> > @@ -49,6 +49,9 @@ along with GCC; see the file COPYING3.
> >  #include "context.h"
> >  #include "pass_manager.h"
> >  
> > +/* Number of parallel tasks to run, -1 if we want to use GNU Make jobserver.  */
> > +int lto_parallelism;
> > +
> >  static GTY(()) tree first_personality_decl;
> >  
> >  /* Returns a hash code for P.  */
> > @@ -3002,6 +3005,98 @@ cmp_partitions_order (const void *a, con
> >    return orderb - ordera;
> >  }
> >  
> > +/* Actually stream out ENCODER into TEMP_FILENAME.  */
> > +
> > +void
> > +do_stream_out (char *temp_filename, lto_symtab_encoder_t encoder)
> > +{
> > +  lto_file *file = lto_obj_file_open (temp_filename, true);
> > +  if (!file)
> > +    fatal_error ("lto_obj_file_open() failed");
> > +  lto_set_current_out_file (file);
> > +
> > +  ipa_write_optimization_summaries (encoder);
> > +
> > +  lto_set_current_out_file (NULL);
> > +  lto_obj_file_close (file);
> > +  free (file);
> > +}
> > +
> > +/* Wait for forked process and signal errors.  */
> > +#ifdef HAVE_WORKING_FORK
> > +void
> > +wait_for_child ()
> > +{
> > +  int status;
> > +  do
> > +    {
> > +      int w = waitpid(0, &status, WUNTRACED | WCONTINUED);
> > +      if (w == -1)
> > +	fatal_error ("waitpid failed");
> > +
> > +      if (WIFEXITED (status) && WEXITSTATUS (status))
> > +	fatal_error ("streaming subprocess failed");
> > +      else if (WIFSIGNALED (status))
> > +	fatal_error ("streaming subprocess was killed by signal");
> > +    }
> > +  while (!WIFEXITED(status) && !WIFSIGNALED(status));
> > +}
> > +#endif
> > +
> > +/* Stream out ENCODER into TEMP_FILENAME
> > +   Fork if that seems to help.  */
> > +
> > +void
> > +stream_out (char *temp_filename, lto_symtab_encoder_t encoder, bool last)
> > +{
> > +#ifdef HAVE_WORKING_FORK
> > +  static int nruns;
> > +
> > +  if (!lto_parallelism || lto_parallelism == 1)
> > +    {
> > +      do_stream_out (temp_filename, encoder);
> > +      return;
> > +    }
> > +
> > +  /* Do not run more than LTO_PARALLELISM streamings
> > +     FIXME: we ignore limits on jobserver.  */
> > +  if (lto_parallelism > 0 && nruns >= lto_parallelism)
> > +    {
> > +      wait_for_child ();
> > +      nruns --;
> > +    }
> > +  /* If this is not the last parallel partition, execute new
> > +     streaming process.  */
> > +  if (!last)
> > +    {
> > +      pid_t cpid = fork ();
> > +
> > +      if (!cpid)
> > +	{
> > +	  setproctitle ("lto1-wpa-streaming");
> > +	  do_stream_out (temp_filename, encoder);
> > +	  exit (0);
> > +	}
> > +      /* Fork failed; lets do the job ourseleves.  */
> > +      else if (cpid == -1)
> > +        do_stream_out (temp_filename, encoder);
> > +      else
> > +	nruns++;
> > +    }
> > +  /* Last partition; stream it and wait for all children to die.  */
> > +  else
> > +    {
> > +      int i;
> > +      do_stream_out (temp_filename, encoder);
> > +      for (i = 0; i < nruns; i++)
> > +	wait_for_child ();
> > +    }
> > +  asm_nodes_output = true;
> > +#else
> > +  do_stream_out (temp_filename, encoder);
> > +#endif
> > +}
> > +
> >  /* Write all output files in WPA mode and the file with the list of
> >     LTRANS units.  */
> >  
> > @@ -3009,18 +3104,15 @@ static void
> >  lto_wpa_write_files (void)
> >  {
> >    unsigned i, n_sets;
> > -  lto_file *file;
> >    ltrans_partition part;
> >    FILE *ltrans_output_list_stream;
> >    char *temp_filename;
> > +  vec <char *>temp_filenames = vNULL;
> >    size_t blen;
> >  
> >    /* Open the LTRANS output list.  */
> >    if (!ltrans_output_list)
> >      fatal_error ("no LTRANS output list filename provided");
> > -  ltrans_output_list_stream = fopen (ltrans_output_list, "w");
> > -  if (ltrans_output_list_stream == NULL)
> > -    fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list);
> >  
> >    timevar_push (TV_WHOPR_WPA);
> >  
> > @@ -3056,14 +3148,10 @@ lto_wpa_write_files (void)
> >  			   : cmp_partitions_order);
> >    for (i = 0; i < n_sets; i++)
> >      {
> > -      size_t len;
> >        ltrans_partition part = ltrans_partitions[i];
> >  
> >        /* Write all the nodes in SET.  */
> >        sprintf (temp_filename + blen, "%u.o", i);
> > -      file = lto_obj_file_open (temp_filename, true);
> > -      if (!file)
> > -	fatal_error ("lto_obj_file_open() failed");
> >  
> >        if (!quiet_flag)
> >  	fprintf (stderr, " %s (%s %i insns)", temp_filename, part->name, part->insns);
> > @@ -3105,21 +3193,25 @@ lto_wpa_write_files (void)
> >  	}
> >        gcc_checking_assert (lto_symtab_encoder_size (part->encoder) || !i);
> >  
> > -      lto_set_current_out_file (file);
> > -
> > -      ipa_write_optimization_summaries (part->encoder);
> > +      stream_out (temp_filename, part->encoder, i == n_sets - 1);
> >  
> > -      lto_set_current_out_file (NULL);
> > -      lto_obj_file_close (file);
> > -      free (file);
> >        part->encoder = NULL;
> >  
> > -      len = strlen (temp_filename);
> > -      if (fwrite (temp_filename, 1, len, ltrans_output_list_stream) < len
> > +      temp_filenames.safe_push (xstrdup (temp_filename));
> > +    }
> > +  ltrans_output_list_stream = fopen (ltrans_output_list, "w");
> > +  if (ltrans_output_list_stream == NULL)
> > +    fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list);
> > +  for (i = 0; i < n_sets; i++)
> > +    {
> > +      unsigned int len = strlen (temp_filenames[i]);
> > +      if (fwrite (temp_filenames[i], 1, len, ltrans_output_list_stream) < len
> >  	  || fwrite ("\n", 1, 1, ltrans_output_list_stream) < 1)
> >  	fatal_error ("writing to LTRANS output list %s: %m",
> >  		     ltrans_output_list);
> > +     free (temp_filenames[i]);
> >      }
> > +  temp_filenames.release();
> >  
> >    lto_stats.num_output_files += n_sets;
> >  
> > Index: lto/lang.opt
> > ===================================================================
> > --- lto/lang.opt	(revision 201891)
> > +++ lto/lang.opt	(working copy)
> > @@ -32,6 +32,10 @@ fltrans-output-list=
> >  LTO Joined Var(ltrans_output_list)
> >  Specify a file to which a list of files output by LTRANS is written.
> >  
> > +fparallelism=
> > +LTO Joined
> > +Run the link-time optimizer in whole program analysis (WPA) mode.
> > +
> >  fwpa
> >  LTO Driver Report Var(flag_wpa)
> >  Run the link-time optimizer in whole program analysis (WPA) mode.
> > Index: lto/lto.h
> > ===================================================================
> > --- lto/lto.h	(revision 201891)
> > +++ lto/lto.h	(working copy)
> > @@ -39,6 +39,7 @@ extern const char *resolution_file_name;
> >  extern tree lto_eh_personality (void);
> >  extern void lto_main (void);
> >  extern void lto_read_all_file_options (void);
> > +extern int lto_parallelism;
> >  
> >  /* In lto-elf.c or lto-coff.c  */
> >  extern lto_file *lto_obj_file_open (const char *filename, bool writable);
> > Index: lto/lto-lang.c
> > ===================================================================
> > --- lto/lto-lang.c	(revision 201891)
> > +++ lto/lto-lang.c	(working copy)
> > @@ -735,6 +735,19 @@ lto_handle_option (size_t scode, const c
> >        warn_psabi = value;
> >        break;
> >  
> > +    case OPT_fparallelism_:
> > +      if (!arg)
> > +	lto_parallelism = 1;
> > +      else if (!strcmp (arg, "jobserver"))
> > +	lto_parallelism = -1;
> > +      else
> > +	{
> > +	  lto_parallelism = atoi (arg);
> > +	  if (lto_parallelism <= 0)
> > +	    lto_parallelism = 0;
> > +	}
> > +      break;
> > +
> >      default:
> >        break;
> >      }
> > 
> > 
> 
> -- 
> Richard Biener <rguenther@suse.de>
> SUSE / SUSE Labs
> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
Richard Biener Nov. 21, 2013, 10:32 a.m. UTC | #3
On Thu, 21 Nov 2013, Jan Hubicka wrote:

> > 
> > Why do you need an additional -fparallelism?  Wouldn't
> > -fwpa=... be a better match, matching -flto=...?  As we already
> > pass down a -fwpa option to WPA this would make things easier, no?
> 
> My plan was to possibly use same option later for parallelizing more parts of
> compiler, not only WPA streaming. Streaming in may have some chance if we get
> into thread safety of GGC or move sufficient amount of stuff out of GGC.  Also
> we can parallelize inliner heuristic or IPA-PTA if it will ever work. So it
> would make sense with -flto-partition=none and perhaps with local optimization,
> too.

I'd like to drop -flto-partition=none eventually.  It's just one more
path through the compiler to support ...

> But I can definitely update the patch to use -fwpa=N and we can deal with this
> once this becomes real. (i.e. I have no clue how to parallelize inliner without
> making its decisions dependent on the parallelizm and declining with parallelizm
> increased nor I have real plans for stream in procedure)

Please.

Richard.

> Honza
> > 
> > Thanks,
> > Richard.
> > 
> > > Honza
> > > 
> > > 	* lto-cgraph.c (asm_nodes_output): Make global.
> > > 	* lto-streamer.h (asm_nodes_output): Declare.
> > > 	* lto-wrapper.c (parallel, jobserver): Make global.
> > > 	(run_gcc): Pass down -fparallelism
> > > 
> > > 	* lto.c (lto_parallelism): New variable.
> > > 	(do_stream_out): New function.
> > > 	(stream_out): New function.
> > > 	(lto_wpa_write_files): Use it.
> > > 	* lang.opt (fparallelism): New.
> > > 	* lto.h (lto_parallelism): Declare.
> > > 	* lto-lang.c (lto_handle_option): Add fparalelism.
> > > 
> > > Index: lto-cgraph.c
> > > ===================================================================
> > > --- lto-cgraph.c	(revision 201891)
> > > +++ lto-cgraph.c	(working copy)
> > > @@ -50,6 +50,9 @@ along with GCC; see the file COPYING3.
> > >  #include "context.h"
> > >  #include "pass_manager.h"
> > >  
> > > +/* True when asm nodes has been output.  */
> > > +bool asm_nodes_output = false;
> > > +
> > >  static void output_cgraph_opt_summary (void);
> > >  static void input_cgraph_opt_summary (vec<symtab_node>  nodes);
> > >  
> > > @@ -852,7 +855,6 @@ output_symtab (void)
> > >    lto_symtab_encoder_iterator lsei;
> > >    int i, n_nodes;
> > >    lto_symtab_encoder_t encoder;
> > > -  static bool asm_nodes_output = false;
> > >  
> > >    if (flag_wpa)
> > >      output_cgraph_opt_summary ();
> > > Index: lto-streamer.h
> > > ===================================================================
> > > --- lto-streamer.h	(revision 201891)
> > > +++ lto-streamer.h	(working copy)
> > > @@ -870,6 +870,7 @@ void lto_output_location (struct output_
> > >  
> > >  
> > >  /* In lto-cgraph.c  */
> > > +extern bool asm_nodes_output;
> > >  lto_symtab_encoder_t lto_symtab_encoder_new (bool);
> > >  int lto_symtab_encoder_encode (lto_symtab_encoder_t, symtab_node);
> > >  void lto_symtab_encoder_delete (lto_symtab_encoder_t);
> > > Index: lto-wrapper.c
> > > ===================================================================
> > > --- lto-wrapper.c	(revision 201891)
> > > +++ lto-wrapper.c	(working copy)
> > > @@ -56,6 +56,9 @@ along with GCC; see the file COPYING3.
> > >  
> > >  int debug;				/* true if -save-temps.  */
> > >  int verbose;				/* true if -v.  */
> > > +int parallel = 0;			/* number of parallel builds specified
> > > +					   by -flto=N  */
> > > +int jobserver = 0;			/* true if -flto=jobserver was used.  */
> > >  
> > >  enum lto_mode_d {
> > >    LTO_MODE_NONE,			/* Not doing LTO.  */
> > > @@ -445,8 +448,6 @@ run_gcc (unsigned argc, char *argv[])
> > >    char *list_option_full = NULL;
> > >    const char *linker_output = NULL;
> > >    const char *collect_gcc, *collect_gcc_options;
> > > -  int parallel = 0;
> > > -  int jobserver = 0;
> > >    bool no_partition = false;
> > >    struct cl_decoded_option *fdecoded_options = NULL;
> > >    unsigned int fdecoded_options_count = 0;
> > > @@ -630,6 +631,16 @@ run_gcc (unsigned argc, char *argv[])
> > >  	      if (parallel <= 1)
> > >  		parallel = 0;
> > >  	    }
> > > +	  if (jobserver)
> > > +	    {
> > > +	      obstack_ptr_grow (&argv_obstack, xstrdup ("-fparallelism=jobserver"));
> > > +	    }
> > > +	  else if (parallel > 1)
> > > +	    {
> > > +	      char buf[256];
> > > +	      sprintf (buf, "-fparallelism=%i", parallel);
> > > +	      obstack_ptr_grow (&argv_obstack, xstrdup (buf));
> > > +	    }
> > >  	  /* Fallthru.  */
> > >  
> > >  	case OPT_flto:
> > > Index: lto/lto.c
> > > ===================================================================
> > > --- lto/lto.c	(revision 201891)
> > > +++ lto/lto.c	(working copy)
> > > @@ -49,6 +49,9 @@ along with GCC; see the file COPYING3.
> > >  #include "context.h"
> > >  #include "pass_manager.h"
> > >  
> > > +/* Number of parallel tasks to run, -1 if we want to use GNU Make jobserver.  */
> > > +int lto_parallelism;
> > > +
> > >  static GTY(()) tree first_personality_decl;
> > >  
> > >  /* Returns a hash code for P.  */
> > > @@ -3002,6 +3005,98 @@ cmp_partitions_order (const void *a, con
> > >    return orderb - ordera;
> > >  }
> > >  
> > > +/* Actually stream out ENCODER into TEMP_FILENAME.  */
> > > +
> > > +void
> > > +do_stream_out (char *temp_filename, lto_symtab_encoder_t encoder)
> > > +{
> > > +  lto_file *file = lto_obj_file_open (temp_filename, true);
> > > +  if (!file)
> > > +    fatal_error ("lto_obj_file_open() failed");
> > > +  lto_set_current_out_file (file);
> > > +
> > > +  ipa_write_optimization_summaries (encoder);
> > > +
> > > +  lto_set_current_out_file (NULL);
> > > +  lto_obj_file_close (file);
> > > +  free (file);
> > > +}
> > > +
> > > +/* Wait for forked process and signal errors.  */
> > > +#ifdef HAVE_WORKING_FORK
> > > +void
> > > +wait_for_child ()
> > > +{
> > > +  int status;
> > > +  do
> > > +    {
> > > +      int w = waitpid(0, &status, WUNTRACED | WCONTINUED);
> > > +      if (w == -1)
> > > +	fatal_error ("waitpid failed");
> > > +
> > > +      if (WIFEXITED (status) && WEXITSTATUS (status))
> > > +	fatal_error ("streaming subprocess failed");
> > > +      else if (WIFSIGNALED (status))
> > > +	fatal_error ("streaming subprocess was killed by signal");
> > > +    }
> > > +  while (!WIFEXITED(status) && !WIFSIGNALED(status));
> > > +}
> > > +#endif
> > > +
> > > +/* Stream out ENCODER into TEMP_FILENAME
> > > +   Fork if that seems to help.  */
> > > +
> > > +void
> > > +stream_out (char *temp_filename, lto_symtab_encoder_t encoder, bool last)
> > > +{
> > > +#ifdef HAVE_WORKING_FORK
> > > +  static int nruns;
> > > +
> > > +  if (!lto_parallelism || lto_parallelism == 1)
> > > +    {
> > > +      do_stream_out (temp_filename, encoder);
> > > +      return;
> > > +    }
> > > +
> > > +  /* Do not run more than LTO_PARALLELISM streamings
> > > +     FIXME: we ignore limits on jobserver.  */
> > > +  if (lto_parallelism > 0 && nruns >= lto_parallelism)
> > > +    {
> > > +      wait_for_child ();
> > > +      nruns --;
> > > +    }
> > > +  /* If this is not the last parallel partition, execute new
> > > +     streaming process.  */
> > > +  if (!last)
> > > +    {
> > > +      pid_t cpid = fork ();
> > > +
> > > +      if (!cpid)
> > > +	{
> > > +	  setproctitle ("lto1-wpa-streaming");
> > > +	  do_stream_out (temp_filename, encoder);
> > > +	  exit (0);
> > > +	}
> > > +      /* Fork failed; lets do the job ourseleves.  */
> > > +      else if (cpid == -1)
> > > +        do_stream_out (temp_filename, encoder);
> > > +      else
> > > +	nruns++;
> > > +    }
> > > +  /* Last partition; stream it and wait for all children to die.  */
> > > +  else
> > > +    {
> > > +      int i;
> > > +      do_stream_out (temp_filename, encoder);
> > > +      for (i = 0; i < nruns; i++)
> > > +	wait_for_child ();
> > > +    }
> > > +  asm_nodes_output = true;
> > > +#else
> > > +  do_stream_out (temp_filename, encoder);
> > > +#endif
> > > +}
> > > +
> > >  /* Write all output files in WPA mode and the file with the list of
> > >     LTRANS units.  */
> > >  
> > > @@ -3009,18 +3104,15 @@ static void
> > >  lto_wpa_write_files (void)
> > >  {
> > >    unsigned i, n_sets;
> > > -  lto_file *file;
> > >    ltrans_partition part;
> > >    FILE *ltrans_output_list_stream;
> > >    char *temp_filename;
> > > +  vec <char *>temp_filenames = vNULL;
> > >    size_t blen;
> > >  
> > >    /* Open the LTRANS output list.  */
> > >    if (!ltrans_output_list)
> > >      fatal_error ("no LTRANS output list filename provided");
> > > -  ltrans_output_list_stream = fopen (ltrans_output_list, "w");
> > > -  if (ltrans_output_list_stream == NULL)
> > > -    fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list);
> > >  
> > >    timevar_push (TV_WHOPR_WPA);
> > >  
> > > @@ -3056,14 +3148,10 @@ lto_wpa_write_files (void)
> > >  			   : cmp_partitions_order);
> > >    for (i = 0; i < n_sets; i++)
> > >      {
> > > -      size_t len;
> > >        ltrans_partition part = ltrans_partitions[i];
> > >  
> > >        /* Write all the nodes in SET.  */
> > >        sprintf (temp_filename + blen, "%u.o", i);
> > > -      file = lto_obj_file_open (temp_filename, true);
> > > -      if (!file)
> > > -	fatal_error ("lto_obj_file_open() failed");
> > >  
> > >        if (!quiet_flag)
> > >  	fprintf (stderr, " %s (%s %i insns)", temp_filename, part->name, part->insns);
> > > @@ -3105,21 +3193,25 @@ lto_wpa_write_files (void)
> > >  	}
> > >        gcc_checking_assert (lto_symtab_encoder_size (part->encoder) || !i);
> > >  
> > > -      lto_set_current_out_file (file);
> > > -
> > > -      ipa_write_optimization_summaries (part->encoder);
> > > +      stream_out (temp_filename, part->encoder, i == n_sets - 1);
> > >  
> > > -      lto_set_current_out_file (NULL);
> > > -      lto_obj_file_close (file);
> > > -      free (file);
> > >        part->encoder = NULL;
> > >  
> > > -      len = strlen (temp_filename);
> > > -      if (fwrite (temp_filename, 1, len, ltrans_output_list_stream) < len
> > > +      temp_filenames.safe_push (xstrdup (temp_filename));
> > > +    }
> > > +  ltrans_output_list_stream = fopen (ltrans_output_list, "w");
> > > +  if (ltrans_output_list_stream == NULL)
> > > +    fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list);
> > > +  for (i = 0; i < n_sets; i++)
> > > +    {
> > > +      unsigned int len = strlen (temp_filenames[i]);
> > > +      if (fwrite (temp_filenames[i], 1, len, ltrans_output_list_stream) < len
> > >  	  || fwrite ("\n", 1, 1, ltrans_output_list_stream) < 1)
> > >  	fatal_error ("writing to LTRANS output list %s: %m",
> > >  		     ltrans_output_list);
> > > +     free (temp_filenames[i]);
> > >      }
> > > +  temp_filenames.release();
> > >  
> > >    lto_stats.num_output_files += n_sets;
> > >  
> > > Index: lto/lang.opt
> > > ===================================================================
> > > --- lto/lang.opt	(revision 201891)
> > > +++ lto/lang.opt	(working copy)
> > > @@ -32,6 +32,10 @@ fltrans-output-list=
> > >  LTO Joined Var(ltrans_output_list)
> > >  Specify a file to which a list of files output by LTRANS is written.
> > >  
> > > +fparallelism=
> > > +LTO Joined
> > > +Run the link-time optimizer in whole program analysis (WPA) mode.
> > > +
> > >  fwpa
> > >  LTO Driver Report Var(flag_wpa)
> > >  Run the link-time optimizer in whole program analysis (WPA) mode.
> > > Index: lto/lto.h
> > > ===================================================================
> > > --- lto/lto.h	(revision 201891)
> > > +++ lto/lto.h	(working copy)
> > > @@ -39,6 +39,7 @@ extern const char *resolution_file_name;
> > >  extern tree lto_eh_personality (void);
> > >  extern void lto_main (void);
> > >  extern void lto_read_all_file_options (void);
> > > +extern int lto_parallelism;
> > >  
> > >  /* In lto-elf.c or lto-coff.c  */
> > >  extern lto_file *lto_obj_file_open (const char *filename, bool writable);
> > > Index: lto/lto-lang.c
> > > ===================================================================
> > > --- lto/lto-lang.c	(revision 201891)
> > > +++ lto/lto-lang.c	(working copy)
> > > @@ -735,6 +735,19 @@ lto_handle_option (size_t scode, const c
> > >        warn_psabi = value;
> > >        break;
> > >  
> > > +    case OPT_fparallelism_:
> > > +      if (!arg)
> > > +	lto_parallelism = 1;
> > > +      else if (!strcmp (arg, "jobserver"))
> > > +	lto_parallelism = -1;
> > > +      else
> > > +	{
> > > +	  lto_parallelism = atoi (arg);
> > > +	  if (lto_parallelism <= 0)
> > > +	    lto_parallelism = 0;
> > > +	}
> > > +      break;
> > > +
> > >      default:
> > >        break;
> > >      }
> > > 
> > > 
> > 
> > -- 
> > Richard Biener <rguenther@suse.de>
> > SUSE / SUSE Labs
> > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> > GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
> 
>
Jan Hubicka Dec. 5, 2013, 11:54 p.m. UTC | #4
> On Thu, 21 Nov 2013, Jan Hubicka wrote:
> 
> > > 
> > > Why do you need an additional -fparallelism?  Wouldn't
> > > -fwpa=... be a better match, matching -flto=...?  As we already
> > > pass down a -fwpa option to WPA this would make things easier, no?
> > 
> > My plan was to possibly use same option later for parallelizing more parts of
> > compiler, not only WPA streaming. Streaming in may have some chance if we get
> > into thread safety of GGC or move sufficient amount of stuff out of GGC.  Also
> > we can parallelize inliner heuristic or IPA-PTA if it will ever work. So it
> > would make sense with -flto-partition=none and perhaps with local optimization,
> > too.
> 
> I'd like to drop -flto-partition=none eventually.  It's just one more
> path through the compiler to support ...
> 
> > But I can definitely update the patch to use -fwpa=N and we can deal with this
> > once this becomes real. (i.e. I have no clue how to parallelize inliner without
> > making its decisions dependent on the parallelizm and declining with parallelizm
> > increased nor I have real plans for stream in procedure)
> 
> Please.
> 

Hi,
here is updated patch. Sorry for taking time, I should have more time for hacking again
now...

Honza

	* lto-cgraph.c (asm_nodes_output): Make global.
	* lto-wrapper.c (run_gcc): Pass down paralelizm to WPA.
	* lto.c (lto_parallelism): New static var.
	(do_stream_out, wait_for_child, stream_out): New static functions.
	(lto_wpa_write_files): Add support for parallel streaming.
	(do_whole_program_analysis): Set parallelism.
	* lang.opt (fwpa): Add parameter.
	* lto-lang.c (lto_handle_option): Handle flag_wpa.
	(lto_init): Update use of flag_wpa.
	* lto-streamer.h (asm_nodes_output): Declare.
Index: lto-cgraph.c
===================================================================
*** lto-cgraph.c	(revision 205646)
--- lto-cgraph.c	(working copy)
*************** along with GCC; see the file COPYING3.
*** 53,58 ****
--- 53,61 ----
  #include "pass_manager.h"
  #include "ipa-utils.h"
  
+ /* True when asm nodes has been output.  */
+ bool asm_nodes_output = false;
+ 
  static void output_cgraph_opt_summary (void);
  static void input_cgraph_opt_summary (vec<symtab_node *>  nodes);
  
*************** output_symtab (void)
*** 889,895 ****
    lto_symtab_encoder_iterator lsei;
    int i, n_nodes;
    lto_symtab_encoder_t encoder;
-   static bool asm_nodes_output = false;
  
    if (flag_wpa)
      output_cgraph_opt_summary ();
--- 892,897 ----
Index: lto-wrapper.c
===================================================================
*** lto-wrapper.c	(revision 205646)
--- lto-wrapper.c	(working copy)
*************** run_gcc (unsigned argc, char *argv[])
*** 745,751 ****
        tmp += list_option_len;
        strcpy (tmp, ltrans_output_file);
  
!       obstack_ptr_grow (&argv_obstack, "-fwpa");
      }
  
    /* Append the input objects and possible preceding arguments.  */
--- 746,761 ----
        tmp += list_option_len;
        strcpy (tmp, ltrans_output_file);
  
!       if (jobserver)
! 	obstack_ptr_grow (&argv_obstack, xstrdup ("-fwpa=jobserver"));
!       else if (parallel > 1)
! 	{
! 	  char buf[256];
! 	  sprintf (buf, "-fwpa=%i", parallel);
! 	  obstack_ptr_grow (&argv_obstack, xstrdup (buf));
! 	}
!       else
!         obstack_ptr_grow (&argv_obstack, "-fwpa");
      }
  
    /* Append the input objects and possible preceding arguments.  */
Index: lto/lto.c
===================================================================
*** lto/lto.c	(revision 205646)
--- lto/lto.c	(working copy)
*************** along with GCC; see the file COPYING3.
*** 53,58 ****
--- 53,61 ----
  /* Vector to keep track of external variables we've seen so far.  */
  vec<tree, va_gc> *lto_global_var_decls;
  
+ /* Number of parallel tasks to run, -1 if we want to use GNU Make jobserver.  */
+ static int lto_parallelism;
+ 
  static GTY(()) tree first_personality_decl;
  
  /* Returns a hash code for P.  */
*************** cmp_partitions_order (const void *a, con
*** 2454,2459 ****
--- 2457,2554 ----
    return orderb - ordera;
  }
  
+ /* Actually stream out ENCODER into TEMP_FILENAME.  */
+ 
+ static void
+ do_stream_out (char *temp_filename, lto_symtab_encoder_t encoder)
+ {
+   lto_file *file = lto_obj_file_open (temp_filename, true);
+   if (!file)
+     fatal_error ("lto_obj_file_open() failed");
+   lto_set_current_out_file (file);
+ 
+   ipa_write_optimization_summaries (encoder);
+ 
+   lto_set_current_out_file (NULL);
+   lto_obj_file_close (file);
+   free (file);
+ }
+ 
+ /* Wait for forked process and signal errors.  */
+ #ifdef HAVE_WORKING_FORK
+ static void
+ wait_for_child ()
+ {
+   int status;
+   do
+     {
+       int w = waitpid(0, &status, WUNTRACED | WCONTINUED);
+       if (w == -1)
+ 	fatal_error ("waitpid failed");
+ 
+       if (WIFEXITED (status) && WEXITSTATUS (status))
+ 	fatal_error ("streaming subprocess failed");
+       else if (WIFSIGNALED (status))
+ 	fatal_error ("streaming subprocess was killed by signal");
+     }
+   while (!WIFEXITED(status) && !WIFSIGNALED(status));
+ }
+ #endif
+ 
+ /* Stream out ENCODER into TEMP_FILENAME
+    Fork if that seems to help.  */
+ 
+ static void
+ stream_out (char *temp_filename, lto_symtab_encoder_t encoder, bool last)
+ {
+ #ifdef HAVE_WORKING_FORK
+   static int nruns;
+ 
+   if (!lto_parallelism || lto_parallelism == 1)
+     {
+       do_stream_out (temp_filename, encoder);
+       return;
+     }
+ 
+   /* Do not run more than LTO_PARALLELISM streamings
+      FIXME: we ignore limits on jobserver.  */
+   if (lto_parallelism > 0 && nruns >= lto_parallelism)
+     {
+       wait_for_child ();
+       nruns --;
+     }
+   /* If this is not the last parallel partition, execute new
+      streaming process.  */
+   if (!last)
+     {
+       pid_t cpid = fork ();
+ 
+       if (!cpid)
+ 	{
+ 	  setproctitle ("lto1-wpa-streaming");
+ 	  do_stream_out (temp_filename, encoder);
+ 	  exit (0);
+ 	}
+       /* Fork failed; lets do the job ourseleves.  */
+       else if (cpid == -1)
+         do_stream_out (temp_filename, encoder);
+       else
+ 	nruns++;
+     }
+   /* Last partition; stream it and wait for all children to die.  */
+   else
+     {
+       int i;
+       do_stream_out (temp_filename, encoder);
+       for (i = 0; i < nruns; i++)
+ 	wait_for_child ();
+     }
+   asm_nodes_output = true;
+ #else
+   do_stream_out (temp_filename, encoder);
+ #endif
+ }
+ 
  /* Write all output files in WPA mode and the file with the list of
     LTRANS units.  */
  
*************** static void
*** 2461,2478 ****
  lto_wpa_write_files (void)
  {
    unsigned i, n_sets;
-   lto_file *file;
    ltrans_partition part;
    FILE *ltrans_output_list_stream;
    char *temp_filename;
    size_t blen;
  
    /* Open the LTRANS output list.  */
    if (!ltrans_output_list)
      fatal_error ("no LTRANS output list filename provided");
-   ltrans_output_list_stream = fopen (ltrans_output_list, "w");
-   if (ltrans_output_list_stream == NULL)
-     fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list);
  
    timevar_push (TV_WHOPR_WPA);
  
--- 2556,2570 ----
  lto_wpa_write_files (void)
  {
    unsigned i, n_sets;
    ltrans_partition part;
    FILE *ltrans_output_list_stream;
    char *temp_filename;
+   vec <char *>temp_filenames = vNULL;
    size_t blen;
  
    /* Open the LTRANS output list.  */
    if (!ltrans_output_list)
      fatal_error ("no LTRANS output list filename provided");
  
    timevar_push (TV_WHOPR_WPA);
  
*************** lto_wpa_write_files (void)
*** 2508,2521 ****
  			   : cmp_partitions_order);
    for (i = 0; i < n_sets; i++)
      {
-       size_t len;
        ltrans_partition part = ltrans_partitions[i];
  
        /* Write all the nodes in SET.  */
        sprintf (temp_filename + blen, "%u.o", i);
-       file = lto_obj_file_open (temp_filename, true);
-       if (!file)
- 	fatal_error ("lto_obj_file_open() failed");
  
        if (!quiet_flag)
  	fprintf (stderr, " %s (%s %i insns)", temp_filename, part->name, part->insns);
--- 2600,2609 ----
*************** lto_wpa_write_files (void)
*** 2557,2577 ****
  	}
        gcc_checking_assert (lto_symtab_encoder_size (part->encoder) || !i);
  
!       lto_set_current_out_file (file);
! 
!       ipa_write_optimization_summaries (part->encoder);
  
-       lto_set_current_out_file (NULL);
-       lto_obj_file_close (file);
-       free (file);
        part->encoder = NULL;
  
!       len = strlen (temp_filename);
!       if (fwrite (temp_filename, 1, len, ltrans_output_list_stream) < len
  	  || fwrite ("\n", 1, 1, ltrans_output_list_stream) < 1)
  	fatal_error ("writing to LTRANS output list %s: %m",
  		     ltrans_output_list);
      }
  
    lto_stats.num_output_files += n_sets;
  
--- 2645,2669 ----
  	}
        gcc_checking_assert (lto_symtab_encoder_size (part->encoder) || !i);
  
!       stream_out (temp_filename, part->encoder, i == n_sets - 1);
  
        part->encoder = NULL;
  
!       temp_filenames.safe_push (xstrdup (temp_filename));
!     }
!   ltrans_output_list_stream = fopen (ltrans_output_list, "w");
!   if (ltrans_output_list_stream == NULL)
!     fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list);
!   for (i = 0; i < n_sets; i++)
!     {
!       unsigned int len = strlen (temp_filenames[i]);
!       if (fwrite (temp_filenames[i], 1, len, ltrans_output_list_stream) < len
  	  || fwrite ("\n", 1, 1, ltrans_output_list_stream) < 1)
  	fatal_error ("writing to LTRANS output list %s: %m",
  		     ltrans_output_list);
+      free (temp_filenames[i]);
      }
+   temp_filenames.release();
  
    lto_stats.num_output_files += n_sets;
  
*************** do_whole_program_analysis (void)
*** 3126,3131 ****
--- 3218,3235 ----
  {
    symtab_node *node;
  
+   lto_parallelism = 1;
+ 
+   /* TODO: jobserver communicatoin is not supported, yet.  */
+   if (!strcmp (flag_wpa, "jobserver"))
+     lto_parallelism = -1;
+   else
+     {
+       lto_parallelism = atoi (flag_wpa);
+       if (lto_parallelism <= 0)
+ 	lto_parallelism = 0;
+     }
+ 
    timevar_start (TV_PHASE_OPT_GEN);
  
    /* Note that since we are in WPA mode, materialize_cgraph will not
Index: lto/lang.opt
===================================================================
*** lto/lang.opt	(revision 205646)
--- lto/lang.opt	(working copy)
*************** LTO Joined Var(ltrans_output_list)
*** 33,41 ****
  Specify a file to which a list of files output by LTRANS is written.
  
  fwpa
! LTO Driver Report Var(flag_wpa)
  Run the link-time optimizer in whole program analysis (WPA) mode.
  
  fresolution=
  LTO Joined
  The resolution file
--- 33,45 ----
  Specify a file to which a list of files output by LTRANS is written.
  
  fwpa
! LTO Driver Report
  Run the link-time optimizer in whole program analysis (WPA) mode.
  
+ fwpa=
+ LTO Driver RejectNegative Joined Var(flag_wpa)
+ Whole program analysis (WPA) mode with number of parallel jobs specified.
+ 
  fresolution=
  LTO Joined
  The resolution file
Index: lto/lto-lang.c
===================================================================
*** lto/lto-lang.c	(revision 205646)
--- lto/lto-lang.c	(working copy)
*************** lto_handle_option (size_t scode, const c
*** 749,754 ****
--- 749,758 ----
        warn_psabi = value;
        break;
  
+     case OPT_fwpa:
+       flag_wpa = value ? "" : NULL;
+       break;
+ 
      default:
        break;
      }
*************** static bool
*** 1148,1154 ****
  lto_init (void)
  {
    /* We need to generate LTO if running in WPA mode.  */
!   flag_generate_lto = flag_wpa;
  
    /* Create the basic integer types.  */
    build_common_tree_nodes (flag_signed_char, /*short_double=*/false);
--- 1152,1158 ----
  lto_init (void)
  {
    /* We need to generate LTO if running in WPA mode.  */
!   flag_generate_lto = (flag_wpa != NULL);
  
    /* Create the basic integer types.  */
    build_common_tree_nodes (flag_signed_char, /*short_double=*/false);
Index: lto-streamer.h
===================================================================
*** lto-streamer.h	(revision 205646)
--- lto-streamer.h	(working copy)
*************** void lto_output_location (struct output_
*** 873,878 ****
--- 873,879 ----
  
  
  /* In lto-cgraph.c  */
+ extern bool asm_nodes_output;
  lto_symtab_encoder_t lto_symtab_encoder_new (bool);
  int lto_symtab_encoder_encode (lto_symtab_encoder_t, symtab_node *);
  void lto_symtab_encoder_delete (lto_symtab_encoder_t);
Richard Biener Dec. 6, 2013, 9:43 a.m. UTC | #5
On Fri, 6 Dec 2013, Jan Hubicka wrote:

> > On Thu, 21 Nov 2013, Jan Hubicka wrote:
> > 
> > > > 
> > > > Why do you need an additional -fparallelism?  Wouldn't
> > > > -fwpa=... be a better match, matching -flto=...?  As we already
> > > > pass down a -fwpa option to WPA this would make things easier, no?
> > > 
> > > My plan was to possibly use same option later for parallelizing more parts of
> > > compiler, not only WPA streaming. Streaming in may have some chance if we get
> > > into thread safety of GGC or move sufficient amount of stuff out of GGC.  Also
> > > we can parallelize inliner heuristic or IPA-PTA if it will ever work. So it
> > > would make sense with -flto-partition=none and perhaps with local optimization,
> > > too.
> > 
> > I'd like to drop -flto-partition=none eventually.  It's just one more
> > path through the compiler to support ...
> > 
> > > But I can definitely update the patch to use -fwpa=N and we can deal with this
> > > once this becomes real. (i.e. I have no clue how to parallelize inliner without
> > > making its decisions dependent on the parallelizm and declining with parallelizm
> > > increased nor I have real plans for stream in procedure)
> > 
> > Please.
> > 
> 
> Hi,
> here is updated patch. Sorry for taking time, I should have more time for hacking again
> now...

Ok.

Thanks,
Richard.

> Honza
> 
> 	* lto-cgraph.c (asm_nodes_output): Make global.
> 	* lto-wrapper.c (run_gcc): Pass down paralelizm to WPA.
> 	* lto.c (lto_parallelism): New static var.
> 	(do_stream_out, wait_for_child, stream_out): New static functions.
> 	(lto_wpa_write_files): Add support for parallel streaming.
> 	(do_whole_program_analysis): Set parallelism.
> 	* lang.opt (fwpa): Add parameter.
> 	* lto-lang.c (lto_handle_option): Handle flag_wpa.
> 	(lto_init): Update use of flag_wpa.
> 	* lto-streamer.h (asm_nodes_output): Declare.
> Index: lto-cgraph.c
> ===================================================================
> *** lto-cgraph.c	(revision 205646)
> --- lto-cgraph.c	(working copy)
> *************** along with GCC; see the file COPYING3.
> *** 53,58 ****
> --- 53,61 ----
>   #include "pass_manager.h"
>   #include "ipa-utils.h"
>   
> + /* True when asm nodes has been output.  */
> + bool asm_nodes_output = false;
> + 
>   static void output_cgraph_opt_summary (void);
>   static void input_cgraph_opt_summary (vec<symtab_node *>  nodes);
>   
> *************** output_symtab (void)
> *** 889,895 ****
>     lto_symtab_encoder_iterator lsei;
>     int i, n_nodes;
>     lto_symtab_encoder_t encoder;
> -   static bool asm_nodes_output = false;
>   
>     if (flag_wpa)
>       output_cgraph_opt_summary ();
> --- 892,897 ----
> Index: lto-wrapper.c
> ===================================================================
> *** lto-wrapper.c	(revision 205646)
> --- lto-wrapper.c	(working copy)
> *************** run_gcc (unsigned argc, char *argv[])
> *** 745,751 ****
>         tmp += list_option_len;
>         strcpy (tmp, ltrans_output_file);
>   
> !       obstack_ptr_grow (&argv_obstack, "-fwpa");
>       }
>   
>     /* Append the input objects and possible preceding arguments.  */
> --- 746,761 ----
>         tmp += list_option_len;
>         strcpy (tmp, ltrans_output_file);
>   
> !       if (jobserver)
> ! 	obstack_ptr_grow (&argv_obstack, xstrdup ("-fwpa=jobserver"));
> !       else if (parallel > 1)
> ! 	{
> ! 	  char buf[256];
> ! 	  sprintf (buf, "-fwpa=%i", parallel);
> ! 	  obstack_ptr_grow (&argv_obstack, xstrdup (buf));
> ! 	}
> !       else
> !         obstack_ptr_grow (&argv_obstack, "-fwpa");
>       }
>   
>     /* Append the input objects and possible preceding arguments.  */
> Index: lto/lto.c
> ===================================================================
> *** lto/lto.c	(revision 205646)
> --- lto/lto.c	(working copy)
> *************** along with GCC; see the file COPYING3.
> *** 53,58 ****
> --- 53,61 ----
>   /* Vector to keep track of external variables we've seen so far.  */
>   vec<tree, va_gc> *lto_global_var_decls;
>   
> + /* Number of parallel tasks to run, -1 if we want to use GNU Make jobserver.  */
> + static int lto_parallelism;
> + 
>   static GTY(()) tree first_personality_decl;
>   
>   /* Returns a hash code for P.  */
> *************** cmp_partitions_order (const void *a, con
> *** 2454,2459 ****
> --- 2457,2554 ----
>     return orderb - ordera;
>   }
>   
> + /* Actually stream out ENCODER into TEMP_FILENAME.  */
> + 
> + static void
> + do_stream_out (char *temp_filename, lto_symtab_encoder_t encoder)
> + {
> +   lto_file *file = lto_obj_file_open (temp_filename, true);
> +   if (!file)
> +     fatal_error ("lto_obj_file_open() failed");
> +   lto_set_current_out_file (file);
> + 
> +   ipa_write_optimization_summaries (encoder);
> + 
> +   lto_set_current_out_file (NULL);
> +   lto_obj_file_close (file);
> +   free (file);
> + }
> + 
> + /* Wait for forked process and signal errors.  */
> + #ifdef HAVE_WORKING_FORK
> + static void
> + wait_for_child ()
> + {
> +   int status;
> +   do
> +     {
> +       int w = waitpid(0, &status, WUNTRACED | WCONTINUED);
> +       if (w == -1)
> + 	fatal_error ("waitpid failed");
> + 
> +       if (WIFEXITED (status) && WEXITSTATUS (status))
> + 	fatal_error ("streaming subprocess failed");
> +       else if (WIFSIGNALED (status))
> + 	fatal_error ("streaming subprocess was killed by signal");
> +     }
> +   while (!WIFEXITED(status) && !WIFSIGNALED(status));
> + }
> + #endif
> + 
> + /* Stream out ENCODER into TEMP_FILENAME
> +    Fork if that seems to help.  */
> + 
> + static void
> + stream_out (char *temp_filename, lto_symtab_encoder_t encoder, bool last)
> + {
> + #ifdef HAVE_WORKING_FORK
> +   static int nruns;
> + 
> +   if (!lto_parallelism || lto_parallelism == 1)
> +     {
> +       do_stream_out (temp_filename, encoder);
> +       return;
> +     }
> + 
> +   /* Do not run more than LTO_PARALLELISM streamings
> +      FIXME: we ignore limits on jobserver.  */
> +   if (lto_parallelism > 0 && nruns >= lto_parallelism)
> +     {
> +       wait_for_child ();
> +       nruns --;
> +     }
> +   /* If this is not the last parallel partition, execute new
> +      streaming process.  */
> +   if (!last)
> +     {
> +       pid_t cpid = fork ();
> + 
> +       if (!cpid)
> + 	{
> + 	  setproctitle ("lto1-wpa-streaming");
> + 	  do_stream_out (temp_filename, encoder);
> + 	  exit (0);
> + 	}
> +       /* Fork failed; lets do the job ourseleves.  */
> +       else if (cpid == -1)
> +         do_stream_out (temp_filename, encoder);
> +       else
> + 	nruns++;
> +     }
> +   /* Last partition; stream it and wait for all children to die.  */
> +   else
> +     {
> +       int i;
> +       do_stream_out (temp_filename, encoder);
> +       for (i = 0; i < nruns; i++)
> + 	wait_for_child ();
> +     }
> +   asm_nodes_output = true;
> + #else
> +   do_stream_out (temp_filename, encoder);
> + #endif
> + }
> + 
>   /* Write all output files in WPA mode and the file with the list of
>      LTRANS units.  */
>   
> *************** static void
> *** 2461,2478 ****
>   lto_wpa_write_files (void)
>   {
>     unsigned i, n_sets;
> -   lto_file *file;
>     ltrans_partition part;
>     FILE *ltrans_output_list_stream;
>     char *temp_filename;
>     size_t blen;
>   
>     /* Open the LTRANS output list.  */
>     if (!ltrans_output_list)
>       fatal_error ("no LTRANS output list filename provided");
> -   ltrans_output_list_stream = fopen (ltrans_output_list, "w");
> -   if (ltrans_output_list_stream == NULL)
> -     fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list);
>   
>     timevar_push (TV_WHOPR_WPA);
>   
> --- 2556,2570 ----
>   lto_wpa_write_files (void)
>   {
>     unsigned i, n_sets;
>     ltrans_partition part;
>     FILE *ltrans_output_list_stream;
>     char *temp_filename;
> +   vec <char *>temp_filenames = vNULL;
>     size_t blen;
>   
>     /* Open the LTRANS output list.  */
>     if (!ltrans_output_list)
>       fatal_error ("no LTRANS output list filename provided");
>   
>     timevar_push (TV_WHOPR_WPA);
>   
> *************** lto_wpa_write_files (void)
> *** 2508,2521 ****
>   			   : cmp_partitions_order);
>     for (i = 0; i < n_sets; i++)
>       {
> -       size_t len;
>         ltrans_partition part = ltrans_partitions[i];
>   
>         /* Write all the nodes in SET.  */
>         sprintf (temp_filename + blen, "%u.o", i);
> -       file = lto_obj_file_open (temp_filename, true);
> -       if (!file)
> - 	fatal_error ("lto_obj_file_open() failed");
>   
>         if (!quiet_flag)
>   	fprintf (stderr, " %s (%s %i insns)", temp_filename, part->name, part->insns);
> --- 2600,2609 ----
> *************** lto_wpa_write_files (void)
> *** 2557,2577 ****
>   	}
>         gcc_checking_assert (lto_symtab_encoder_size (part->encoder) || !i);
>   
> !       lto_set_current_out_file (file);
> ! 
> !       ipa_write_optimization_summaries (part->encoder);
>   
> -       lto_set_current_out_file (NULL);
> -       lto_obj_file_close (file);
> -       free (file);
>         part->encoder = NULL;
>   
> !       len = strlen (temp_filename);
> !       if (fwrite (temp_filename, 1, len, ltrans_output_list_stream) < len
>   	  || fwrite ("\n", 1, 1, ltrans_output_list_stream) < 1)
>   	fatal_error ("writing to LTRANS output list %s: %m",
>   		     ltrans_output_list);
>       }
>   
>     lto_stats.num_output_files += n_sets;
>   
> --- 2645,2669 ----
>   	}
>         gcc_checking_assert (lto_symtab_encoder_size (part->encoder) || !i);
>   
> !       stream_out (temp_filename, part->encoder, i == n_sets - 1);
>   
>         part->encoder = NULL;
>   
> !       temp_filenames.safe_push (xstrdup (temp_filename));
> !     }
> !   ltrans_output_list_stream = fopen (ltrans_output_list, "w");
> !   if (ltrans_output_list_stream == NULL)
> !     fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list);
> !   for (i = 0; i < n_sets; i++)
> !     {
> !       unsigned int len = strlen (temp_filenames[i]);
> !       if (fwrite (temp_filenames[i], 1, len, ltrans_output_list_stream) < len
>   	  || fwrite ("\n", 1, 1, ltrans_output_list_stream) < 1)
>   	fatal_error ("writing to LTRANS output list %s: %m",
>   		     ltrans_output_list);
> +      free (temp_filenames[i]);
>       }
> +   temp_filenames.release();
>   
>     lto_stats.num_output_files += n_sets;
>   
> *************** do_whole_program_analysis (void)
> *** 3126,3131 ****
> --- 3218,3235 ----
>   {
>     symtab_node *node;
>   
> +   lto_parallelism = 1;
> + 
> +   /* TODO: jobserver communicatoin is not supported, yet.  */
> +   if (!strcmp (flag_wpa, "jobserver"))
> +     lto_parallelism = -1;
> +   else
> +     {
> +       lto_parallelism = atoi (flag_wpa);
> +       if (lto_parallelism <= 0)
> + 	lto_parallelism = 0;
> +     }
> + 
>     timevar_start (TV_PHASE_OPT_GEN);
>   
>     /* Note that since we are in WPA mode, materialize_cgraph will not
> Index: lto/lang.opt
> ===================================================================
> *** lto/lang.opt	(revision 205646)
> --- lto/lang.opt	(working copy)
> *************** LTO Joined Var(ltrans_output_list)
> *** 33,41 ****
>   Specify a file to which a list of files output by LTRANS is written.
>   
>   fwpa
> ! LTO Driver Report Var(flag_wpa)
>   Run the link-time optimizer in whole program analysis (WPA) mode.
>   
>   fresolution=
>   LTO Joined
>   The resolution file
> --- 33,45 ----
>   Specify a file to which a list of files output by LTRANS is written.
>   
>   fwpa
> ! LTO Driver Report
>   Run the link-time optimizer in whole program analysis (WPA) mode.
>   
> + fwpa=
> + LTO Driver RejectNegative Joined Var(flag_wpa)
> + Whole program analysis (WPA) mode with number of parallel jobs specified.
> + 
>   fresolution=
>   LTO Joined
>   The resolution file
> Index: lto/lto-lang.c
> ===================================================================
> *** lto/lto-lang.c	(revision 205646)
> --- lto/lto-lang.c	(working copy)
> *************** lto_handle_option (size_t scode, const c
> *** 749,754 ****
> --- 749,758 ----
>         warn_psabi = value;
>         break;
>   
> +     case OPT_fwpa:
> +       flag_wpa = value ? "" : NULL;
> +       break;
> + 
>       default:
>         break;
>       }
> *************** static bool
> *** 1148,1154 ****
>   lto_init (void)
>   {
>     /* We need to generate LTO if running in WPA mode.  */
> !   flag_generate_lto = flag_wpa;
>   
>     /* Create the basic integer types.  */
>     build_common_tree_nodes (flag_signed_char, /*short_double=*/false);
> --- 1152,1158 ----
>   lto_init (void)
>   {
>     /* We need to generate LTO if running in WPA mode.  */
> !   flag_generate_lto = (flag_wpa != NULL);
>   
>     /* Create the basic integer types.  */
>     build_common_tree_nodes (flag_signed_char, /*short_double=*/false);
> Index: lto-streamer.h
> ===================================================================
> *** lto-streamer.h	(revision 205646)
> --- lto-streamer.h	(working copy)
> *************** void lto_output_location (struct output_
> *** 873,878 ****
> --- 873,879 ----
>   
>   
>   /* In lto-cgraph.c  */
> + extern bool asm_nodes_output;
>   lto_symtab_encoder_t lto_symtab_encoder_new (bool);
>   int lto_symtab_encoder_encode (lto_symtab_encoder_t, symtab_node *);
>   void lto_symtab_encoder_delete (lto_symtab_encoder_t);
> 
>
Markus Trippelsdorf Dec. 13, 2013, 12:37 p.m. UTC | #6
On 2013.12.06 at 10:43 +0100, Richard Biener wrote:
> On Fri, 6 Dec 2013, Jan Hubicka wrote:
> 
> > > On Thu, 21 Nov 2013, Jan Hubicka wrote:
> > > 
> > > > > 
> > > > > Why do you need an additional -fparallelism?  Wouldn't
> > > > > -fwpa=... be a better match, matching -flto=...?  As we already
> > > > > pass down a -fwpa option to WPA this would make things easier, no?
> > > > 
> > > > My plan was to possibly use same option later for parallelizing more parts of
> > > > compiler, not only WPA streaming. Streaming in may have some chance if we get
> > > > into thread safety of GGC or move sufficient amount of stuff out of GGC.  Also
> > > > we can parallelize inliner heuristic or IPA-PTA if it will ever work. So it
> > > > would make sense with -flto-partition=none and perhaps with local optimization,
> > > > too.
> > > 
> > > I'd like to drop -flto-partition=none eventually.  It's just one more
> > > path through the compiler to support ...
> > > 
> > > > But I can definitely update the patch to use -fwpa=N and we can deal with this
> > > > once this becomes real. (i.e. I have no clue how to parallelize inliner without
> > > > making its decisions dependent on the parallelizm and declining with parallelizm
> > > > increased nor I have real plans for stream in procedure)
> > > 
> > > Please.
> > > 
> > 
> > Hi,
> > here is updated patch. Sorry for taking time, I should have more time for hacking again
> > now...
> 
> Ok.

Honza, it looks like you forgot to commit the patch.
(I see nice speedups with it and it would be unfortunate if it fell
through the cracks.)
Jan Hubicka Dec. 13, 2013, 1:06 p.m. UTC | #7
> On 2013.12.06 at 10:43 +0100, Richard Biener wrote:
> > On Fri, 6 Dec 2013, Jan Hubicka wrote:
> > 
> > > > On Thu, 21 Nov 2013, Jan Hubicka wrote:
> > > > 
> > > > > > 
> > > > > > Why do you need an additional -fparallelism?  Wouldn't
> > > > > > -fwpa=... be a better match, matching -flto=...?  As we already
> > > > > > pass down a -fwpa option to WPA this would make things easier, no?
> > > > > 
> > > > > My plan was to possibly use same option later for parallelizing more parts of
> > > > > compiler, not only WPA streaming. Streaming in may have some chance if we get
> > > > > into thread safety of GGC or move sufficient amount of stuff out of GGC.  Also
> > > > > we can parallelize inliner heuristic or IPA-PTA if it will ever work. So it
> > > > > would make sense with -flto-partition=none and perhaps with local optimization,
> > > > > too.
> > > > 
> > > > I'd like to drop -flto-partition=none eventually.  It's just one more
> > > > path through the compiler to support ...
> > > > 
> > > > > But I can definitely update the patch to use -fwpa=N and we can deal with this
> > > > > once this becomes real. (i.e. I have no clue how to parallelize inliner without
> > > > > making its decisions dependent on the parallelizm and declining with parallelizm
> > > > > increased nor I have real plans for stream in procedure)
> > > > 
> > > > Please.
> > > > 
> > > 
> > > Hi,
> > > here is updated patch. Sorry for taking time, I should have more time for hacking again
> > > now...
> > 
> > Ok.
> 
> Honza, it looks like you forgot to commit the patch.
> (I see nice speedups with it and it would be unfortunate if it fell
> through the cracks.)

I plan to commit it shortly (i am just slowly progressing through the
bugreports and TODOs cumulated)
- indeed for bigger apps and edit/relink cycle it is an life saver ;)

Honza
H.J. Lu Feb. 20, 2014, 10:58 p.m. UTC | #8
On Thu, Dec 5, 2013 at 3:54 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Thu, 21 Nov 2013, Jan Hubicka wrote:
>>
>> > >
>> > > Why do you need an additional -fparallelism?  Wouldn't
>> > > -fwpa=... be a better match, matching -flto=...?  As we already
>> > > pass down a -fwpa option to WPA this would make things easier, no?
>> >
>> > My plan was to possibly use same option later for parallelizing more parts of
>> > compiler, not only WPA streaming. Streaming in may have some chance if we get
>> > into thread safety of GGC or move sufficient amount of stuff out of GGC.  Also
>> > we can parallelize inliner heuristic or IPA-PTA if it will ever work. So it
>> > would make sense with -flto-partition=none and perhaps with local optimization,
>> > too.
>>
>> I'd like to drop -flto-partition=none eventually.  It's just one more
>> path through the compiler to support ...
>>
>> > But I can definitely update the patch to use -fwpa=N and we can deal with this
>> > once this becomes real. (i.e. I have no clue how to parallelize inliner without
>> > making its decisions dependent on the parallelizm and declining with parallelizm
>> > increased nor I have real plans for stream in procedure)
>>
>> Please.
>>
>
> Hi,
> here is updated patch. Sorry for taking time, I should have more time for hacking again
> now...
>
> Honza
>
>         * lto-cgraph.c (asm_nodes_output): Make global.
>         * lto-wrapper.c (run_gcc): Pass down paralelizm to WPA.
>         * lto.c (lto_parallelism): New static var.
>         (do_stream_out, wait_for_child, stream_out): New static functions.
>         (lto_wpa_write_files): Add support for parallel streaming.
>         (do_whole_program_analysis): Set parallelism.
>         * lang.opt (fwpa): Add parameter.
>         * lto-lang.c (lto_handle_option): Handle flag_wpa.
>         (lto_init): Update use of flag_wpa.
>         * lto-streamer.h (asm_nodes_output): Declare.

This caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60295


H.J.
Andi Kleen Feb. 21, 2014, 12:43 a.m. UTC | #9
> I plan to commit it shortly (i am just slowly progressing through the
> bugreports and TODOs cumulated)
> - indeed for bigger apps and edit/relink cycle it is an life saver ;)

I haven't tested exactly around this, but I see a ~10s (~5%) improved kernel
LTO build time going from 4.9-20140209 to 20140220

Also major faults went down somewhat.

gcc49
gcc version 4.9.0 20140209 (experimental) (GCC) 
real=178.91 user=1829.87 system=125.76 share=1093%% maxrss=1231580 ins=37848 outs=7852280 mfaults=49741854 waits=151528
gcc49
gcc version 4.9.0 20140220 (experimental) (GCC) 
real=168.90 user=1824.20 system=127.04 share=1155%% maxrss=1231996 ins=448 outs=7808584 mfaults=49257032 waits=136755

-Andi
Jan Hubicka Feb. 21, 2014, 1:58 a.m. UTC | #10
> > I plan to commit it shortly (i am just slowly progressing through the
> > bugreports and TODOs cumulated)
> > - indeed for bigger apps and edit/relink cycle it is an life saver ;)
> 
> I haven't tested exactly around this, but I see a ~10s (~5%) improved kernel
> LTO build time going from 4.9-20140209 to 20140220

Good to know! It also improves firefox build noticeably.
I added some comments to the PR itself. I do not see how it can make too many
WPA processes, given that it does not fork without explicit -flto=N argument
that is not passed by the bootstrap-lto.mk config. So perhaps something is
wrong with parsing command liine arguments and setting lto_parallelizm?

Also for all builds I tested so far the memory is not dominated by WPA
streaming but by the subsequent ltrans-es now.  Things are different with
-fprofile-generate that adds a lot of extra datastructures to stream. Generally
I am trying to convince people to profile without LTO as it is much faster.
I will try to reproduce the problem - but I am running ltobootstraps and
profiled-ltobootstrap regularly and never saw too many WPA processes at
once.

Honza
diff mbox

Patch

Index: lto-cgraph.c
===================================================================
--- lto-cgraph.c	(revision 201891)
+++ lto-cgraph.c	(working copy)
@@ -50,6 +50,9 @@  along with GCC; see the file COPYING3.
 #include "context.h"
 #include "pass_manager.h"
 
+/* True when asm nodes has been output.  */
+bool asm_nodes_output = false;
+
 static void output_cgraph_opt_summary (void);
 static void input_cgraph_opt_summary (vec<symtab_node>  nodes);
 
@@ -852,7 +855,6 @@  output_symtab (void)
   lto_symtab_encoder_iterator lsei;
   int i, n_nodes;
   lto_symtab_encoder_t encoder;
-  static bool asm_nodes_output = false;
 
   if (flag_wpa)
     output_cgraph_opt_summary ();
Index: lto-streamer.h
===================================================================
--- lto-streamer.h	(revision 201891)
+++ lto-streamer.h	(working copy)
@@ -870,6 +870,7 @@  void lto_output_location (struct output_
 
 
 /* In lto-cgraph.c  */
+extern bool asm_nodes_output;
 lto_symtab_encoder_t lto_symtab_encoder_new (bool);
 int lto_symtab_encoder_encode (lto_symtab_encoder_t, symtab_node);
 void lto_symtab_encoder_delete (lto_symtab_encoder_t);
Index: lto-wrapper.c
===================================================================
--- lto-wrapper.c	(revision 201891)
+++ lto-wrapper.c	(working copy)
@@ -56,6 +56,9 @@  along with GCC; see the file COPYING3.
 
 int debug;				/* true if -save-temps.  */
 int verbose;				/* true if -v.  */
+int parallel = 0;			/* number of parallel builds specified
+					   by -flto=N  */
+int jobserver = 0;			/* true if -flto=jobserver was used.  */
 
 enum lto_mode_d {
   LTO_MODE_NONE,			/* Not doing LTO.  */
@@ -445,8 +448,6 @@  run_gcc (unsigned argc, char *argv[])
   char *list_option_full = NULL;
   const char *linker_output = NULL;
   const char *collect_gcc, *collect_gcc_options;
-  int parallel = 0;
-  int jobserver = 0;
   bool no_partition = false;
   struct cl_decoded_option *fdecoded_options = NULL;
   unsigned int fdecoded_options_count = 0;
@@ -630,6 +631,16 @@  run_gcc (unsigned argc, char *argv[])
 	      if (parallel <= 1)
 		parallel = 0;
 	    }
+	  if (jobserver)
+	    {
+	      obstack_ptr_grow (&argv_obstack, xstrdup ("-fparallelism=jobserver"));
+	    }
+	  else if (parallel > 1)
+	    {
+	      char buf[256];
+	      sprintf (buf, "-fparallelism=%i", parallel);
+	      obstack_ptr_grow (&argv_obstack, xstrdup (buf));
+	    }
 	  /* Fallthru.  */
 
 	case OPT_flto:
Index: lto/lto.c
===================================================================
--- lto/lto.c	(revision 201891)
+++ lto/lto.c	(working copy)
@@ -49,6 +49,9 @@  along with GCC; see the file COPYING3.
 #include "context.h"
 #include "pass_manager.h"
 
+/* Number of parallel tasks to run, -1 if we want to use GNU Make jobserver.  */
+int lto_parallelism;
+
 static GTY(()) tree first_personality_decl;
 
 /* Returns a hash code for P.  */
@@ -3002,6 +3005,98 @@  cmp_partitions_order (const void *a, con
   return orderb - ordera;
 }
 
+/* Actually stream out ENCODER into TEMP_FILENAME.  */
+
+void
+do_stream_out (char *temp_filename, lto_symtab_encoder_t encoder)
+{
+  lto_file *file = lto_obj_file_open (temp_filename, true);
+  if (!file)
+    fatal_error ("lto_obj_file_open() failed");
+  lto_set_current_out_file (file);
+
+  ipa_write_optimization_summaries (encoder);
+
+  lto_set_current_out_file (NULL);
+  lto_obj_file_close (file);
+  free (file);
+}
+
+/* Wait for forked process and signal errors.  */
+#ifdef HAVE_WORKING_FORK
+void
+wait_for_child ()
+{
+  int status;
+  do
+    {
+      int w = waitpid(0, &status, WUNTRACED | WCONTINUED);
+      if (w == -1)
+	fatal_error ("waitpid failed");
+
+      if (WIFEXITED (status) && WEXITSTATUS (status))
+	fatal_error ("streaming subprocess failed");
+      else if (WIFSIGNALED (status))
+	fatal_error ("streaming subprocess was killed by signal");
+    }
+  while (!WIFEXITED(status) && !WIFSIGNALED(status));
+}
+#endif
+
+/* Stream out ENCODER into TEMP_FILENAME
+   Fork if that seems to help.  */
+
+void
+stream_out (char *temp_filename, lto_symtab_encoder_t encoder, bool last)
+{
+#ifdef HAVE_WORKING_FORK
+  static int nruns;
+
+  if (!lto_parallelism || lto_parallelism == 1)
+    {
+      do_stream_out (temp_filename, encoder);
+      return;
+    }
+
+  /* Do not run more than LTO_PARALLELISM streamings
+     FIXME: we ignore limits on jobserver.  */
+  if (lto_parallelism > 0 && nruns >= lto_parallelism)
+    {
+      wait_for_child ();
+      nruns --;
+    }
+  /* If this is not the last parallel partition, execute new
+     streaming process.  */
+  if (!last)
+    {
+      pid_t cpid = fork ();
+
+      if (!cpid)
+	{
+	  setproctitle ("lto1-wpa-streaming");
+	  do_stream_out (temp_filename, encoder);
+	  exit (0);
+	}
+      /* Fork failed; lets do the job ourseleves.  */
+      else if (cpid == -1)
+        do_stream_out (temp_filename, encoder);
+      else
+	nruns++;
+    }
+  /* Last partition; stream it and wait for all children to die.  */
+  else
+    {
+      int i;
+      do_stream_out (temp_filename, encoder);
+      for (i = 0; i < nruns; i++)
+	wait_for_child ();
+    }
+  asm_nodes_output = true;
+#else
+  do_stream_out (temp_filename, encoder);
+#endif
+}
+
 /* Write all output files in WPA mode and the file with the list of
    LTRANS units.  */
 
@@ -3009,18 +3104,15 @@  static void
 lto_wpa_write_files (void)
 {
   unsigned i, n_sets;
-  lto_file *file;
   ltrans_partition part;
   FILE *ltrans_output_list_stream;
   char *temp_filename;
+  vec <char *>temp_filenames = vNULL;
   size_t blen;
 
   /* Open the LTRANS output list.  */
   if (!ltrans_output_list)
     fatal_error ("no LTRANS output list filename provided");
-  ltrans_output_list_stream = fopen (ltrans_output_list, "w");
-  if (ltrans_output_list_stream == NULL)
-    fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list);
 
   timevar_push (TV_WHOPR_WPA);
 
@@ -3056,14 +3148,10 @@  lto_wpa_write_files (void)
 			   : cmp_partitions_order);
   for (i = 0; i < n_sets; i++)
     {
-      size_t len;
       ltrans_partition part = ltrans_partitions[i];
 
       /* Write all the nodes in SET.  */
       sprintf (temp_filename + blen, "%u.o", i);
-      file = lto_obj_file_open (temp_filename, true);
-      if (!file)
-	fatal_error ("lto_obj_file_open() failed");
 
       if (!quiet_flag)
 	fprintf (stderr, " %s (%s %i insns)", temp_filename, part->name, part->insns);
@@ -3105,21 +3193,25 @@  lto_wpa_write_files (void)
 	}
       gcc_checking_assert (lto_symtab_encoder_size (part->encoder) || !i);
 
-      lto_set_current_out_file (file);
-
-      ipa_write_optimization_summaries (part->encoder);
+      stream_out (temp_filename, part->encoder, i == n_sets - 1);
 
-      lto_set_current_out_file (NULL);
-      lto_obj_file_close (file);
-      free (file);
       part->encoder = NULL;
 
-      len = strlen (temp_filename);
-      if (fwrite (temp_filename, 1, len, ltrans_output_list_stream) < len
+      temp_filenames.safe_push (xstrdup (temp_filename));
+    }
+  ltrans_output_list_stream = fopen (ltrans_output_list, "w");
+  if (ltrans_output_list_stream == NULL)
+    fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list);
+  for (i = 0; i < n_sets; i++)
+    {
+      unsigned int len = strlen (temp_filenames[i]);
+      if (fwrite (temp_filenames[i], 1, len, ltrans_output_list_stream) < len
 	  || fwrite ("\n", 1, 1, ltrans_output_list_stream) < 1)
 	fatal_error ("writing to LTRANS output list %s: %m",
 		     ltrans_output_list);
+     free (temp_filenames[i]);
     }
+  temp_filenames.release();
 
   lto_stats.num_output_files += n_sets;
 
Index: lto/lang.opt
===================================================================
--- lto/lang.opt	(revision 201891)
+++ lto/lang.opt	(working copy)
@@ -32,6 +32,10 @@  fltrans-output-list=
 LTO Joined Var(ltrans_output_list)
 Specify a file to which a list of files output by LTRANS is written.
 
+fparallelism=
+LTO Joined
+Run the link-time optimizer in whole program analysis (WPA) mode.
+
 fwpa
 LTO Driver Report Var(flag_wpa)
 Run the link-time optimizer in whole program analysis (WPA) mode.
Index: lto/lto.h
===================================================================
--- lto/lto.h	(revision 201891)
+++ lto/lto.h	(working copy)
@@ -39,6 +39,7 @@  extern const char *resolution_file_name;
 extern tree lto_eh_personality (void);
 extern void lto_main (void);
 extern void lto_read_all_file_options (void);
+extern int lto_parallelism;
 
 /* In lto-elf.c or lto-coff.c  */
 extern lto_file *lto_obj_file_open (const char *filename, bool writable);
Index: lto/lto-lang.c
===================================================================
--- lto/lto-lang.c	(revision 201891)
+++ lto/lto-lang.c	(working copy)
@@ -735,6 +735,19 @@  lto_handle_option (size_t scode, const c
       warn_psabi = value;
       break;
 
+    case OPT_fparallelism_:
+      if (!arg)
+	lto_parallelism = 1;
+      else if (!strcmp (arg, "jobserver"))
+	lto_parallelism = -1;
+      else
+	{
+	  lto_parallelism = atoi (arg);
+	  if (lto_parallelism <= 0)
+	    lto_parallelism = 0;
+	}
+      break;
+
     default:
       break;
     }