diff mbox series

Add parameter to limit LTO streaming parallelism

Message ID 20190411114905.ivebkz234l4bflhb@kam.mff.cuni.cz
State New
Headers show
Series Add parameter to limit LTO streaming parallelism | expand

Commit Message

Jan Hubicka April 11, 2019, 11:49 a.m. UTC
Hi,
the LTO streaming forks for every partition. With the number of
partitions incrased to 128 and relatively large memory usage (around
5GB) needed to WPA firefox this causes kernel to spend a lot of time
probably by copying the page tables.

This patch makes the streamer to for only lto_parallelism times
and strem num_partitions/lto_paralleism in each thread.
I have also added parameter because currently -flto=jobserv leads
to unlimited parallelism.  This should be fixed by conneting to Make's
jobsever and build our own mini jobserver to distribute partitions
between worker threads, but this seems bit too involved for last minute
change in stage4.  I plan to work on this and hopefully bacport it to .2
release.

I have tested the performance on by 32CPU 64threads box and got best
wall time with 32 partitions and therefore I set it by default.  I get

--param max-lto-streaming-parallelism=1
Time variable                                   usr           sys          wall               GGC
 phase stream out                   :  50.65 ( 30%)  20.66 ( 61%)  71.38 ( 35%)     921 kB (  0%)
 TOTAL                              : 170.73         33.69        204.64        7459610 kB

--param max-lto-streaming-parallelism=4
 phase stream out                   :  13.79 ( 11%)   6.80 ( 35%)  20.94 ( 14%)     155 kB (  0%)
 TOTAL                              : 130.26         19.68        150.46        7458844 kB

--param max-lto-streaming-parallelism=8
 phase stream out                   :   8.94 (  7%)   5.21 ( 29%)  14.15 ( 10%)      83 kB (  0%)
 TOTAL                              : 125.28         18.09        143.54        7458773 kB

--param max-lto-streaming-parallelism=16
 phase stream out                   :   4.56 (  4%)   4.34 ( 25%)   9.46 (  7%)      35 kB (  0%)
 TOTAL                              : 122.60         17.21        140.56        7458725 kB

--param max-lto-streaming-parallelism=32
 phase stream out                   :   2.34 (  2%)   5.69 ( 31%)   8.03 (  6%)      15 kB (  0%)
 TOTAL                              : 118.53         18.36        137.08        7458705 kB

--param max-lto-streaming-parallelism=64
 phase stream out                   :   1.63 (  1%)  15.76 ( 55%)  17.40 ( 12%)      13 kB (  0%)
 TOTAL                              : 122.17         28.66        151.00        7458702 kB

--param max-lto-streaming-parallelism=256
 phase stream out                   :   1.28 (  1%)   9.24 ( 41%)  10.53 (  8%)      13 kB (  0%)
 TOTAL                              : 116.78         22.56        139.53        7458702 kB

Note that it is bit odd that 64 leads to worse results that full
parallelism but it seems to reproduce relatively well. Also the usr/sys
times for streaming are not representative since they do not account sys
time of the forked threads. I am not sure where the fork time is
accounted.

Generally it seems that the forking performance is not at all that
bad and scales reasonably, but I still we should limit the default for
something less than 128 we do now. Definitly there are diminishing
returns after increasing from 16 or 32 and memory use goes up
noticeably. With current trunk memory use also does not seem terribly
bad (less global stream streaming makes the workers cheaper) and in all
memory traces I collected it is dominated by compilation stage during
the full rebuild.

I did similar tests for cc1 binary. There the relative time spent in
streaming is lower so it goes from 17% to 1% (for parallelism 1 and 32
respectively)

Bootstrapped/regtested x86_64-linux, OK?

	* params.def (PARAM_MAX_LTO_STREAMING_PARALLELISM): New parameter.
	* lto.c (do_stream_out): rename to ...
	(stream_out): ... this one; move original code to ...
	(stream_out_partitions_1, stream_out_partitions): ... these new
	functions.
	(lto_wpa_write_files): Honnor lto_parallelism

Comments

Richard Biener April 11, 2019, 12:11 p.m. UTC | #1
On Thu, 11 Apr 2019, Jan Hubicka wrote:

> Hi,
> the LTO streaming forks for every partition. With the number of
> partitions incrased to 128 and relatively large memory usage (around
> 5GB) needed to WPA firefox this causes kernel to spend a lot of time
> probably by copying the page tables.
> 
> This patch makes the streamer to for only lto_parallelism times
> and strem num_partitions/lto_paralleism in each thread.
> I have also added parameter because currently -flto=jobserv leads
> to unlimited parallelism.  This should be fixed by conneting to Make's
> jobsever and build our own mini jobserver to distribute partitions
> between worker threads, but this seems bit too involved for last minute
> change in stage4.  I plan to work on this and hopefully bacport it to .2
> release.
> 
> I have tested the performance on by 32CPU 64threads box and got best
> wall time with 32 partitions and therefore I set it by default.  I get
> 
> --param max-lto-streaming-parallelism=1
> Time variable                                   usr           sys          wall               GGC
>  phase stream out                   :  50.65 ( 30%)  20.66 ( 61%)  71.38 ( 35%)     921 kB (  0%)
>  TOTAL                              : 170.73         33.69        204.64        7459610 kB
> 
> --param max-lto-streaming-parallelism=4
>  phase stream out                   :  13.79 ( 11%)   6.80 ( 35%)  20.94 ( 14%)     155 kB (  0%)
>  TOTAL                              : 130.26         19.68        150.46        7458844 kB
> 
> --param max-lto-streaming-parallelism=8
>  phase stream out                   :   8.94 (  7%)   5.21 ( 29%)  14.15 ( 10%)      83 kB (  0%)
>  TOTAL                              : 125.28         18.09        143.54        7458773 kB
> 
> --param max-lto-streaming-parallelism=16
>  phase stream out                   :   4.56 (  4%)   4.34 ( 25%)   9.46 (  7%)      35 kB (  0%)
>  TOTAL                              : 122.60         17.21        140.56        7458725 kB
> 
> --param max-lto-streaming-parallelism=32
>  phase stream out                   :   2.34 (  2%)   5.69 ( 31%)   8.03 (  6%)      15 kB (  0%)
>  TOTAL                              : 118.53         18.36        137.08        7458705 kB
> 
> --param max-lto-streaming-parallelism=64
>  phase stream out                   :   1.63 (  1%)  15.76 ( 55%)  17.40 ( 12%)      13 kB (  0%)
>  TOTAL                              : 122.17         28.66        151.00        7458702 kB
> 
> --param max-lto-streaming-parallelism=256
>  phase stream out                   :   1.28 (  1%)   9.24 ( 41%)  10.53 (  8%)      13 kB (  0%)
>  TOTAL                              : 116.78         22.56        139.53        7458702 kB
> 
> Note that it is bit odd that 64 leads to worse results that full
> parallelism but it seems to reproduce relatively well. Also the usr/sys
> times for streaming are not representative since they do not account sys
> time of the forked threads. I am not sure where the fork time is
> accounted.
> 
> Generally it seems that the forking performance is not at all that
> bad and scales reasonably, but I still we should limit the default for
> something less than 128 we do now. Definitly there are diminishing
> returns after increasing from 16 or 32 and memory use goes up
> noticeably. With current trunk memory use also does not seem terribly
> bad (less global stream streaming makes the workers cheaper) and in all
> memory traces I collected it is dominated by compilation stage during
> the full rebuild.
> 
> I did similar tests for cc1 binary. There the relative time spent in
> streaming is lower so it goes from 17% to 1% (for parallelism 1 and 32
> respectively)
> 
> Bootstrapped/regtested x86_64-linux, OK?

Please document the new param in invoke.texi.  Otherwise looks good
to me.  Btw, do we actually allocate garbage at write-out time?
Thus, would using threads work as well?

Thanks,
Richard.

> 	* params.def (PARAM_MAX_LTO_STREAMING_PARALLELISM): New parameter.
> 	* lto.c (do_stream_out): rename to ...
> 	(stream_out): ... this one; move original code to ...
> 	(stream_out_partitions_1, stream_out_partitions): ... these new
> 	functions.
> 	(lto_wpa_write_files): Honnor lto_parallelism
> Index: params.def
> ===================================================================
> --- params.def	(revision 270143)
> +++ params.def	(working copy)
> @@ -1146,6 +1146,11 @@ DEFPARAM (MAX_PARTITION_SIZE,
>  	  "Maximal size of a partition for LTO (in estimated instructions).",
>  	  1000000, 0, INT_MAX)
>  
> +DEFPARAM (PARAM_MAX_LTO_STREAMING_PARALLELISM,
> +	  "max-lto-streaming-parallelism",
> +	  "maximal number of LTO partitions streamed in parallel.",
> +	  32, 1, 0)
> +
>  /* Diagnostic parameters.  */
>  
>  DEFPARAM (CXX_MAX_NAMESPACES_FOR_DIAGNOSTIC_HELP,
> Index: lto/lto.c
> ===================================================================
> --- lto/lto.c	(revision 270143)
> +++ lto/lto.c	(working copy)
> @@ -2304,7 +2304,7 @@ static lto_file *current_lto_file;
>  /* Actually stream out ENCODER into TEMP_FILENAME.  */
>  
>  static void
> -do_stream_out (char *temp_filename, lto_symtab_encoder_t encoder, int part)
> +stream_out (char *temp_filename, lto_symtab_encoder_t encoder, int part)
>  {
>    lto_file *file = lto_obj_file_open (temp_filename, true);
>    if (!file)
> @@ -2352,19 +2352,31 @@ wait_for_child ()
>  }
>  #endif
>  
> +static void
> +stream_out_partitions_1 (char *temp_filename, int blen, int min, int max)
> +{
> +   /* Write all the nodes in SET.  */
> +   for (int p = min; p < max; p ++)
> +     {
> +       sprintf (temp_filename + blen, "%u.o", p);
> +       stream_out (temp_filename, ltrans_partitions[p]->encoder, p);
> +       ltrans_partitions[p]->encoder = NULL;
> +     }
> +}
> +
>  /* Stream out ENCODER into TEMP_FILENAME
>     Fork if that seems to help.  */
>  
>  static void
> -stream_out (char *temp_filename, lto_symtab_encoder_t encoder,
> -	    bool ARG_UNUSED (last), int part)
> +stream_out_partitions (char *temp_filename, int blen, int min, int max,
> +		       bool ARG_UNUSED (last))
>  {
>  #ifdef HAVE_WORKING_FORK
>    static int nruns;
>  
>    if (lto_parallelism <= 1)
>      {
> -      do_stream_out (temp_filename, encoder, part);
> +      stream_out_partitions_1 (temp_filename, blen, min, max);
>        return;
>      }
>  
> @@ -2384,12 +2396,12 @@ stream_out (char *temp_filename, lto_sym
>        if (!cpid)
>  	{
>  	  setproctitle ("lto1-wpa-streaming");
> -	  do_stream_out (temp_filename, encoder, part);
> +          stream_out_partitions_1 (temp_filename, blen, min, max);
>  	  exit (0);
>  	}
>        /* Fork failed; lets do the job ourseleves.  */
>        else if (cpid == -1)
> -        do_stream_out (temp_filename, encoder, part);
> +        stream_out_partitions_1 (temp_filename, blen, min, max);
>        else
>  	nruns++;
>      }
> @@ -2397,13 +2409,13 @@ stream_out (char *temp_filename, lto_sym
>    else
>      {
>        int i;
> -      do_stream_out (temp_filename, encoder, part);
> +      stream_out_partitions_1 (temp_filename, blen, min, max);
>        for (i = 0; i < nruns; i++)
>  	wait_for_child ();
>      }
>    asm_nodes_output = true;
>  #else
> -  do_stream_out (temp_filename, encoder, part);
> +  stream_out_partitions_1 (temp_filename, blen, min, max);
>  #endif
>  }
>  
> @@ -2445,6 +2457,13 @@ lto_wpa_write_files (void)
>    blen = strlen (temp_filename);
>  
>    n_sets = ltrans_partitions.length ();
> +  unsigned sets_per_worker = n_sets;
> +  if (lto_parallelism > 1)
> +    {
> +      if (lto_parallelism > (int)n_sets)
> +	lto_parallelism = n_sets;
> +      sets_per_worker = (n_sets + lto_parallelism - 1) / lto_parallelism;
> +    }
>  
>    for (i = 0; i < n_sets; i++)
>      {
> @@ -2493,13 +2512,17 @@ lto_wpa_write_files (void)
>  	}
>        gcc_checking_assert (lto_symtab_encoder_size (part->encoder) || !i);
>  
> -      stream_out (temp_filename, part->encoder, i == n_sets - 1, i);
> -
> -      part->encoder = NULL;
> -
>        temp_priority.safe_push (part->insns);
>        temp_filenames.safe_push (xstrdup (temp_filename));
>      }
> +
> +  for (int set = 0; set < MAX (lto_parallelism, 1); set++)
> +    {
> +      stream_out_partitions (temp_filename, blen, set * sets_per_worker,
> +			     MIN ((set + 1) * sets_per_worker, n_sets),
> +			     set == MAX (lto_parallelism, 1) - 1);
> +    }
> +
>    ltrans_output_list_stream = fopen (ltrans_output_list, "w");
>    if (ltrans_output_list_stream == NULL)
>      fatal_error (input_location,
> @@ -3113,14 +3136,16 @@ do_whole_program_analysis (void)
>  
>    lto_parallelism = 1;
>  
> -  /* TODO: jobserver communicatoin is not supported, yet.  */
> +  /* TODO: jobserver communication is not supported, yet.  */
>    if (!strcmp (flag_wpa, "jobserver"))
> -    lto_parallelism = -1;
> +    lto_parallelism = PARAM_VALUE (PARAM_MAX_LTO_STREAMING_PARALLELISM);
>    else
>      {
>        lto_parallelism = atoi (flag_wpa);
>        if (lto_parallelism <= 0)
>  	lto_parallelism = 0;
> +      if (lto_parallelism >= PARAM_VALUE (PARAM_MAX_LTO_STREAMING_PARALLELISM))
> +	lto_parallelism = PARAM_VALUE (PARAM_MAX_LTO_STREAMING_PARALLELISM);
>      }
>  
>    timevar_start (TV_PHASE_OPT_GEN);
>
Jan Hubicka April 11, 2019, 12:24 p.m. UTC | #2
> On Thu, 11 Apr 2019, Jan Hubicka wrote:
> 
> > Hi,
> > the LTO streaming forks for every partition. With the number of
> > partitions incrased to 128 and relatively large memory usage (around
> > 5GB) needed to WPA firefox this causes kernel to spend a lot of time
> > probably by copying the page tables.
> > 
> > This patch makes the streamer to for only lto_parallelism times
> > and strem num_partitions/lto_paralleism in each thread.
> > I have also added parameter because currently -flto=jobserv leads
> > to unlimited parallelism.  This should be fixed by conneting to Make's
> > jobsever and build our own mini jobserver to distribute partitions
> > between worker threads, but this seems bit too involved for last minute
> > change in stage4.  I plan to work on this and hopefully bacport it to .2
> > release.
> > 
> > I have tested the performance on by 32CPU 64threads box and got best
> > wall time with 32 partitions and therefore I set it by default.  I get
> > 
> > --param max-lto-streaming-parallelism=1
> > Time variable                                   usr           sys          wall               GGC
> >  phase stream out                   :  50.65 ( 30%)  20.66 ( 61%)  71.38 ( 35%)     921 kB (  0%)
> >  TOTAL                              : 170.73         33.69        204.64        7459610 kB
> > 
> > --param max-lto-streaming-parallelism=4
> >  phase stream out                   :  13.79 ( 11%)   6.80 ( 35%)  20.94 ( 14%)     155 kB (  0%)
> >  TOTAL                              : 130.26         19.68        150.46        7458844 kB
> > 
> > --param max-lto-streaming-parallelism=8
> >  phase stream out                   :   8.94 (  7%)   5.21 ( 29%)  14.15 ( 10%)      83 kB (  0%)
> >  TOTAL                              : 125.28         18.09        143.54        7458773 kB
> > 
> > --param max-lto-streaming-parallelism=16
> >  phase stream out                   :   4.56 (  4%)   4.34 ( 25%)   9.46 (  7%)      35 kB (  0%)
> >  TOTAL                              : 122.60         17.21        140.56        7458725 kB
> > 
> > --param max-lto-streaming-parallelism=32
> >  phase stream out                   :   2.34 (  2%)   5.69 ( 31%)   8.03 (  6%)      15 kB (  0%)
> >  TOTAL                              : 118.53         18.36        137.08        7458705 kB
> > 
> > --param max-lto-streaming-parallelism=64
> >  phase stream out                   :   1.63 (  1%)  15.76 ( 55%)  17.40 ( 12%)      13 kB (  0%)
> >  TOTAL                              : 122.17         28.66        151.00        7458702 kB
> > 
> > --param max-lto-streaming-parallelism=256
> >  phase stream out                   :   1.28 (  1%)   9.24 ( 41%)  10.53 (  8%)      13 kB (  0%)
> >  TOTAL                              : 116.78         22.56        139.53        7458702 kB
> > 
> > Note that it is bit odd that 64 leads to worse results that full
> > parallelism but it seems to reproduce relatively well. Also the usr/sys
> > times for streaming are not representative since they do not account sys
> > time of the forked threads. I am not sure where the fork time is
> > accounted.
> > 
> > Generally it seems that the forking performance is not at all that
> > bad and scales reasonably, but I still we should limit the default for
> > something less than 128 we do now. Definitly there are diminishing
> > returns after increasing from 16 or 32 and memory use goes up
> > noticeably. With current trunk memory use also does not seem terribly
> > bad (less global stream streaming makes the workers cheaper) and in all
> > memory traces I collected it is dominated by compilation stage during
> > the full rebuild.
> > 
> > I did similar tests for cc1 binary. There the relative time spent in
> > streaming is lower so it goes from 17% to 1% (for parallelism 1 and 32
> > respectively)
> > 
> > Bootstrapped/regtested x86_64-linux, OK?
> 
> Please document the new param in invoke.texi.  Otherwise looks good
> to me.  Btw, do we actually allocate garbage at write-out time?
> Thus, would using threads work as well?

It is on my TODO to get this working.  Last time i checked by adding
abort into ggc_alloc there was some occurences but I think that can be
cleanded up.

I wonder how much performance hit we would get for enabling pthreads for
lto1 binary and thus building libbackend with it?

Honza
Richard Biener April 11, 2019, 12:28 p.m. UTC | #3
On Thu, 11 Apr 2019, Jan Hubicka wrote:

> > On Thu, 11 Apr 2019, Jan Hubicka wrote:
> > 
> > > Hi,
> > > the LTO streaming forks for every partition. With the number of
> > > partitions incrased to 128 and relatively large memory usage (around
> > > 5GB) needed to WPA firefox this causes kernel to spend a lot of time
> > > probably by copying the page tables.
> > > 
> > > This patch makes the streamer to for only lto_parallelism times
> > > and strem num_partitions/lto_paralleism in each thread.
> > > I have also added parameter because currently -flto=jobserv leads
> > > to unlimited parallelism.  This should be fixed by conneting to Make's
> > > jobsever and build our own mini jobserver to distribute partitions
> > > between worker threads, but this seems bit too involved for last minute
> > > change in stage4.  I plan to work on this and hopefully bacport it to .2
> > > release.
> > > 
> > > I have tested the performance on by 32CPU 64threads box and got best
> > > wall time with 32 partitions and therefore I set it by default.  I get
> > > 
> > > --param max-lto-streaming-parallelism=1
> > > Time variable                                   usr           sys          wall               GGC
> > >  phase stream out                   :  50.65 ( 30%)  20.66 ( 61%)  71.38 ( 35%)     921 kB (  0%)
> > >  TOTAL                              : 170.73         33.69        204.64        7459610 kB
> > > 
> > > --param max-lto-streaming-parallelism=4
> > >  phase stream out                   :  13.79 ( 11%)   6.80 ( 35%)  20.94 ( 14%)     155 kB (  0%)
> > >  TOTAL                              : 130.26         19.68        150.46        7458844 kB
> > > 
> > > --param max-lto-streaming-parallelism=8
> > >  phase stream out                   :   8.94 (  7%)   5.21 ( 29%)  14.15 ( 10%)      83 kB (  0%)
> > >  TOTAL                              : 125.28         18.09        143.54        7458773 kB
> > > 
> > > --param max-lto-streaming-parallelism=16
> > >  phase stream out                   :   4.56 (  4%)   4.34 ( 25%)   9.46 (  7%)      35 kB (  0%)
> > >  TOTAL                              : 122.60         17.21        140.56        7458725 kB
> > > 
> > > --param max-lto-streaming-parallelism=32
> > >  phase stream out                   :   2.34 (  2%)   5.69 ( 31%)   8.03 (  6%)      15 kB (  0%)
> > >  TOTAL                              : 118.53         18.36        137.08        7458705 kB
> > > 
> > > --param max-lto-streaming-parallelism=64
> > >  phase stream out                   :   1.63 (  1%)  15.76 ( 55%)  17.40 ( 12%)      13 kB (  0%)
> > >  TOTAL                              : 122.17         28.66        151.00        7458702 kB
> > > 
> > > --param max-lto-streaming-parallelism=256
> > >  phase stream out                   :   1.28 (  1%)   9.24 ( 41%)  10.53 (  8%)      13 kB (  0%)
> > >  TOTAL                              : 116.78         22.56        139.53        7458702 kB
> > > 
> > > Note that it is bit odd that 64 leads to worse results that full
> > > parallelism but it seems to reproduce relatively well. Also the usr/sys
> > > times for streaming are not representative since they do not account sys
> > > time of the forked threads. I am not sure where the fork time is
> > > accounted.
> > > 
> > > Generally it seems that the forking performance is not at all that
> > > bad and scales reasonably, but I still we should limit the default for
> > > something less than 128 we do now. Definitly there are diminishing
> > > returns after increasing from 16 or 32 and memory use goes up
> > > noticeably. With current trunk memory use also does not seem terribly
> > > bad (less global stream streaming makes the workers cheaper) and in all
> > > memory traces I collected it is dominated by compilation stage during
> > > the full rebuild.
> > > 
> > > I did similar tests for cc1 binary. There the relative time spent in
> > > streaming is lower so it goes from 17% to 1% (for parallelism 1 and 32
> > > respectively)
> > > 
> > > Bootstrapped/regtested x86_64-linux, OK?
> > 
> > Please document the new param in invoke.texi.  Otherwise looks good
> > to me.  Btw, do we actually allocate garbage at write-out time?
> > Thus, would using threads work as well?
> 
> It is on my TODO to get this working.  Last time i checked by adding
> abort into ggc_alloc there was some occurences but I think that can be
> cleanded up.
> 
> I wonder how much performance hit we would get for enabling pthreads for
> lto1 binary and thus building libbackend with it?

Is there any performance impact before the first thread creation?
(besides eventually a few well-predicted if (threads_are_running) checks?)

Richard.
diff mbox series

Patch

Index: params.def
===================================================================
--- params.def	(revision 270143)
+++ params.def	(working copy)
@@ -1146,6 +1146,11 @@  DEFPARAM (MAX_PARTITION_SIZE,
 	  "Maximal size of a partition for LTO (in estimated instructions).",
 	  1000000, 0, INT_MAX)
 
+DEFPARAM (PARAM_MAX_LTO_STREAMING_PARALLELISM,
+	  "max-lto-streaming-parallelism",
+	  "maximal number of LTO partitions streamed in parallel.",
+	  32, 1, 0)
+
 /* Diagnostic parameters.  */
 
 DEFPARAM (CXX_MAX_NAMESPACES_FOR_DIAGNOSTIC_HELP,
Index: lto/lto.c
===================================================================
--- lto/lto.c	(revision 270143)
+++ lto/lto.c	(working copy)
@@ -2304,7 +2304,7 @@  static lto_file *current_lto_file;
 /* Actually stream out ENCODER into TEMP_FILENAME.  */
 
 static void
-do_stream_out (char *temp_filename, lto_symtab_encoder_t encoder, int part)
+stream_out (char *temp_filename, lto_symtab_encoder_t encoder, int part)
 {
   lto_file *file = lto_obj_file_open (temp_filename, true);
   if (!file)
@@ -2352,19 +2352,31 @@  wait_for_child ()
 }
 #endif
 
+static void
+stream_out_partitions_1 (char *temp_filename, int blen, int min, int max)
+{
+   /* Write all the nodes in SET.  */
+   for (int p = min; p < max; p ++)
+     {
+       sprintf (temp_filename + blen, "%u.o", p);
+       stream_out (temp_filename, ltrans_partitions[p]->encoder, p);
+       ltrans_partitions[p]->encoder = NULL;
+     }
+}
+
 /* Stream out ENCODER into TEMP_FILENAME
    Fork if that seems to help.  */
 
 static void
-stream_out (char *temp_filename, lto_symtab_encoder_t encoder,
-	    bool ARG_UNUSED (last), int part)
+stream_out_partitions (char *temp_filename, int blen, int min, int max,
+		       bool ARG_UNUSED (last))
 {
 #ifdef HAVE_WORKING_FORK
   static int nruns;
 
   if (lto_parallelism <= 1)
     {
-      do_stream_out (temp_filename, encoder, part);
+      stream_out_partitions_1 (temp_filename, blen, min, max);
       return;
     }
 
@@ -2384,12 +2396,12 @@  stream_out (char *temp_filename, lto_sym
       if (!cpid)
 	{
 	  setproctitle ("lto1-wpa-streaming");
-	  do_stream_out (temp_filename, encoder, part);
+          stream_out_partitions_1 (temp_filename, blen, min, max);
 	  exit (0);
 	}
       /* Fork failed; lets do the job ourseleves.  */
       else if (cpid == -1)
-        do_stream_out (temp_filename, encoder, part);
+        stream_out_partitions_1 (temp_filename, blen, min, max);
       else
 	nruns++;
     }
@@ -2397,13 +2409,13 @@  stream_out (char *temp_filename, lto_sym
   else
     {
       int i;
-      do_stream_out (temp_filename, encoder, part);
+      stream_out_partitions_1 (temp_filename, blen, min, max);
       for (i = 0; i < nruns; i++)
 	wait_for_child ();
     }
   asm_nodes_output = true;
 #else
-  do_stream_out (temp_filename, encoder, part);
+  stream_out_partitions_1 (temp_filename, blen, min, max);
 #endif
 }
 
@@ -2445,6 +2457,13 @@  lto_wpa_write_files (void)
   blen = strlen (temp_filename);
 
   n_sets = ltrans_partitions.length ();
+  unsigned sets_per_worker = n_sets;
+  if (lto_parallelism > 1)
+    {
+      if (lto_parallelism > (int)n_sets)
+	lto_parallelism = n_sets;
+      sets_per_worker = (n_sets + lto_parallelism - 1) / lto_parallelism;
+    }
 
   for (i = 0; i < n_sets; i++)
     {
@@ -2493,13 +2512,17 @@  lto_wpa_write_files (void)
 	}
       gcc_checking_assert (lto_symtab_encoder_size (part->encoder) || !i);
 
-      stream_out (temp_filename, part->encoder, i == n_sets - 1, i);
-
-      part->encoder = NULL;
-
       temp_priority.safe_push (part->insns);
       temp_filenames.safe_push (xstrdup (temp_filename));
     }
+
+  for (int set = 0; set < MAX (lto_parallelism, 1); set++)
+    {
+      stream_out_partitions (temp_filename, blen, set * sets_per_worker,
+			     MIN ((set + 1) * sets_per_worker, n_sets),
+			     set == MAX (lto_parallelism, 1) - 1);
+    }
+
   ltrans_output_list_stream = fopen (ltrans_output_list, "w");
   if (ltrans_output_list_stream == NULL)
     fatal_error (input_location,
@@ -3113,14 +3136,16 @@  do_whole_program_analysis (void)
 
   lto_parallelism = 1;
 
-  /* TODO: jobserver communicatoin is not supported, yet.  */
+  /* TODO: jobserver communication is not supported, yet.  */
   if (!strcmp (flag_wpa, "jobserver"))
-    lto_parallelism = -1;
+    lto_parallelism = PARAM_VALUE (PARAM_MAX_LTO_STREAMING_PARALLELISM);
   else
     {
       lto_parallelism = atoi (flag_wpa);
       if (lto_parallelism <= 0)
 	lto_parallelism = 0;
+      if (lto_parallelism >= PARAM_VALUE (PARAM_MAX_LTO_STREAMING_PARALLELISM))
+	lto_parallelism = PARAM_VALUE (PARAM_MAX_LTO_STREAMING_PARALLELISM);
     }
 
   timevar_start (TV_PHASE_OPT_GEN);