Add parameter to limit LTO streaming parallelism

Hi,
the LTO streaming forks for every partition. With the number of
partitions incrased to 128 and relatively large memory usage (around
5GB) needed to WPA firefox this causes kernel to spend a lot of time
probably by copying the page tables.

This patch makes the streamer to for only lto_parallelism times
and strem num_partitions/lto_paralleism in each thread.
I have also added parameter because currently -flto=jobserv leads
to unlimited parallelism.  This should be fixed by conneting to Make's
jobsever and build our own mini jobserver to distribute partitions
between worker threads, but this seems bit too involved for last minute
change in stage4.  I plan to work on this and hopefully bacport it to .2
release.

I have tested the performance on by 32CPU 64threads box and got best
wall time with 32 partitions and therefore I set it by default.  I get

--param max-lto-streaming-parallelism=1
Time variable                                   usr           sys          wall               GGC
 phase stream out                   :  50.65 ( 30%)  20.66 ( 61%)  71.38 ( 35%)     921 kB (  0%)
 TOTAL                              : 170.73         33.69        204.64        7459610 kB

--param max-lto-streaming-parallelism=4
 phase stream out                   :  13.79 ( 11%)   6.80 ( 35%)  20.94 ( 14%)     155 kB (  0%)
 TOTAL                              : 130.26         19.68        150.46        7458844 kB

--param max-lto-streaming-parallelism=8
 phase stream out                   :   8.94 (  7%)   5.21 ( 29%)  14.15 ( 10%)      83 kB (  0%)
 TOTAL                              : 125.28         18.09        143.54        7458773 kB

--param max-lto-streaming-parallelism=16
 phase stream out                   :   4.56 (  4%)   4.34 ( 25%)   9.46 (  7%)      35 kB (  0%)
 TOTAL                              : 122.60         17.21        140.56        7458725 kB

--param max-lto-streaming-parallelism=32
 phase stream out                   :   2.34 (  2%)   5.69 ( 31%)   8.03 (  6%)      15 kB (  0%)
 TOTAL                              : 118.53         18.36        137.08        7458705 kB

--param max-lto-streaming-parallelism=64
 phase stream out                   :   1.63 (  1%)  15.76 ( 55%)  17.40 ( 12%)      13 kB (  0%)
 TOTAL                              : 122.17         28.66        151.00        7458702 kB

--param max-lto-streaming-parallelism=256
 phase stream out                   :   1.28 (  1%)   9.24 ( 41%)  10.53 (  8%)      13 kB (  0%)
 TOTAL                              : 116.78         22.56        139.53        7458702 kB

Note that it is bit odd that 64 leads to worse results that full
parallelism but it seems to reproduce relatively well. Also the usr/sys
times for streaming are not representative since they do not account sys
time of the forked threads. I am not sure where the fork time is
accounted.

Generally it seems that the forking performance is not at all that
bad and scales reasonably, but I still we should limit the default for
something less than 128 we do now. Definitly there are diminishing
returns after increasing from 16 or 32 and memory use goes up
noticeably. With current trunk memory use also does not seem terribly
bad (less global stream streaming makes the workers cheaper) and in all
memory traces I collected it is dominated by compilation stage during
the full rebuild.

I did similar tests for cc1 binary. There the relative time spent in
streaming is lower so it goes from 17% to 1% (for parallelism 1 and 32
respectively)

Bootstrapped/regtested x86_64-linux, OK?

	* params.def (PARAM_MAX_LTO_STREAMING_PARALLELISM): New parameter.
	* lto.c (do_stream_out): rename to ...
	(stream_out): ... this one; move original code to ...
	(stream_out_partitions_1, stream_out_partitions): ... these new
	functions.
	(lto_wpa_write_files): Honnor lto_parallelism

Message ID	20190411114905.ivebkz234l4bflhb@kam.mff.cuni.cz
State	New
Headers	show Return-Path: <gcc-patches-return-499110-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:message-id:mime-version:content-type; q=dns; s= default; b=ECc6XRtWLa5ouftSGxSAomqwfnPASFlIcoAMjOHcyCs7KNZGh+jqu 1sYr4mt3SEj6zsQrEltrN9OIBgkbD9UYhxsCeqCrZXkCNDSd4HnR7YyRe3E+m5M5 6OR4dKtrRARo17/SIYMI+YQfsjF1jL2VjxVyM4IJ8pwOBXNQI2tXP8= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Date: Thu, 11 Apr 2019 13:49:05 +0200 From: Jan Hubicka <hubicka@ucw.cz> To: gcc-patches@gcc.gnu.org, rguenther@suse.de Subject: Add parameter to limit LTO streaming parallelism Message-ID: <20190411114905.ivebkz234l4bflhb@kam.mff.cuni.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: NeoMutt/20170113 (1.7.2)
Series	Add parameter to limit LTO streaming parallelism \| expand Add parameter to limit LTO streaming parallelism

Add parameter to limit LTO streaming parallelism

Commit Message

Comments

Patch