From patchwork Wed Aug 21 14:17:48 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Hubicka X-Patchwork-Id: 268816 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "localhost", Issuer "www.qmailtoaster.com" (not verified)) by ozlabs.org (Postfix) with ESMTPS id BBC502C00AB for ; Thu, 22 Aug 2013 00:18:03 +1000 (EST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:message-id:mime-version:content-type; q=dns; s= default; b=FAGFs5nZWg304wTpA9sr2+SMY6ijmHND0VYga9mhBNKE2pvhucdic elsq1Alc1L9C+3nNpbzXm9bDhn2gzwDhInbz/KictlEmvMfiC8LDEwBhBpg71PP3 Phxh9sfQg0eah0JZKXLBzicqEiwY+b0EUMSO4YeWnNDcuHXH3IkNaA= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:message-id:mime-version:content-type; s= default; bh=WpsoHKhwOYrASJ/aW8Q8IHLYtCY=; b=s+2NLkyz8uk0ictQIAGc mJwPfNVYQ1DvN7x2xEgVdNp+GXRn/dYBD7KG4eaeOL8C3TpXohi41R1GrKjrvvF8 1Lrg2yqSFvURPLpYqE4Z7Njg5jD1iNFILHf4aHQ4D9ORSXxwwy8g/5tYP4Eyy6Yo 5diF27B1nhmOXBFJ1xNUsvo= Received: (qmail 20030 invoked by alias); 21 Aug 2013 14:17:56 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 20016 invoked by uid 89); 21 Aug 2013 14:17:56 -0000 X-Spam-SWARE-Status: No, score=-5.4 required=5.0 tests=AWL, BAYES_00, KHOP_RCVD_UNTRUST, RCVD_IN_DNSWL_LOW, RCVD_IN_HOSTKARMA_NO, RCVD_IN_HOSTKARMA_W, RCVD_IN_HOSTKARMA_WL, RP_MATCHES_RCVD autolearn=ham version=3.3.2 Received: from nikam.ms.mff.cuni.cz (HELO nikam.ms.mff.cuni.cz) (195.113.20.16) by sourceware.org (qpsmtpd/0.84/v0.84-167-ge50287c) with ESMTP; Wed, 21 Aug 2013 14:17:50 +0000 Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 17ED15430C8; Wed, 21 Aug 2013 16:17:48 +0200 (CEST) Date: Wed, 21 Aug 2013 16:17:48 +0200 From: Jan Hubicka To: gcc-patches@gcc.gnu.org, ak@linux.intel.com, rguenther@suse.de, dnovillo@google.com, dmalcolm@redhat.com Subject: [RFC] Old school parallelization of WPA streaming Message-ID: <20130821141747.GD24782@kam.mff.cuni.cz> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-06-14) Hi, this is my attempt to bring GCC into wonderful era of multicore CPUs :) It is a hack, but it seems to help quite a lot. About 50% of WPA time is spent by streaming the individual ltrans .o files. This can be easily parallelized by fork - we do nothing afterwards, just exit and pass the list to the linker. So until we are thread safe, perhaps this may be a solution? (or on unixish systems probably it can be solution forever) I added a logic parsing -flto=24 and do number of streaming processes user asked for. For -flto=jobserver I simply fork all 32 processes. It may not be a disaster, but perhaps we should figure out how to communicate with jobserver. At first glance on document on how it works, it seems easy to add. Perhaps we can even convicne GNU Make folks to put simple helpers to libiberty? We also may figure out number of CPUs (is it available i.e. from libgomp) and use it by default even if user do not care to pass number of processes. Naturally these streaming forks should be cheap memory wise. I hope Martin will get me some actual numbers. With the patch the WPA time of firefox goes down to 2 minutes (4.8 needs about 30 minutes and without the hack one needs about 5 minutes) Before: Execution times (seconds) phase setup : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 1398 kB ( 0%) ggc phase opt and generate : 39.73 (17%) usr 0.49 ( 3%) sys 40.26 (16%) wall 347726 kB ( 5%) ggc phase stream in : 82.43 (35%) usr 2.15 (14%) sys 84.62 (34%) wall 5970152 kB (94%) ggc phase stream out : 114.05 (48%) usr 12.86 (83%) sys 127.26 (50%) wall 6868 kB ( 0%) ggc garbage collection : 3.07 ( 1%) usr 0.00 ( 0%) sys 3.08 ( 1%) wall 0 kB ( 0%) ggc callgraph optimization : 0.34 ( 0%) usr 0.00 ( 0%) sys 0.33 ( 0%) wall 30 kB ( 0%) ggc ipa dead code removal : 4.91 ( 2%) usr 0.11 ( 1%) sys 5.16 ( 2%) wall 113 kB ( 0%) ggc ipa inheritance graph : 0.12 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 0%) wall 927 kB ( 0%) ggc ipa virtual call target : 5.11 ( 2%) usr 0.05 ( 0%) sys 4.99 ( 2%) wall 55296 kB ( 1%) ggc ipa cp : 2.65 ( 1%) usr 0.17 ( 1%) sys 2.80 ( 1%) wall 188629 kB ( 3%) ggc ipa inlining heuristics : 18.49 ( 8%) usr 0.29 ( 2%) sys 18.79 ( 7%) wall 439981 kB ( 7%) ggc ipa lto gimple in : 0.12 ( 0%) usr 0.01 ( 0%) sys 0.15 ( 0%) wall 0 kB ( 0%) ggc ipa lto gimple out : 16.66 ( 7%) usr 1.26 ( 8%) sys 17.97 ( 7%) wall 0 kB ( 0%) ggc ipa lto decl in : 68.70 (29%) usr 1.50 (10%) sys 70.23 (28%) wall 5181795 kB (82%) ggc ipa lto decl out : 93.09 (39%) usr 4.93 (32%) sys 98.07 (39%) wall 0 kB ( 0%) ggc ipa lto cgraph I/O : 1.65 ( 1%) usr 0.27 ( 2%) sys 1.92 ( 1%) wall 428974 kB ( 7%) ggc ipa lto decl merge : 3.66 ( 2%) usr 0.00 ( 0%) sys 3.65 ( 1%) wall 8288 kB ( 0%) ggc ipa lto cgraph merge : 3.42 ( 1%) usr 0.00 ( 0%) sys 3.42 ( 1%) wall 13725 kB ( 0%) ggc whopr wpa : 3.58 ( 2%) usr 0.02 ( 0%) sys 3.59 ( 1%) wall 6871 kB ( 0%) ggc whopr wpa I/O : 0.99 ( 0%) usr 6.65 (43%) sys 7.92 ( 3%) wall 0 kB ( 0%) ggc whopr partitioning : 2.63 ( 1%) usr 0.01 ( 0%) sys 2.66 ( 1%) wall 0 kB ( 0%) ggc ipa reference : 3.08 ( 1%) usr 0.08 ( 1%) sys 3.18 ( 1%) wall 0 kB ( 0%) ggc whopr partitioning : 2.63 ( 1%) usr 0.01 ( 0%) sys 2.66 ( 1%) wall 0 kB ( 0%) ggc ipa reference : 3.08 ( 1%) usr 0.08 ( 1%) sys 3.18 ( 1%) wall 0 kB ( 0%) ggc ipa profile : 0.43 ( 0%) usr 0.05 ( 0%) sys 0.48 ( 0%) wall 0 kB ( 0%) ggc ipa pure const : 3.00 ( 1%) usr 0.06 ( 0%) sys 3.07 ( 1%) wall 0 kB ( 0%) ggc varconst : 0.03 ( 0%) usr 0.04 ( 0%) sys 0.06 ( 0%) wall 0 kB ( 0%) ggc unaccounted todo : 0.48 ( 0%) usr 0.00 ( 0%) sys 0.50 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 236.22 15.50 252.15 6326146 kB after: Execution times (seconds) phase setup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 1399 kB ( 0%) ggc phase opt and generate : 35.49 (28%) usr 0.44 ( 6%) sys 35.95 (26%) wall 313971 kB ( 5%) ggc phase stream in : 82.98 (64%) usr 2.10 (30%) sys 85.13 (61%) wall 5969191 kB (95%) ggc phase stream out : 10.37 ( 8%) usr 4.49 (64%) sys 17.33 (13%) wall 5813 kB ( 0%) ggc garbage collection : 3.00 ( 2%) usr 0.00 ( 0%) sys 2.99 ( 2%) wall 0 kB ( 0%) ggc callgraph optimization : 0.33 ( 0%) usr 0.00 ( 0%) sys 0.33 ( 0%) wall 30 kB ( 0%) ggc ipa dead code removal : 4.91 ( 4%) usr 0.10 ( 1%) sys 5.04 ( 4%) wall 114 kB ( 0%) ggc ipa inheritance graph : 0.10 ( 0%) usr 0.00 ( 0%) sys 0.10 ( 0%) wall 792 kB ( 0%) ggc ipa virtual call target : 2.14 ( 2%) usr 0.01 ( 0%) sys 2.15 ( 2%) wall 21661 kB ( 0%) ggc ipa cp : 2.34 ( 2%) usr 0.18 ( 3%) sys 2.52 ( 2%) wall 188629 kB ( 3%) ggc ipa inlining heuristics : 18.43 (14%) usr 0.26 ( 4%) sys 18.68 (13%) wall 439993 kB ( 7%) ggc ipa lto gimple in : 0.05 ( 0%) usr 0.05 ( 1%) sys 0.12 ( 0%) wall 0 kB ( 0%) ggc ipa lto gimple out : 0.44 ( 0%) usr 0.06 ( 1%) sys 0.50 ( 0%) wall 0 kB ( 0%) ggc ipa lto decl in : 69.27 (54%) usr 1.52 (22%) sys 70.87 (51%) wall 5180837 kB (82%) ggc ipa lto decl out : 7.77 ( 6%) usr 0.51 ( 7%) sys 8.28 ( 6%) wall 0 kB ( 0%) ggc ipa lto cgraph I/O : 1.71 ( 1%) usr 0.19 ( 3%) sys 1.90 ( 1%) wall 428974 kB ( 7%) ggc ipa lto decl merge : 3.66 ( 3%) usr 0.00 ( 0%) sys 3.67 ( 3%) wall 8288 kB ( 0%) ggc ipa lto cgraph merge : 3.40 ( 3%) usr 0.00 ( 0%) sys 3.39 ( 2%) wall 13725 kB ( 0%) ggc whopr wpa : 3.19 ( 2%) usr 0.00 ( 0%) sys 3.19 ( 2%) wall 5816 kB ( 0%) ggc whopr wpa I/O : 0.00 ( 0%) usr 3.92 (56%) sys 6.39 ( 5%) wall 0 kB ( 0%) ggc whopr partitioning : 1.44 ( 1%) usr 0.02 ( 0%) sys 1.45 ( 1%) wall 0 kB ( 0%) ggc ipa reference : 2.64 ( 2%) usr 0.08 ( 1%) sys 2.74 ( 2%) wall 0 kB ( 0%) ggc ipa profile : 0.46 ( 0%) usr 0.02 ( 0%) sys 0.47 ( 0%) wall 0 kB ( 0%) ggc ipa pure const : 3.02 ( 2%) usr 0.05 ( 1%) sys 3.08 ( 2%) wall 0 kB ( 0%) ggc varconst : 0.06 ( 0%) usr 0.06 ( 1%) sys 0.07 ( 0%) wall 0 kB ( 0%) ggc unaccounted todo : 0.48 ( 0%) usr 0.00 ( 0%) sys 0.48 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 128.85 7.05 138.45 6290376 kB real 7m42.126s user 53m13.816s sys 2m16.993s Seems almost like we are at level WPA does something useful. IPA inlining has definitely room for improvement. I think to significantly speedup LTO decl in, I think we need to implement Richard's original idea of compoaring the SCC in pickled version and materializing only theose that are unique. Looks resonable? Honza * lto-cgraph.c (asm_nodes_output): Make global. * lto-streamer.h (asm_nodes_output): Declare. * lto-wrapper.c (parallel, jobserver): Make global. (run_gcc): Pass down -fparallelism * lto.c (lto_parallelism): New variable. (do_stream_out): New function. (stream_out): New function. (lto_wpa_write_files): Use it. * lang.opt (fparallelism): New. * lto.h (lto_parallelism): Declare. * lto-lang.c (lto_handle_option): Add fparalelism. Index: lto-cgraph.c =================================================================== --- lto-cgraph.c (revision 201891) +++ lto-cgraph.c (working copy) @@ -50,6 +50,9 @@ along with GCC; see the file COPYING3. #include "context.h" #include "pass_manager.h" +/* True when asm nodes has been output. */ +bool asm_nodes_output = false; + static void output_cgraph_opt_summary (void); static void input_cgraph_opt_summary (vec nodes); @@ -852,7 +855,6 @@ output_symtab (void) lto_symtab_encoder_iterator lsei; int i, n_nodes; lto_symtab_encoder_t encoder; - static bool asm_nodes_output = false; if (flag_wpa) output_cgraph_opt_summary (); Index: lto-streamer.h =================================================================== --- lto-streamer.h (revision 201891) +++ lto-streamer.h (working copy) @@ -870,6 +870,7 @@ void lto_output_location (struct output_ /* In lto-cgraph.c */ +extern bool asm_nodes_output; lto_symtab_encoder_t lto_symtab_encoder_new (bool); int lto_symtab_encoder_encode (lto_symtab_encoder_t, symtab_node); void lto_symtab_encoder_delete (lto_symtab_encoder_t); Index: lto-wrapper.c =================================================================== --- lto-wrapper.c (revision 201891) +++ lto-wrapper.c (working copy) @@ -56,6 +56,9 @@ along with GCC; see the file COPYING3. int debug; /* true if -save-temps. */ int verbose; /* true if -v. */ +int parallel = 0; /* number of parallel builds specified + by -flto=N */ +int jobserver = 0; /* true if -flto=jobserver was used. */ enum lto_mode_d { LTO_MODE_NONE, /* Not doing LTO. */ @@ -445,8 +448,6 @@ run_gcc (unsigned argc, char *argv[]) char *list_option_full = NULL; const char *linker_output = NULL; const char *collect_gcc, *collect_gcc_options; - int parallel = 0; - int jobserver = 0; bool no_partition = false; struct cl_decoded_option *fdecoded_options = NULL; unsigned int fdecoded_options_count = 0; @@ -630,6 +631,16 @@ run_gcc (unsigned argc, char *argv[]) if (parallel <= 1) parallel = 0; } + if (jobserver) + { + obstack_ptr_grow (&argv_obstack, xstrdup ("-fparallelism=jobserver")); + } + else if (parallel > 1) + { + char buf[256]; + sprintf (buf, "-fparallelism=%i", parallel); + obstack_ptr_grow (&argv_obstack, xstrdup (buf)); + } /* Fallthru. */ case OPT_flto: Index: lto/lto.c =================================================================== --- lto/lto.c (revision 201891) +++ lto/lto.c (working copy) @@ -49,6 +49,9 @@ along with GCC; see the file COPYING3. #include "context.h" #include "pass_manager.h" +/* Number of parallel tasks to run, -1 if we want to use GNU Make jobserver. */ +int lto_parallelism; + static GTY(()) tree first_personality_decl; /* Returns a hash code for P. */ @@ -3002,6 +3005,98 @@ cmp_partitions_order (const void *a, con return orderb - ordera; } +/* Actually stream out ENCODER into TEMP_FILENAME. */ + +void +do_stream_out (char *temp_filename, lto_symtab_encoder_t encoder) +{ + lto_file *file = lto_obj_file_open (temp_filename, true); + if (!file) + fatal_error ("lto_obj_file_open() failed"); + lto_set_current_out_file (file); + + ipa_write_optimization_summaries (encoder); + + lto_set_current_out_file (NULL); + lto_obj_file_close (file); + free (file); +} + +/* Wait for forked process and signal errors. */ +#ifdef HAVE_WORKING_FORK +void +wait_for_child () +{ + int status; + do + { + int w = waitpid(0, &status, WUNTRACED | WCONTINUED); + if (w == -1) + fatal_error ("waitpid failed"); + + if (WIFEXITED (status) && WEXITSTATUS (status)) + fatal_error ("streaming subprocess failed"); + else if (WIFSIGNALED (status)) + fatal_error ("streaming subprocess was killed by signal"); + } + while (!WIFEXITED(status) && !WIFSIGNALED(status)); +} +#endif + +/* Stream out ENCODER into TEMP_FILENAME + Fork if that seems to help. */ + +void +stream_out (char *temp_filename, lto_symtab_encoder_t encoder, bool last) +{ +#ifdef HAVE_WORKING_FORK + static int nruns; + + if (!lto_parallelism || lto_parallelism == 1) + { + do_stream_out (temp_filename, encoder); + return; + } + + /* Do not run more than LTO_PARALLELISM streamings + FIXME: we ignore limits on jobserver. */ + if (lto_parallelism > 0 && nruns >= lto_parallelism) + { + wait_for_child (); + nruns --; + } + /* If this is not the last parallel partition, execute new + streaming process. */ + if (!last) + { + pid_t cpid = fork (); + + if (!cpid) + { + setproctitle ("lto1-wpa-streaming"); + do_stream_out (temp_filename, encoder); + exit (0); + } + /* Fork failed; lets do the job ourseleves. */ + else if (cpid == -1) + do_stream_out (temp_filename, encoder); + else + nruns++; + } + /* Last partition; stream it and wait for all children to die. */ + else + { + int i; + do_stream_out (temp_filename, encoder); + for (i = 0; i < nruns; i++) + wait_for_child (); + } + asm_nodes_output = true; +#else + do_stream_out (temp_filename, encoder); +#endif +} + /* Write all output files in WPA mode and the file with the list of LTRANS units. */ @@ -3009,18 +3104,15 @@ static void lto_wpa_write_files (void) { unsigned i, n_sets; - lto_file *file; ltrans_partition part; FILE *ltrans_output_list_stream; char *temp_filename; + vec temp_filenames = vNULL; size_t blen; /* Open the LTRANS output list. */ if (!ltrans_output_list) fatal_error ("no LTRANS output list filename provided"); - ltrans_output_list_stream = fopen (ltrans_output_list, "w"); - if (ltrans_output_list_stream == NULL) - fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list); timevar_push (TV_WHOPR_WPA); @@ -3056,14 +3148,10 @@ lto_wpa_write_files (void) : cmp_partitions_order); for (i = 0; i < n_sets; i++) { - size_t len; ltrans_partition part = ltrans_partitions[i]; /* Write all the nodes in SET. */ sprintf (temp_filename + blen, "%u.o", i); - file = lto_obj_file_open (temp_filename, true); - if (!file) - fatal_error ("lto_obj_file_open() failed"); if (!quiet_flag) fprintf (stderr, " %s (%s %i insns)", temp_filename, part->name, part->insns); @@ -3105,21 +3193,25 @@ lto_wpa_write_files (void) } gcc_checking_assert (lto_symtab_encoder_size (part->encoder) || !i); - lto_set_current_out_file (file); - - ipa_write_optimization_summaries (part->encoder); + stream_out (temp_filename, part->encoder, i == n_sets - 1); - lto_set_current_out_file (NULL); - lto_obj_file_close (file); - free (file); part->encoder = NULL; - len = strlen (temp_filename); - if (fwrite (temp_filename, 1, len, ltrans_output_list_stream) < len + temp_filenames.safe_push (xstrdup (temp_filename)); + } + ltrans_output_list_stream = fopen (ltrans_output_list, "w"); + if (ltrans_output_list_stream == NULL) + fatal_error ("opening LTRANS output list %s: %m", ltrans_output_list); + for (i = 0; i < n_sets; i++) + { + unsigned int len = strlen (temp_filenames[i]); + if (fwrite (temp_filenames[i], 1, len, ltrans_output_list_stream) < len || fwrite ("\n", 1, 1, ltrans_output_list_stream) < 1) fatal_error ("writing to LTRANS output list %s: %m", ltrans_output_list); + free (temp_filenames[i]); } + temp_filenames.release(); lto_stats.num_output_files += n_sets; Index: lto/lang.opt =================================================================== --- lto/lang.opt (revision 201891) +++ lto/lang.opt (working copy) @@ -32,6 +32,10 @@ fltrans-output-list= LTO Joined Var(ltrans_output_list) Specify a file to which a list of files output by LTRANS is written. +fparallelism= +LTO Joined +Run the link-time optimizer in whole program analysis (WPA) mode. + fwpa LTO Driver Report Var(flag_wpa) Run the link-time optimizer in whole program analysis (WPA) mode. Index: lto/lto.h =================================================================== --- lto/lto.h (revision 201891) +++ lto/lto.h (working copy) @@ -39,6 +39,7 @@ extern const char *resolution_file_name; extern tree lto_eh_personality (void); extern void lto_main (void); extern void lto_read_all_file_options (void); +extern int lto_parallelism; /* In lto-elf.c or lto-coff.c */ extern lto_file *lto_obj_file_open (const char *filename, bool writable); Index: lto/lto-lang.c =================================================================== --- lto/lto-lang.c (revision 201891) +++ lto/lto-lang.c (working copy) @@ -735,6 +735,19 @@ lto_handle_option (size_t scode, const c warn_psabi = value; break; + case OPT_fparallelism_: + if (!arg) + lto_parallelism = 1; + else if (!strcmp (arg, "jobserver")) + lto_parallelism = -1; + else + { + lto_parallelism = atoi (arg); + if (lto_parallelism <= 0) + lto_parallelism = 0; + } + break; + default: break; }