[RFC] Getting LTO incremental linking work

Hi,
PR 67548 is about LTO not supporting incremental linking.  I never really
considered our current incremental linking very useful, because it triggers
code generation at the incremental link time basically nullifying any
benefits of whole program optimization and in fact I think it is harmful,
because it sort of works and w/o any warning produce not very optimized code.

Basically there are 3 schemes how to make incremental link work
 1) Turn LTO objects to non-LTO as we do now
 2) concatenate LTO sections as implemented by Andi and Hj
 3) Do actual linking of LTO sections

The problem of current implementation of 1) is that GCC thinks the resulting
object file will not be used for static linking and thus assume that hidden
symbols can be turned to static.

In the log of PR67548 HJ actually pointed out that we do have API at linker
plugin side which says what type of output is done.  This is cool because we
can also use it to drop -fpic when building static binary. This is common in
Firefox, where some objects are built with -fpic and linked to both binaries
and libraries.

Moreover we do have all infrastructure ready to implement 3).  Our tree merging
and symbol table handling is fuly incremental and I think made a patch to 
implement it today.   The scheme is easy:

 1) linker plugin is modified to pass -flinker-output to lto wrapper
    linker-output is either dyn (.so), pie or exec
    for incremental linking I added .rel for 3) and noltorel for 1)

    currently it does rel because 3) (nor 2) can not be done when incremnetal
    linking is done on both LTO and non-LTO objects.  In this case linker
    plugin output warings about code quality loss and switch to
    noltorel.
 2) with -flinker-ouptut the lto wrapper behaves same way as with
    -flto-partition=none.
 3) lto frontend parses -flinker-output and sets our internal flags accordingly.
    I added new flag_incremental_linking to inform middle-end about the fact
    that the output is going to be statically linked again.  This disables
    the privatization of hidden symbols and if set to 2 it also triggers
    the LTO IL streaming

The incremental linking with rel mode now streams in all global streams,
merges trees, merges symbol table, removes unreachable symbols (which are
result of merging) and streams everything out to .s file.

I only tested the patch on incremental linnking libbackend.o.  The linking
time is 46 seconds:

Execution times (seconds)
 phase opt and generate  :  35.75 (81%) usr   0.90 (76%) sys  36.63 (81%) wall    5008 kB ( 1%) ggc
 phase stream in         :   8.57 (19%) usr   0.28 (24%) sys   8.86 (19%) wall  700851 kB (99%) ggc
 callgraph optimization  :   0.08 ( 0%) usr   0.01 ( 1%) sys   0.08 ( 0%) wall       0 kB ( 0%) ggc
 ipa dead code removal   :   0.09 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 ipa cp                  :   0.36 ( 1%) usr   0.04 ( 3%) sys   0.41 ( 1%) wall   42862 kB ( 6%) ggc
 ipa inlining heuristics :   0.18 ( 0%) usr   0.02 ( 2%) sys   0.19 ( 0%) wall   26771 kB ( 4%) ggc
 lto stream inflate      :   3.57 ( 8%) usr   0.14 (12%) sys   3.70 ( 8%) wall       0 kB ( 0%) ggc
 lto stream deflate      :  20.13 (45%) usr   0.05 ( 4%) sys  19.42 (43%) wall       0 kB ( 0%) ggc
 lto stream output       :   9.70 (22%) usr   0.32 (27%) sys  10.50 (23%) wall       0 kB ( 0%) ggc
 ipa lto gimple out      :   0.66 ( 1%) usr   0.24 (20%) sys   1.09 ( 2%) wall    4655 kB ( 1%) ggc
 ipa lto decl in         :   5.87 (13%) usr   0.11 ( 9%) sys   6.10 (13%) wall  552108 kB (78%) ggc
 ipa lto decl out        :   2.91 ( 7%) usr   0.16 (14%) sys   3.07 ( 7%) wall       0 kB ( 0%) ggc
 ipa lto constructors in :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     108 kB ( 0%) ggc
 ipa lto constructors out:   0.12 ( 0%) usr   0.03 ( 3%) sys   0.13 ( 0%) wall     178 kB ( 0%) ggc
 ipa lto cgraph I/O      :   0.12 ( 0%) usr   0.02 ( 2%) sys   0.15 ( 0%) wall   70005 kB (10%) ggc
 ipa lto decl merge      :   0.31 ( 1%) usr   0.00 ( 0%) sys   0.30 ( 1%) wall    1023 kB ( 0%) ggc
 ipa lto cgraph merge    :   0.11 ( 0%) usr   0.00 ( 0%) sys   0.11 ( 0%) wall    7972 kB ( 1%) ggc
 ipa profile             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   0.01 ( 0%) usr   0.01 ( 1%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 ipa icf                 :   0.04 ( 0%) usr   0.01 ( 1%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 varconst                :   0.02 ( 0%) usr   0.01 ( 1%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 :  44.32             1.18            45.49             707846 kB

There are few low hanging fruits.  First streaming LTO files is slow because of vprintf:
        case 1:
          /* TODO: Print in hex with fast function, important for -flto. */
          fprintf (f, "\\%03o", c);
          break;
a trivial bug to fix, will send separate patch for this.

Second most of inflate/deflate time goes to compressing and uncompressing
sections that are being copied. Also something that is trivial to fix, will
do that in separate patch - this also affects WPA and /tmp space usage.

The size of library is cut to about a half.
-rw-r--r-- 1 hubicka _cvsadmin 211854942 Nov 25 09:18 libbackend.a
-rw-r--r-- 1 hubicka _cvsadmin 121986816 Nov 25 09:16 libbackend.o

and linking of cc1 binary goes from 1m31s to 1m20s. Because we link
libbackend.a more than 4 times, it would actually pay back even in GCC setting,
though i suppose the main utility would be in parallelizing the builds (like
kernel does).

WPA stage times are:
Execution times (seconds)                                                       
 phase opt and generate  :   3.76 (52%) usr   0.07 ( 6%) sys   3.83 (41%) wall   53777 kB (13%) ggc
 phase stream in         :   3.04 (42%) usr   0.33 (28%) sys   3.37 (36%) wall  346427 kB (86%) ggc
 phase stream out        :   0.40 ( 6%) usr   0.78 (66%) sys   2.18 (23%) wall       0 kB ( 0%) ggc
 callgraph optimization  :   0.05 ( 1%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall      18 kB ( 0%) ggc
 ipa dead code removal   :   0.46 ( 6%) usr   0.00 ( 0%) sys   0.44 ( 5%) wall       0 kB ( 0%) ggc
 ipa cp                  :   0.40 ( 6%) usr   0.05 ( 4%) sys   0.47 ( 5%) wall   55439 kB (14%) ggc
 ipa inlining heuristics :   1.95 (27%) usr   0.02 ( 2%) sys   1.97 (21%) wall   65871 kB (16%) ggc
 lto stream inflate      :   0.60 ( 8%) usr   0.11 ( 9%) sys   0.67 ( 7%) wall       0 kB ( 0%) ggc
 ipa lto decl in         :   1.93 (27%) usr   0.18 (15%) sys   2.10 (22%) wall  205593 kB (51%) ggc
 ipa lto decl out        :   0.28 ( 4%) usr   0.02 ( 2%) sys   0.29 ( 3%) wall       0 kB ( 0%) ggc
 ipa lto cgraph I/O      :   0.09 ( 1%) usr   0.02 ( 2%) sys   0.12 ( 1%) wall   62797 kB (16%) ggc
 ipa lto decl merge      :   0.20 ( 3%) usr   0.00 ( 0%) sys   0.20 ( 2%) wall    1023 kB ( 0%) ggc
 whopr partitioning      :   0.56 ( 8%) usr   0.00 ( 0%) sys   0.56 ( 6%) wall    1419 kB ( 0%) ggc
 ipa reference           :   0.17 ( 2%) usr   0.00 ( 0%) sys   0.17 ( 2%) wall       0 kB ( 0%) ggc
 ipa pure const          :   0.17 ( 2%) usr   0.00 ( 0%) sys   0.16 ( 2%) wall       0 kB ( 0%) ggc
 ipa icf                 :   0.07 ( 1%) usr   0.00 ( 0%) sys   0.07 ( 1%) wall     485 kB ( 0%) ggc
 unaccounted todo        :   0.06 ( 1%) usr   0.00 ( 0%) sys   0.06 ( 1%) wall       0 kB ( 0%) ggc
 TOTAL                 :   7.20             1.18             9.39             402192 kB

Execution times (seconds)                                                       
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall    1986 kB ( 0%) ggc
 phase opt and generate  :   6.66 (39%) usr   0.38 (22%) sys   7.03 (36%) wall  199143 kB (21%) ggc
 phase stream in         :   9.33 (54%) usr   0.38 (22%) sys   9.71 (50%) wall  764698 kB (79%) ggc
 phase stream out        :   0.82 ( 5%) usr   0.97 (55%) sys   2.23 (11%) wall       2 kB ( 0%) ggc
 phase finalize          :   0.40 ( 2%) usr   0.03 ( 2%) sys   0.43 ( 2%) wall       0 kB ( 0%) ggc
 garbage collection      :   0.79 ( 5%) usr   0.01 ( 1%) sys   0.80 ( 4%) wall       0 kB ( 0%) ggc
 ipa dead code removal   :   0.41 ( 2%) usr   0.00 ( 0%) sys   0.45 ( 2%) wall       0 kB ( 0%) ggc
 ipa cp                  :   0.33 ( 2%) usr   0.05 ( 3%) sys   0.41 ( 2%) wall   56753 kB ( 6%) ggc
 ipa inlining heuristics :   1.74 (10%) usr   0.02 ( 1%) sys   1.80 ( 9%) wall   55600 kB ( 6%) ggc
 lto stream inflate      :   2.18 (13%) usr   0.12 ( 7%) sys   2.28 (12%) wall       0 kB ( 0%) ggc
 ipa lto gimple in       :   0.62 ( 4%) usr   0.23 (13%) sys   0.96 ( 5%) wall  135317 kB (14%) ggc
 ipa lto decl in         :   6.63 (39%) usr   0.15 ( 9%) sys   6.70 (35%) wall  598144 kB (62%) ggc
 ipa lto decl out        :   0.55 ( 3%) usr   0.01 ( 1%) sys   0.57 ( 3%) wall       0 kB ( 0%) ggc
 ipa lto cgraph I/O      :   0.14 ( 1%) usr   0.03 ( 2%) sys   0.15 ( 1%) wall   76843 kB ( 8%) ggc
 ipa lto decl merge      :   0.35 ( 2%) usr   0.00 ( 0%) sys   0.34 ( 2%) wall    1023 kB ( 0%) ggc
 ipa lto cgraph merge    :   0.13 ( 1%) usr   0.00 ( 0%) sys   0.13 ( 1%) wall    9284 kB ( 1%) ggc
 whopr partitioning      :   0.51 ( 3%) usr   0.00 ( 0%) sys   0.50 ( 3%) wall    1496 kB ( 0%) ggc
 ipa reference           :   0.18 ( 1%) usr   0.00 ( 0%) sys   0.19 ( 1%) wall       0 kB ( 0%) ggc
 ipa pure const          :   0.20 ( 1%) usr   0.01 ( 1%) sys   0.20 ( 1%) wall       0 kB ( 0%) ggc
 ipa icf                 :   1.82 (11%) usr   0.05 ( 3%) sys   1.85 (10%) wall    2138 kB ( 0%) ggc
 tree operand scan       :   0.13 ( 1%) usr   0.06 ( 3%) sys   0.17 ( 1%) wall   21674 kB ( 2%) ggc
 TOTAL                 :  17.21             1.76            19.41             965830 kB

so 50% cut in memory use and resonable speedup. I need to check what happens
with ICF.

The WPA stats are as follows:
WPA statistics
[WPA] read 891308 SCCs of average size 1.972195
[WPA] 1757833 tree bodies read in total
[WPA] tree SCC table: size 524287, 230881 elements, collision ratio: 1.107788
[WPA] tree SCC max chain length 39 (size 1)
[WPA] Compared 73318 SCCs, 81315 collisions (1.109073)
[WPA] Merged 52578 SCCs
[WPA] Merged 502850 tree bodies
[WPA] Merged 36730 types
[WPA] 205971 types prevailed (565069 associated trees)
[WPA] GIMPLE canonical type table: size 16381, 1251 elements, 28138 searches, 444 collisions (ratio: 0.015779)
[WPA] GIMPLE canonical type pointer-map: 1251 elements, 99917 searches
[WPA] # of input files: 125
[WPA] Compression: 23123694 input bytes, 79799028 uncompressed bytes (ratio: 3.450963)
[WPA] Size of mmap'd section decls: 23123694 bytes

compoared to
WPA statistics
[WPA] read 3633234 SCCs of average size 2.539347
[WPA] 9226041 tree bodies read in total
[WPA] tree SCC table: size 524287, 257562 elements, collision ratio: 0.673833
[WPA] tree SCC max chain length 39 (size 1)
[WPA] Compared 500618 SCCs, 646007 collisions (1.290419)
[WPA] Merged 478513 SCCs
[WPA] Merged 5659960 tree bodies
[WPA] Merged 326141 types
[WPA] 207806 types prevailed (562649 associated trees)
[WPA] GIMPLE canonical type table: size 16381, 1246 elements, 27925 searches, 437 collisions (ratio: 0.015649)
[WPA] GIMPLE canonical type pointer-map: 1246 elements, 97858 searches
[WPA] # of input files: 461
[WPA] Compression: 95695388 input bytes, 303240971 uncompressed bytes (ratio: 3.168815)
[WPA] Size of mmap'd section decls: 95695388 bytes

So about 5fold improvement in number of trees and decls read. By end of WPA:

[WPA] 1757833 tree bodies read in total
[WPA] # of input files: 125
[WPA] # of input cgraph nodes: 36977
[WPA] # of function bodies: 651
[WPA] # of output files: 31
[WPA] # of output symtab nodes: 185336
[WPA] # of output tree pickle references: 629336
[WPA] # of output tree bodies: 129898
[WPA] # callgraph partitions: 31
[WPA] Compression: 30134544 input bytes, 100590102 uncompressed bytes (ratio: 3.338033)
[WPA] Size of mmap'd section decls: 23123694 bytes
[WPA] Size of mmap'd section function_body: 2641029 bytes
[WPA] Size of mmap'd section statics: 0 bytes
[WPA] Size of mmap'd section symtab: 0 bytes
[WPA] Size of mmap'd section refs: 408500 bytes
[WPA] Size of mmap'd section asm: 0 bytes
[WPA] Size of mmap'd section jmpfuncs: 1432063 bytes
[WPA] Size of mmap'd section pureconst: 80213 bytes
[WPA] Size of mmap'd section reference: 0 bytes
[WPA] Size of mmap'd section profile: 2439 bytes
[WPA] Size of mmap'd section symbol_nodes: 1413364 bytes
[WPA] Size of mmap'd section opts: 0 bytes
[WPA] Size of mmap'd section cgraphopt: 0 bytes
[WPA] Size of mmap'd section inline: 1005113 bytes
[WPA] Size of mmap'd section ipcp_trans: 0 bytes
[WPA] Size of mmap'd section icf: 28129 bytes
[WPA] Size of mmap'd section offload_table: 0 bytes
[WPA] Size of mmap'd section mode_table: 0 bytes

[WPA] 9226041 tree bodies read in total
[WPA] # of input files: 461
[WPA] # of input cgraph nodes: 36888
[WPA] # of function bodies: 7690
[WPA] # of output files: 31
[WPA] # of output symtab nodes: 191489
[WPA] # of output tree pickle references: 1444221
[WPA] # of output tree bodies: 261141
[WPA] # callgraph partitions: 31
[WPA] Compression: 112942159 input bytes, 347530231 uncompressed bytes (ratio: 3.077064)
[WPA] Size of mmap'd section decls: 95695388 bytes
[WPA] Size of mmap'd section function_body: 11747200 bytes
[WPA] Size of mmap'd section statics: 0 bytes
[WPA] Size of mmap'd section symtab: 0 bytes
[WPA] Size of mmap'd section refs: 395831 bytes
[WPA] Size of mmap'd section asm: 0 bytes
[WPA] Size of mmap'd section jmpfuncs: 1666954 bytes
[WPA] Size of mmap'd section pureconst: 94608 bytes
[WPA] Size of mmap'd section reference: 0 bytes
[WPA] Size of mmap'd section profile: 9259 bytes
[WPA] Size of mmap'd section symbol_nodes: 1769069 bytes
[WPA] Size of mmap'd section opts: 0 bytes
[WPA] Size of mmap'd section cgraphopt: 0 bytes
[WPA] Size of mmap'd section inline: 1266586 bytes
[WPA] Size of mmap'd section ipcp_trans: 0 bytes
[WPA] Size of mmap'd section icf: 297264 bytes
[WPA] Size of mmap'd section offload_table: 0 bytes
[WPA] Size of mmap'd section mode_table: 0 bytes

Does anyone see problems with this approach? I think this is easy enough 
and fixes PR67548 so it may still get to mainline?
I need to do more testing, but in general I think the implemntation is OK 
as it is.  We need a way to force noltorel model for testsuite, as the
new default will bypass codegen for all our -r -nostdlib testcases.

BTW ltrans now dies with -ftime-report. Any ideas why?

Honza

[RFC] Getting LTO incremental linking work

Commit Message

Comments

Patch