diff mbox

[PR,lto/41528] Add internal documentation in doc/lto.texi

Message ID 20101115062311.GA26274@google.com
State New
Headers show

Commit Message

Diego Novillo Nov. 15, 2010, 6:23 a.m. UTC
This patch adds internal documentation for LTO.  Much of it comes
from Honza's GCC Summit paper, wiki pages and source comments.  I
also moved the internal flags from invoke.texi and added several
pointers to the source code.

It can still use more information, but this is a start.

Tested with make doc, make pdf and visual inspection.

OK for mainline?


Diego.

2010-11-14  Jan Hubicka  <jh@suse.cz>
	    Diego Novillo  <dnovillo@google.com>

	PR lto/41528
	* doc/lto.texi: Add.
	* doc/gccint.texi: Add reference to lto.texi.
	* doc/invoke.texi: Update user documentation for LTO.
	Move internal flags to lto.texi

Comments

Xinliang David Li Nov. 15, 2010, 6:57 a.m. UTC | #1
Hi Diego, I have some random comments inlined below.

On Sun, Nov 14, 2010 at 10:23 PM, Diego Novillo <dnovillo@google.com> wrote:
> This patch adds internal documentation for LTO.  Much of it comes
> from Honza's GCC Summit paper, wiki pages and source comments.  I
> also moved the internal flags from invoke.texi and added several
> pointers to the source code.
>
> It can still use more information, but this is a start.
>
> Tested with make doc, make pdf and visual inspection.
>
> OK for mainline?
>
>
> Diego.
>
> 2010-11-14  Jan Hubicka  <jh@suse.cz>
>            Diego Novillo  <dnovillo@google.com>
>
>        PR lto/41528
>        * doc/lto.texi: Add.
>        * doc/gccint.texi: Add reference to lto.texi.
>        * doc/invoke.texi: Update user documentation for LTO.
>        Move internal flags to lto.texi
>
> Index: doc/lto.texi
> ===================================================================
> --- doc/lto.texi        (revision 0)
> +++ doc/lto.texi        (revision 0)
> @@ -0,0 +1,568 @@
> +@c Copyright (c) 2010 Free Software Foundation, Inc.
> +@c Free Software Foundation, Inc.
> +@c This is part of the GCC manual.
> +@c For copying conditions, see the file gcc.texi.
> +@c Contributed by Jan Hubicka <jh@suse.cz> and
> +@c Diego Novillo <dnovillo@google.com>
> +
> +@node LTO
> +@chapter Link Time Optimization
> +@cindex lto
> +@cindex whopr
> +@cindex wpa
> +@cindex ltrans
> +
> +@section Design Overview
> +
> +Link time optimization is implemented as a GCC front end for a
> +bytecode representation of GIMPLE that is emitted in special sections
> +of @code{.o} files.  Currently, LTO support is enabled in most
> +ELF-based systems, as well as darwin, cygwin and mingw systems.
> +
> +Since GIMPLE bytecode is saved alongside final object code, object
> +files generated with LTO support are larger than regular object files.
> +This ``fat'' object format makes it easy to integrate LTO into
> +existing build systems, as one can, for instance, produce archives of
> +the files.  Additionally, one might be able to ship one set of fat
> +objects which could be used both for development and the production of
> +optimized builds.  A, perhaps surprising, side effect of this feature
> +is that any mistake in the toolchain that leads to LTO information not
> +being used (e.g. an older @code{libtool} calling @code{ld} directly).
> +This is both an advantage, as the system is more robust, and a
> +disadvantage, as the user is not informed that the optimization has
> +been disabled.
> +
> +The current implementation only produces ``fat'' objects, effectively
> +doubling compilation time and increasing file sizes up to 5x the
> +original size.  This hides the problem that some tools, such as
> +@code{ar} and @code{nm}, need to understand symbol tables of LTO
> +sections.  These tools were extended to use the plugin infrastructure,
> +and with these problems solved, GCC will also support ``slim'' objects
> +consisting of the intermediate code alone.
> +
> +At the highest level, LTO splits the compiler in two.  The first half
> +(the ``writer'') produces a streaming representation of all the
> +internal data structures needed to optimize and generate code.  This
> +includes declarations, types, the callgraph and the GIMPLE representation
> +of function bodies.
> +
> +When @option{-flto} is given during compilation of a source file, the
> +pass manager executes all the passes in @code{all_lto_gen_passes}.
> +Currently, this phase is composed of two IPA passes:
> +
> +@itemize @bullet
> +@item @code{pass_ipa_lto_gimple_out}
> +This pass executes the function @code{lto_output} in
> +@file{lto-streamer-out.c}, which traverses the call graph encoding
> +every reachable declaration, type and function. This generates a
> +memory representation of all the file sections described below.
> +
> +@item @code{pass_ipa_lto_finish_out}
> +This pass executes the function @code{produce_asm_for_decls} in
> +@file{lto-streamer-out.c}, which takes the memory image built in the
> +previous pass and encodes it in the corresponding ELF file sections.
> +@end itemize
> +
> +The second half of LTO support is the ``reader''.  This is implemented
> +as the GCC front end @file{lto1} in @file{lto/lto.c}.  When
> +@file{collect2} detects a link set of @code{.o}/@code{.a} files with
> +LTO information and the @option{-flto} is enabled, it invokes
> +@file{lto1} which reads the set of files and aggregates them into a
> +single translation unit for optimization.  The main entry point for
> +the reader is @file{lto/lto.c}:@code{lto_main}.

cc1, cc1plus, f951, and lto1 -- where does '1' here mean? It always
got me thinking where cc2/lto2/... is.


> +
> +@subsection LTO modes of operation
> +
> +One of the main goals of the GCC link-time infrastructure was to allow
> +effective compilation of large programs.  For this reason GCC implements two
> +link-time compilation modes.
> +
> +@enumerate
> +@item  @emph{LTO mode}, in which the whole program is read into the
> +compiler at link-time and optimized in a similar way as if it
> +were a single source-level compilation unit.
> +
> +@item  @emph{WHOPR or partitioned mode}, designed to utilize multiple
> +CPUs and/or a distributed compilation environment to quickly link
> +large applications.  WHOPR stands for WHOle Program optimizeR (not to
> +be confused with the semantics of @option{-fwhole-program}).

WHOPR --- the Whole part is not accurate -- it can be any part of the
whole program. The 'Whole Program' should be 'the set of IL units' to
be more exact.

>  It
> +partitions the aggregated callgraph from many different @code{.o}
> +files and distributes the compilation of the sub-graphs to different
> +CPUs.
> +
> +Note that distributed compilation is not implemented yet, but since
> +the parallelism is facilitated via generating a @code{Makefile}, it
> +would be easy to implement.
> +@end enumerate
> +
> +WHOPR splits LTO into three main stages:
> +@enumerate
> +@item Local generation (LGEN)
> +This stage executes in parallel. Every file in the program is compiled
> +into the intermediate language and packaged together with the local
> +call-graph and summary information.  This stage is the same for both
> +the LTO and WHOPR compilation mode.
> +
> +@item Whole Program Analysis (WPA)
> +WPA is performed sequentially. The global call-graph is generated, and
> +a global analysis procedure makes transformation decisions. The global
> +call-graph is partitioned to facilitate parallel optimization during
> +phase 3. The results of the WPA stage are stored into new object files
> +which contain the partitions of program expressed in the intermediate
> +language and the optimization decisions.

Would it better to mention 'do not confuse with -fwhole-program' here?


> +
> +@item Local transformations (LTRANS)
> +This stage executes in parallel. All the decisions made during phase 2
> +are implemented locally in each partitioned object file, and the final
> +object code is generated. Optimizations which cannot be decided
> +efficiently during the phase 2 may be performed on the local
> +call-graph partitions.
> +@end enumerate
> +
> +WHOPR can be seen as an extension of the usual LTO mode of
> +compilation.  In LTO, WPA and LTRANS and are executed within a single
> +execution of the compiler, after the whole program has been read into
> +memory.

The use of 'whole program' here and other places do not seem to be
accurate -- it is essentially the set of object files with IL/IR in
it.


> +
> +When compiling in WHOPR mode the callgraph is partitioned during
> +the WPA stage.  The whole program is split into a given number of
> +partitions of roughly the same size.  The compiler tries to
> +minimize the number of references which cross partition boundaries.
> +The main advantage of WHOPR is to allow the parallel execution of
> +LTRANS stages, which are the most time-consuming part of the
> +compilation process.  Additionally, it avoids the need to load the
> +whole program into memory.
> +
> +
> +@section LTO file sections
> +
> +LTO information is stored in several ELF sections inside object files.
> +Data structures and enum codes for sections are defined in
> +@file{lto-streamer.h}.
> +
> +These sections are emitted from @file{lto-streamer-out.c} and mapped
> +in all at once from @file{lto/lto.c}:@code{lto_file_read}.  The
> +individual functions dealing with the reading/writing of each section
> +are described below.
> +
> +@itemize @bullet
> +@item Command line options (@code{.gnu.lto_.opts})
> +
> +This section contains the command line options used to generate the
> +object files.  This is used at link-time to determine the optimization
> +level and other settings when they are not explicitly specified at the
> +linker command line.
> +
> +Currently, GCC does not support combining LTO object files compiled
> +with different set of the command line options into a single binary.
> +At link-time, the options given on the command line and the options
> +saved on all the files in a link-time set are applied globally.  No
> +attempt is made at validating the combination of flags (other than the
> +usual validation done by option processing).  This is implemented in
> +@file{lto/lto.c}:@code{lto_read_all_file_options}.

This can be a big limiting factor for the wide adoption of LTO. In
LIPO, incompatible options are detected and modules not safe to
include are banned.


Thanks,

David


> +
> +
> +@item Symbol table (@code{.gnu.lto_.symtab})
> +
> +This table replaces the ELF symbol table for functions and variables
> +represented in the LTO IL. Symbols used and exported by the optimized
> +assembly code of ``fat'' objects might not match the ones used and
> +exported by the intermediate code.  This table is necessary because
> +the intermediate code is less optimized and thus requires a separate
> +symbol table.
> +
> +Additionally, the binary code in the ``fat'' object will lack a call
> +to a function, since the call was optimized out at compilation time
> +after the intermediate language was streamed out.  In some special
> +cases, the same optimization may not happen  during link-time
> +optimization.  This would lead to an undefined symbol if only one
> +symbol table was used.
> +
> +The symbol table is emitted in
> +@file{lto-streamer-out.c}:@code{produce_symtab}.
> +
> +
> +@item Global declarations and types (@code{.gnu.lto_.decls})
> +
> +This section contains an intermediate language dump of all
> +declarations and types required to represent the callgraph, static
> +variables and top-level debug info.
> +
> +The contents of this section are emitted in
> +@file{lto-streamer-out.c}:@code{produce_asm_for_decls}.  Types and
> +symbols are emitted in a topological order that preserves the sharing
> +of pointers when the file is read back in
> +(@file{lto.c}:@code{read_cgraph_and_symbols}).
> +
> +
> +@item The callgraph (@code{.gnu.lto_.cgraph})
> +
> +This section contains the basic data structure used by the GCC
> +inter-procedural optimization infrastructure. This section stores an
> +annotated multi-graph which represents the functions and call sites as
> +well as the variables, aliases and top-level @code{asm} statements.
> +
> +This section is emitted in
> +@file{lto-streamer-out.c}:@code{output_cgraph} and read in
> +@file{lto-cgraph.c}:@code{input_cgraph}.
> +
> +
> +@item IPA references (@code{.gnu.lto_.refs})
> +
> +This section contains references between function and static
> +variables.  It is emitted by @file{lto-cgraph.c}:@code{output_refs}
> +and read by @file{lto-cgraph.c}:@code{input_refs}.
> +
> +
> +@item Function bodies (@code{.gnu.lto_.function_body.<name>})
> +
> +This section contains function bodies in the intermediate language
> +representation. Every function body is in a separate section to allow
> +copying of the section independently to different object files or
> +reading the function on demand.
> +
> +Functions are emitted in
> +@file{lto-streamer-out.c}:@code{output_function} and read in
> +@file{lto-streamer-in.c}:@code{input_function}.
> +
> +
> +@item Static variable initializers (@code{.gnu.lto_.vars})
> +
> +This section contains all the symbols in the global variable pool.  It
> +is emitted by @file{lto-cgraph.c}:@code{output_varpool} and read in
> +@file{lto-cgraph.c}:@code{input_cgraph}.
> +
> +@item Summaries and optimization summaries used by IPA passes
> +(@code{.gnu.lto_.<xxx>}, where @code{<xxx>} is one of @code{jmpfuncs},
> +@code{pureconst} or @code{reference})
> +
> +These sections are used by IPA passes that need to emit summary
> +information during LTO generation to be read and aggregated at
> +link time.  Each pass is responsible for implementing two pass manager
> +hooks: one for writing the summary and another for reading it in.  The
> +format of these sections is entirely up to each individual pass.  The
> +only requirement is that the writer and reader hooks agree on the
> +format.
> +@end itemize
> +
> +
> +@section Using summary information in IPA passes
> +
> +Programs are represented internally as a @emph{callgraph} (a
> +multi-graph where nodes are functions and edges are call sites)
> +and a @emph{varpool} (a list of static and external variables in
> +the program).
> +
> +The inter-procedural optimization is organized as a sequence of
> +individual passes, which operate on the callgraph and the
> +varpool.  To make the implementation of WHOPR possible, every
> +inter-procedural optimization pass is split into several stages
> +that are executed at different times during WHOPR compilation:
> +
> +@itemize @bullet
> +@item LGEN time
> +@enumerate
> +@item @emph{Generate summary} (@code{generate_summary} in
> +@code{struct ipa_opt_pass_d}). This stage analyzes every function
> +body and variable initializer is examined and stores relevant
> +information into a pass-specific data structure.
> +
> +@item @emph{Write summary} (@code{write_summary} in
> +@code{struct ipa_opt_pass_d}. This stage writes all the
> +pass-specific information generated by @code{generate_summary}.
> +Summaries go into their own @code{LTO_section_*} sections that
> +have to be declared in @file{lto-streamer.h}:@code{enum
> +lto_section_type}.  A new section is created by calling
> +@code{create_output_block} and data can be written using the
> +@code{lto_output_*} routines.
> +@end enumerate
> +
> +@item WPA time
> +@enumerate
> +@item @emph{Read summary} (@code{read_summary} in
> +@code{struct ipa_opt_pass_d}). This stage reads all the
> +pass-specific information in exactly the same order that it was
> +written by @code{write_summary}.
> +
> +@item @emph{Execute} (@code{execute} in @code{struct
> +opt_pass}).  This performs inter-procedural propagation.  This
> +must be done without actual access to the individual function
> +bodies or variable initializers.  Typically, this results in a
> +transitive closure operation over the summary information of all
> +the nodes in the callgraph.
> +
> +@item @emph{Write optimization summary}
> +(@code{write_optimization_summary} in @code{struct
> +ipa_opt_pass_d}).  This writes the result of the inter-procedural
> +propagation into the object file.  This can use the same data
> +structures and helper routines used in @code{write_summary}.
> +@end enumerate
> +
> +@item LTRANS time
> +@enumerate
> +@item @emph{Read optimization summary}
> +(@code{read_optimization_summary} in @code{struct
> +ipa_opt_pass_d}).  The counterpart to
> +@code{write_optimization_summary}.  This reads the interprocedural
> +optimization decisions in exactly the same format emitted by
> +@code{write_optimization_summary}.
> +
> +@item @emph{Transform} (@code{function_transform} and
> +@code{variable_transform} in @code{struct ipa_opt_pass_d}).
> +The actual function bodies and variable initializers are updated
> +based on the information passed down from the @emph{Execute} stage.
> +@end enumerate
> +@end itemize
> +
> +The implementation of the inter-procedural passes are shared
> +between LTO, WHOPR and classic non-LTO compilation.
> +
> +@itemize
> +@item During the traditional file-by-file mode every pass executes its
> +own @emph{Generate summary}, @emph{Execute}, and @emph{Transform}
> +stages within the single execution context of the compiler.
> +
> +@item In LTO compilation mode, every pass uses @emph{Generate
> +summary} and @emph{Write summary} stages at compilation time,
> +while the @emph{Read summary}, @emph{Execute}, and
> +@emph{Transform} stages are executed at link time.
> +
> +@item In WHOPR mode all stages are used.
> +@end itemize
> +
> +To simplify development, the GCC pass manager differentiates
> +between normal inter-procedural passes and small inter-procedural
> +passes.  A @emph{small inter-procedural pass}
> +(@code{SIMPLE_IPA_PASS}) is a pass that does
> +everything at once and thus it can not be executed during WPA in
> +WHOPR mode. It defines only the @emph{Execute} stage and during
> +this stage it accesses and modifies the function bodies.  Such
> +passes are useful for optimization at LGEN or LTRANS time and are
> +used, for example, to implement early optimization before writing
> +object files.  The simple inter-procedural passes can also be used
> +for easier prototyping and development of a new inter-procedural
> +pass.
> +
> +
> +@subsection Virtual clones
> +
> +One of the main challenges of introducing the WHOPR compilation
> +mode was addressing the interactions between optimization passes.
> +In LTO compilation mode, the passes are executed in a sequence,
> +each of which consists of analysis (or @emph{Generate summary}),
> +propagation (or @emph{Execute}) and @emph{Transform} stages.
> +Once the work of one pass is finished, the next pass sees the
> +updated program representation and can execute.  This makes the
> +individual passes dependent on each other.
> +
> +In WHOPR mode all passes first execute their @emph{Generate
> +summary} stage.  Then summary writing marks the end of the LGEN
> +stage.  At WPA time,
> +the summaries are read back into memory and all passes run the
> +@emph{Execute} stage.  Optimization summaries are streamed and
> +sent to LTRANS, where all the passes execute the @emph{Transform}
> +stage.
> +
> +Most optimization passes split naturally into analysis,
> +propagation and transformation stages.  But some do not.  The
> +main problem arises when one pass performs changes and the
> +following pass gets confused by seeing different callgraphs
> +betwee the @emph{Transform} stage and the @emph{Generate summary}
> +or @emph{Execute} stage.  This means that the passes are required
> +to communicate their decisions with each other.
> +
> +To facilitate this communication, the GCC callgraph
> +infrastructure implements @emph{virtual clones}, a method of
> +representing the changes performed by the optimization passes in
> +the callgraph without needing to update function bodies.
> +
> +A @emph{virtual clone} in the callgraph is a function that has no
> +associated body, just a description of how to create its body based
> +on a different function (which itself may be a virtual clone).
> +
> +The description of function modifications includes adjustments to
> +the function's signature (which allows, for example, removing or
> +adding function arguments), substitutions to perform on the
> +function body, and, for inlined functions, a pointer to the
> +function that it will be inlined into.
> +
> +It is also possible to redirect any edge of the callgraph from a
> +function to its virtual clone.  This implies updating of the call
> +site to adjust for the new function signature.
> +
> +Most of the transformations performed by inter-procedural
> +optimizations can be represented via virtual clones.  For
> +instance, a constant propagation pass can produce a virtual clone
> +of the function which replaces one of its arguments by a
> +constant.  The inliner can represent its decisions by producing a
> +clone of a function whose body will be later integrated into
> +a given function.
> +
> +Using @emph{virtual clones}, the program can be easily updated
> +during the @emph{Execute} stage, solving most of pass interactions
> +problems that would otherwise occur during @emph{Transform}.
> +
> +Virtual clones are later materialized in the LTRANS stage and
> +turned into real functions.  Passes executed after the virtual
> +clone were introduced also perform their @emph{Transform} stage
> +on new functions, so for a pass there is no significant
> +difference between operating on a real function or a virtual
> +clone introduced before its @emph{Execute} stage.
> +
> +Optimization passes then work on virtual clones introduced before
> +their @emph{Execute} stage as if they were real functions.  The
> +only difference is that clones are not visible during the
> +@emph{Generate Summary} stage.
> +
> +To keep function summaries updated, the callgraph interface
> +allows an optimizer to register a callback that is called every
> +time a new clone is introduced as well as when the actual
> +function or variable is generated or when a function or variable
> +is removed.  These hooks are registered in the @emph{Generate
> +summary} stage and allow the pass to keep its information intact
> +until the @emph{Execute} stage.  The same hooks can also be
> +registered during the @emph{Execute} stage to keep the
> +optimization summaries updated for the @emph{Transform} stage.
> +
> +@subsection IPA references
> +
> +GCC represents IPA references in the callgraph.  For a function
> +or variable @code{A}, the @emph{IPA reference} is a list of all
> +locations where the address of @code{A} is taken and, when
> +@code{A} is a variable, a list of all direct stores and reads
> +to/from @code{A}. References represent an oriented multi-graph on
> +the union of nodes of the callgraph and the varpool.  See
> +@file{ipa-reference.c}:@code{ipa_reference_write_optimization_summary}
> +and
> +@file{ipa-reference.c}:@code{ipa_reference_read_optimization_summary}
> +for details.
> +
> +@subsection Jump functions
> +Suppose that an optimization pass sees a function @code{A} and it
> +knows the values of (some of) its arguments.  The @emph{jump
> +function} describes the value of a parameter of a given function
> +call in function @code{A} based on this knowledge.
> +
> +Jump functions are used by several optimizations, such as the
> +inter-procedural constant propagation pass and the
> +devirtualization pass.  The inliner also uses jump functions to
> +perform inlining of callbacks.
> +
> +@section Whole program assumptions, linker plugin and symbol visibilities
> +
> +Link-time optimization gives relatively minor benefits when used
> +alone.  The problem is that propagation of inter-procedural
> +information does not work well across functions and variables
> +that are called or referenced by other compilation units (such as
> +from a dynamically linked library). We say that such functions
> +are variables are @emph{externally visible}.
> +
> +To make the situation even more difficult, many applications
> +organize themselves as a set of shared libraries, and the default
> +ELF visibility rules allow one to overwrite any externally
> +visible symbol with a different symbol at runtime.  This
> +basically disables any optimizations across such functions and
> +variables, because the compiler cannot be sure that the function
> +body it is seeing is the same function body that will be used at
> +runtime.  Any function or variable not declared @code{static} in
> +the sources degrades the quality of inter-procedural
> +optimization.
> +
> +To avoid this problem the compiler must assume that it sees the
> +whole program when doing link-time optimization.  Strictly
> +speaking, the whole program is rarely visible even at link-time.
> +Standard system libraries are usually linked dynamically or not
> +provided with the link-time information.  In GCC, the whole
> +program option (@option{-fwhole-program}) asserts that every
> +function and variable defined in the current compilation
> +unit is static, except for function @code{main} (note: at
> +link-time, the current unit is the union of all objects compiled
> +with LTO).  Since some functions and variables need to
> +be referenced externally, for example by another DSO or from an
> +assembler file, GCC also provides the function and variable
> +attribute @code{externally_visible} which can be used to disable
> +the effect of @option{-fwhole-program} on a specific symbol.
> +
> +The whole program mode assumptions are slightly more complex in
> +C++, where inline functions in headers are put into @emph{COMDAT}
> +sections. COMDAT function and variables can be defined by
> +multiple object files and their bodies are unified at link-time
> +and dynamic link-time.  COMDAT functions are changed to local only
> +when their address is not taken and thus un-sharing them with a
> +library is not harmful.  COMDAT variables always remain externally
> +visible, however for readonly variables it is assumed that their
> +initializers cannot be overwritten by a different value.
> +
> +GCC provides the function and variable attribute
> +@code{visibility} that can be used to specify the visibility of
> +externally visible symbols (or alternatively an
> +@option{-fdefault-visibility} command line option).  ELF defines
> +the @code{default}, @code{protected}, @code{hidden} and
> +@code{internal} visibilities.
> +
> +The most commonly used is visibility is @code{hidden}. It
> +specifies that the symbol cannot be referenced from outside of
> +the current shared library. Unfortunately, this information
> +cannot be used directly by the link-time optimization in the
> +compiler since the whole shared library also might contain
> +non-LTO objects and those are not visible to the compiler.
> +
> +GCC solves this problem using linker plugins.  A @emph{linker
> +plugin} is an interface to the linker that allows an external
> +program to claim the ownership of a given object file.  The linker
> +then performs the linking procedure by querying the plugin about
> +the symbol table of the claimed objects and once the linking
> +decisions are complete, the plugin is allowed to provide the
> +final object file before the actual linking is made.  The linker
> +plugin obtains the symbol resolution information which specifies
> +which symbols provided by the claimed objects are bound from the
> +rest of a binary being linked.
> +
> +Currently, the linker plugin  works only in combination
> +with the Gold linker,  but a GNU ld implementation is under
> +development.
> +
> +GCC is designed to be independent of the rest of the toolchain
> +and aims to support linkers without plugin support.  For this
> +reason it does not use the linker plugin by default.  Instead,
> +the object files are examined by @command{collect2} before being
> +passed to the linker and objects found to have LTO sections are
> +passed to @command{lto1} first.  This mode does not work for
> +library archives. The decision on what object files from the
> +archive are needed depends on the actual linking and thus GCC
> +would have to implement the linker itself.  The resolution
> +information is missing too and thus GCC needs to make an educated
> +guess based on @option{-fwhole-program}.  Without the linker
> +plugin GCC also assumes that symbols are declared @code{hidden}
> +and not referred by non-LTO code by default.
> +
> +@section Internal flags controlling @code{lto1}
> +
> +The following flags are passed into @command{lto1} and are not
> +meant to be used directly from the command line.
> +
> +@itemize
> +@item -fwpa
> +@opindex fwpa
> +This option runs the serial part of the link-time optimizer
> +performing the inter-procedural propagation (WPA mode).  The
> +compiler reads in summary information from all inputs and
> +performs an analysis based on summary information only.  It
> +generates object files for subsequent runs of the link-time
> +optimizer where individual object files are optimized using both
> +summary information from the WPA mode and the actual function
> +bodies.  It then drives the LTRANS phase.
> +
> +@item -fltrans
> +@opindex fltrans
> +This option runs the link-time optimizer in the
> +local-transformation (LTRANS) mode, which reads in output from a
> +previous run of the LTO in WPA mode. In the LTRANS mode, LTO
> +optimizes an object and produces the final assembly.
> +
> +@item -fltrans-output-list=@var{file}
> +@opindex fltrans-output-list
> +This option specifies a file to which the names of LTRANS output
> +files are written.  This option is only meaningful in conjunction
> +with @option{-fwpa}.
> +@end itemize
> Index: doc/gccint.texi
> ===================================================================
> --- doc/gccint.texi     (revision 166733)
> +++ doc/gccint.texi     (working copy)
> @@ -123,6 +123,7 @@ Additional tutorial information is linke
>  * Header Dirs::     Understanding the standard header file directories.
>  * Type Information:: GCC's memory management; generating type information.
>  * Plugins::         Extending the compiler with plugins.
> +* LTO::             Using Link-Time Optimization.
>
>  * Funding::         How to help assure funding for free software.
>  * GNU Project::     The GNU Project and GNU/Linux.
> @@ -158,6 +159,7 @@ Additional tutorial information is linke
>  @include headerdirs.texi
>  @include gty.texi
>  @include plugins.texi
> +@include lto.texi
>
>  @include funding.texi
>  @include gnu.texi
> Index: doc/invoke.texi
> ===================================================================
> --- doc/invoke.texi     (revision 166733)
> +++ doc/invoke.texi     (working copy)
> @@ -356,8 +356,8 @@ Objective-C and Objective-C++ Dialects}.
>  -fno-ira-share-spill-slots -fira-verbose=@var{n} @gol
>  -fivopts -fkeep-inline-functions -fkeep-static-consts @gol
>  -floop-block -floop-flatten -floop-interchange -floop-strip-mine @gol
> --floop-parallelize-all -flto -flto-compression-level -flto-partition=@var{alg} @gol
> --flto-report -fltrans -fltrans-output-list -fmerge-all-constants @gol
> +-floop-parallelize-all -flto -flto-compression-level
> +-flto-partition=@var{alg} -flto-report -fmerge-all-constants @gol
>  -fmerge-constants -fmodulo-sched -fmodulo-sched-allow-regmoves @gol
>  -fmove-loop-invariants fmudflap -fmudflapir -fmudflapth -fno-branch-count-reg @gol
>  -fno-default-inline @gol
> @@ -399,7 +399,7 @@ Objective-C and Objective-C++ Dialects}.
>  -funit-at-a-time -funroll-all-loops -funroll-loops @gol
>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
> --fwhole-program -fwhopr[=@var{n}] -fwpa -fuse-linker-plugin @gol
> +-fwhole-program -fwpa -fuse-linker-plugin @gol
>  --param @var{name}=@var{value}
>  -O  -O0  -O1  -O2  -O3  -Os -Ofast}
>
> @@ -7489,6 +7489,16 @@ The only important thing to keep in mind
>  optimizations the @option{-flto} flag needs to be passed to both the
>  compile and the link commands.
>
> +To make whole program optimization effective, it is necesary to make
> +certain whole program assumptions.  The compiler needs to know
> +what functions and variables can be accessed by libraries and runtime
> +outside of the link time optimized unit.  When supported by the linker,
> +the linker plugin (see @option{-fuse-linker-plugin}) passes to the
> +compiler information about used and externally visible symbols.  When
> +the linker plugin is not available, @option{-fwhole-program} should be
> +used to allow the compiler to make these assumptions, which will lead
> +to more aggressive optimization decisions.
> +
>  Note that when a file is compiled with @option{-flto}, the generated
>  object file will be larger than a regular object file because it will
>  contain GIMPLE bytecodes and the usual final code.  This means that
> @@ -7601,16 +7611,18 @@ GCC will not work with an older/newer ve
>
>  Link time optimization does not play well with generating debugging
>  information.  Combining @option{-flto} with
> -@option{-g} is experimental.
> +@option{-g} is currently experimental and expected to produce wrong
> +results.
>
> -If you specify the optional @var{n} the link stage is executed in
> -parallel using @var{n} parallel jobs by utilizing an installed
> -@command{make} program.  The environment variable @env{MAKE} may be
> -used to override the program used.
> +If you specify the optional @var{n}, the optimization and code
> +generation done at link time is executed in parallel using @var{n}
> +parallel jobs by utilizing an installed @command{make} program.  The
> +environment variable @env{MAKE} may be used to override the program
> +used.  The default value for @var{n} is 1.
>
> -You can also specify @option{-fwhopr=jobserver} to use GNU make's
> +You can also specify @option{-flto=jobserver} to use GNU make's
>  job server mode to determine the number of parallel jobs. This
> -is useful when the Makefile calling GCC is already parallel.
> +is useful when the Makefile calling GCC is already executing in parallel.
>  The parent Makefile will need a @samp{+} prepended to the command recipe
>  for this to work. This will likely only work if @env{MAKE} is
>  GNU make.
> @@ -7619,53 +7631,17 @@ This option is disabled by default.
>
>  @item -flto-partition=@var{alg}
>  @opindex flto-partition
> -Specify partitioning algorithm used by @option{-fwhopr} mode.  The value is
> -either @code{1to1} to specify partitioning corresponding to source files
> -or @code{balanced} to specify partitioning into, if possible, equally sized
> -chunks.  Specifying @code{none} as an algorithm disables partitioning
> -and streaming completely.
> -The default value is @code{balanced}.
> -
> -@item -fwpa
> -@opindex fwpa
> -This is an internal option used by GCC when compiling with
> -@option{-fwhopr}.  You should never need to use it.
> -
> -This option runs the link-time optimizer in the whole-program-analysis
> -(WPA) mode, which reads in summary information from all inputs and
> -performs a whole-program analysis based on summary information only.
> -It generates object files for subsequent runs of the link-time
> -optimizer where individual object files are optimized using both
> -summary information from the WPA mode and the actual function bodies.
> -It then drives the LTRANS phase.
> -
> -Disabled by default.
> -
> -@item -fltrans
> -@opindex fltrans
> -This is an internal option used by GCC when compiling with
> -@option{-fwhopr}.  You should never need to use it.
> -
> -This option runs the link-time optimizer in the local-transformation (LTRANS)
> -mode, which reads in output from a previous run of the LTO in WPA mode.
> -In the LTRANS mode, LTO optimizes an object and produces the final assembly.
> -
> -Disabled by default.
> -
> -@item -fltrans-output-list=@var{file}
> -@opindex fltrans-output-list
> -This is an internal option used by GCC when compiling with
> -@option{-fwhopr}.  You should never need to use it.
> -
> -This option specifies a file to which the names of LTRANS output files are
> -written.  This option is only meaningful in conjunction with @option{-fwpa}.
> -
> -Disabled by default.
> +Specify the partitioning algorithm used by the link time optimizer.
> +The value is either @code{1to1} to specify a partitioning mirroring
> +the original source files or @code{balanced} to specify partitioning
> +into equally sized chunks (whenever possible).  Specifying @code{none}
> +as an algorithm disables partitioning and streaming completely. The
> +default value is @code{balanced}.
>
>  @item -flto-compression-level=@var{n}
>  This option specifies the level of compression used for intermediate
>  language written to LTO object files, and is only meaningful in
> -conjunction with LTO mode (@option{-fwhopr}, @option{-flto}).  Valid
> +conjunction with LTO mode (@option{-flto}).  Valid
>  values are 0 (no compression) to 9 (maximum compression).  Values
>  outside this range are clamped to either 0 or 9.  If the option is not
>  given, a default balanced compression setting is used.
> @@ -7674,7 +7650,7 @@ given, a default balanced compression se
>  Prints a report with internal details on the workings of the link-time
>  optimizer.  The contents of this report vary from version to version,
>  it is meant to be useful to GCC developers when processing object
> -files in LTO mode (via @option{-fwhopr} or @option{-flto}).
> +files in LTO mode (via @option{-flto}).
>
>  Disabled by default.
>
>
>
Dave Korn Nov. 15, 2010, 7:28 a.m. UTC | #2
On 15/11/2010 06:23, Diego Novillo wrote:

  Hi Diego, here's a handful of minor grammar/phrasing/naming nits I've
spotted so far:

> +optimized builds.  A, perhaps surprising, side effect of this feature
> +is that any mistake in the toolchain that leads to LTO information not
> +being used (e.g. an older @code{libtool} calling @code{ld} directly).

  Sentence truncated?  Or perhaps the second "that" should not be present?

> +This is both an advantage, as the system is more robust, and a
> +disadvantage, as the user is not informed that the optimization has
> +been disabled.
> +
> +The current implementation only produces ``fat'' objects, effectively
> +doubling compilation time and increasing file sizes up to 5x the
> +original size.  This hides the problem that some tools, such as
> +@code{ar} and @code{nm}, need to understand symbol tables of LTO
> +sections.

  The fact that the files are bigger "hides" the problem?  Surely not!
Perhaps it is the fact that there are ordinary object code sections in those
files and everything falls back to using them when it doesn't understand LTO
that is what hides the problem that some of the tools didn't understand LTO?

> +@item @code{pass_ipa_lto_finish_out}
> +This pass executes the function @code{produce_asm_for_decls} in
> +@file{lto-streamer-out.c}, which takes the memory image built in the
> +previous pass and encodes it in the corresponding ELF file sections.

  Not necessarily "ELF ..." any more, just general object file sections.

> +@section LTO file sections
> +
> +LTO information is stored in several ELF sections inside object files.

  Again, just sections now, not necessarily ELF ones.

> +This table replaces the ELF symbol table for functions and variables

  Not necessarily ELF symbol tables any more ...

> +Using @emph{virtual clones}, the program can be easily updated
> +during the @emph{Execute} stage, solving most of pass interactions
> +problems that would otherwise occur during @emph{Transform}.

  Most of /the/ pass interaction (no s) problems?

> +
> +@section Whole program assumptions, linker plugin and symbol visibilities

  I only got up to about this point, then stopped because I've been up all
night and suddenly realised I'm crashing.  Will give the remained a look over
after a few hours' sleep.

    cheers,
      DaveK
Richard Biener Nov. 15, 2010, 10:45 a.m. UTC | #3
On Mon, 15 Nov 2010, Diego Novillo wrote:

> This patch adds internal documentation for LTO.  Much of it comes
> from Honza's GCC Summit paper, wiki pages and source comments.  I
> also moved the internal flags from invoke.texi and added several
> pointers to the source code.
> 
> It can still use more information, but this is a start.
> 
> Tested with make doc, make pdf and visual inspection.
> 
> OK for mainline?

Wow, thanks for writing this up.  The patch is ok with or without
the various suggestions already made - incremental improvements
can always be done as followup.

Thanks,
Richard.

> 
> Diego.
> 
> 2010-11-14  Jan Hubicka  <jh@suse.cz>
> 	    Diego Novillo  <dnovillo@google.com>
> 
> 	PR lto/41528
> 	* doc/lto.texi: Add.
> 	* doc/gccint.texi: Add reference to lto.texi.
> 	* doc/invoke.texi: Update user documentation for LTO.
> 	Move internal flags to lto.texi
> 
> Index: doc/lto.texi
> ===================================================================
> --- doc/lto.texi	(revision 0)
> +++ doc/lto.texi	(revision 0)
> @@ -0,0 +1,568 @@
> +@c Copyright (c) 2010 Free Software Foundation, Inc.
> +@c Free Software Foundation, Inc.
> +@c This is part of the GCC manual.
> +@c For copying conditions, see the file gcc.texi.
> +@c Contributed by Jan Hubicka <jh@suse.cz> and
> +@c Diego Novillo <dnovillo@google.com>
> +
> +@node LTO
> +@chapter Link Time Optimization
> +@cindex lto
> +@cindex whopr
> +@cindex wpa
> +@cindex ltrans
> +
> +@section Design Overview
> +
> +Link time optimization is implemented as a GCC front end for a
> +bytecode representation of GIMPLE that is emitted in special sections
> +of @code{.o} files.  Currently, LTO support is enabled in most
> +ELF-based systems, as well as darwin, cygwin and mingw systems.
> +
> +Since GIMPLE bytecode is saved alongside final object code, object
> +files generated with LTO support are larger than regular object files.
> +This ``fat'' object format makes it easy to integrate LTO into
> +existing build systems, as one can, for instance, produce archives of
> +the files.  Additionally, one might be able to ship one set of fat
> +objects which could be used both for development and the production of
> +optimized builds.  A, perhaps surprising, side effect of this feature
> +is that any mistake in the toolchain that leads to LTO information not
> +being used (e.g. an older @code{libtool} calling @code{ld} directly).
> +This is both an advantage, as the system is more robust, and a
> +disadvantage, as the user is not informed that the optimization has
> +been disabled.
> +
> +The current implementation only produces ``fat'' objects, effectively
> +doubling compilation time and increasing file sizes up to 5x the
> +original size.  This hides the problem that some tools, such as
> +@code{ar} and @code{nm}, need to understand symbol tables of LTO
> +sections.  These tools were extended to use the plugin infrastructure,
> +and with these problems solved, GCC will also support ``slim'' objects
> +consisting of the intermediate code alone.
> +
> +At the highest level, LTO splits the compiler in two.  The first half
> +(the ``writer'') produces a streaming representation of all the
> +internal data structures needed to optimize and generate code.  This
> +includes declarations, types, the callgraph and the GIMPLE representation
> +of function bodies.
> +
> +When @option{-flto} is given during compilation of a source file, the
> +pass manager executes all the passes in @code{all_lto_gen_passes}.
> +Currently, this phase is composed of two IPA passes:
> +
> +@itemize @bullet
> +@item @code{pass_ipa_lto_gimple_out}
> +This pass executes the function @code{lto_output} in
> +@file{lto-streamer-out.c}, which traverses the call graph encoding
> +every reachable declaration, type and function. This generates a
> +memory representation of all the file sections described below.
> +
> +@item @code{pass_ipa_lto_finish_out}
> +This pass executes the function @code{produce_asm_for_decls} in
> +@file{lto-streamer-out.c}, which takes the memory image built in the
> +previous pass and encodes it in the corresponding ELF file sections.
> +@end itemize
> +
> +The second half of LTO support is the ``reader''.  This is implemented
> +as the GCC front end @file{lto1} in @file{lto/lto.c}.  When
> +@file{collect2} detects a link set of @code{.o}/@code{.a} files with
> +LTO information and the @option{-flto} is enabled, it invokes
> +@file{lto1} which reads the set of files and aggregates them into a
> +single translation unit for optimization.  The main entry point for
> +the reader is @file{lto/lto.c}:@code{lto_main}.
> +
> +@subsection LTO modes of operation
> +
> +One of the main goals of the GCC link-time infrastructure was to allow
> +effective compilation of large programs.  For this reason GCC implements two
> +link-time compilation modes.
> +
> +@enumerate
> +@item	@emph{LTO mode}, in which the whole program is read into the
> +compiler at link-time and optimized in a similar way as if it
> +were a single source-level compilation unit.
> +
> +@item	@emph{WHOPR or partitioned mode}, designed to utilize multiple
> +CPUs and/or a distributed compilation environment to quickly link
> +large applications.  WHOPR stands for WHOle Program optimizeR (not to
> +be confused with the semantics of @option{-fwhole-program}).  It
> +partitions the aggregated callgraph from many different @code{.o}
> +files and distributes the compilation of the sub-graphs to different
> +CPUs.
> +
> +Note that distributed compilation is not implemented yet, but since
> +the parallelism is facilitated via generating a @code{Makefile}, it
> +would be easy to implement.
> +@end enumerate
> +
> +WHOPR splits LTO into three main stages:
> +@enumerate
> +@item Local generation (LGEN)
> +This stage executes in parallel. Every file in the program is compiled
> +into the intermediate language and packaged together with the local
> +call-graph and summary information.  This stage is the same for both
> +the LTO and WHOPR compilation mode.
> +
> +@item Whole Program Analysis (WPA)
> +WPA is performed sequentially. The global call-graph is generated, and
> +a global analysis procedure makes transformation decisions. The global
> +call-graph is partitioned to facilitate parallel optimization during
> +phase 3. The results of the WPA stage are stored into new object files
> +which contain the partitions of program expressed in the intermediate
> +language and the optimization decisions.
> +
> +@item Local transformations (LTRANS)
> +This stage executes in parallel. All the decisions made during phase 2
> +are implemented locally in each partitioned object file, and the final
> +object code is generated. Optimizations which cannot be decided
> +efficiently during the phase 2 may be performed on the local
> +call-graph partitions.
> +@end enumerate
> +
> +WHOPR can be seen as an extension of the usual LTO mode of
> +compilation.  In LTO, WPA and LTRANS and are executed within a single
> +execution of the compiler, after the whole program has been read into
> +memory.
> +
> +When compiling in WHOPR mode the callgraph is partitioned during
> +the WPA stage.  The whole program is split into a given number of
> +partitions of roughly the same size.  The compiler tries to
> +minimize the number of references which cross partition boundaries.
> +The main advantage of WHOPR is to allow the parallel execution of
> +LTRANS stages, which are the most time-consuming part of the
> +compilation process.  Additionally, it avoids the need to load the
> +whole program into memory.
> +
> +
> +@section LTO file sections
> +
> +LTO information is stored in several ELF sections inside object files.
> +Data structures and enum codes for sections are defined in
> +@file{lto-streamer.h}.
> +
> +These sections are emitted from @file{lto-streamer-out.c} and mapped
> +in all at once from @file{lto/lto.c}:@code{lto_file_read}.  The
> +individual functions dealing with the reading/writing of each section
> +are described below.
> +
> +@itemize @bullet
> +@item Command line options (@code{.gnu.lto_.opts})
> +
> +This section contains the command line options used to generate the
> +object files.  This is used at link-time to determine the optimization
> +level and other settings when they are not explicitly specified at the
> +linker command line.
> +
> +Currently, GCC does not support combining LTO object files compiled
> +with different set of the command line options into a single binary.
> +At link-time, the options given on the command line and the options
> +saved on all the files in a link-time set are applied globally.  No
> +attempt is made at validating the combination of flags (other than the
> +usual validation done by option processing).  This is implemented in
> +@file{lto/lto.c}:@code{lto_read_all_file_options}.
> +
> +
> +@item Symbol table (@code{.gnu.lto_.symtab})
> +
> +This table replaces the ELF symbol table for functions and variables
> +represented in the LTO IL. Symbols used and exported by the optimized
> +assembly code of ``fat'' objects might not match the ones used and
> +exported by the intermediate code.  This table is necessary because
> +the intermediate code is less optimized and thus requires a separate
> +symbol table.
> +
> +Additionally, the binary code in the ``fat'' object will lack a call
> +to a function, since the call was optimized out at compilation time
> +after the intermediate language was streamed out.  In some special
> +cases, the same optimization may not happen  during link-time
> +optimization.  This would lead to an undefined symbol if only one
> +symbol table was used.
> +
> +The symbol table is emitted in
> +@file{lto-streamer-out.c}:@code{produce_symtab}.
> +
> +
> +@item Global declarations and types (@code{.gnu.lto_.decls})
> +
> +This section contains an intermediate language dump of all
> +declarations and types required to represent the callgraph, static
> +variables and top-level debug info.
> +
> +The contents of this section are emitted in
> +@file{lto-streamer-out.c}:@code{produce_asm_for_decls}.  Types and
> +symbols are emitted in a topological order that preserves the sharing
> +of pointers when the file is read back in
> +(@file{lto.c}:@code{read_cgraph_and_symbols}).
> +
> +
> +@item The callgraph (@code{.gnu.lto_.cgraph})
> +
> +This section contains the basic data structure used by the GCC
> +inter-procedural optimization infrastructure. This section stores an
> +annotated multi-graph which represents the functions and call sites as
> +well as the variables, aliases and top-level @code{asm} statements.
> +
> +This section is emitted in
> +@file{lto-streamer-out.c}:@code{output_cgraph} and read in
> +@file{lto-cgraph.c}:@code{input_cgraph}.
> +
> +
> +@item IPA references (@code{.gnu.lto_.refs})
> +
> +This section contains references between function and static
> +variables.  It is emitted by @file{lto-cgraph.c}:@code{output_refs}
> +and read by @file{lto-cgraph.c}:@code{input_refs}.
> +
> +
> +@item Function bodies (@code{.gnu.lto_.function_body.<name>})
> +
> +This section contains function bodies in the intermediate language
> +representation. Every function body is in a separate section to allow
> +copying of the section independently to different object files or
> +reading the function on demand.
> +
> +Functions are emitted in
> +@file{lto-streamer-out.c}:@code{output_function} and read in
> +@file{lto-streamer-in.c}:@code{input_function}.
> +
> +
> +@item Static variable initializers (@code{.gnu.lto_.vars})
> +
> +This section contains all the symbols in the global variable pool.  It
> +is emitted by @file{lto-cgraph.c}:@code{output_varpool} and read in
> +@file{lto-cgraph.c}:@code{input_cgraph}.
> +
> +@item Summaries and optimization summaries used by IPA passes
> +(@code{.gnu.lto_.<xxx>}, where @code{<xxx>} is one of @code{jmpfuncs},
> +@code{pureconst} or @code{reference})
> +
> +These sections are used by IPA passes that need to emit summary
> +information during LTO generation to be read and aggregated at
> +link time.  Each pass is responsible for implementing two pass manager
> +hooks: one for writing the summary and another for reading it in.  The
> +format of these sections is entirely up to each individual pass.  The
> +only requirement is that the writer and reader hooks agree on the
> +format.
> +@end itemize
> +
> +
> +@section Using summary information in IPA passes
> +
> +Programs are represented internally as a @emph{callgraph} (a
> +multi-graph where nodes are functions and edges are call sites)
> +and a @emph{varpool} (a list of static and external variables in
> +the program).
> +
> +The inter-procedural optimization is organized as a sequence of
> +individual passes, which operate on the callgraph and the
> +varpool.  To make the implementation of WHOPR possible, every
> +inter-procedural optimization pass is split into several stages
> +that are executed at different times during WHOPR compilation:
> +
> +@itemize @bullet
> +@item LGEN time
> +@enumerate
> +@item @emph{Generate summary} (@code{generate_summary} in
> +@code{struct ipa_opt_pass_d}). This stage analyzes every function
> +body and variable initializer is examined and stores relevant
> +information into a pass-specific data structure.
> +
> +@item @emph{Write summary} (@code{write_summary} in
> +@code{struct ipa_opt_pass_d}. This stage writes all the
> +pass-specific information generated by @code{generate_summary}.
> +Summaries go into their own @code{LTO_section_*} sections that
> +have to be declared in @file{lto-streamer.h}:@code{enum
> +lto_section_type}.  A new section is created by calling
> +@code{create_output_block} and data can be written using the
> +@code{lto_output_*} routines.
> +@end enumerate
> +
> +@item WPA time
> +@enumerate
> +@item @emph{Read summary} (@code{read_summary} in
> +@code{struct ipa_opt_pass_d}). This stage reads all the
> +pass-specific information in exactly the same order that it was
> +written by @code{write_summary}.
> +
> +@item @emph{Execute} (@code{execute} in @code{struct
> +opt_pass}).  This performs inter-procedural propagation.  This
> +must be done without actual access to the individual function
> +bodies or variable initializers.  Typically, this results in a
> +transitive closure operation over the summary information of all
> +the nodes in the callgraph.
> +
> +@item @emph{Write optimization summary}
> +(@code{write_optimization_summary} in @code{struct
> +ipa_opt_pass_d}).  This writes the result of the inter-procedural
> +propagation into the object file.  This can use the same data
> +structures and helper routines used in @code{write_summary}.
> +@end enumerate
> +
> +@item LTRANS time
> +@enumerate
> +@item @emph{Read optimization summary}
> +(@code{read_optimization_summary} in @code{struct
> +ipa_opt_pass_d}).  The counterpart to
> +@code{write_optimization_summary}.  This reads the interprocedural
> +optimization decisions in exactly the same format emitted by
> +@code{write_optimization_summary}.
> +
> +@item @emph{Transform} (@code{function_transform} and
> +@code{variable_transform} in @code{struct ipa_opt_pass_d}).
> +The actual function bodies and variable initializers are updated
> +based on the information passed down from the @emph{Execute} stage.
> +@end enumerate
> +@end itemize
> +
> +The implementation of the inter-procedural passes are shared
> +between LTO, WHOPR and classic non-LTO compilation.
> +
> +@itemize
> +@item During the traditional file-by-file mode every pass executes its
> +own @emph{Generate summary}, @emph{Execute}, and @emph{Transform}
> +stages within the single execution context of the compiler.
> +
> +@item In LTO compilation mode, every pass uses @emph{Generate
> +summary} and @emph{Write summary} stages at compilation time,
> +while the @emph{Read summary}, @emph{Execute}, and
> +@emph{Transform} stages are executed at link time.
> +
> +@item In WHOPR mode all stages are used.
> +@end itemize
> +
> +To simplify development, the GCC pass manager differentiates
> +between normal inter-procedural passes and small inter-procedural
> +passes.  A @emph{small inter-procedural pass}
> +(@code{SIMPLE_IPA_PASS}) is a pass that does
> +everything at once and thus it can not be executed during WPA in
> +WHOPR mode. It defines only the @emph{Execute} stage and during
> +this stage it accesses and modifies the function bodies.  Such
> +passes are useful for optimization at LGEN or LTRANS time and are
> +used, for example, to implement early optimization before writing
> +object files.  The simple inter-procedural passes can also be used
> +for easier prototyping and development of a new inter-procedural
> +pass.
> +
> +
> +@subsection Virtual clones
> +
> +One of the main challenges of introducing the WHOPR compilation
> +mode was addressing the interactions between optimization passes.
> +In LTO compilation mode, the passes are executed in a sequence,
> +each of which consists of analysis (or @emph{Generate summary}),
> +propagation (or @emph{Execute}) and @emph{Transform} stages.
> +Once the work of one pass is finished, the next pass sees the
> +updated program representation and can execute.  This makes the
> +individual passes dependent on each other.
> +
> +In WHOPR mode all passes first execute their @emph{Generate
> +summary} stage.  Then summary writing marks the end of the LGEN
> +stage.  At WPA time,
> +the summaries are read back into memory and all passes run the
> +@emph{Execute} stage.  Optimization summaries are streamed and
> +sent to LTRANS, where all the passes execute the @emph{Transform}
> +stage.
> +
> +Most optimization passes split naturally into analysis,
> +propagation and transformation stages.  But some do not.  The
> +main problem arises when one pass performs changes and the
> +following pass gets confused by seeing different callgraphs
> +betwee the @emph{Transform} stage and the @emph{Generate summary}
> +or @emph{Execute} stage.  This means that the passes are required
> +to communicate their decisions with each other.
> +
> +To facilitate this communication, the GCC callgraph
> +infrastructure implements @emph{virtual clones}, a method of
> +representing the changes performed by the optimization passes in
> +the callgraph without needing to update function bodies.
> +
> +A @emph{virtual clone} in the callgraph is a function that has no
> +associated body, just a description of how to create its body based
> +on a different function (which itself may be a virtual clone).
> +
> +The description of function modifications includes adjustments to
> +the function's signature (which allows, for example, removing or
> +adding function arguments), substitutions to perform on the
> +function body, and, for inlined functions, a pointer to the
> +function that it will be inlined into.
> +
> +It is also possible to redirect any edge of the callgraph from a
> +function to its virtual clone.  This implies updating of the call
> +site to adjust for the new function signature.
> +
> +Most of the transformations performed by inter-procedural
> +optimizations can be represented via virtual clones.  For
> +instance, a constant propagation pass can produce a virtual clone
> +of the function which replaces one of its arguments by a
> +constant.  The inliner can represent its decisions by producing a
> +clone of a function whose body will be later integrated into
> +a given function.
> +
> +Using @emph{virtual clones}, the program can be easily updated
> +during the @emph{Execute} stage, solving most of pass interactions
> +problems that would otherwise occur during @emph{Transform}.
> +
> +Virtual clones are later materialized in the LTRANS stage and
> +turned into real functions.  Passes executed after the virtual
> +clone were introduced also perform their @emph{Transform} stage
> +on new functions, so for a pass there is no significant
> +difference between operating on a real function or a virtual
> +clone introduced before its @emph{Execute} stage.
> +
> +Optimization passes then work on virtual clones introduced before
> +their @emph{Execute} stage as if they were real functions.  The
> +only difference is that clones are not visible during the
> +@emph{Generate Summary} stage.
> +
> +To keep function summaries updated, the callgraph interface
> +allows an optimizer to register a callback that is called every
> +time a new clone is introduced as well as when the actual
> +function or variable is generated or when a function or variable
> +is removed.  These hooks are registered in the @emph{Generate
> +summary} stage and allow the pass to keep its information intact
> +until the @emph{Execute} stage.  The same hooks can also be
> +registered during the @emph{Execute} stage to keep the
> +optimization summaries updated for the @emph{Transform} stage.
> +
> +@subsection IPA references
> +
> +GCC represents IPA references in the callgraph.  For a function
> +or variable @code{A}, the @emph{IPA reference} is a list of all
> +locations where the address of @code{A} is taken and, when
> +@code{A} is a variable, a list of all direct stores and reads
> +to/from @code{A}. References represent an oriented multi-graph on
> +the union of nodes of the callgraph and the varpool.  See
> +@file{ipa-reference.c}:@code{ipa_reference_write_optimization_summary}
> +and
> +@file{ipa-reference.c}:@code{ipa_reference_read_optimization_summary}
> +for details.
> +
> +@subsection Jump functions
> +Suppose that an optimization pass sees a function @code{A} and it
> +knows the values of (some of) its arguments.  The @emph{jump
> +function} describes the value of a parameter of a given function
> +call in function @code{A} based on this knowledge.
> +
> +Jump functions are used by several optimizations, such as the
> +inter-procedural constant propagation pass and the
> +devirtualization pass.  The inliner also uses jump functions to
> +perform inlining of callbacks.
> +
> +@section Whole program assumptions, linker plugin and symbol visibilities
> +
> +Link-time optimization gives relatively minor benefits when used
> +alone.  The problem is that propagation of inter-procedural
> +information does not work well across functions and variables
> +that are called or referenced by other compilation units (such as
> +from a dynamically linked library). We say that such functions
> +are variables are @emph{externally visible}.
> +
> +To make the situation even more difficult, many applications
> +organize themselves as a set of shared libraries, and the default
> +ELF visibility rules allow one to overwrite any externally
> +visible symbol with a different symbol at runtime.  This
> +basically disables any optimizations across such functions and
> +variables, because the compiler cannot be sure that the function
> +body it is seeing is the same function body that will be used at
> +runtime.  Any function or variable not declared @code{static} in
> +the sources degrades the quality of inter-procedural
> +optimization.
> +
> +To avoid this problem the compiler must assume that it sees the
> +whole program when doing link-time optimization.  Strictly
> +speaking, the whole program is rarely visible even at link-time.
> +Standard system libraries are usually linked dynamically or not
> +provided with the link-time information.  In GCC, the whole
> +program option (@option{-fwhole-program}) asserts that every
> +function and variable defined in the current compilation
> +unit is static, except for function @code{main} (note: at
> +link-time, the current unit is the union of all objects compiled
> +with LTO).  Since some functions and variables need to
> +be referenced externally, for example by another DSO or from an
> +assembler file, GCC also provides the function and variable
> +attribute @code{externally_visible} which can be used to disable
> +the effect of @option{-fwhole-program} on a specific symbol.
> +
> +The whole program mode assumptions are slightly more complex in
> +C++, where inline functions in headers are put into @emph{COMDAT}
> +sections. COMDAT function and variables can be defined by
> +multiple object files and their bodies are unified at link-time
> +and dynamic link-time.  COMDAT functions are changed to local only
> +when their address is not taken and thus un-sharing them with a
> +library is not harmful.  COMDAT variables always remain externally
> +visible, however for readonly variables it is assumed that their
> +initializers cannot be overwritten by a different value.
> +
> +GCC provides the function and variable attribute
> +@code{visibility} that can be used to specify the visibility of
> +externally visible symbols (or alternatively an
> +@option{-fdefault-visibility} command line option).  ELF defines
> +the @code{default}, @code{protected}, @code{hidden} and
> +@code{internal} visibilities.
> +
> +The most commonly used is visibility is @code{hidden}. It
> +specifies that the symbol cannot be referenced from outside of
> +the current shared library. Unfortunately, this information
> +cannot be used directly by the link-time optimization in the
> +compiler since the whole shared library also might contain
> +non-LTO objects and those are not visible to the compiler.
> +
> +GCC solves this problem using linker plugins.  A @emph{linker
> +plugin} is an interface to the linker that allows an external
> +program to claim the ownership of a given object file.  The linker
> +then performs the linking procedure by querying the plugin about
> +the symbol table of the claimed objects and once the linking
> +decisions are complete, the plugin is allowed to provide the
> +final object file before the actual linking is made.  The linker
> +plugin obtains the symbol resolution information which specifies
> +which symbols provided by the claimed objects are bound from the
> +rest of a binary being linked.
> +
> +Currently, the linker plugin  works only in combination
> +with the Gold linker,  but a GNU ld implementation is under
> +development.
> +
> +GCC is designed to be independent of the rest of the toolchain
> +and aims to support linkers without plugin support.  For this
> +reason it does not use the linker plugin by default.  Instead,
> +the object files are examined by @command{collect2} before being
> +passed to the linker and objects found to have LTO sections are
> +passed to @command{lto1} first.  This mode does not work for
> +library archives. The decision on what object files from the
> +archive are needed depends on the actual linking and thus GCC
> +would have to implement the linker itself.  The resolution
> +information is missing too and thus GCC needs to make an educated
> +guess based on @option{-fwhole-program}.  Without the linker
> +plugin GCC also assumes that symbols are declared @code{hidden}
> +and not referred by non-LTO code by default.
> +
> +@section Internal flags controlling @code{lto1}
> +
> +The following flags are passed into @command{lto1} and are not
> +meant to be used directly from the command line.
> +
> +@itemize
> +@item -fwpa
> +@opindex fwpa
> +This option runs the serial part of the link-time optimizer
> +performing the inter-procedural propagation (WPA mode).  The
> +compiler reads in summary information from all inputs and
> +performs an analysis based on summary information only.  It
> +generates object files for subsequent runs of the link-time
> +optimizer where individual object files are optimized using both
> +summary information from the WPA mode and the actual function
> +bodies.  It then drives the LTRANS phase.
> +
> +@item -fltrans
> +@opindex fltrans
> +This option runs the link-time optimizer in the
> +local-transformation (LTRANS) mode, which reads in output from a
> +previous run of the LTO in WPA mode. In the LTRANS mode, LTO
> +optimizes an object and produces the final assembly.
> +
> +@item -fltrans-output-list=@var{file}
> +@opindex fltrans-output-list
> +This option specifies a file to which the names of LTRANS output
> +files are written.  This option is only meaningful in conjunction
> +with @option{-fwpa}.
> +@end itemize
> Index: doc/gccint.texi
> ===================================================================
> --- doc/gccint.texi	(revision 166733)
> +++ doc/gccint.texi	(working copy)
> @@ -123,6 +123,7 @@ Additional tutorial information is linke
>  * Header Dirs::     Understanding the standard header file directories.
>  * Type Information:: GCC's memory management; generating type information.
>  * Plugins::         Extending the compiler with plugins.
> +* LTO::             Using Link-Time Optimization.
>  
>  * Funding::         How to help assure funding for free software.
>  * GNU Project::     The GNU Project and GNU/Linux.
> @@ -158,6 +159,7 @@ Additional tutorial information is linke
>  @include headerdirs.texi
>  @include gty.texi
>  @include plugins.texi
> +@include lto.texi
>  
>  @include funding.texi
>  @include gnu.texi
> Index: doc/invoke.texi
> ===================================================================
> --- doc/invoke.texi	(revision 166733)
> +++ doc/invoke.texi	(working copy)
> @@ -356,8 +356,8 @@ Objective-C and Objective-C++ Dialects}.
>  -fno-ira-share-spill-slots -fira-verbose=@var{n} @gol
>  -fivopts -fkeep-inline-functions -fkeep-static-consts @gol
>  -floop-block -floop-flatten -floop-interchange -floop-strip-mine @gol
> --floop-parallelize-all -flto -flto-compression-level -flto-partition=@var{alg} @gol
> --flto-report -fltrans -fltrans-output-list -fmerge-all-constants @gol
> +-floop-parallelize-all -flto -flto-compression-level
> +-flto-partition=@var{alg} -flto-report -fmerge-all-constants @gol
>  -fmerge-constants -fmodulo-sched -fmodulo-sched-allow-regmoves @gol
>  -fmove-loop-invariants fmudflap -fmudflapir -fmudflapth -fno-branch-count-reg @gol
>  -fno-default-inline @gol
> @@ -399,7 +399,7 @@ Objective-C and Objective-C++ Dialects}.
>  -funit-at-a-time -funroll-all-loops -funroll-loops @gol
>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
> --fwhole-program -fwhopr[=@var{n}] -fwpa -fuse-linker-plugin @gol
> +-fwhole-program -fwpa -fuse-linker-plugin @gol
>  --param @var{name}=@var{value}
>  -O  -O0  -O1  -O2  -O3  -Os -Ofast}
>  
> @@ -7489,6 +7489,16 @@ The only important thing to keep in mind
>  optimizations the @option{-flto} flag needs to be passed to both the
>  compile and the link commands.
>  
> +To make whole program optimization effective, it is necesary to make
> +certain whole program assumptions.  The compiler needs to know
> +what functions and variables can be accessed by libraries and runtime
> +outside of the link time optimized unit.  When supported by the linker,
> +the linker plugin (see @option{-fuse-linker-plugin}) passes to the
> +compiler information about used and externally visible symbols.  When
> +the linker plugin is not available, @option{-fwhole-program} should be
> +used to allow the compiler to make these assumptions, which will lead
> +to more aggressive optimization decisions.
> +
>  Note that when a file is compiled with @option{-flto}, the generated
>  object file will be larger than a regular object file because it will
>  contain GIMPLE bytecodes and the usual final code.  This means that
> @@ -7601,16 +7611,18 @@ GCC will not work with an older/newer ve
>  
>  Link time optimization does not play well with generating debugging
>  information.  Combining @option{-flto} with
> -@option{-g} is experimental.
> +@option{-g} is currently experimental and expected to produce wrong
> +results.
>  
> -If you specify the optional @var{n} the link stage is executed in
> -parallel using @var{n} parallel jobs by utilizing an installed
> -@command{make} program.  The environment variable @env{MAKE} may be
> -used to override the program used.
> +If you specify the optional @var{n}, the optimization and code
> +generation done at link time is executed in parallel using @var{n}
> +parallel jobs by utilizing an installed @command{make} program.  The
> +environment variable @env{MAKE} may be used to override the program
> +used.  The default value for @var{n} is 1.
>  
> -You can also specify @option{-fwhopr=jobserver} to use GNU make's 
> +You can also specify @option{-flto=jobserver} to use GNU make's 
>  job server mode to determine the number of parallel jobs. This 
> -is useful when the Makefile calling GCC is already parallel.
> +is useful when the Makefile calling GCC is already executing in parallel.
>  The parent Makefile will need a @samp{+} prepended to the command recipe
>  for this to work. This will likely only work if @env{MAKE} is 
>  GNU make.
> @@ -7619,53 +7631,17 @@ This option is disabled by default.
>  
>  @item -flto-partition=@var{alg}
>  @opindex flto-partition
> -Specify partitioning algorithm used by @option{-fwhopr} mode.  The value is
> -either @code{1to1} to specify partitioning corresponding to source files
> -or @code{balanced} to specify partitioning into, if possible, equally sized
> -chunks.  Specifying @code{none} as an algorithm disables partitioning
> -and streaming completely.
> -The default value is @code{balanced}.
> -
> -@item -fwpa
> -@opindex fwpa
> -This is an internal option used by GCC when compiling with
> -@option{-fwhopr}.  You should never need to use it.
> -
> -This option runs the link-time optimizer in the whole-program-analysis
> -(WPA) mode, which reads in summary information from all inputs and
> -performs a whole-program analysis based on summary information only.
> -It generates object files for subsequent runs of the link-time
> -optimizer where individual object files are optimized using both
> -summary information from the WPA mode and the actual function bodies.
> -It then drives the LTRANS phase.
> -
> -Disabled by default.
> -
> -@item -fltrans
> -@opindex fltrans
> -This is an internal option used by GCC when compiling with
> -@option{-fwhopr}.  You should never need to use it.
> -
> -This option runs the link-time optimizer in the local-transformation (LTRANS)
> -mode, which reads in output from a previous run of the LTO in WPA mode.
> -In the LTRANS mode, LTO optimizes an object and produces the final assembly.
> -
> -Disabled by default.
> -
> -@item -fltrans-output-list=@var{file}
> -@opindex fltrans-output-list
> -This is an internal option used by GCC when compiling with
> -@option{-fwhopr}.  You should never need to use it.
> -
> -This option specifies a file to which the names of LTRANS output files are
> -written.  This option is only meaningful in conjunction with @option{-fwpa}.
> -
> -Disabled by default.
> +Specify the partitioning algorithm used by the link time optimizer.
> +The value is either @code{1to1} to specify a partitioning mirroring
> +the original source files or @code{balanced} to specify partitioning
> +into equally sized chunks (whenever possible).  Specifying @code{none}
> +as an algorithm disables partitioning and streaming completely. The
> +default value is @code{balanced}.
>  
>  @item -flto-compression-level=@var{n}
>  This option specifies the level of compression used for intermediate
>  language written to LTO object files, and is only meaningful in
> -conjunction with LTO mode (@option{-fwhopr}, @option{-flto}).  Valid
> +conjunction with LTO mode (@option{-flto}).  Valid
>  values are 0 (no compression) to 9 (maximum compression).  Values
>  outside this range are clamped to either 0 or 9.  If the option is not
>  given, a default balanced compression setting is used.
> @@ -7674,7 +7650,7 @@ given, a default balanced compression se
>  Prints a report with internal details on the workings of the link-time
>  optimizer.  The contents of this report vary from version to version,
>  it is meant to be useful to GCC developers when processing object
> -files in LTO mode (via @option{-fwhopr} or @option{-flto}).
> +files in LTO mode (via @option{-flto}).
>  
>  Disabled by default.
>  
> 
>
Diego Novillo Nov. 15, 2010, 4:47 p.m. UTC | #4
On 10-11-14 22:57 , Xinliang David Li wrote:

> cc1, cc1plus, f951, and lto1 -- where does '1' here mean? It always
> got me thinking where cc2/lto2/... is.

Lost in the mists of tradition.  IIRC, the '1' used to distinguish it 
from 'cc', the driver (others may remember the true reason, I'm too 
young ;).  We are creatures of habit.

> WHOPR --- the Whole part is not accurate -- it can be any part of the
> whole program. The 'Whole Program' should be 'the set of IL units' to
> be more exact.

Naturally.  This is the usual assumption.  Whole refers to the whole set 
of files presented to the compiler.  If the user is withholding files, 
there's little the compiler can do about that, short of guessing.  I'll 
add a note on the semantics of 'whole' at the start of the document.

>> +Currently, GCC does not support combining LTO object files compiled
>> +with different set of the command line options into a single binary.
>> +At link-time, the options given on the command line and the options
>> +saved on all the files in a link-time set are applied globally.  No
>> +attempt is made at validating the combination of flags (other than the
>> +usual validation done by option processing).  This is implemented in
>> +@file{lto/lto.c}:@code{lto_read_all_file_options}.
>
> This can be a big limiting factor for the wide adoption of LTO. In
> LIPO, incompatible options are detected and modules not safe to
> include are banned.

It's another bug in a long stream of bugs.  We have discussed ways to 
address it (http://gcc.gnu.org/wiki/whopr), but so far it has not 
percolated to the top of anyone's todo list.


Diego.
Diego Novillo Nov. 15, 2010, 4:48 p.m. UTC | #5
On 10-11-14 23:28 , Dave Korn wrote:

>    I only got up to about this point, then stopped because I've been up all
> night and suddenly realised I'm crashing.  Will give the remained a look over
> after a few hours' sleep.

Thanks for the feedback.  I'll make the changes in a followup patch. 
This one is pretty big already.


Diego.
Diego Novillo Nov. 15, 2010, 4:48 p.m. UTC | #6
On 10-11-15 02:45 , Richard Guenther wrote:

> Wow, thanks for writing this up.  The patch is ok with or without
> the various suggestions already made - incremental improvements
> can always be done as followup.

Thanks.  I'll commit this one and incorporate the other feedback in 
followup patches.


Diego.
Richard Biener Nov. 15, 2010, 9:55 p.m. UTC | #7
On Mon, Nov 15, 2010 at 5:47 PM, Diego Novillo <dnovillo@google.com> wrote:
> On 10-11-14 22:57 , Xinliang David Li wrote:
>
>> cc1, cc1plus, f951, and lto1 -- where does '1' here mean? It always
>> got me thinking where cc2/lto2/... is.
>
> Lost in the mists of tradition.  IIRC, the '1' used to distinguish it from
> 'cc', the driver (others may remember the true reason, I'm too young ;).  We
> are creatures of habit.
>
>> WHOPR --- the Whole part is not accurate -- it can be any part of the
>> whole program. The 'Whole Program' should be 'the set of IL units' to
>> be more exact.
>
> Naturally.  This is the usual assumption.  Whole refers to the whole set of
> files presented to the compiler.  If the user is withholding files, there's
> little the compiler can do about that, short of guessing.  I'll add a note
> on the semantics of 'whole' at the start of the document.
>
>>> +Currently, GCC does not support combining LTO object files compiled
>>> +with different set of the command line options into a single binary.
>>> +At link-time, the options given on the command line and the options
>>> +saved on all the files in a link-time set are applied globally.  No
>>> +attempt is made at validating the combination of flags (other than the
>>> +usual validation done by option processing).  This is implemented in
>>> +@file{lto/lto.c}:@code{lto_read_all_file_options}.
>>
>> This can be a big limiting factor for the wide adoption of LTO. In
>> LIPO, incompatible options are detected and modules not safe to
>> include are banned.

If you have one file compiled with -fstrict-aliasing and one with
-fno-strict-aliasing - which one is "banned"?  One idea of solving
the problem with LTO was to do paritioning according to the set
of command-line options.

> It's another bug in a long stream of bugs.  We have discussed ways to
> address it (http://gcc.gnu.org/wiki/whopr), but so far it has not percolated
> to the top of anyone's todo list.

And the paragraph is not true anyway.  We only save and combine
(read: apply in random order) all target specific flags and a selected
set of generic flags.  In general we use the optimization option set
provided at link time (which would be a more correct description
of what happens).

Richard.

>
> Diego.
>
Xinliang David Li Nov. 15, 2010, 10:14 p.m. UTC | #8
On Mon, Nov 15, 2010 at 1:55 PM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Mon, Nov 15, 2010 at 5:47 PM, Diego Novillo <dnovillo@google.com> wrote:
>> On 10-11-14 22:57 , Xinliang David Li wrote:
>>
>>> cc1, cc1plus, f951, and lto1 -- where does '1' here mean? It always
>>> got me thinking where cc2/lto2/... is.
>>
>> Lost in the mists of tradition.  IIRC, the '1' used to distinguish it from
>> 'cc', the driver (others may remember the true reason, I'm too young ;).  We
>> are creatures of habit.
>>
>>> WHOPR --- the Whole part is not accurate -- it can be any part of the
>>> whole program. The 'Whole Program' should be 'the set of IL units' to
>>> be more exact.
>>
>> Naturally.  This is the usual assumption.  Whole refers to the whole set of
>> files presented to the compiler.  If the user is withholding files, there's
>> little the compiler can do about that, short of guessing.  I'll add a note
>> on the semantics of 'whole' at the start of the document.
>>
>>>> +Currently, GCC does not support combining LTO object files compiled
>>>> +with different set of the command line options into a single binary.
>>>> +At link-time, the options given on the command line and the options
>>>> +saved on all the files in a link-time set are applied globally.  No
>>>> +attempt is made at validating the combination of flags (other than the
>>>> +usual validation done by option processing).  This is implemented in
>>>> +@file{lto/lto.c}:@code{lto_read_all_file_options}.
>>>
>>> This can be a big limiting factor for the wide adoption of LTO. In
>>> LIPO, incompatible options are detected and modules not safe to
>>> include are banned.
>
> If you have one file compiled with -fstrict-aliasing and one with
> -fno-strict-aliasing - which one is "banned"?  One idea of solving
> the problem with LTO was to do paritioning according to the set
> of command-line options.

For correctness, it should take the most conservative one, for
performance,  it is the other way. For LIPO, the rule is simple -- the
aux module option has to be compatible with the primary module's
option -- in other words, the primary module's option will take
precedence (i.e., used for aux modules in its group) regardless,
however if such option setting may make the aux module mis-compile,
the aux module is excluded.  Currently, only mechanical comparison is
done, but it should allow the following: aux module with
-fstrict-aliasing should be allowed to be included in the group with
-fno-strict-aliasing as the primary option -- but there is a slight
chance it may actually hurt performance.

thanks,

David


>
>> It's another bug in a long stream of bugs.  We have discussed ways to
>> address it (http://gcc.gnu.org/wiki/whopr), but so far it has not percolated
>> to the top of anyone's todo list.
>
> And the paragraph is not true anyway.  We only save and combine
> (read: apply in random order) all target specific flags and a selected
> set of generic flags.  In general we use the optimization option set
> provided at link time (which would be a more correct description
> of what happens).
>
> Richard.
>
>>
>> Diego.
>>
>
Richard Biener Nov. 15, 2010, 10:24 p.m. UTC | #9
On Mon, Nov 15, 2010 at 11:14 PM, Xinliang David Li <davidxl@google.com> wrote:
> On Mon, Nov 15, 2010 at 1:55 PM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Mon, Nov 15, 2010 at 5:47 PM, Diego Novillo <dnovillo@google.com> wrote:
>>> On 10-11-14 22:57 , Xinliang David Li wrote:
>>>
>>>> cc1, cc1plus, f951, and lto1 -- where does '1' here mean? It always
>>>> got me thinking where cc2/lto2/... is.
>>>
>>> Lost in the mists of tradition.  IIRC, the '1' used to distinguish it from
>>> 'cc', the driver (others may remember the true reason, I'm too young ;).  We
>>> are creatures of habit.
>>>
>>>> WHOPR --- the Whole part is not accurate -- it can be any part of the
>>>> whole program. The 'Whole Program' should be 'the set of IL units' to
>>>> be more exact.
>>>
>>> Naturally.  This is the usual assumption.  Whole refers to the whole set of
>>> files presented to the compiler.  If the user is withholding files, there's
>>> little the compiler can do about that, short of guessing.  I'll add a note
>>> on the semantics of 'whole' at the start of the document.
>>>
>>>>> +Currently, GCC does not support combining LTO object files compiled
>>>>> +with different set of the command line options into a single binary.
>>>>> +At link-time, the options given on the command line and the options
>>>>> +saved on all the files in a link-time set are applied globally.  No
>>>>> +attempt is made at validating the combination of flags (other than the
>>>>> +usual validation done by option processing).  This is implemented in
>>>>> +@file{lto/lto.c}:@code{lto_read_all_file_options}.
>>>>
>>>> This can be a big limiting factor for the wide adoption of LTO. In
>>>> LIPO, incompatible options are detected and modules not safe to
>>>> include are banned.
>>
>> If you have one file compiled with -fstrict-aliasing and one with
>> -fno-strict-aliasing - which one is "banned"?  One idea of solving
>> the problem with LTO was to do paritioning according to the set
>> of command-line options.
>
> For correctness, it should take the most conservative one, for
> performance,  it is the other way. For LIPO, the rule is simple -- the
> aux module option has to be compatible with the primary module's
> option -- in other words, the primary module's option will take
> precedence (i.e., used for aux modules in its group) regardless,
> however if such option setting may make the aux module mis-compile,
> the aux module is excluded.  Currently, only mechanical comparison is
> done, but it should allow the following: aux module with
> -fstrict-aliasing should be allowed to be included in the group with
> -fno-strict-aliasing as the primary option -- but there is a slight
> chance it may actually hurt performance.

So, do you have an extensive list of options and which one, the -f or
the -fno variant takes precedence?  I was thinking of simply having
a white list of compatible options and treat everything else as
incompatible (and error) for LTO.

Richard.

> thanks,
>
> David
>
>
>>
>>> It's another bug in a long stream of bugs.  We have discussed ways to
>>> address it (http://gcc.gnu.org/wiki/whopr), but so far it has not percolated
>>> to the top of anyone's todo list.
>>
>> And the paragraph is not true anyway.  We only save and combine
>> (read: apply in random order) all target specific flags and a selected
>> set of generic flags.  In general we use the optimization option set
>> provided at link time (which would be a more correct description
>> of what happens).
>>
>> Richard.
>>
>>>
>>> Diego.
>>>
>>
>
Xinliang David Li Nov. 15, 2010, 10:30 p.m. UTC | #10
Currently as I mentioned, only very mechanical comparison is done --
it simply removes all -Wxxx options (and some others I forgot) and
compare the option list.

David

On Mon, Nov 15, 2010 at 2:24 PM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Mon, Nov 15, 2010 at 11:14 PM, Xinliang David Li <davidxl@google.com> wrote:
>> On Mon, Nov 15, 2010 at 1:55 PM, Richard Guenther
>> <richard.guenther@gmail.com> wrote:
>>> On Mon, Nov 15, 2010 at 5:47 PM, Diego Novillo <dnovillo@google.com> wrote:
>>>> On 10-11-14 22:57 , Xinliang David Li wrote:
>>>>
>>>>> cc1, cc1plus, f951, and lto1 -- where does '1' here mean? It always
>>>>> got me thinking where cc2/lto2/... is.
>>>>
>>>> Lost in the mists of tradition.  IIRC, the '1' used to distinguish it from
>>>> 'cc', the driver (others may remember the true reason, I'm too young ;).  We
>>>> are creatures of habit.
>>>>
>>>>> WHOPR --- the Whole part is not accurate -- it can be any part of the
>>>>> whole program. The 'Whole Program' should be 'the set of IL units' to
>>>>> be more exact.
>>>>
>>>> Naturally.  This is the usual assumption.  Whole refers to the whole set of
>>>> files presented to the compiler.  If the user is withholding files, there's
>>>> little the compiler can do about that, short of guessing.  I'll add a note
>>>> on the semantics of 'whole' at the start of the document.
>>>>
>>>>>> +Currently, GCC does not support combining LTO object files compiled
>>>>>> +with different set of the command line options into a single binary.
>>>>>> +At link-time, the options given on the command line and the options
>>>>>> +saved on all the files in a link-time set are applied globally.  No
>>>>>> +attempt is made at validating the combination of flags (other than the
>>>>>> +usual validation done by option processing).  This is implemented in
>>>>>> +@file{lto/lto.c}:@code{lto_read_all_file_options}.
>>>>>
>>>>> This can be a big limiting factor for the wide adoption of LTO. In
>>>>> LIPO, incompatible options are detected and modules not safe to
>>>>> include are banned.
>>>
>>> If you have one file compiled with -fstrict-aliasing and one with
>>> -fno-strict-aliasing - which one is "banned"?  One idea of solving
>>> the problem with LTO was to do paritioning according to the set
>>> of command-line options.
>>
>> For correctness, it should take the most conservative one, for
>> performance,  it is the other way. For LIPO, the rule is simple -- the
>> aux module option has to be compatible with the primary module's
>> option -- in other words, the primary module's option will take
>> precedence (i.e., used for aux modules in its group) regardless,
>> however if such option setting may make the aux module mis-compile,
>> the aux module is excluded.  Currently, only mechanical comparison is
>> done, but it should allow the following: aux module with
>> -fstrict-aliasing should be allowed to be included in the group with
>> -fno-strict-aliasing as the primary option -- but there is a slight
>> chance it may actually hurt performance.
>
> So, do you have an extensive list of options and which one, the -f or
> the -fno variant takes precedence?  I was thinking of simply having
> a white list of compatible options and treat everything else as
> incompatible (and error) for LTO.
>
> Richard.
>
>> thanks,
>>
>> David
>>
>>
>>>
>>>> It's another bug in a long stream of bugs.  We have discussed ways to
>>>> address it (http://gcc.gnu.org/wiki/whopr), but so far it has not percolated
>>>> to the top of anyone's todo list.
>>>
>>> And the paragraph is not true anyway.  We only save and combine
>>> (read: apply in random order) all target specific flags and a selected
>>> set of generic flags.  In general we use the optimization option set
>>> provided at link time (which would be a more correct description
>>> of what happens).
>>>
>>> Richard.
>>>
>>>>
>>>> Diego.
>>>>
>>>
>>
>
Michael Matz Nov. 16, 2010, 1:56 p.m. UTC | #11
Hi,

On Mon, 15 Nov 2010, Diego Novillo wrote:

> > cc1, cc1plus, f951, and lto1 -- where does '1' here mean? It always 
> > got me thinking where cc2/lto2/... is.
> 
> Lost in the mists of tradition.  IIRC, the '1' used to distinguish it 
> from 'cc', the driver (others may remember the true reason, I'm too 
> young ;).  We are creatures of habit.

The early C compilers, including the Portable C Compiler were multi-pass 
compilers, consisting of executables 'pass1' and 'pass2' (plus the 
preprocessor), communicating via text files.  Later versions had the 
option of producing just one binary, when the machines had more memory.  I 
guess RMS left open the option of also having a second super pass that 
he'd then have called cc2.

Perhaps we should call lto1 cc2?  ;-)


Ciao,
Michael.
P.S: The old document "A Tour Through the Portable C Compiler" is a nice 
read on these topics: 
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.3512
Ralf Wildenhues Nov. 16, 2010, 6:34 p.m. UTC | #12
Hello,

* Diego Novillo wrote on Mon, Nov 15, 2010 at 07:23:15AM CET:
> +optimized builds.  A, perhaps surprising, side effect of this feature
> +is that any mistake in the toolchain that leads to LTO information not
> +being used (e.g. an older @code{libtool} calling @code{ld} directly).
> +This is both an advantage, as the system is more robust, and a
> +disadvantage, as the user is not informed that the optimization has
> +been disabled.

Well, such a disadvantage could be ameliorated with a warning, no?

Rainer just mentioned other instances where LTO would silently not do
TRT (nm in PATH with different format, etc), so it would seem generally
useful to at least optionally warn.

Thanks,
Ralf
Richard Biener Nov. 17, 2010, 9:59 a.m. UTC | #13
On Tue, 16 Nov 2010, Ralf Wildenhues wrote:

> Hello,
> 
> * Diego Novillo wrote on Mon, Nov 15, 2010 at 07:23:15AM CET:
> > +optimized builds.  A, perhaps surprising, side effect of this feature
> > +is that any mistake in the toolchain that leads to LTO information not
> > +being used (e.g. an older @code{libtool} calling @code{ld} directly).
> > +This is both an advantage, as the system is more robust, and a
> > +disadvantage, as the user is not informed that the optimization has
> > +been disabled.
> 
> Well, such a disadvantage could be ameliorated with a warning, no?
> 
> Rainer just mentioned other instances where LTO would silently not do
> TRT (nm in PATH with different format, etc), so it would seem generally
> useful to at least optionally warn.

I can't see how this is possible, or if it is, then how it is possible
to detect the legitimate case of using the fat binary w/o link time
optimization.

Richard.
Ralf Wildenhues Nov. 17, 2010, 7:34 p.m. UTC | #14
* Richard Guenther wrote on Wed, Nov 17, 2010 at 10:59:04AM CET:
> On Tue, 16 Nov 2010, Ralf Wildenhues wrote:
> > * Diego Novillo wrote on Mon, Nov 15, 2010 at 07:23:15AM CET:
> > > +optimized builds.  A, perhaps surprising, side effect of this feature
> > > +is that any mistake in the toolchain that leads to LTO information not
> > > +being used (e.g. an older @code{libtool} calling @code{ld} directly).
> > > +This is both an advantage, as the system is more robust, and a
> > > +disadvantage, as the user is not informed that the optimization has
> > > +been disabled.
> > 
> > Well, such a disadvantage could be ameliorated with a warning, no?
> > 
> > Rainer just mentioned other instances where LTO would silently not do
> > TRT (nm in PATH with different format, etc), so it would seem generally
> > useful to at least optionally warn.
> 
> I can't see how this is possible, or if it is, then how it is possible
> to detect the legitimate case of using the fat binary w/o link time
> optimization.

Well, when -flto (or similar) is passed to the compiler driver at link
time, that surely is a sign that LTO is desired, no?  I'm not asking
about a fatal error, but a helpful warning, ideally telling the user
also why LTO was not enabled, would seem prudent in that case.

I'm arguing from a pure user-side perspective here; if that is not
possible technically, I'd be delighted to hear about why.

Thanks,
Ralf
Richard Biener Nov. 17, 2010, 8:04 p.m. UTC | #15
On Wed, 17 Nov 2010, Ralf Wildenhues wrote:

> * Richard Guenther wrote on Wed, Nov 17, 2010 at 10:59:04AM CET:
> > On Tue, 16 Nov 2010, Ralf Wildenhues wrote:
> > > * Diego Novillo wrote on Mon, Nov 15, 2010 at 07:23:15AM CET:
> > > > +optimized builds.  A, perhaps surprising, side effect of this feature
> > > > +is that any mistake in the toolchain that leads to LTO information not
> > > > +being used (e.g. an older @code{libtool} calling @code{ld} directly).
> > > > +This is both an advantage, as the system is more robust, and a
> > > > +disadvantage, as the user is not informed that the optimization has
> > > > +been disabled.
> > > 
> > > Well, such a disadvantage could be ameliorated with a warning, no?
> > > 
> > > Rainer just mentioned other instances where LTO would silently not do
> > > TRT (nm in PATH with different format, etc), so it would seem generally
> > > useful to at least optionally warn.
> > 
> > I can't see how this is possible, or if it is, then how it is possible
> > to detect the legitimate case of using the fat binary w/o link time
> > optimization.
> 
> Well, when -flto (or similar) is passed to the compiler driver at link
> time, that surely is a sign that LTO is desired, no?  I'm not asking
> about a fatal error, but a helpful warning, ideally telling the user
> also why LTO was not enabled, would seem prudent in that case.

Oh, we can indeed warn about the case where -flto is present at link
time but no object file contains LTO bytecode.  But I think what
is more common is that -flto is present at compile time but not
at link time - that case is not easy to detect, and it might be
on purpose.

> I'm arguing from a pure user-side perspective here; if that is not
> possible technically, I'd be delighted to hear about why.

If the linker is directly invoked (like libtool does) then we
don't have a chance to see a missing -flto flag.

Richard.
diff mbox

Patch

Index: doc/lto.texi
===================================================================
--- doc/lto.texi	(revision 0)
+++ doc/lto.texi	(revision 0)
@@ -0,0 +1,568 @@ 
+@c Copyright (c) 2010 Free Software Foundation, Inc.
+@c Free Software Foundation, Inc.
+@c This is part of the GCC manual.
+@c For copying conditions, see the file gcc.texi.
+@c Contributed by Jan Hubicka <jh@suse.cz> and
+@c Diego Novillo <dnovillo@google.com>
+
+@node LTO
+@chapter Link Time Optimization
+@cindex lto
+@cindex whopr
+@cindex wpa
+@cindex ltrans
+
+@section Design Overview
+
+Link time optimization is implemented as a GCC front end for a
+bytecode representation of GIMPLE that is emitted in special sections
+of @code{.o} files.  Currently, LTO support is enabled in most
+ELF-based systems, as well as darwin, cygwin and mingw systems.
+
+Since GIMPLE bytecode is saved alongside final object code, object
+files generated with LTO support are larger than regular object files.
+This ``fat'' object format makes it easy to integrate LTO into
+existing build systems, as one can, for instance, produce archives of
+the files.  Additionally, one might be able to ship one set of fat
+objects which could be used both for development and the production of
+optimized builds.  A, perhaps surprising, side effect of this feature
+is that any mistake in the toolchain that leads to LTO information not
+being used (e.g. an older @code{libtool} calling @code{ld} directly).
+This is both an advantage, as the system is more robust, and a
+disadvantage, as the user is not informed that the optimization has
+been disabled.
+
+The current implementation only produces ``fat'' objects, effectively
+doubling compilation time and increasing file sizes up to 5x the
+original size.  This hides the problem that some tools, such as
+@code{ar} and @code{nm}, need to understand symbol tables of LTO
+sections.  These tools were extended to use the plugin infrastructure,
+and with these problems solved, GCC will also support ``slim'' objects
+consisting of the intermediate code alone.
+
+At the highest level, LTO splits the compiler in two.  The first half
+(the ``writer'') produces a streaming representation of all the
+internal data structures needed to optimize and generate code.  This
+includes declarations, types, the callgraph and the GIMPLE representation
+of function bodies.
+
+When @option{-flto} is given during compilation of a source file, the
+pass manager executes all the passes in @code{all_lto_gen_passes}.
+Currently, this phase is composed of two IPA passes:
+
+@itemize @bullet
+@item @code{pass_ipa_lto_gimple_out}
+This pass executes the function @code{lto_output} in
+@file{lto-streamer-out.c}, which traverses the call graph encoding
+every reachable declaration, type and function. This generates a
+memory representation of all the file sections described below.
+
+@item @code{pass_ipa_lto_finish_out}
+This pass executes the function @code{produce_asm_for_decls} in
+@file{lto-streamer-out.c}, which takes the memory image built in the
+previous pass and encodes it in the corresponding ELF file sections.
+@end itemize
+
+The second half of LTO support is the ``reader''.  This is implemented
+as the GCC front end @file{lto1} in @file{lto/lto.c}.  When
+@file{collect2} detects a link set of @code{.o}/@code{.a} files with
+LTO information and the @option{-flto} is enabled, it invokes
+@file{lto1} which reads the set of files and aggregates them into a
+single translation unit for optimization.  The main entry point for
+the reader is @file{lto/lto.c}:@code{lto_main}.
+
+@subsection LTO modes of operation
+
+One of the main goals of the GCC link-time infrastructure was to allow
+effective compilation of large programs.  For this reason GCC implements two
+link-time compilation modes.
+
+@enumerate
+@item	@emph{LTO mode}, in which the whole program is read into the
+compiler at link-time and optimized in a similar way as if it
+were a single source-level compilation unit.
+
+@item	@emph{WHOPR or partitioned mode}, designed to utilize multiple
+CPUs and/or a distributed compilation environment to quickly link
+large applications.  WHOPR stands for WHOle Program optimizeR (not to
+be confused with the semantics of @option{-fwhole-program}).  It
+partitions the aggregated callgraph from many different @code{.o}
+files and distributes the compilation of the sub-graphs to different
+CPUs.
+
+Note that distributed compilation is not implemented yet, but since
+the parallelism is facilitated via generating a @code{Makefile}, it
+would be easy to implement.
+@end enumerate
+
+WHOPR splits LTO into three main stages:
+@enumerate
+@item Local generation (LGEN)
+This stage executes in parallel. Every file in the program is compiled
+into the intermediate language and packaged together with the local
+call-graph and summary information.  This stage is the same for both
+the LTO and WHOPR compilation mode.
+
+@item Whole Program Analysis (WPA)
+WPA is performed sequentially. The global call-graph is generated, and
+a global analysis procedure makes transformation decisions. The global
+call-graph is partitioned to facilitate parallel optimization during
+phase 3. The results of the WPA stage are stored into new object files
+which contain the partitions of program expressed in the intermediate
+language and the optimization decisions.
+
+@item Local transformations (LTRANS)
+This stage executes in parallel. All the decisions made during phase 2
+are implemented locally in each partitioned object file, and the final
+object code is generated. Optimizations which cannot be decided
+efficiently during the phase 2 may be performed on the local
+call-graph partitions.
+@end enumerate
+
+WHOPR can be seen as an extension of the usual LTO mode of
+compilation.  In LTO, WPA and LTRANS and are executed within a single
+execution of the compiler, after the whole program has been read into
+memory.
+
+When compiling in WHOPR mode the callgraph is partitioned during
+the WPA stage.  The whole program is split into a given number of
+partitions of roughly the same size.  The compiler tries to
+minimize the number of references which cross partition boundaries.
+The main advantage of WHOPR is to allow the parallel execution of
+LTRANS stages, which are the most time-consuming part of the
+compilation process.  Additionally, it avoids the need to load the
+whole program into memory.
+
+
+@section LTO file sections
+
+LTO information is stored in several ELF sections inside object files.
+Data structures and enum codes for sections are defined in
+@file{lto-streamer.h}.
+
+These sections are emitted from @file{lto-streamer-out.c} and mapped
+in all at once from @file{lto/lto.c}:@code{lto_file_read}.  The
+individual functions dealing with the reading/writing of each section
+are described below.
+
+@itemize @bullet
+@item Command line options (@code{.gnu.lto_.opts})
+
+This section contains the command line options used to generate the
+object files.  This is used at link-time to determine the optimization
+level and other settings when they are not explicitly specified at the
+linker command line.
+
+Currently, GCC does not support combining LTO object files compiled
+with different set of the command line options into a single binary.
+At link-time, the options given on the command line and the options
+saved on all the files in a link-time set are applied globally.  No
+attempt is made at validating the combination of flags (other than the
+usual validation done by option processing).  This is implemented in
+@file{lto/lto.c}:@code{lto_read_all_file_options}.
+
+
+@item Symbol table (@code{.gnu.lto_.symtab})
+
+This table replaces the ELF symbol table for functions and variables
+represented in the LTO IL. Symbols used and exported by the optimized
+assembly code of ``fat'' objects might not match the ones used and
+exported by the intermediate code.  This table is necessary because
+the intermediate code is less optimized and thus requires a separate
+symbol table.
+
+Additionally, the binary code in the ``fat'' object will lack a call
+to a function, since the call was optimized out at compilation time
+after the intermediate language was streamed out.  In some special
+cases, the same optimization may not happen  during link-time
+optimization.  This would lead to an undefined symbol if only one
+symbol table was used.
+
+The symbol table is emitted in
+@file{lto-streamer-out.c}:@code{produce_symtab}.
+
+
+@item Global declarations and types (@code{.gnu.lto_.decls})
+
+This section contains an intermediate language dump of all
+declarations and types required to represent the callgraph, static
+variables and top-level debug info.
+
+The contents of this section are emitted in
+@file{lto-streamer-out.c}:@code{produce_asm_for_decls}.  Types and
+symbols are emitted in a topological order that preserves the sharing
+of pointers when the file is read back in
+(@file{lto.c}:@code{read_cgraph_and_symbols}).
+
+
+@item The callgraph (@code{.gnu.lto_.cgraph})
+
+This section contains the basic data structure used by the GCC
+inter-procedural optimization infrastructure. This section stores an
+annotated multi-graph which represents the functions and call sites as
+well as the variables, aliases and top-level @code{asm} statements.
+
+This section is emitted in
+@file{lto-streamer-out.c}:@code{output_cgraph} and read in
+@file{lto-cgraph.c}:@code{input_cgraph}.
+
+
+@item IPA references (@code{.gnu.lto_.refs})
+
+This section contains references between function and static
+variables.  It is emitted by @file{lto-cgraph.c}:@code{output_refs}
+and read by @file{lto-cgraph.c}:@code{input_refs}.
+
+
+@item Function bodies (@code{.gnu.lto_.function_body.<name>})
+
+This section contains function bodies in the intermediate language
+representation. Every function body is in a separate section to allow
+copying of the section independently to different object files or
+reading the function on demand.
+
+Functions are emitted in
+@file{lto-streamer-out.c}:@code{output_function} and read in
+@file{lto-streamer-in.c}:@code{input_function}.
+
+
+@item Static variable initializers (@code{.gnu.lto_.vars})
+
+This section contains all the symbols in the global variable pool.  It
+is emitted by @file{lto-cgraph.c}:@code{output_varpool} and read in
+@file{lto-cgraph.c}:@code{input_cgraph}.
+
+@item Summaries and optimization summaries used by IPA passes
+(@code{.gnu.lto_.<xxx>}, where @code{<xxx>} is one of @code{jmpfuncs},
+@code{pureconst} or @code{reference})
+
+These sections are used by IPA passes that need to emit summary
+information during LTO generation to be read and aggregated at
+link time.  Each pass is responsible for implementing two pass manager
+hooks: one for writing the summary and another for reading it in.  The
+format of these sections is entirely up to each individual pass.  The
+only requirement is that the writer and reader hooks agree on the
+format.
+@end itemize
+
+
+@section Using summary information in IPA passes
+
+Programs are represented internally as a @emph{callgraph} (a
+multi-graph where nodes are functions and edges are call sites)
+and a @emph{varpool} (a list of static and external variables in
+the program).
+
+The inter-procedural optimization is organized as a sequence of
+individual passes, which operate on the callgraph and the
+varpool.  To make the implementation of WHOPR possible, every
+inter-procedural optimization pass is split into several stages
+that are executed at different times during WHOPR compilation:
+
+@itemize @bullet
+@item LGEN time
+@enumerate
+@item @emph{Generate summary} (@code{generate_summary} in
+@code{struct ipa_opt_pass_d}). This stage analyzes every function
+body and variable initializer is examined and stores relevant
+information into a pass-specific data structure.
+
+@item @emph{Write summary} (@code{write_summary} in
+@code{struct ipa_opt_pass_d}. This stage writes all the
+pass-specific information generated by @code{generate_summary}.
+Summaries go into their own @code{LTO_section_*} sections that
+have to be declared in @file{lto-streamer.h}:@code{enum
+lto_section_type}.  A new section is created by calling
+@code{create_output_block} and data can be written using the
+@code{lto_output_*} routines.
+@end enumerate
+
+@item WPA time
+@enumerate
+@item @emph{Read summary} (@code{read_summary} in
+@code{struct ipa_opt_pass_d}). This stage reads all the
+pass-specific information in exactly the same order that it was
+written by @code{write_summary}.
+
+@item @emph{Execute} (@code{execute} in @code{struct
+opt_pass}).  This performs inter-procedural propagation.  This
+must be done without actual access to the individual function
+bodies or variable initializers.  Typically, this results in a
+transitive closure operation over the summary information of all
+the nodes in the callgraph.
+
+@item @emph{Write optimization summary}
+(@code{write_optimization_summary} in @code{struct
+ipa_opt_pass_d}).  This writes the result of the inter-procedural
+propagation into the object file.  This can use the same data
+structures and helper routines used in @code{write_summary}.
+@end enumerate
+
+@item LTRANS time
+@enumerate
+@item @emph{Read optimization summary}
+(@code{read_optimization_summary} in @code{struct
+ipa_opt_pass_d}).  The counterpart to
+@code{write_optimization_summary}.  This reads the interprocedural
+optimization decisions in exactly the same format emitted by
+@code{write_optimization_summary}.
+
+@item @emph{Transform} (@code{function_transform} and
+@code{variable_transform} in @code{struct ipa_opt_pass_d}).
+The actual function bodies and variable initializers are updated
+based on the information passed down from the @emph{Execute} stage.
+@end enumerate
+@end itemize
+
+The implementation of the inter-procedural passes are shared
+between LTO, WHOPR and classic non-LTO compilation.
+
+@itemize
+@item During the traditional file-by-file mode every pass executes its
+own @emph{Generate summary}, @emph{Execute}, and @emph{Transform}
+stages within the single execution context of the compiler.
+
+@item In LTO compilation mode, every pass uses @emph{Generate
+summary} and @emph{Write summary} stages at compilation time,
+while the @emph{Read summary}, @emph{Execute}, and
+@emph{Transform} stages are executed at link time.
+
+@item In WHOPR mode all stages are used.
+@end itemize
+
+To simplify development, the GCC pass manager differentiates
+between normal inter-procedural passes and small inter-procedural
+passes.  A @emph{small inter-procedural pass}
+(@code{SIMPLE_IPA_PASS}) is a pass that does
+everything at once and thus it can not be executed during WPA in
+WHOPR mode. It defines only the @emph{Execute} stage and during
+this stage it accesses and modifies the function bodies.  Such
+passes are useful for optimization at LGEN or LTRANS time and are
+used, for example, to implement early optimization before writing
+object files.  The simple inter-procedural passes can also be used
+for easier prototyping and development of a new inter-procedural
+pass.
+
+
+@subsection Virtual clones
+
+One of the main challenges of introducing the WHOPR compilation
+mode was addressing the interactions between optimization passes.
+In LTO compilation mode, the passes are executed in a sequence,
+each of which consists of analysis (or @emph{Generate summary}),
+propagation (or @emph{Execute}) and @emph{Transform} stages.
+Once the work of one pass is finished, the next pass sees the
+updated program representation and can execute.  This makes the
+individual passes dependent on each other.
+
+In WHOPR mode all passes first execute their @emph{Generate
+summary} stage.  Then summary writing marks the end of the LGEN
+stage.  At WPA time,
+the summaries are read back into memory and all passes run the
+@emph{Execute} stage.  Optimization summaries are streamed and
+sent to LTRANS, where all the passes execute the @emph{Transform}
+stage.
+
+Most optimization passes split naturally into analysis,
+propagation and transformation stages.  But some do not.  The
+main problem arises when one pass performs changes and the
+following pass gets confused by seeing different callgraphs
+betwee the @emph{Transform} stage and the @emph{Generate summary}
+or @emph{Execute} stage.  This means that the passes are required
+to communicate their decisions with each other.
+
+To facilitate this communication, the GCC callgraph
+infrastructure implements @emph{virtual clones}, a method of
+representing the changes performed by the optimization passes in
+the callgraph without needing to update function bodies.
+
+A @emph{virtual clone} in the callgraph is a function that has no
+associated body, just a description of how to create its body based
+on a different function (which itself may be a virtual clone).
+
+The description of function modifications includes adjustments to
+the function's signature (which allows, for example, removing or
+adding function arguments), substitutions to perform on the
+function body, and, for inlined functions, a pointer to the
+function that it will be inlined into.
+
+It is also possible to redirect any edge of the callgraph from a
+function to its virtual clone.  This implies updating of the call
+site to adjust for the new function signature.
+
+Most of the transformations performed by inter-procedural
+optimizations can be represented via virtual clones.  For
+instance, a constant propagation pass can produce a virtual clone
+of the function which replaces one of its arguments by a
+constant.  The inliner can represent its decisions by producing a
+clone of a function whose body will be later integrated into
+a given function.
+
+Using @emph{virtual clones}, the program can be easily updated
+during the @emph{Execute} stage, solving most of pass interactions
+problems that would otherwise occur during @emph{Transform}.
+
+Virtual clones are later materialized in the LTRANS stage and
+turned into real functions.  Passes executed after the virtual
+clone were introduced also perform their @emph{Transform} stage
+on new functions, so for a pass there is no significant
+difference between operating on a real function or a virtual
+clone introduced before its @emph{Execute} stage.
+
+Optimization passes then work on virtual clones introduced before
+their @emph{Execute} stage as if they were real functions.  The
+only difference is that clones are not visible during the
+@emph{Generate Summary} stage.
+
+To keep function summaries updated, the callgraph interface
+allows an optimizer to register a callback that is called every
+time a new clone is introduced as well as when the actual
+function or variable is generated or when a function or variable
+is removed.  These hooks are registered in the @emph{Generate
+summary} stage and allow the pass to keep its information intact
+until the @emph{Execute} stage.  The same hooks can also be
+registered during the @emph{Execute} stage to keep the
+optimization summaries updated for the @emph{Transform} stage.
+
+@subsection IPA references
+
+GCC represents IPA references in the callgraph.  For a function
+or variable @code{A}, the @emph{IPA reference} is a list of all
+locations where the address of @code{A} is taken and, when
+@code{A} is a variable, a list of all direct stores and reads
+to/from @code{A}. References represent an oriented multi-graph on
+the union of nodes of the callgraph and the varpool.  See
+@file{ipa-reference.c}:@code{ipa_reference_write_optimization_summary}
+and
+@file{ipa-reference.c}:@code{ipa_reference_read_optimization_summary}
+for details.
+
+@subsection Jump functions
+Suppose that an optimization pass sees a function @code{A} and it
+knows the values of (some of) its arguments.  The @emph{jump
+function} describes the value of a parameter of a given function
+call in function @code{A} based on this knowledge.
+
+Jump functions are used by several optimizations, such as the
+inter-procedural constant propagation pass and the
+devirtualization pass.  The inliner also uses jump functions to
+perform inlining of callbacks.
+
+@section Whole program assumptions, linker plugin and symbol visibilities
+
+Link-time optimization gives relatively minor benefits when used
+alone.  The problem is that propagation of inter-procedural
+information does not work well across functions and variables
+that are called or referenced by other compilation units (such as
+from a dynamically linked library). We say that such functions
+are variables are @emph{externally visible}.
+
+To make the situation even more difficult, many applications
+organize themselves as a set of shared libraries, and the default
+ELF visibility rules allow one to overwrite any externally
+visible symbol with a different symbol at runtime.  This
+basically disables any optimizations across such functions and
+variables, because the compiler cannot be sure that the function
+body it is seeing is the same function body that will be used at
+runtime.  Any function or variable not declared @code{static} in
+the sources degrades the quality of inter-procedural
+optimization.
+
+To avoid this problem the compiler must assume that it sees the
+whole program when doing link-time optimization.  Strictly
+speaking, the whole program is rarely visible even at link-time.
+Standard system libraries are usually linked dynamically or not
+provided with the link-time information.  In GCC, the whole
+program option (@option{-fwhole-program}) asserts that every
+function and variable defined in the current compilation
+unit is static, except for function @code{main} (note: at
+link-time, the current unit is the union of all objects compiled
+with LTO).  Since some functions and variables need to
+be referenced externally, for example by another DSO or from an
+assembler file, GCC also provides the function and variable
+attribute @code{externally_visible} which can be used to disable
+the effect of @option{-fwhole-program} on a specific symbol.
+
+The whole program mode assumptions are slightly more complex in
+C++, where inline functions in headers are put into @emph{COMDAT}
+sections. COMDAT function and variables can be defined by
+multiple object files and their bodies are unified at link-time
+and dynamic link-time.  COMDAT functions are changed to local only
+when their address is not taken and thus un-sharing them with a
+library is not harmful.  COMDAT variables always remain externally
+visible, however for readonly variables it is assumed that their
+initializers cannot be overwritten by a different value.
+
+GCC provides the function and variable attribute
+@code{visibility} that can be used to specify the visibility of
+externally visible symbols (or alternatively an
+@option{-fdefault-visibility} command line option).  ELF defines
+the @code{default}, @code{protected}, @code{hidden} and
+@code{internal} visibilities.
+
+The most commonly used is visibility is @code{hidden}. It
+specifies that the symbol cannot be referenced from outside of
+the current shared library. Unfortunately, this information
+cannot be used directly by the link-time optimization in the
+compiler since the whole shared library also might contain
+non-LTO objects and those are not visible to the compiler.
+
+GCC solves this problem using linker plugins.  A @emph{linker
+plugin} is an interface to the linker that allows an external
+program to claim the ownership of a given object file.  The linker
+then performs the linking procedure by querying the plugin about
+the symbol table of the claimed objects and once the linking
+decisions are complete, the plugin is allowed to provide the
+final object file before the actual linking is made.  The linker
+plugin obtains the symbol resolution information which specifies
+which symbols provided by the claimed objects are bound from the
+rest of a binary being linked.
+
+Currently, the linker plugin  works only in combination
+with the Gold linker,  but a GNU ld implementation is under
+development.
+
+GCC is designed to be independent of the rest of the toolchain
+and aims to support linkers without plugin support.  For this
+reason it does not use the linker plugin by default.  Instead,
+the object files are examined by @command{collect2} before being
+passed to the linker and objects found to have LTO sections are
+passed to @command{lto1} first.  This mode does not work for
+library archives. The decision on what object files from the
+archive are needed depends on the actual linking and thus GCC
+would have to implement the linker itself.  The resolution
+information is missing too and thus GCC needs to make an educated
+guess based on @option{-fwhole-program}.  Without the linker
+plugin GCC also assumes that symbols are declared @code{hidden}
+and not referred by non-LTO code by default.
+
+@section Internal flags controlling @code{lto1}
+
+The following flags are passed into @command{lto1} and are not
+meant to be used directly from the command line.
+
+@itemize
+@item -fwpa
+@opindex fwpa
+This option runs the serial part of the link-time optimizer
+performing the inter-procedural propagation (WPA mode).  The
+compiler reads in summary information from all inputs and
+performs an analysis based on summary information only.  It
+generates object files for subsequent runs of the link-time
+optimizer where individual object files are optimized using both
+summary information from the WPA mode and the actual function
+bodies.  It then drives the LTRANS phase.
+
+@item -fltrans
+@opindex fltrans
+This option runs the link-time optimizer in the
+local-transformation (LTRANS) mode, which reads in output from a
+previous run of the LTO in WPA mode. In the LTRANS mode, LTO
+optimizes an object and produces the final assembly.
+
+@item -fltrans-output-list=@var{file}
+@opindex fltrans-output-list
+This option specifies a file to which the names of LTRANS output
+files are written.  This option is only meaningful in conjunction
+with @option{-fwpa}.
+@end itemize
Index: doc/gccint.texi
===================================================================
--- doc/gccint.texi	(revision 166733)
+++ doc/gccint.texi	(working copy)
@@ -123,6 +123,7 @@  Additional tutorial information is linke
 * Header Dirs::     Understanding the standard header file directories.
 * Type Information:: GCC's memory management; generating type information.
 * Plugins::         Extending the compiler with plugins.
+* LTO::             Using Link-Time Optimization.
 
 * Funding::         How to help assure funding for free software.
 * GNU Project::     The GNU Project and GNU/Linux.
@@ -158,6 +159,7 @@  Additional tutorial information is linke
 @include headerdirs.texi
 @include gty.texi
 @include plugins.texi
+@include lto.texi
 
 @include funding.texi
 @include gnu.texi
Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 166733)
+++ doc/invoke.texi	(working copy)
@@ -356,8 +356,8 @@  Objective-C and Objective-C++ Dialects}.
 -fno-ira-share-spill-slots -fira-verbose=@var{n} @gol
 -fivopts -fkeep-inline-functions -fkeep-static-consts @gol
 -floop-block -floop-flatten -floop-interchange -floop-strip-mine @gol
--floop-parallelize-all -flto -flto-compression-level -flto-partition=@var{alg} @gol
--flto-report -fltrans -fltrans-output-list -fmerge-all-constants @gol
+-floop-parallelize-all -flto -flto-compression-level
+-flto-partition=@var{alg} -flto-report -fmerge-all-constants @gol
 -fmerge-constants -fmodulo-sched -fmodulo-sched-allow-regmoves @gol
 -fmove-loop-invariants fmudflap -fmudflapir -fmudflapth -fno-branch-count-reg @gol
 -fno-default-inline @gol
@@ -399,7 +399,7 @@  Objective-C and Objective-C++ Dialects}.
 -funit-at-a-time -funroll-all-loops -funroll-loops @gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
 -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
--fwhole-program -fwhopr[=@var{n}] -fwpa -fuse-linker-plugin @gol
+-fwhole-program -fwpa -fuse-linker-plugin @gol
 --param @var{name}=@var{value}
 -O  -O0  -O1  -O2  -O3  -Os -Ofast}
 
@@ -7489,6 +7489,16 @@  The only important thing to keep in mind
 optimizations the @option{-flto} flag needs to be passed to both the
 compile and the link commands.
 
+To make whole program optimization effective, it is necesary to make
+certain whole program assumptions.  The compiler needs to know
+what functions and variables can be accessed by libraries and runtime
+outside of the link time optimized unit.  When supported by the linker,
+the linker plugin (see @option{-fuse-linker-plugin}) passes to the
+compiler information about used and externally visible symbols.  When
+the linker plugin is not available, @option{-fwhole-program} should be
+used to allow the compiler to make these assumptions, which will lead
+to more aggressive optimization decisions.
+
 Note that when a file is compiled with @option{-flto}, the generated
 object file will be larger than a regular object file because it will
 contain GIMPLE bytecodes and the usual final code.  This means that
@@ -7601,16 +7611,18 @@  GCC will not work with an older/newer ve
 
 Link time optimization does not play well with generating debugging
 information.  Combining @option{-flto} with
-@option{-g} is experimental.
+@option{-g} is currently experimental and expected to produce wrong
+results.
 
-If you specify the optional @var{n} the link stage is executed in
-parallel using @var{n} parallel jobs by utilizing an installed
-@command{make} program.  The environment variable @env{MAKE} may be
-used to override the program used.
+If you specify the optional @var{n}, the optimization and code
+generation done at link time is executed in parallel using @var{n}
+parallel jobs by utilizing an installed @command{make} program.  The
+environment variable @env{MAKE} may be used to override the program
+used.  The default value for @var{n} is 1.
 
-You can also specify @option{-fwhopr=jobserver} to use GNU make's 
+You can also specify @option{-flto=jobserver} to use GNU make's 
 job server mode to determine the number of parallel jobs. This 
-is useful when the Makefile calling GCC is already parallel.
+is useful when the Makefile calling GCC is already executing in parallel.
 The parent Makefile will need a @samp{+} prepended to the command recipe
 for this to work. This will likely only work if @env{MAKE} is 
 GNU make.
@@ -7619,53 +7631,17 @@  This option is disabled by default.
 
 @item -flto-partition=@var{alg}
 @opindex flto-partition
-Specify partitioning algorithm used by @option{-fwhopr} mode.  The value is
-either @code{1to1} to specify partitioning corresponding to source files
-or @code{balanced} to specify partitioning into, if possible, equally sized
-chunks.  Specifying @code{none} as an algorithm disables partitioning
-and streaming completely.
-The default value is @code{balanced}.
-
-@item -fwpa
-@opindex fwpa
-This is an internal option used by GCC when compiling with
-@option{-fwhopr}.  You should never need to use it.
-
-This option runs the link-time optimizer in the whole-program-analysis
-(WPA) mode, which reads in summary information from all inputs and
-performs a whole-program analysis based on summary information only.
-It generates object files for subsequent runs of the link-time
-optimizer where individual object files are optimized using both
-summary information from the WPA mode and the actual function bodies.
-It then drives the LTRANS phase.
-
-Disabled by default.
-
-@item -fltrans
-@opindex fltrans
-This is an internal option used by GCC when compiling with
-@option{-fwhopr}.  You should never need to use it.
-
-This option runs the link-time optimizer in the local-transformation (LTRANS)
-mode, which reads in output from a previous run of the LTO in WPA mode.
-In the LTRANS mode, LTO optimizes an object and produces the final assembly.
-
-Disabled by default.
-
-@item -fltrans-output-list=@var{file}
-@opindex fltrans-output-list
-This is an internal option used by GCC when compiling with
-@option{-fwhopr}.  You should never need to use it.
-
-This option specifies a file to which the names of LTRANS output files are
-written.  This option is only meaningful in conjunction with @option{-fwpa}.
-
-Disabled by default.
+Specify the partitioning algorithm used by the link time optimizer.
+The value is either @code{1to1} to specify a partitioning mirroring
+the original source files or @code{balanced} to specify partitioning
+into equally sized chunks (whenever possible).  Specifying @code{none}
+as an algorithm disables partitioning and streaming completely. The
+default value is @code{balanced}.
 
 @item -flto-compression-level=@var{n}
 This option specifies the level of compression used for intermediate
 language written to LTO object files, and is only meaningful in
-conjunction with LTO mode (@option{-fwhopr}, @option{-flto}).  Valid
+conjunction with LTO mode (@option{-flto}).  Valid
 values are 0 (no compression) to 9 (maximum compression).  Values
 outside this range are clamped to either 0 or 9.  If the option is not
 given, a default balanced compression setting is used.
@@ -7674,7 +7650,7 @@  given, a default balanced compression se
 Prints a report with internal details on the workings of the link-time
 optimizer.  The contents of this report vary from version to version,
 it is meant to be useful to GCC developers when processing object
-files in LTO mode (via @option{-fwhopr} or @option{-flto}).
+files in LTO mode (via @option{-flto}).
 
 Disabled by default.