LTO - GNU Compiler Collection (GCC) Internals

Next: Funding, Previous: Plugins, Up: Top

24 Link Time Optimization

24.1 Design Overview

链接时优化是作为一个GIMPLE字节码表示的前段实现的，这些字节码就存在于.o文件的特殊section。目前，LTO已经支持大多数基于ELF文件格式的系统，也支持darwin，cygwin和mingw这样的系统。

由于GIMPLE字节码保存在最后的目标文件中，所以支持LTO的目标文件比一般的目标文件大一些。采用这样“大”目标文件格式，LTO功能容易被集成到现有的编译器构建系统中，例如，在构建文件的档案时。

现在的实现方式只有一种，就是是产生"肥胖"对象文件，虽然这样不仅会让编译时间翻倍，而且会让文件大小增长五倍，但是这样的实现方式掩盖了一个问题，就是一些工具比如ar 和 nm需要理解LTO section的符号表。这些工具可以被扩展以使用插件功能，而且，随着这些问题的解决，GCC会支持仅由中间代码组成的"苗条"对象文件。

从高层次来看，LTO把编译器分割为两个部分。前半部分("写者")产生以流形式表达的所有代码优化和生成需要的内部数据结构。这其中包括函数的声明，类型，调用图和GIMPLE表达。

当编译源文件使用了选项-flto，pass管理器在all_lto_gen_passes中执行所有的pass。现在，这个阶段有两个IPA pass组成:

pass_ipa_lto_gimple_out 这个pass执行了lto-streamer-out.c的函数lto_output。这个函数遍历调用图中的每一个可达的声明，类型和函数。这样就生成了所有如下所述的文件段的内存表示。
pass_ipa_lto_finish_out 这个pass执行lto-streamer-out.c中的produce_asm_for_decls函数，这个函数把上一个pass构建的内存表示编码保存到相应的ELF的文件段中。

LTO的下半个部分是"读者"。这个部分是作为GCC的前端lto/lto.c文件lto1实现的。当collect2检查出链接文件集合.o/.a具有LTO信息，而且选项-flto被打开，它就调用lto1，lto1读入一些文件，把它们聚集成为一个单独的为优化的翻译单元。这个读入功能的主入口点是lto/lto.c文件的lto_main函数。

24.1.1 LTO modes of operation

GCC的链接时基础设施的一个主要目标就是能够有效地编译大程序。由于这个原因，GCC实现了两种链接时编译的模式

LTO 模式,在这种模式下，整个程序在链接时刻被读入编译器，然后就像对一个源代码级别的编译单元，进行优化。
WHOPR 或分区模式，这样的设计是为了利用多个CPU和/或分布式编译环境来链接大的应用程序。WHOPR代表"全部程序优化"(不要被-fwhole-program的语义所疑惑)。它对从许多不同的.o文件中产生聚集的调用图进行分块，然后把这些分块后的子图分布到不同的CPU进行编译。
注意分布式编译现在还没有实现，但是由于并行化只是利用生成的一个Makefile，所以并行编译容易实现。

WHOPR 把LTO分成了三个主要阶段:\

局部生成 (LGEN) 这个阶段是并行执行的。程序中的每一个文件被编译成为一种中间表达语言，并且和局部的调用图和总计信息包装在一起。这个阶段对于LTO和WHOPR编译模式是相同的。
全程序分析 (WPA) WPA是顺序执行的，生成一个全局调用图，并且有一个全局分析过程来做转换的决定。在第三阶段，全局调用图被分块，以方便并行优化。WPA阶段的结果保存在新的目标文件中，这些新的文件包含用中间语言表示的分块程序和一些优化的决定。
局部转换 (LTRANS) 这个阶段也是并行执行的。所有在阶段二做的决定在每个分块目标文件上实现，并且生成最后的目标代码。无法在第二阶段高效进行的优化可能会在局部分块调用图上进行。

WHOPR可以被看作为普通LTO编译模式的扩展。在LTO模式，当所有程序被读进内存后，WPA和LTRANS在一次编译器运行中运行。

当在WHOPR编译模式中时，在WPA阶段，调用图被分块。整个程序被分割为大小基本一致的若干分块。编译器尽量减少跨越分块边缘的引用。WHOPR的主要有点是允许并行执行LTRANS阶段，而这个阶段有恰恰是整个编译过程中最耗时的部分。例外，它也避免了把整个程序全部加载到内存。

24.2 LTO file sections

LTO的信息保存在目标文件的几个段里边。这些段相关的数据结构和枚举代码定义在文件lto-streamer.h。

这些段从lto-streamer-out.c中生成，并且从lto/lto.c的函数lto_file_read一次性映射。处理每个段的读写的单个函数，在下边描述。

命令行选项 (.gnu.lto_.opts) 这个段包含了用来生成目标文件的命令行选项。这个段被用来在链接时刻决定优化级别和其他的设置，当这些信息都没有在连接器的命令行中显示地设置。
Currently, GCC does not support combining LTO object files compiled with different set of the command line options into a single binary. At link-time, the options given on the command line and the options saved on all the files in a link-time set are applied globally. No attempt is made at validating the combination of flags (other than the usual validation done by option processing). This is implemented in lto/lto.c:lto_read_all_file_options.
Symbol table (.gnu.lto_.symtab)
This table replaces the ELF symbol table for functions and variables represented in the LTO IL. Symbols used and exported by the optimized assembly code of “fat” objects might not match the ones used and exported by the intermediate code. This table is necessary because the intermediate code is less optimized and thus requires a separate symbol table.
Additionally, the binary code in the “fat” object will lack a call to a function, since the call was optimized out at compilation time after the intermediate language was streamed out. In some special cases, the same optimization may not happen during link-time optimization. This would lead to an undefined symbol if only one symbol table was used.
The symbol table is emitted in lto-streamer-out.c:produce_symtab.
Global declarations and types (.gnu.lto_.decls)
This section contains an intermediate language dump of all declarations and types required to represent the callgraph, static variables and top-level debug info.
The contents of this section are emitted in lto-streamer-out.c:produce_asm_for_decls. Types and symbols are emitted in a topological order that preserves the sharing of pointers when the file is read back in (lto.c:read_cgraph_and_symbols).
The callgraph (.gnu.lto_.cgraph)
This section contains the basic data structure used by the GCC inter-procedural optimization infrastructure. This section stores an annotated multi-graph which represents the functions and call sites as well as the variables, aliases and top-level asm statements.
This section is emitted in lto-streamer-out.c:output_cgraph and read in lto-cgraph.c:input_cgraph.
IPA references (.gnu.lto_.refs)
This section contains references between function and static variables. It is emitted by lto-cgraph.c:output_refs and read by lto-cgraph.c:input_refs.
Function bodies (.gnu.lto_.function_body.<name>)
This section contains function bodies in the intermediate language representation. Every function body is in a separate section to allow copying of the section independently to different object files or reading the function on demand.
Functions are emitted in lto-streamer-out.c:output_function and read in lto-streamer-in.c:input_function.
Static variable initializers (.gnu.lto_.vars)
This section contains all the symbols in the global variable pool. It is emitted by lto-cgraph.c:output_varpool and read in lto-cgraph.c:input_cgraph.
Summaries and optimization summaries used by IPA passes (.gnu.lto_.<xxx>, where <xxx> is one of jmpfuncs, pureconst or reference)
These sections are used by IPA passes that need to emit summary information during LTO generation to be read and aggregated at link time. Each pass is responsible for implementing two pass manager hooks: one for writing the summary and another for reading it in. The format of these sections is entirely up to each individual pass. The only requirement is that the writer and reader hooks agree on the format.

24.3 Using summary information in IPA passes

Programs are represented internally as a callgraph (a multi-graph where nodes are functions and edges are call sites) and a varpool (a list of static and external variables in the program).

The inter-procedural optimization is organized as a sequence of individual passes, which operate on the callgraph and the varpool. To make the implementation of WHOPR possible, every inter-procedural optimization pass is split into several stages that are executed at different times during WHOPR compilation:

LGEN time
1. Generate summary (generate_summary in struct ipa_opt_pass_d). This stage analyzes every function body and variable initializer is examined and stores relevant information into a pass-specific data structure.
2. Write summary (write_summary in struct ipa_opt_pass_d. This stage writes all the pass-specific information generated by generate_summary. Summaries go into their own LTO_section_* sections that have to be declared in lto-streamer.h:enum lto_section_type. A new section is created by calling create_output_block and data can be written using the lto_output_* routines.
WPA time
1. Read summary (read_summary in struct ipa_opt_pass_d). This stage reads all the pass-specific information in exactly the same order that it was written by write_summary.
2. Execute (execute in struct opt_pass). This performs inter-procedural propagation. This must be done without actual access to the individual function bodies or variable initializers. Typically, this results in a transitive closure operation over the summary information of all the nodes in the callgraph.
3. Write optimization summary (write_optimization_summary in struct ipa_opt_pass_d). This writes the result of the inter-procedural propagation into the object file. This can use the same data structures and helper routines used in write_summary.
LTRANS time
1. Read optimization summary (read_optimization_summary in struct ipa_opt_pass_d). The counterpart to write_optimization_summary. This reads the interprocedural optimization decisions in exactly the same format emitted by write_optimization_summary.
2. Transform (function_transform and variable_transform in struct ipa_opt_pass_d). The actual function bodies and variable initializers are updated based on the information passed down from the Execute stage.

The implementation of the inter-procedural passes are shared between LTO, WHOPR and classic non-LTO compilation.

During the traditional file-by-file mode every pass executes its own Generate summary, Execute, and Transform stages within the single execution context of the compiler.
In LTO compilation mode, every pass uses Generate summary and Write summary stages at compilation time, while the Read summary, Execute, and Transform stages are executed at link time.
In WHOPR mode all stages are used.

To simplify development, the GCC pass manager differentiates between normal inter-procedural passes and small inter-procedural passes. A small inter-procedural pass (SIMPLE_IPA_PASS) is a pass that does everything at once and thus it can not be executed during WPA in WHOPR mode. It defines only the Execute stage and during this stage it accesses and modifies the function bodies. Such passes are useful for optimization at LGEN or LTRANS time and are used, for example, to implement early optimization before writing object files. The simple inter-procedural passes can also be used for easier prototyping and development of a new inter-procedural pass.

24.3.1 Virtual clones

One of the main challenges of introducing the WHOPR compilation mode was addressing the interactions between optimization passes. In LTO compilation mode, the passes are executed in a sequence, each of which consists of analysis (or Generate summary), propagation (or Execute) and Transform stages. Once the work of one pass is finished, the next pass sees the updated program representation and can execute. This makes the individual passes dependent on each other.

In WHOPR mode all passes first execute their Generate summary stage. Then summary writing marks the end of the LGEN stage. At WPA time, the summaries are read back into memory and all passes run the Execute stage. Optimization summaries are streamed and sent to LTRANS, where all the passes execute the Transform stage.

Most optimization passes split naturally into analysis, propagation and transformation stages. But some do not. The main problem arises when one pass performs changes and the following pass gets confused by seeing different callgraphs betwee the Transform stage and the Generate summary or Execute stage. This means that the passes are required to communicate their decisions with each other.

To facilitate this communication, the GCC callgraph infrastructure implements virtual clones, a method of representing the changes performed by the optimization passes in the callgraph without needing to update function bodies.

A virtual clone in the callgraph is a function that has no associated body, just a description of how to create its body based on a different function (which itself may be a virtual clone).

The description of function modifications includes adjustments to the function's signature (which allows, for example, removing or adding function arguments), substitutions to perform on the function body, and, for inlined functions, a pointer to the function that it will be inlined into.

It is also possible to redirect any edge of the callgraph from a function to its virtual clone. This implies updating of the call site to adjust for the new function signature.

Most of the transformations performed by inter-procedural optimizations can be represented via virtual clones. For instance, a constant propagation pass can produce a virtual clone of the function which replaces one of its arguments by a constant. The inliner can represent its decisions by producing a clone of a function whose body will be later integrated into a given function.

Using virtual clones, the program can be easily updated during the Execute stage, solving most of pass interactions problems that would otherwise occur during Transform.

Virtual clones are later materialized in the LTRANS stage and turned into real functions. Passes executed after the virtual clone were introduced also perform their Transform stage on new functions, so for a pass there is no significant difference between operating on a real function or a virtual clone introduced before its Execute stage.

Optimization passes then work on virtual clones introduced before their Execute stage as if they were real functions. The only difference is that clones are not visible during the Generate Summary stage.

To keep function summaries updated, the callgraph interface allows an optimizer to register a callback that is called every time a new clone is introduced as well as when the actual function or variable is generated or when a function or variable is removed. These hooks are registered in the Generate summary stage and allow the pass to keep its information intact until the Execute stage. The same hooks can also be registered during the Execute stage to keep the optimization summaries updated for the Transform stage.

24.3.2 IPA references

GCC represents IPA references in the callgraph. For a function or variable A, the IPA reference is a list of all locations where the address of A is taken and, when A is a variable, a list of all direct stores and reads to/from A. References represent an oriented multi-graph on the union of nodes of the callgraph and the varpool. See ipa-reference.c:ipa_reference_write_optimization_summary and ipa-reference.c:ipa_reference_read_optimization_summary for details.

24.3.3 Jump functions

Suppose that an optimization pass sees a function A and it knows the values of (some of) its arguments. The jump function describes the value of a parameter of a given function call in function A based on this knowledge.

Jump functions are used by several optimizations, such as the inter-procedural constant propagation pass and the devirtualization pass. The inliner also uses jump functions to perform inlining of callbacks.

24.4 Whole program assumptions, linker plugin and symbol visibilities

Link-time optimization gives relatively minor benefits when used alone. The problem is that propagation of inter-procedural information does not work well across functions and variables that are called or referenced by other compilation units (such as from a dynamically linked library). We say that such functions are variables are externally visible.

To make the situation even more difficult, many applications organize themselves as a set of shared libraries, and the default ELF visibility rules allow one to overwrite any externally visible symbol with a different symbol at runtime. This basically disables any optimizations across such functions and variables, because the compiler cannot be sure that the function body it is seeing is the same function body that will be used at runtime. Any function or variable not declared static in the sources degrades the quality of inter-procedural optimization.

To avoid this problem the compiler must assume that it sees the whole program when doing link-time optimization. Strictly speaking, the whole program is rarely visible even at link-time. Standard system libraries are usually linked dynamically or not provided with the link-time information. In GCC, the whole program option (-fwhole-program) asserts that every function and variable defined in the current compilation unit is static, except for function main (note: at link-time, the current unit is the union of all objects compiled with LTO). Since some functions and variables need to be referenced externally, for example by another DSO or from an assembler file, GCC also provides the function and variable attribute externally_visible which can be used to disable the effect of -fwhole-program on a specific symbol.

The whole program mode assumptions are slightly more complex in C++, where inline functions in headers are put into COMDAT sections. COMDAT function and variables can be defined by multiple object files and their bodies are unified at link-time and dynamic link-time. COMDAT functions are changed to local only when their address is not taken and thus un-sharing them with a library is not harmful. COMDAT variables always remain externally visible, however for readonly variables it is assumed that their initializers cannot be overwritten by a different value.

GCC provides the function and variable attribute visibility that can be used to specify the visibility of externally visible symbols (or alternatively an -fdefault-visibility command line option). ELF defines the default, protected, hidden and internal visibilities.

The most commonly used is visibility is hidden. It specifies that the symbol cannot be referenced from outside of the current shared library. Unfortunately, this information cannot be used directly by the link-time optimization in the compiler since the whole shared library also might contain non-LTO objects and those are not visible to the compiler.

GCC solves this problem using linker plugins. A linker plugin is an interface to the linker that allows an external program to claim the ownership of a given object file. The linker then performs the linking procedure by querying the plugin about the symbol table of the claimed objects and once the linking decisions are complete, the plugin is allowed to provide the final object file before the actual linking is made. The linker plugin obtains the symbol resolution information which specifies which symbols provided by the claimed objects are bound from the rest of a binary being linked.

Currently, the linker plugin works only in combination with the Gold linker, but a GNU ld implementation is under development.

GCC is designed to be independent of the rest of the toolchain and aims to support linkers without plugin support. For this reason it does not use the linker plugin by default. Instead, the object files are examined by collect2 before being passed to the linker and objects found to have LTO sections are passed to lto1 first. This mode does not work for library archives. The decision on what object files from the archive are needed depends on the actual linking and thus GCC would have to implement the linker itself. The resolution information is missing too and thus GCC needs to make an educated guess based on -fwhole-program. Without the linker plugin GCC also assumes that symbols are declared hidden and not referred by non-LTO code by default.

24.5 Internal flags controlling `lto1`

The following flags are passed into lto1 and are not meant to be used directly from the command line.

-fwpa This option runs the serial part of the link-time optimizer performing the inter-procedural propagation (WPA mode). The compiler reads in summary information from all inputs and performs an analysis based on summary information only. It generates object files for subsequent runs of the link-time optimizer where individual object files are optimized using both summary information from the WPA mode and the actual function bodies. It then drives the LTRANS phase.
-fltrans This option runs the link-time optimizer in the local-transformation (LTRANS) mode, which reads in output from a previous run of the LTO in WPA mode. In the LTRANS mode, LTO optimizes an object and produces the final assembly.
-fltrans-output-list=file This option specifies a file to which the names of LTRANS output files are written. This option is only meaningful in conjunction with -fwpa.