Next: , Previous: Plugins, Up: Top


24 Link Time Optimization

24.1 Design Overview

链接时优化是作为一个GIMPLE字节码表示的前段实现的,这些字节码就存在于.o文件的特殊section。目前,LTO已经支持大多数基于ELF文件格式的系统,也支持darwin,cygwin和mingw这样的系统。

由于GIMPLE字节码保存在最后的目标文件中,所以支持LTO的目标文件比一般的目标文件大一些。采用这样“大”目标文件格式,LTO功能容易被集成到现有的编译器构建系统中,例如,在构建文件的档案时。

现在的实现方式只有一种,就是是产生"肥胖"对象文件,虽然这样不仅会让编译时间翻倍,而且会让文件大小增长五倍,但是这样的实现方式掩盖了一个问题,就是一些工具比如arnm需要理解LTO section的符号表。这些工具可以被扩展以使用插件功能,而且,随着这些问题的解决,GCC会支持仅由中间代码组成的"苗条"对象文件。

从高层次来看,LTO把编译器分割为两个部分。前半部分("写者")产生以流形式表达的所有代码优化和生成需要的内部数据结构。这其中包括函数的声明,类型,调用图和GIMPLE表达。

当编译源文件使用了选项-flto,pass管理器在all_lto_gen_passes中执行所有的pass。现在,这个阶段有两个IPA pass组成:

LTO的下半个部分是"读者"。这个部分是作为GCC的前端lto/lto.c文件lto1实现的。当collect2检查出链接文件集合.o/.a具有LTO信息,而且选项-flto被打开,它就调用lto1lto1读入一些文件,把它们聚集成为一个单独的为优化的翻译单元。这个读入功能的主入口点是lto/lto.c文件的lto_main函数。

24.1.1 LTO modes of operation

GCC的链接时基础设施的一个主要目标就是能够有效地编译大程序。由于这个原因,GCC实现了两种链接时编译的模式

  1. LTO 模式,在这种模式下,整个程序在链接时刻被读入编译器,然后就像对一个源代码级别的编译单元,进行优化。
  2. WHOPR 或分区模式,这样的设计是为了利用多个CPU和/或分布式编译环境来链接大的应用程序。WHOPR代表"全部程序优化"(不要被-fwhole-program的语义所疑惑)。 它对从许多不同的.o文件中产生聚集的调用图进行分块,然后把这些分块后的子图分布到不同的CPU进行编译。

    注意分布式编译现在还没有实现,但是由于并行化只是利用生成的一个Makefile,所以并行编译容易实现。

WHOPR 把LTO分成了三个主要阶段:\

  1. 局部生成 (LGEN) 这个阶段是并行执行的。程序中的每一个文件被编译成为一种中间表达语言,并且和局部的调用图和总计信息包装在一起。这个阶段对于LTO和WHOPR编译模式是相同的。
  2. 全程序分析 (WPA) WPA是顺序执行的,生成一个全局调用图,并且有一个全局分析过程来做转换的决定。在第三阶段,全局调用图被分块,以方便并行优化。WPA阶段的结果保存在新的目标文件中,这些新的文件包含用中间语言表示的分块程序和一些优化的决定。
  3. 局部转换 (LTRANS) 这个阶段也是并行执行的。所有在阶段二做的决定在每个分块目标文件上实现,并且生成最后的目标代码。无法在第二阶段高效进行的优化可能会在局部分块调用图上进行。

WHOPR可以被看作为普通LTO编译模式的扩展。在LTO模式,当所有程序被读进内存后,WPA和LTRANS在一次编译器运行中运行。

当在WHOPR编译模式中时,在WPA阶段,调用图被分块。整个程序被分割为大小基本一致的若干分块。编译器尽量减少跨越分块边缘的引用。WHOPR的主要有点是允许并行执行LTRANS阶段,而这个阶段有恰恰是整个编译过程中最耗时的部分。例外,它也避免了把整个程序全部加载到内存。

24.2 LTO file sections

LTO的信息保存在目标文件的几个段里边。这些段相关的数据结构和枚举代码定义在文件lto-streamer.h

这些段从lto-streamer-out.c中生成,并且从lto/lto.c的函数lto_file_read一次性映射。处理每个段的读写的单个函数,在下边描述。

24.3 Using summary information in IPA passes

Programs are represented internally as a callgraph (a multi-graph where nodes are functions and edges are call sites) and a varpool (a list of static and external variables in the program).

The inter-procedural optimization is organized as a sequence of individual passes, which operate on the callgraph and the varpool. To make the implementation of WHOPR possible, every inter-procedural optimization pass is split into several stages that are executed at different times during WHOPR compilation:

The implementation of the inter-procedural passes are shared between LTO, WHOPR and classic non-LTO compilation.

To simplify development, the GCC pass manager differentiates between normal inter-procedural passes and small inter-procedural passes. A small inter-procedural pass (SIMPLE_IPA_PASS) is a pass that does everything at once and thus it can not be executed during WPA in WHOPR mode. It defines only the Execute stage and during this stage it accesses and modifies the function bodies. Such passes are useful for optimization at LGEN or LTRANS time and are used, for example, to implement early optimization before writing object files. The simple inter-procedural passes can also be used for easier prototyping and development of a new inter-procedural pass.

24.3.1 Virtual clones

One of the main challenges of introducing the WHOPR compilation mode was addressing the interactions between optimization passes. In LTO compilation mode, the passes are executed in a sequence, each of which consists of analysis (or Generate summary), propagation (or Execute) and Transform stages. Once the work of one pass is finished, the next pass sees the updated program representation and can execute. This makes the individual passes dependent on each other.

In WHOPR mode all passes first execute their Generate summary stage. Then summary writing marks the end of the LGEN stage. At WPA time, the summaries are read back into memory and all passes run the Execute stage. Optimization summaries are streamed and sent to LTRANS, where all the passes execute the Transform stage.

Most optimization passes split naturally into analysis, propagation and transformation stages. But some do not. The main problem arises when one pass performs changes and the following pass gets confused by seeing different callgraphs betwee the Transform stage and the Generate summary or Execute stage. This means that the passes are required to communicate their decisions with each other.

To facilitate this communication, the GCC callgraph infrastructure implements virtual clones, a method of representing the changes performed by the optimization passes in the callgraph without needing to update function bodies.

A virtual clone in the callgraph is a function that has no associated body, just a description of how to create its body based on a different function (which itself may be a virtual clone).

The description of function modifications includes adjustments to the function's signature (which allows, for example, removing or adding function arguments), substitutions to perform on the function body, and, for inlined functions, a pointer to the function that it will be inlined into.

It is also possible to redirect any edge of the callgraph from a function to its virtual clone. This implies updating of the call site to adjust for the new function signature.

Most of the transformations performed by inter-procedural optimizations can be represented via virtual clones. For instance, a constant propagation pass can produce a virtual clone of the function which replaces one of its arguments by a constant. The inliner can represent its decisions by producing a clone of a function whose body will be later integrated into a given function.

Using virtual clones, the program can be easily updated during the Execute stage, solving most of pass interactions problems that would otherwise occur during Transform.

Virtual clones are later materialized in the LTRANS stage and turned into real functions. Passes executed after the virtual clone were introduced also perform their Transform stage on new functions, so for a pass there is no significant difference between operating on a real function or a virtual clone introduced before its Execute stage.

Optimization passes then work on virtual clones introduced before their Execute stage as if they were real functions. The only difference is that clones are not visible during the Generate Summary stage.

To keep function summaries updated, the callgraph interface allows an optimizer to register a callback that is called every time a new clone is introduced as well as when the actual function or variable is generated or when a function or variable is removed. These hooks are registered in the Generate summary stage and allow the pass to keep its information intact until the Execute stage. The same hooks can also be registered during the Execute stage to keep the optimization summaries updated for the Transform stage.

24.3.2 IPA references

GCC represents IPA references in the callgraph. For a function or variable A, the IPA reference is a list of all locations where the address of A is taken and, when A is a variable, a list of all direct stores and reads to/from A. References represent an oriented multi-graph on the union of nodes of the callgraph and the varpool. See ipa-reference.c:ipa_reference_write_optimization_summary and ipa-reference.c:ipa_reference_read_optimization_summary for details.

24.3.3 Jump functions

Suppose that an optimization pass sees a function A and it knows the values of (some of) its arguments. The jump function describes the value of a parameter of a given function call in function A based on this knowledge.

Jump functions are used by several optimizations, such as the inter-procedural constant propagation pass and the devirtualization pass. The inliner also uses jump functions to perform inlining of callbacks.

24.4 Whole program assumptions, linker plugin and symbol visibilities

Link-time optimization gives relatively minor benefits when used alone. The problem is that propagation of inter-procedural information does not work well across functions and variables that are called or referenced by other compilation units (such as from a dynamically linked library). We say that such functions are variables are externally visible.

To make the situation even more difficult, many applications organize themselves as a set of shared libraries, and the default ELF visibility rules allow one to overwrite any externally visible symbol with a different symbol at runtime. This basically disables any optimizations across such functions and variables, because the compiler cannot be sure that the function body it is seeing is the same function body that will be used at runtime. Any function or variable not declared static in the sources degrades the quality of inter-procedural optimization.

To avoid this problem the compiler must assume that it sees the whole program when doing link-time optimization. Strictly speaking, the whole program is rarely visible even at link-time. Standard system libraries are usually linked dynamically or not provided with the link-time information. In GCC, the whole program option (-fwhole-program) asserts that every function and variable defined in the current compilation unit is static, except for function main (note: at link-time, the current unit is the union of all objects compiled with LTO). Since some functions and variables need to be referenced externally, for example by another DSO or from an assembler file, GCC also provides the function and variable attribute externally_visible which can be used to disable the effect of -fwhole-program on a specific symbol.

The whole program mode assumptions are slightly more complex in C++, where inline functions in headers are put into COMDAT sections. COMDAT function and variables can be defined by multiple object files and their bodies are unified at link-time and dynamic link-time. COMDAT functions are changed to local only when their address is not taken and thus un-sharing them with a library is not harmful. COMDAT variables always remain externally visible, however for readonly variables it is assumed that their initializers cannot be overwritten by a different value.

GCC provides the function and variable attribute visibility that can be used to specify the visibility of externally visible symbols (or alternatively an -fdefault-visibility command line option). ELF defines the default, protected, hidden and internal visibilities.

The most commonly used is visibility is hidden. It specifies that the symbol cannot be referenced from outside of the current shared library. Unfortunately, this information cannot be used directly by the link-time optimization in the compiler since the whole shared library also might contain non-LTO objects and those are not visible to the compiler.

GCC solves this problem using linker plugins. A linker plugin is an interface to the linker that allows an external program to claim the ownership of a given object file. The linker then performs the linking procedure by querying the plugin about the symbol table of the claimed objects and once the linking decisions are complete, the plugin is allowed to provide the final object file before the actual linking is made. The linker plugin obtains the symbol resolution information which specifies which symbols provided by the claimed objects are bound from the rest of a binary being linked.

Currently, the linker plugin works only in combination with the Gold linker, but a GNU ld implementation is under development.

GCC is designed to be independent of the rest of the toolchain and aims to support linkers without plugin support. For this reason it does not use the linker plugin by default. Instead, the object files are examined by collect2 before being passed to the linker and objects found to have LTO sections are passed to lto1 first. This mode does not work for library archives. The decision on what object files from the archive are needed depends on the actual linking and thus GCC would have to implement the linker itself. The resolution information is missing too and thus GCC needs to make an educated guess based on -fwhole-program. Without the linker plugin GCC also assumes that symbols are declared hidden and not referred by non-LTO code by default.

24.5 Internal flags controlling lto1

The following flags are passed into lto1 and are not meant to be used directly from the command line.