This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

LLVM/GCC Integration Proposal


Hi Everyone,

At the request of several members of the GCC community, I'm writing this email to describe some of my short-term plans with GCC and describe an alternative to the recent link-time optimization [1] and code generator rewrite [2] proposals.

For those who are not familiar with me, I'm one of the main developers working on the LLVM project (http://llvm.org/). One important way that LLVM is currently used is as a back-end for GCC. In this role, it provides a static optimizer, interprocedural link- time optimizer, JIT support, and several other features. Until recently, LLVM has only been loosely integrated with an old version of GCC (a 3.4 prerelease), which limited its effectiveness.

Recently, at Apple, I have been working on a new version of the llvm- gcc translation layer, built on GCC 4. This implementation links the LLVM optimizers and code generator directly into the GCC process, replacing the tree-ssa optimizers and the RTL code generator with the corresponding LLVM components when enabled. The end result is a compiler that is command line compatible with GCC: 'gcc -S t.c -o t.s' does exactly what you'd expect, and most standard command line options are supported (those that aren't are very specific to the design of the RTL backend, which we just ignore). I plan to have this work committed to the Apple branch within a month.

Though not yet implemented, we intend to support link-time optimization with many design constraints that match the recent proposal [1]. Because LLVM already currently supports link-time optimization and has an architecture that makes it straight-forward, this work mainly amounts to changes in the GCC compiler-driver. If you're interested in the link-time IPO architecture, there are documents that describe the high level ideas [10,11] with some (potentially out of date) implementation information.

In this email, I want to briefly talk about the strengths that LLVM brings to the table, some of the implementation details of my integration work, some of the important ongoing work that we are working on, and answer some of the hot questions that will inevitably come up. :)


==== Strengths of LLVM


LLVM is a modern and efficient compiler framework built out of libraries with well defined APIs. As I mentioned above, LLVM provides an optimizer and code generator. It also provides several other components I won't discuss here. If you are interested, please see LLVM's extensive documentation [9] for more information.

The LLVM mid-level and interprocedural optimizer work on a common representation (the LLVM IR), which is a three-address SSA-based representation that is somewhat similar to GIMPLE. It is fully specified [3], is easy to analyze and manipulate [4], and is very memory efficient (for example, it takes about 50M of memory on a 32- bit host to hold the IR for all of 176.gcc, which is about 230K LOC). The IR has a text form (suitable for compiler dumps and fine- grained regression testing) and a 100% equivalent compressed binary form (suitable for interchange between compile-time and link-time optimization steps), both of which correspond to the in-memory IR.

The IR supports several features that are useful to various communities, including true tail calls, accurate garbage collection, etc. The IR is continuously evolving to add new features, and we plan several extensions in the future.

The optimizer itself has a full suite of standard scalar optimizations and also includes a collection of interprocedural optimizations and interprocedural analyses, some of which are quite aggressive [5] (though this particular set is not enabled by default). The optimizer is fully modular, which allows us to have nice tools for working with the IR and optimizer, including an automated bug finding tool [6] which makes tracking down miscompilations and ICEs really easy.

The LLVM code generator is built on modern techniques. For example, it uses pattern-matching DAG-based instruction selection, maintains SSA form up until register allocation, represents code in a form quite similar to the "compressed RTL" proposal [7], and supports dynamically loaded targets. We currently have stable code generators for PowerPC and X86, with targets for Alpha, IA-64, and Sparc also available, but less stable. LLVM can also emit C code, which allows it to support systems for which we do not have a native code generator implemented.

The design of the register allocation components is very close to that proposed in Andrew MacLeod's recent proposal [2]. We have separate instruction selection, global coalescing, and register allocation stages, have a spiller that does simple local register allocation, etc. We currently support multiple pluggable register allocators, to (for example) support quick -O0 compiles, and to support targets who want to have completely custom allocators. The spiller is currently capable of folding spill code into instructions, but does not support rematerialization yet.


==== Implementation Details


My current work involves integrating the LLVM optimizer and code generator into GCC as mentioned above. Here is a basic sketch for the major pieces this entails:

1. The build system is taught about C++ code.
2. A configure option (--enable-llvm) has been added to GCC's configure
script. This enables an ENABLE_LLVM #define. Every LLVM-specific
piece of code is protected by this macro. Compiling without -- enable-llvm
results in no behavior change.
3. Main is compiled as a C++ function, if ENABLE_LLVM is set.
4. A new header file, llvm.h, defines a small number of C functions which
the extant GCC code calls into. For example, tree_rest_of_compilation
calls through this interface to compile a function body. The GCC C code
is all still compiled with a C compiler. These calls are not particularly
invasive, and are all protected with ENABLE_LLVM as described above.
5. The LLVM interface is implemented with 4 C++ files, which are currently
about 4000 lines of code. This code basically translates GIMPLE into LLVM.
6. The binary links in the standard LLVM libraries, including optimizations
and code generators taken from an LLVM build tree.


This design point makes the work a completely optional opt-in component with a low impact on the existing GCC source base. We clearly do not support all of the targets that GCC does, so those unsupported ones will not be able to use LLVM (currently, see below for more information).


==== Ongoing Work


While it has many virtues, LLVM is still not complete, and we are investing in adding the missing features. In particular, this system is missing two key features to be truly useful: debug info support and inline asm support. Apple is currently investing in adding both these features, as well as adding vector support.

In the future, I anticipate work to extend the LLVM to be capable of capturing higher-level information (e.g. that needed for dynamic dispatch optimizations and more alias information), as well as supporting new targets and the usual collection of miscellaneous improvements to the current system.

The project is in early stages of development, but code performance and compile times are currently comparable to an unmodified GCC compiler on PowerPC.


==== Answers to some hot questions


* What about licensing and copyright assignment issues?

The patch I'm working on is GPL licensed and copyright will be assigned to the FSF under the standard Apple copyright assignment. Initially, I intend to link the LLVM libraries in from the existing LLVM distribution, mainly to simplify my work. This code is licensed under a BSD-like license [8], and LLVM itself will not initially be assigned to the FSF. If people are seriously in favor of LLVM being a long-term part of GCC, I personally believe that the LLVM community would agree to assign the copyright of LLVM itself to the FSF and we can work through these details.

* LLVM is written in C++, eww yuck!

It is true. However, the C++ness only exists in the LLVM portions of the resultant compiler, no other parts have to be converted to C++ (well, except for main). Though there is a lot of potential to misuse C++ in horrible ways (it is a sharp tool after all) LLVM has been quite successful at using it for good, not evil. While I don't expect everyone to like use of C++, others will hopefully respect that it allows us to build good APIs, increase modularity, and get more done in less time.

* LLVM is missing support for representing high-level information!

First, this isn't true. LLVM does capture some important pieces of high-level information, and is continuing to add more. There is a fundamental design tension here between building a compiler that supports multiple languages (in a link-time context, that means it can link translation units from multiple languages together and optimize across the boundary) and building a compiler that supports only one.

Specifically, advocates of the recent link-time-optimization proposal [1] may claim that exposing all of the information that DWARF captures (e.g. about the type system) at link-time is a good thing: it makes it easier to do specific high-level optimizations, because all of the information is trivially available. However, they are ignoring the issue of handling multiple languages at the same time: an optimizer with this design has to be aware of (and be able to update) all possible source-language information and must be extended in the future as new languages are added (this is the problem with universal compilers, dating back to the UNCOL design). Also, the linker needs to know how to combine declarations defined in different languages, because an interprocedural optimizer only want to see one declaration for each logical variable (for example). This quickly becomes difficult and messy, which is presumably why the link-time proposal allows the linker to "give up" linking two translation units.

In contrast, LLVM has historically leaned the other way: it exposes only the *important* high-level information in ways that are useful to the optimizers. As we find that we need more information, we extend it to capture this information as needed. This allows the optimizers to be simple and forces us to think about the key aspects of the information we need to capture before we add it. LLVM will continue to evolve this way, being extended to support new information as it is needed.

* LLVM is missing X, Y, Z feature!

We are continuing to improve and extend LLVM. As I mention above, three of the most important ones for Apple (debug info, inline asm, and vector support) are on the short-term todo list. Others will be tackled in time (and help is always appreciated!).

* This will not support <some target>!

As describe above, we won't support every target that GCC currently does. Three options are possible:

1. We (the GCC community) could build an LLVM to GIMPLE translator. This would probably take about as much work as the GIMPLE -> LLVM translator (about 4000 LOC), which is not a huge project.
2. We could build an LLVM to RTL translator.
3. Target maintainers can build an LLVM port for their target.


In any case, it is certain that GCC will not drop its existing RTL backend any time in the near future.


==== Conclusion


As I mentioned above, this work is currently underway. I anticipate committing this patch to the Apple branch within a month, and am wondering if people are interested in having this functionality on the mainline GCC branch.

This work clearly overlaps with the two recent proposals [1,2]. I suggest that improving LLVM to address deficiencies in it would be easier and more man-hour-efficient than building a whole new interprocedural optimization component and rewriting large portions of the GCC backend. Further, this proposal differs from the other two in that it is already largely implemented and works. Finally, there is no reason that multiple different approaches cannot be developed in parallel, if people desire such an approach.

Thoughtful feedback appreciated,

-Chris

==== References

[1] http://gcc.gnu.org/ml/gcc/2005-11/msg00735.html
[2] http://gcc.gnu.org/ml/gcc/2005-11/msg00783.html
[3] http://llvm.org/docs/LangRef.html
[4] http://llvm.org/pubs/2004-09-22-LCPCLLVMTutorial.html
[5] http://llvm.org/pubs/2005-05-04-LattnerPHDThesis.html
[6] http://llvm.org/docs/Bugpoint.html
[7] http://gcc.gnu.org/wiki/Speedup%20areas
[8] http://llvm.org/releases/1.6/LICENSE.TXT
[9] http://llvm.org/docs/
[10] http://llvm.org/pubs/2004-01-30-CGO-LLVM.html
[11] http://llvm.org/pubs/2003-05-01-GCCSummit2003.html




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]