GCC has grown too big and intertwined. This has made learning curves steeper, raised barriers of entry for new developers, and made the compiler harder to test, debug and understand.
We need to make the compiler more modular, introduce strong APIs, reduce global state, build and test components separately, add unit tests for components, and generally make the compiler a collection of distinct components with well defined interfaces and behaviour.
Include files and Makefile dependencies
One problem with modularization of GCC, is that there is a spaghetti web of included files. There are many C and header files that include other header files to pick up one declaration from the included header in some cases, or for no reason at all in some other cases. The dependencies between the files are most obviously identified by looking at the headers included by a .c implementation file. The makefile rules in Makefile.in and in the target make rules are also very helpful.
Very often, the reason for including files in a C source file can be traced back to the initial commit of a new file, indicating that the file was created as a copy of another file, implementing something else and perhaps needing the header. If the new file implements something for which the included header is not required, this is usually not cleaned up or even noticed.
An example of this is output.h, which used to be included in many front-end files. This is an unwanted dependency, because the functions made available by including output.h are used to write out assembler instructions, which front ends should not be doing anymore: A front end should do parsing, semantics, gimplification, and hand off one or more translation units to cgraphunit.c, one unit at a time.
Another example is intl.h, which is included in files like cgraphunit.c but is not actually used by that file.
In other cases, new features are implemented in a way that creates unnecessary dependences. This happeded for example with options.h which was included in tree.h to import cl_target_option. This was resolved by moving cl_target_option to coretypes.h. Another example from input.h is location_t, which probably ought to be declared in coretypes.h as well.
More problematic for a modularization effort, but fortunately slightly less challenging to fix, is header files including other header files. Assume C source file 'A' includes header file 'B', which includes header file 'C'. All declarations imported from header file 'C' are also available in source file 'A'. There are even cases where no declarations from 'B' are used in 'A' but where not including 'B' anymore uncovers the hidden dependency on 'A'. These hidden dependencies make it all too easy for developers to, more or less accidentally, cross modular boundaries without anyone noticing in patch reviews.
The task of removing redundant include files is something that could be added to the list of projects for beginners. The following independent, small projects would greatly improve the include file situation in GCC:
Large project 1: Add accessor functions for all tree structures and use those instead of accessor macros. This would be the first step towards hiding the tree data structures from the compiler.
- Small project 1: Move line-map.h to its own directory/library, so you do not need to link with cpplib to use line-map.c
Small project 2 (done): Find out how to make diagnostics.[ch] + pretty-printer.[ch] as independent of other parts of the compiler as possible. Probably will require to factor out parts of the pretty-printers that implement FE, middle-end specific stuff. pretty-print.c should not include tree.h. diagnostics.c should not include tree.h and possibly many other header files.
- Small project 3: Stop including header files in header files. This will for the most part be trial-and-error: Remove an include file, try to bootstrap, and fix breakage. Whenever a hidden dependency is uncovered, and the dependency inappropriately crosses a module boundary, add a FIXME note at the location of the breakage, and a FIXME note above the extra header that has to be included so that the dependency can easily be found and hopefully removed at a later stage.
All this is a bit pointless unless it is focused to get a particular independent module out of it. Any module: graphite, GIMPLE core, GIMPLE passes, RTL core, RTL passes, etc. (there is no obvious consensus about what would be modules yet). Some people believe that physical separation of files and independent libraries are the only way to ensure dependencies are not added back. What is certain, is that with the current setup it is too easy to add undesired dependencies.
Another problem with the code base of GCC, is that there are too many dependencies between individual files. To identify individual files, or groups of files, that form a module, such dependencies have to be identified, and either be broken if they are unwanted, or documented if the dependency is supposed to exist. For instance, a GIMPLE pass naturally depends on gimple-fold.c but not on expr.h, a file that only deals with RTL stuff.
Before GCC can be modularized properly, it is necessary to make clear what interfaces and declarations are actually needed in each source file. This will, no doubt, be a huge job. It is unclear at the moment whether there are tools available that could help (Dehydra perhaps, or a dedicated plugin, or Ctags? Or turn this patch into a proper plugin? Maybe create a symbols database and identify dependencies to break?).
- Random small tasks:
Split a new header gimplify.h off from gimple.h, prototype just those functions that the front ends need to see, and remove gimple.h from the front-ends and some back-end files.
- Internalize java:force_evaluation_order
Make darwin_register_frameworks & friends C-only target hooks.
Split target.h into c-target.h and cxx-target.h and internalize all c_target_objs and cxx_target_objs in the front ends. Watch out for code shared with other front ends and code that actually belongs to the back end (e.g. attribs code).
Making a single FE an independent library is a major undertaking. All front ends live in their own subdirectory, but there are countless dependencies on other parts of the compiler (target-independent middle end as well as target-specific things) that have to be broken to make a front end a stand-alone module. But starting with one FE would set an example to follow, and help initiate an effort to define and document/implement the interface between the front ends and the rest of the compiler. The Go front end may be just that example. This is a new front end written intentionally to be independent of the GCC code generator.
Instead of having a single monolithic binary, it is proposed to separate the major components in libbackend.a into several libraries. These libraries would live in separate sub-directories under gcc/. Although initially they would be built together with the rest of the compiler, the intent is to evolve these libraries into independent modules that could be built separately from the compiler.
The gcc/ directory would be modularized using the different Intermediate Languages (IL) as the main separators. This means the creation of libgeneric, libgimple and librtl. However, there are other major pieces of functionality that should also be moved into their own modules. The various modules will live under a sub-directory of gcc/ (at least initially). The following is the proposed organization of the gcc/ directory (this does not include the existing directories):
- driver/ - The gcc driver.
- generic/ - Tree files.
- gimple/ - Gimple generation, analysis and optimization.
- graphite/ - Gimple represented as polyhedra, loop optimizations.
- ipa/ - Callgraph manager and IPA analysis/optimization.
- rtl/ - RTL code generation and optimization.
- diagnostic/ - Generic diagnostic routines.
- cfg/ - Control Flow Graph routines.
- openmp/ - OpenMP implementation.
Each directory will export a library and a set of include files that define its interface. Every other module that wants to use its services will only be able to talk to it via the published interface.
TODO: Propose which files should go in each directory and define the interfaces.
-- ManuelLópezIbáñez 2010-05-15 20:37:04 We should document the desired and the undesired [!] dependencies. The main hurdle for me to help make GCC more modular is autotools. I tried to move line-map out of libcpp. This should be trivial because line-map does not have any dependencies but it was impossible to get the autotools magic right. So I gave up.
-- RichardGuenther 2010-06-06 14:37:00 As several of the desired modules are tied together via dependencies on tree separating them out to directories should be not priority. I do agree with c/ and driver/. And I would add lib/ for stuff like sbitmap.
- c/ : Depends: ??? Dependants: ???
- driver/ : Depends: ??? Dependants: ???
- generic/ : Depends: ??? Dependants: ???
- gimple/ : Depends: ??? Dependants: ???
- graphite/: Depends:? Dependants: ???
- graphite-poly.h - Main interface for the polyhedral representation.
- ipa/ : Depends: ??? Dependants: ???
- rtl/ : Depends: ??? Dependants: ???
diagnostic/ : Depends: location, opts, langhooks!, tree!, plugins!. Dependants: Almost everything except libcpp? Work is ongoing to break most of these dependencies, see e.g. this patch to disentangle diagnostics from trees. More to follow.
- diagnostic.h: Interface.
- pretty-printer.h: Internal? Interface?
- cfg/ : Depends: ??? Dependants: ???
- openmp/ : Depends: ??? Dependants: ???
-- ManuelLópezIbáñez 2010-05-16 11:37:14 What does this contain? There is already libgomp. The other bits and pieces are FE-specific or gimple-specific, so they should be part of the corresponding module.
- source-location/ : Depends: libcpp! Dependants: libcpp, diagnostic, ???
- source-location.h: Interface.
- line-map.h: Internal.
- line-map.c: Internal.
- source-location.c: expand_location and other stuff that is distributed around.
Streamable Intermediate Representations
One important aspect of the modularization proposed in the previous section is going to be the ability to test the different modules independently. As such, I believe that all the intermediate languages in GCC should have a streaming representation. This should make it possible for the compiler pipeline to start and stop at almost any point during compilation.
The main benefit of this feature is in testing. It will be possible to write tests that check a specific analysis or transformation done on a representation, independently of what other transformations may occur before it in the regular pipeline. For instance, it should be possible to generate synthetic gimple, pass it through a single optimization pass and test the resulting gimple.
Additionally, having streamable representations means that the compiler can stop at any time, save its state to some database and resume from that database. This is a feature that is extremely useful for projects like the compiler server or LTO.