Incremental Compiler Project
The goal of this project is to turn GCC into an incremental compiler. GCC will run as a server and maintain a model of the user's program. When a translation unit is recompiled, GCC will re-compile the minimum necessary. Read the announcement and check out the branch at svn://gcc.gnu.org/svn/gcc/branches/incremental-compiler
- Faster turnaround time for changes.
- Eventually, ability to ask server for information about program.
- Change GCC to run as a server
- Change front ends (C and C++ are planned) to re-use results of parsing across translation units
- Change GCC to be able to generate a single object file per top-level definition
- Change GCC to be able to parse in multiple threads. This means removing globals from the front ends
- and teaching the GC about threads.
- Initial server code is checked in. This shares parsed contents across translation units.
Initial results are that this uses 1/2 the time and 1/3 the memory of an ordinary (--combine) compilation. To use, run gcc --server then set the environment variable GCCSERVER. Now gcc will contact the server.
Driver and UI
- It seems important to have the least impact on how users invoke gcc. The current implementation requires starting the server and setting an environment variable. (We could remove the environment variable at the expense of an attempt to open a socket per invocation.)
Currently gcc --server only starts cc1. Eventually we at least want it to be able to start cc1plus. Then gcc --kill-server will have to try to contact all available servers.
Another possibility is to turn each FE into a shared library and then dlopen them as needed.
- One important detail is interrupting the server -- the server should gracefully stop a job in response to a control-c. I think this can be reasonably handled by having the parser occasionally check for an interrupt state, and then have it stop parsing (perhaps this could be done deep in the parser with the use of longjmp). (This may require some cpp work to ensure files are closed at the proper points.) After code generation has started, an interrupt of the driver should simply kill the back-end process. (An initial implementation of this was checked in on 2008-01-18.)
The server code is largely isolated in server.c, with a few changes in toplev.c. The client code is also in server.c.
- The client connects to the server over a unix domain socket; this seemed like the simplest approach for good security.
The protocol is an ad hoc one; all protocol details are private to server.c. Eventually we should support a more robust protocol, for instance one that supports queries by IDEs.
- Some modules require re-initialization code in order to run in the server environment. This code is added on an as needed basis; only modules which run in the server proper (see the section on code generation) need this treatment.
C Front End
- The C front end has been modified to lex the entire file at once. (There are a couple known buglets here, but nothing major.) All re-use happens after preprocessing has completed; the preprocessor is unchanged.
- While lexing the C FE marks all the file change events in the token stream.
- When parsing, the parser divides the token stream into "hunks". A hunk is the unit of reuse and is identified by a signature (currently computed using MD5).
- Hunk boundaries are found in one of two ways.
For "system" files, and for files not owned by the user (a distinction made because some files in /usr/include are unlikely to change but are not considered "system" files by gcc), a hunk starts at a file change boundary. These hunks end at the next possible file change boundary; this may span more than one file change event if a top-level definition or declaration crosses a boundary (e.g., see how tree.def is included somewhere). A multi-span hunk will only be reused if the same successors are seen in the token stream as were seen when the hunk was initially parsed. Note that file change boundaries are chosen primarily because they are relatively unchanging and simple to compute. Experimentation with other heuristics may prove valuable.
- For user-owned files, a hunk is a single top-level declaration or definition. The parser uses a heuristic to find the end of the hunk. Hunks are chosen this way in user code to maximize reuse; this in turn minimizes the amount of work done in the incremental case.
- A hunk has a set of prerequisites, which are checked before a hunk is reused. When parsing a hunk for the first time, the parser records the originating hunk of any names looked up by the current hunk. This set of names constitutes the prerequisites. A hunk can only be reused if all its prerequisites hunks were seen earlier in the current compilation.
- Prerequisites also function as "anti-dependencies". If a re-declaration is seen, the new declaration can only be re-used if the identical previous declaration is seen in the new compilation. (Need example here, or rewording.)
- Note that we can treat some dependencies semantically. That is, we can look at the prerequisite hunk's definitions. If the current compilation has different definitions for all the same objects, and these replacement definitions are ABI-compatible with the prerequisites, then we can still re-use the hunk. This case will require a change to introduce aliasing somewhere, perhaps in cgraph. This is not yet implemented.
- This would allow smarter re-compilation when making internal changes to functions -- it would capture more cases.
- This approach may work well for C++.
- A hunk's contents is just the mapping of names to declarations that were seen while parsing the hunk. When reusing a hunk, we re-bind the names accordingly.
- While parsing we also compute a "smash map". The trunk C front end does "decl smashing" if it sees multiple declarations of the same object -- that is, it will modify an existing object to update it with new properties. Because the incremental front end is re-using these objects, it cannot smash decls; instead we must make a copy and arrange for parts of the front end to look up the copy. Currently this is only done when dereferencing a pointer or looking for a field in a structure; this is done because these operations may be applied to apparently incomplete types.
- Incremental compilation is handled without a special case. Because a hunk in user code is a single definition, and because hunk dependencies are other hunks, the results from a change to a definition are limited to its users. In other words, the hunks form a fine-grained dependency model of the program. It may be possible to do better than this -- for instance, to avoid recompiling a piece of code when its dependencies have change in compatible ways -- but that kind of analysis can be comfortably added later.
- After finishing a compilation, we rewrite the locations of objects. This lets us avoid a problem where a long-running server could run out of mapped locations. See the source for the details of this rewriting.
- There are two open projects for the C front end.
- The state of pragmas must be captured as part of a hunk's prerequisites. It would be nice to only add a pragma as a prerequisite if it actually affected compilation of the hunk; this might be tricky. Another issue is ensuring that pragmas (which are relatively rare) do not bloat the hunk data structure. One idea here would be to encapsulate the state in a special tree object, and then register an ordinary prerequisite whose identifier is not a valid C identifier (and thus unable to appear in the user's program).
- Warnings emitted when compiling a declaration should be re-emitted when the declaration is reused. This is tricky for a number of reasons. If the warning or error settings change, what is emitted must change. Locations of warnings must be rewritten the same way that decl locations are rewritten. And, it is hard to conceive of an efficient approach that works even if the user changes his LANG setting at runtime.
- Currently cpp is largely untouched. Sharing is only done on the output of cpp.
- Should cpp be a performance bottleneck, it can be made to function incrementally, and then be composed with the compiler's incremental parser.
For instance, the preprocessor could keep state around so that it could predict the next hunk's signature without actually lexing it. (This is possible if cpp has already seen a sequence of headers before and knows that nothing has changed.) Then cpp could call the parser to see if the next hunk is re-usable (this callback is needed for dependency checking). If this succeeds, rather than lex a header, cpp could skip it and emit a single CPP_HUNK token that carries with it the hunk's signature.
Perhaps a server-ified cpp could use inotify to more efficiently determine when header files have changed.
C++ Front End
- No work has started here, not even the design of the reuse approach. It may be possible to re-use the same approach that is used for C.
- One thing worth contemplating is that a change to the body of an inline definition of a method should not require recompiling all users of the containing class -- only places where that method has been inlined. This affects hunk signature computation, in particular it breaks the C-like approach. Perhaps a recursive approach would work (i.e., looking for hunks in places other than the parser's top-level loop).
Currently code generation is handled by forking the server (see toplev.c:compile_file) and doing the remaining work in the child process. This avoids re-initialization problems for most modules.
- Medium term it would be good to design a lang-hook interface here, to avoid the sort of hacks needed in the current implementation.
Long term the plan is to support incremental compilation by having a single definition per object file. The user will tell the server where to put these mini-objects; after all objects have been generated the server will invoke ld to create the object file that the user requested.
- There are some open design problems with incremental debuginfo generation. A hunk's signature purposely does not incorporate the location of a token. However, a change in token location may require debuginfo changes.
- We eventually want to do compilations in parallel, to make the best use of multi-processor machines.
One approach is to make the front end multi-threaded. (I've pretty much abandoned this idea. There are too many mutable tree fields, making this a difficult project. Also, threads do not interact well with fork, which is currently needed by the code generation approach.)
This will entail removing most global variables, marking some with __thread, and wrapping a few with locks.
- For the C front end, sharing will take place at the hunk level. The parser will acquire a lock on a hunk's key before parsing (or re-using) the hunk. Once the hunk has been registered the lock will be released. This will allow reasonably fine-grained sharing without possibility of deadlock.
This sub-project will also require updates to the GC. The current plan is to have ggc_collect be a safe point; this works well with the C parser as it collects in each iteration of the main loop. The C++ parser does not do this, and will need to be modified. Additionally, the server main loop will either need not to hold GC-able data, or it will need a way to inform the GC of its activity. (Note that the GC work is completed on the branch. The C++ parser has not been modified to periodically collect, however.)
- Another idea is simpler: run multiple servers, and distribute jobs between them. This approach is attractive because it requires much less work on GCC. This is currently implemented on the branch.