SoC Project: Incremental Parsing (of C++)

Mike Stump
Tue Mar 20 23:40:00 GMT 2007

On Mar 20, 2007, at 1:07 AM, Simon Brenner wrote:
> I propose to implement incremental parsing in C++

Sounds like a multi-person, multi-year project.

We did something like this a while ago, called the compile server.   
The idea was to be able to advance through unchanged portions of code  
and replay the changes to compile state so as to reduce compilation  
time for code in which we've already seen before, either because  
we're compiling mostly the same source, or using a header file more  
than once.  It included fine grained dependancy tracking for things  
like macros and declarations.  It could also replay state for  
fragments across translation units as well.

You can view it at branches/compile-server-branch.

> Basically, a diff algorithm is run on the character or token stream,

We ran it at the character level.

> producing a list of insertions, modifications, and deletions,

We only had two states, unchanged or changed region.  Unchanged  
replayed the state, changed meant compile just that region as  
normal.  The idea was that people usually only edit a small number of  
regions.   A region was defined as the lines between # lines in the  
cpp output.

> In this project, I wouldn't even try to attack incremental code  
> generation, but
> just run the entire middle/back end on an updated tree.

We were able to do incremental code-gen as well.

> A future extension
> would be to use the information about which trees have changed to  
> perform
> minimal code generation. This approach obviously limits the  
> improvement in
> compile time, and an important initial part of the project would be  
> to measure
> the time spent in lexing, parsing and everything after parsing to  
> see what the
> potential gain would be.

We saw a 41% speed-up for SimpleText, a 110x peak speedup for  
<Carbon.h> and (cstdlib).  A C++ Carbon hello world was 91x faster,  
peak.   C hello world was the same speed.  Peak speedups for C 2x,  
for C++ 142x.

> My implementation would store the tree representation and the token  
> stream from
> a source file in the object file

We kept everything in core.

> For starters, I would only consider top-level constructs (i.e. if  
> any token in
> a function, type or global-scope variable has changed, the entire  
> declaration/
> definition would be reparsed),

We handled this case by gluing together regions until the start and  
end of a region was at the toplevel.

> A number of changes in source affect more than the tree of the  
> construct itself
> (inline functions, function default values, templates etc.. see  
> Problems below).
> In these cases, the initial implementation would just identify the  
> dangerousness
> of the changes, abort, and initiate a complete recompilation of the  
> file.

We validated state on a fine grained basis.  Such checking can be  
expensive, as odd as that sounds.

> Would it be feasible to construct a dependency map of the tree, to  
> handle these
> cases with minimal recompilation? How large would such a dependency  
> map have
> to be?

We never progressed the project far enough to scale into the, push  
1000 projects through the server stage, so we never had to worry  
about page faulting and running out of ram, though, there was a  
concern that in an already tight 32-bit vm space, the server could  
run out of ram and/or vm.

To work well, one would have to check every unique identifier used to  
ensure that it was not #defined.  I wanted a language extension to  
make the guarantee for me so that I would not have to check them all.

> For checking, the initial implementation should also provide for a  
> way to
> compare the result of the incremental reparse against a complete  
> recompilation.

We handled this case by comparing the .s files compiled with the  
server against one without.

> Some of the information that is saved and updated in the aux file  
> or object file
> is probably the same as what is saved in a GCH file.

:-)  We selected an in core database so avoid all the issues associated.

> Would incremental update of GCH files be possible/interesting?

Nice question.  In software almost anything is possible.  Harder to  
know if it is useful.

> Should this all be integrated into the precompiled header framework?

Nice question.

> * Changing a declaration (function arguments, default values), also  
> affects all
> uses of the same declaration.

We did this by the fine grained dependancy tracking.

> * Adding and removing a template specialization changes all uses of  
> the template
> after the declaration.

I don't think we handled this case.

> * If code inside an inlined function body is changed, all (inlined)  
> uses of the
> function also change.

We did this by the fine grained dependancy tracking.

> * What other cases like these have not yet been considered?

There are about 1000 more cases like that.  I can begin listing them  
if you want.  :-)  Overload sets, RTX constant pools, debugging  
information, exception tables, deprecation and NODE_POISONED,  

> * How much space does a serialized token stream take?

gcc -E *.c | wc -c will tell you an approximate number.  You then  
just * by a constant.  I wouldn't worry about it.

> * How much time is actually spent in the parser?

Seems like if you do a good job, you should  be able to get a  
100-200x speedup, though, you're gonna have to do code-gen to get it.

> * Future: Incremental code generation

I didn't find this part very hard to do.  Harder was going to get  
perfect fidelity across all the language features and get enough  
people interested in the project to get it into mainline.

I'd be happy to kibitz.

More information about the Gcc mailing list