Pre-Parsed Headers - A cache for the C++ parser

Project Status

Last update: 2012-05-24

We have paused work on this project, pending the resolution of several issues we have found both in the language and in GCC. For details on our experience and current implementation status, please read this document.

Code Repository

The project is being implemented in the pph SVN branch (browse). To get the latest version:

$ svn co svn://gcc.gnu.org/svn/gcc/branches/pph

The branch is maintained by Diego Novillo <dnovillo@google.com> and Lawrence Crowl <crowl@google.com>. The usual rules for contributing to branches apply to this branch:

  1. Messages and patches to the lists should have their subject

    prefixed with [pph].

  2. ChangeLog entries should be written to ChangeLog.pph.

Introduction

This project is yet another attempt at speeding up compilation by caching results from previous builds of the same/similar source files. It shares the same objectives as ccache, pre-compiled headers and incremental compilation.

The focus is different, however. Pre-parsed headers (PPH) exploit the same property exploited by pre-compiled headers:

The key difference between this new approach and pre-compiled headers is that PCH files are monolithic. A PCH image encodes the transitive closure of all the headers included by the translation unit. This makes PCH images unstable with respect to source code changes. If a single line of a single header file changes, the whole image needs to be re-built. For some programming environments this is not a problem. Code bases that are organized around a few master header files that rarely change, can use PCH effectively.

The code base that we are working with, however, uses a different organization. There are no master/central header files that include everything an application may need. Every translation unit is responsible for including exactly the headers it needs. This is known as the ''include what you use'' principle.

We tried implementing PCH strategies with our code base, but the storage requirements where prohibitive (due to the transitive closure) and the rate of change of individual header files meant that PCH images were generated too frequently, making turnaround times slower than before.

Scope and limitations

This work only addresses the problem of C++ parsing speeds. It does not address the more general incremental compilation problem addressed by the compiler server.

Additionally, we will place restrictions on the types of header files that can be pre-parsed. In essence, the only headers that can be pre-parsed are those that produce the same result when they are compiled in isolation or as part of another translation unit. So, header files that are affected by pre-processor symbols defined before inclusion are not going to be considered (e.g., stddef.h).

A longer term goal of this project is to establish a path towards C++ Modules.

Pre-Tokenized Headers (PTH)

Pre-tokenization works by creating image files associated with each header file that contain the post-processed tokens from the original input text.

When the compiler starts processing a file, it reads the text representation and invokes the pre-processor to convert it into a token stream. Besides creating tokens, the pre-processor will also interpret directives, expand macros, etc. The result is that the parser sees a continuous token stream for the whole translation unit.

However, we need to separate that token stream into chunks corresponding to individual files within the translation unit. Additionally, the tokens generated by a single file may be spread out in the final token stream. For instance, consider the following file foo.cc:

int x;
#include "foo.h"
float y;

The tokens for int x; and float y; will be separated by the tokens read from file foo.h. Therefore, when preserving information for foo.cc it is important to describe this separation and sequencing. The image for foo.cc will contain two sets of tokens (token hunks), one for everything before #include "foo.h". Another for everything after it.

In general, each file in the translation unit will consist of a sequence of token hunks separated by file changes (i.e., #include directives act as separators).

The compilation process follows this logic

  1. The front end starts getting tokens from the pre-processor
    1. On encountering a new file F, the front end will

      1. Check if a an image file F.pth exists

      2. If F.pth exists, it will have in its header an md5

        • digest that corresponds to the version of F from which F.pth was created

      3. If the md5 digest read from F.pth is the same as the

        • md5 digest of F, it means that F has not changed and its image F.pth can, in principle, be re-used.

      4. Whether F.pth can be re-used will depend on other

        • factors as well. These factors have to do with the context in which F.pth is being instantiated. For instance, suppose that one of the token hunks in F.pth uses the pre-processor symbol N with a value of 100 (e.g., char buf[N]). The image F.pth will have the token sequence char buf[100], but if in the current compilation N has a different value, then we cannot re-use F.pth. This means that F.pth not only needs to store the md5 digest for F, it also needs to store the value of every pre-processor symbol used inside F but not defined by F (i.e., all the symbols that F takes from the "outside").

      5. If the image F.pth can be used, the first token hunk

        • in F.pth is loaded and the lexer advances until the next file change directive inside F.pth (or the end of the file).

      6. Otherwise, F is read and pre-processed as usual.

        • During pre-processing a new image for F is generated (see below).

      7. The process continues from step 2.

When a a file F is pre-processed for the first time, a new image (F.pth) is created for it. The process is similar, with the difference that file-change events are handled by storing all the tokens read until that point into a token hunk buffer associated with the file that just finished.

At the end of lexical analysis, all the PTH image buffers are flushed out to disk. Each image is saved into its corresponding .pth file.

Implementation notes

The C++ parser in GCC tokenizes the whole translation unit before starting the actual parsing process. This is done in cp/parser.c:cp_lexer_new_main. The tokens are read by calling cp_lexer_get_preprocessor_token and stored in lexer->buffer.

PTH support relies on two main data structures:

  1. struct pth_cache_d. This describes the token cache for a single source file. It contains the pathname of the source file being cached, the MD5 digest for the file and a vector of token hunks. Each entry in this vector is a continuous stream of tokens belonging to this file. If the file contains N #include directives, this vector will contain N + 1 entries.

  2. struct pth_state_d. This contains global state for PTH support. Mainly, it contains a vector of pth_cache_d objects. Each entry will represent a single file in the translation unit. It also contains a pointer map to speed up lookups into the vector of token caches.

PTH support is enabled using the flag -fpth when compiling a file. The implementation is event-driven. On startup, the parser sets up a pre-processor callback for file change events. Every time the pre-processor starts reading a new file, the function cp/pth.c:pth_pp_file_change is called with the name of the new file that is about to be processed. This causes the front end to activate the token caching logic in pth_switch_to_file.

After lexing is complete, the front end calls pth_finish, which traverses the set of token caches and writes out to disk those that need saving.

Analysis of the PTH prototype

We analyzed the behaviour of the PTH implementation on two internal C++ translation units. In summary, PTH shows a 4% improvement on total compile time.

We looked at five different scenarios:

The following times are all wall times and represent the average of 4 runs of the compiler with warm filesystem caches. The compiler was bootstrapped.

These timings only include front end related timers. We have confirmed that all the other phases are sufficiently similar and that the code generated is the same.

No PTH

Total compile time 14.40 secs

Phase

Time (secs)

% of total compile time

preprocessing

0.95

7%

parser (global)

1.91

13%

parser struct body

1.98

14%

parser enumerator list

0.03

0%

parser function body

0.49

3%

parser inl. meth. body

1.52

11%

name lookup

1.84

13%

Total FE

8.72

60%

PTH generation

Total compile time 15.25 secs (6% increase)

Phase

Time (secs)

% of total compile time

PTH bookkeeping

0.68

4%

PTH MD5 signatures

0.18

1%

PTH stream save

0.25

2%

preprocessing

0.88

6%

parser (global)

1.99

13%

parser struct body

1.93

13%

parser enumerator list

0.02

0%

parser function body

0.56

4%

parser inl. meth. body

1.40

9%

name lookup

1.85

12%

Total FE

9.74

64%

This represents pure overhead. It's the cost to generate the token image database from scratch. The various front end components are largely unaffected, but we add the PTH image generation overhead.

The biggest overhead (PTH bookkeeping) is mostly due to the way we store identifiers in the lookaside tables for each hunk. Macros are expanded into text strings and before/after values are always kept (whether or not they changed).

Additionally, we incur in streaming overhead and MD5 computations.

This overhead would not be the norm, given the amount of sharing that we expect to find across files in a TU (particularly seldom modified ones).

PTH use with no changes

Total compile time 13.80 secs (4% reduction).

Phase

Time (secs)

% of total compile time

Notes

PTH dependency checks

0.01

0%

PTH bookkeeping

0.18

1%

PTH MD5 signatures

0.11

1%

PTH stream load

0.49

4%

PTH initialization

0.12

1%

preprocessing

0.07

1%

[ 93% reduction ]

parser (global)

1.48

11%

[ 23% reduction ]

parser struct body

1.96

14%

parser enumerator list

0.01

0%

parser function body

0.50

4%

parser inl. meth. body

1.35

10%

name lookup

2.08

15%

Total FE

8.36

60%

[ 4% reduction ]

The main effect is on pre-processing. There is also a secondary effect on the global timer for the parser because pre-processing is part of it. Some of the time saves by PTH is part of the token interpretation done in the parser.

There is also unnecessary work that the prototype is doing now. It computes MD5 signatures for every file in the TU, and it keeps token images in files, which forces it to stream things in and out.

Additionally, notice that we are still doing some pre-processing. Although nothing in the TU changed from one compile to the other, there are some files for which we are currently not generating token images.

Some files (e.g., stddef.h) are not double-guarded, so multiple inclusion of those files potentially generate different sets of tokens and identifiers. It would be quite possible to handle this by supporting multiple flavours of token images, but the simpler approach is to just refuse to handle them.

This decision also helped with the implementation of tainted image support. When a image is found to be tainted, we need to set the pre-processor on that file to consume tokens from it. This switching between images and text files is needed to support incremental edits. This was kind of tricky to implement since it involves going behind the pre-processor's back quite a bit.

Pre-Parsed Headers (PPH)

This is currently under development. The implementation lives in the file cp/pph.c. It does not have any interesting functionality yet.

The current implementation instruments the parser to emit information about declarations found during parsing. It defines three flags:

For each exposed declaration, the compiler emits the following information:

Additionally, the compiler emits dependency information between declarations. When a declaration D, say a class definition, is edited, every declaration that uses D needs to be re-compiled again. In general, a change to a single declaration D means that the transitive closure of D's dependencies needs to be re-compiled. Therefore, the key metric of success for an incremental compiler is the degree of connectivity of the declaration dependency graph and the rate of change for declarations at the root of that graph.

As an example of the dependency information produced by the compiler, consider the following code snippet:

     1  struct X
     2  {
     3      int x;
     4      int y;
     5  };
     6  X array[10];

The compiler emits the following dependency data:

declaration type_decl 4138 'X' a.cc:2:1 htok 2,0
declaration type_decl 4139 'X' a.cc:2:1 btok 8,8 lazy
depend type_decl 4139 'X' a.cc:2:1 uses type_decl 4138 'X' a.cc:2:1

declaration var_decl 4160 'array' a.cc:6:11 htok 6,0
depend var_decl 4160 'array' a.cc:6:11 uses type_decl 4138 'X' a.cc:2:1
declaration var_decl 4161 'array' a.cc:6:11 btok 0,4 needed
depend var_decl 4161 'array' a.cc:6:11 uses var_decl 4160 'array' a.cc:6:11

Every exposed declaration in the program is emitted in a declaration record. When a declaration depends on another, the compiler emits a depend record. In this case, the first depend record reads depend type_decl 4139 'X' a.cc:2:1 uses type_decl 4138 'X' a.cc:2:1. This states that the body of struct X needs to be re-compiled if the head changes. Similarly, the declaration for array in line 6 also has a dependency on the head of struct X

Notice that to distinguish declaration heads from declaration bodies, the compiler uses even UID numbers to identify declaration heads and odd UID numbers to identify declaration bodies.

More generally, the compiler implements the following logic:

  1. For each exposed declaration,
    1. Record the source file of declaration.
    2. Record the index of beginning token.
    3. Record the index of ending token.
    4. Ensure heads and bodies have separate declarations.
    5. Record the whether or not the declaration is primarily needed.
  2. For each symbol lookup,
    1. Record exposed declaration containing the referrer.
    2. If the referrent is inside a class, record the definition as the referrent.
    3. If the reference requires only an incomplete class, record the declaration as the referrent.
    4. If the reference requires a complete class, record the definition as the referrent.
    5. If the referrent is a function, record the declaration as the referent.
    6. If the referrent is a const integral variable, record the definition as the referrent.
    7. If the referrent is an array variable and is used by sizeof, record the definition as the referrent.
    8. For all other variables, record the declaration as the referrent.
    9. Ignore self-to-self references.

None: pph (last edited 2012-05-24 22:55:17 by DiegoNovillo)