This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

compilercache C/C++/ObjC unifier and technical background


hi. here i paste the new not yet released chapter 7 of the compilercache
readme file.

please comment on it, especially on the sourcecode unifier part.


7. Technical insights and problems
----------------------------------

first compilercache checks what kind of action it shall perform.  only
if the compiler is called to actually compile a single C sourcefile,
the script continues its work. Otherwise everything is bypassed and
the normal compiler called instead. This happens for instance if you
call the compiler as a linker.

Ok. compilercache shall compile a single sourcefile. First it creates
two sets of commandline arguments from the original given commandline
arguments.

STRIPPEDARGS is the set of commandline arguments without -c and -o and
the filenames.

IDENTARGS is the set of commandline arguments that are neccessary to
uniquely identify the corresponding output file. i.e. STRIPPEDARGS
without include paths, macro definitions, library paths. the rule for
the design of the IDENTARGS set is to include as many options as
needed and as few as possible to produce as many cache hits as
possible and still produce the correct output files.

you can see the values of the options by uncommenting the various
debugging blocks inside the compilercache script.

now the preprocessor is called. this happens with the -E option and
the STRIPPEDARGS option and the sourcefilename.

now the output of the preprocessor, the version of the compiler, and
the IDENTARGS are put into a file. Then the md5sum of this file is
computed. this md5sum is the filename of the cache entry. so now the
compilercache checks if such a file is already inside the cache
directory. if yes, then this file is taken as the output of the
compiler run (i.e the .o file) and compilercache is done. If not, the
normal compiler is run and if it produced no warnings and no errors,
the result is put into the cache aswell as into the output file and
the compilercache has finished its work.

That's it. There are three design criterias in compilercache, shown in
descending order of priority:

1) the compilercache may NEVER return an output file that is not
bitwise the same as if the original compiler would have been run.

2) the compilercache shall do exactly the same thing as the original
compiler, only the time consumation is sometimes much less.

3) the compilercache shall use the files from the cache as often as
possible and not wastedly recompile.

It is absolutely top priority that 1) and 2) are ALWAYS met under ALL
circumstances.

To explain 3) further, consider a sourcecode with some added newlines
at the end of the file. of course the .o output file will be exactly
the same, even though the preprocessor output is different.

Ok. now that you know the basic operation, let's discuss the advanced
topic of cache hit increasing. A compiler is a tool to perform a
mapping. you have an infinite set of input sourcefiles denoted as
S=(S1, S2, ...). You also have an infinite set of output object files
denoted as O=(O1, O2, ...). The compiler is nothing more but a
relation that connects elements of S to elements of O. Let's discuss
this with an example:

S1 -> O1

S2 -> O2

S3 --\
S4 --->--> O345
S5 --/

what's shown here is that multiple different (infinitely many) source
files map to the same output file. The compilercache shall not only
know that S1 -> O1, S2 -> O2 and S3 -> O3, but it shall also know that
S4 and S5 will produce the same output than S3. This way if you ever
compiled S3, the result of the compilation of S4 and S5 will come from
the cache instead of a recompilation.

But, is this practically relevant? Well, consider an extremly big
project like the linux kernel. Maybe there is a single include file
that almost everybody includes. Let's say it defines some very
fundamental basic integer types. Now, if a developer fixes a typo in a
comment ( /* this is myy integer */ --> /* this is my integer */ ), a
complete recompilation of the whole project is needed, even though
absolutely nothing changed in the output files! Often this leads
programmers to not fix typos in comments, which is not a good thing.

Another example, also from the linux kernel, is a central include file
(autoconf.h) containing macros for the kernel configuration. This
single file contains definitions for ALL drivers in the kernel. The
drivers themselves are separate C source files, and each of them
considers only a few of the macros in the central include file. Now if
you change a definition in the central include file, or add and remove
whitespace (like the linux kernel configuration tools do) only a few
driver sourcefiles would need to be recompiled theoretically. But
practically everything will be recompiled. The cache should stop with
this and use the cached values whereever possible.

so, know that you see that many sourcefiles map to the same output
file, and you are also convinced that it would be very useful for the
cache to detect those situations to produce much much more cache hits,
let's discuss the techniques used to reach the target.

For a first, it should be clear that the output of the preprocessor
plus the IDENTARGS commandline options plus the compiler version
uniquely specify the corresponding output file. (now you understand
why IDENTARGS does not contain paths and macro definitions. all this
information is not needed anymore after the preprocessor finished its
job)

Ok. how can we further compactify the preprocessor output so that the
corresponding output file is still uniquely specified, but more source
files match ?

well first we must make sure that the output file has no direct links
into the source file, what I mean is you cannot reformat the
sourcefile if the output file contains debugging information that
directly refers to line numbers in the sourcefile. This would mean if
you add newlines in the sourcefile and recompile, the same output
would be generated. if you now call your debugger, the line number
information will be wrong. So the following technique is only
activated if debugging options are turned off.

the preprocessor output still contains lines starting with a '#'. This
seems to break the design, but it is true anyway. The only # lines
that have an effect on the resulting output file are the ones that
start with #pragma.

The preprocessor output is fed through a program called "unifier".
The output of the unifier is then taken for the md5sum. The unifier
works like a C/C++/ObjC lexer. it has a get_token() function which
returns a string corresponding to the token in the source
file. get_token() ignores all comments and already strips the useless
# lines (but it returns the #pragma lines as tokens!). The main loop
of the unifier calls get_token(), prints the token, prints a newline,
and iterates. This ends at the end of file.

for example, if the following is the input to the unifier:
--------
# 1 "/home/erik/kernel-source-2.4.2/include/linux/autoconf.h" 1


static inline int spin_trylock(spinlock_t *lock)
#pragma implementation
--------
then the following will be the output:
--------
static
inline
int
spin_trylock
(
spinlock_t
*
lock
)
#pragma implementation
--------

the unifier is the real power of the compilercache.  That's what make
and all the other tools can not do. Cache the linux kernel and the
mozilla project :-)



-- 
Name:  Erik Thiele                                        \\\\
Email: erikyyy@erikyyy.de                                 o `QQ'_
WWW:   http://www.erikyyy.de/                              /   __8
                                                           '  `


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]