This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

PCH, and more generally C++ parser performance


We've been discussing the performance problems of the C++ front end on
an internal Redhat mailing list.  I was asked to open the discussion
up to a wider audience.

We are looking at a specific test case, provided by Apple.  Here's a
quote from the original letter:

> First, get SPU 0.5, which is in the sources.redhat.com CVS archive,
> under src/utils/spu.  It's configured automatically as part of a
> GDB checkout's configure, but not built, so you have to cd to
> <objdir>/utils/spu and do the make manually.  It's a very simple
> program, just consists of one file, so "cc spu.c" would work too.
> 
> Then do
> 
>   ./spu --language c++ \
>     --files 1 --functions 20 --function-length 10 \
>     --lib-enums 2000 --enumerators 8 \             
>     --lib-structs 1000 --fields 12 \
>     --lib-classes 800 --methods 20 \
>     --lib-functions 10000           

This generates a bunch of files named file<whatever>.  The ones we're
interested in are file0.cc and filelib.h.  file0.cc has 20 functions
in it, but it includes filelib.h which has ten thousand or so.  This
is said to approximate the structure of some real-world,
class-library-heavy code.

Once you have all these files, you can compile them several different
ways, but the one I've been using is just

$ time ./cc1plus -lang-c++ -quiet file0.cc -o file0.s

This cc1plus has the preprocessor built into it.  If you don't want to
muck with the 7,000 line patch to get an integrated preprocessor, you
can run file0.cc through gcc -E and then invoke cc1plus on the .ii
file; the numbers will be roughly the same.

With gcc 2.95.2 on my machine, it takes ten seconds to compile file0.
With current CVS, it takes sixteen seconds.  This is x86-linux; Stan
Shebs gets similar numbers on ppc-macosx with comparable hardware.

The top ten, from gprof:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 16.00      2.55     2.55   285012     0.01     0.01  get_identifier
 10.10      4.16     1.61 14360005     0.00     0.00  ggc_set_mark
  8.16      5.46     1.30       18    72.22   193.19  ggc_mark_trees
  4.96      6.25     0.79        1   790.00 15902.99  yyparse
  3.51      6.81     0.56     5652     0.10     0.10  lookup_tag
  3.32      7.34     0.53  1404134     0.00     0.00  memset
  3.01      7.82     0.48  1861769     0.00     0.00  ggc_alloc
  2.32      8.19     0.37  2011020     0.00     0.00  lang_mark_tree
  2.01      8.51     0.32  1222221     0.00     0.00  make_node
  1.44      8.74     0.23   201127     0.00     0.00  walk_tree

I conclude from this that we have three targets to wail on:
get_identifier, garbage collection, and the parser.

(1) get_identifier is an easy one.  

[9]     17.6    2.55    0.25  285012         get_identifier [9]
                0.10    0.00  285012/388269      strlen [104]
                0.02    0.06   89982/1222221     make_node [21]
                0.01    0.07   89982/138858      ggc_alloc_string [110]

It's been called 285,000 times.  It called make_node 90,000 times.
That means that there were 90,000 unique identifiers in the sample
code.  get_identifier's hash table has only 1000 slots; even with a
perfect hash function, which it does not have, we would get 90-element
chains.  (An ideal binary search tree would have depth 16.)

Compare the performance of cpplib's hash table lookup routine, which
is doing more or less the same thing:

[83]     1.4    0.20    0.03  179920         _cpp_lookup_with_hash [83]
                0.01    0.00   37035/322515      memcpy [105]
                0.01    0.00       4/4           expand_hash [339]
                0.00    0.00     306/395         _obstack_newchunk [538]
                0.00    0.00       4/153205      cfree [115]

This uses a specialized implementation of Vladimir Makarov's
expandable hash tables, and a sane hash function that's calculated in
parallel with scanning the identifier.  1.4% of runtime instead of
17.6%.  It isn't being called as many times, but if it scales linearly
(and it should) it'd only need 0.31 seconds to do what get_identifier
is doing.  I intend to make get_identifier use cpplib's hash table
implementation, and share the table itself with cpplib when using an
integrated preprocessor, which should wipe out this bottleneck.

(2) Garbage collection.  Here the relevant question is, how much
memory are we using?  I instrumented mmap() and brk() and I have
pretty graphs which you can have if you're interested.  But this test
case is pretty darn pathological for memory usage.  We free very
little memory over the course of the run, and at the end we're using
45 megabytes of mmapped anonymous memory, plus nine allocated with
brk.  When you count program text, libraries, daemons, and kernel
space memory, this test case is going to need more than 64 megs of
RAM.  I have 256, it didn't hit swap space on my system, but it will
on many people's.

To a first approximation, all of the memory allocated with mmap is
GCed arena:

                0.00    0.00       1/719         init_ggc [890]
                0.00    0.00       1/719         _IO_file_doallocate [971]
                0.00    0.00       3/719         read_file [1517]
                0.00    0.00       9/719         chunk_alloc [102]
                0.00    0.00     705/719         ggc_alloc [46]
[1188]   0.0    0.00    0.00     719         mmap [1188]

And there is simply no fast way to mark and sweep 45 megabytes of
data.  We could bum cycles out of ggc_set_mark and ggc_mark_trees; I
did a bit of that already, but the improvements are negligible.  Any
real improvement's got to come from reducing the amount of memory GC
has to inspect.

There's three ways I can think of to do that.  First, we could reduce
the internal fragmentation in ggc-page's allocator.  It rounds
everything to powers of two.  On 32-bit machines, RTXes are 4+4n bytes
where n is rtx_length[] of the particular type.  n=2 is very common,
leading to size 12 and a word of wasted space.  Trees are even worse;
most members of tree_union are ~20 bytes (rounded to 32), and two are
~100 bytes (rounded to 128).  We could go to a slab-ish allocation
scheme with tight packing, or we could figure out how to make the
common tree nodes smaller.  This wouldn't reduce the absolute quantity
of work to be done by the mark phase, but it would reduce the total
memory in use and so improve the cache situation or at least the paging
situation.

Second, we could change to some sort of incremental or generational
collector.  This usually requires evil system dependent hacks, because
you have to write-protect memory and intercept the faults.  I proposed
a simple two-generation scheme awhile ago, where we would explicitly
assert to GC that (for example) the saved structure for an inline
function would never again be modified nor was it to be deallocated.
The mark phase could then ignore it.  Mark Mitchell thought that the
necessary invariant didn't happen often enough to be worth it, but
maybe we could change things so it did.

Third, we could not generate so much data in the first place.  Most of
that 45 megs is saved tree representations of inline functions.  Most
of those inline functions are never used.  If we could avoid parsing
them in the first place, wouldn't that be nice?  Here's where you're
all thinking precompiled headers.  That's the right direction, but a
naive PCH implementation - like the one Cygnus had that blindly dumped
the compiler's data space to disk - won't help much here.  We'd have
to reread and possibly unpickle all 45 megs on each compiler invocation,
and once it's in, the garbage collector would have to deal with it.

Apple inherited a remarkably clever PCH implementation from NeXT.  It
can figure out which functions the compiler actually needs to see, and
present it with only those.  Alas, it only works with C; but Stan
reports that for the C version of the above example, compiler runtime
with PCH is less than a second.  (Part of that might be the simpler C
front end, though.)  We can do something similar without necessarily
implementing PCH.  The C++ front end already has logic to delay
parsing inline functions until convenient.  It is used only to work
around some of the horrors of parsing C++ at the moment, but it could
be applied to every inline function, and they could be delayed until
actually required instead of just until the context for parsing them
is better.  The unparsed structure is somewhat smaller, and much
simpler, than the tree it becomes.

I like this idea a lot.  I may not have time or sufficient understanding
of the C++ front end to implement it - I'm going back to grad school
in three weeks - but I think it's worth examining in detail.

(3) The parser is also bloody slow; we're spending .79 seconds just in
yyparse.  This is mostly the fault of C++ the language.  Improvements
in this area will come from giving it less text to grovel through -
another advantage of the above idea about delaying parsing until
needed.  (The code that saves inline functions for later uses a
trivial brace-counting algorithm.)

It's instructive to note that if you use byacc instead of bison, the
generated parser is significantly faster - by itself.  The difference
isn't enough to show up in the total wall clock time.

  4.96      6.25     0.79        1   790.00 15902.99  yyparse (bison)
  3.57      7.20     0.57        1   570.00 15916.08  yyparse (byacc)

If you're desperate, turning off YYDEBUG might also help.

zw

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]