RFC: identifier GC

Richard Guenther richard.guenther@gmail.com
Sat Mar 22 14:41:00 GMT 2008


On Fri, Mar 21, 2008 at 8:26 PM, Tom Tromey <tromey@redhat.com> wrote:
> This patch changes identifier nodes to be truly garbage collected.  I
>  need this for the incremental compiler, so that memory use in the
>  server does not grow without bound.
>
>  However, it is also fairly independent from the other changes and
>  seems worthwhile.  With this it is pretty common to see the GC collect
>  200-300k after parsing is complete.
>
>  The basic idea is to use the struct hack for storing the contents of
>  the identifier, like we do for STRING_CST.  After parsing is complete,
>  we turn the string pool hash from a root into a weak set, allowing
>  collection.

Good.  (please use diff -p for the patches, it makes it easier to spot the
place of the change)

>  There are a few oddities though.  I worry that these will be
>  off-putting.
>
>  * Identifier nodes have different sizes in different front ends.  So,
>   I added a new field to the langhook structure so that the generic
>   code knows how to compute the offset of the string pointer.  There's
>   a new requirement on front ends that the last field of
>   lang_identifier be a struct ht_identifier.
>
>   I considered doing something simpler here -- just ggc_alloc the
>   string data and let the GC do its thing, but that wastes a half word
>   per identifier on average (since currently strings are allocated on
>   an obstack with no padding, AIUI).  I'm happy to change this if the
>   rest of this is too gross for you...

It looks somewhat gross - did you consider not embedding ht_indentifier
but just storing a pointer to it?  That's basically the same as ggc_allocing
the string data but would make it possible to "move" the identifiers to
other structs than tree_identifier.  I wouldn't worry about the waste as
at the moment we waste a pointer to the string as well (and the alignment
might turn out as a performance advantage anyway).

>  * IDENTIFIER_POINTER now has to add in this offset from the langhook.
>   I.e., the offset to the string data varies by front end.
>
>   Anywhere you use IDENTIFIER_POINTER you must include langhooks.h.
>
>   Also, debugging is a little weird now... what I do is cast the tree
>   to a 'struct lang_identifier *', then grovel around for the string
>   data.

a little gross, as you say ;)

>  * There was some code assuming that the result of ggc_alloc_string (or
>   ggc_strdup) was uncollectable.  This is a bad practice to begin
>   with, so in a sense this change is ferreting out bugs :-)
>   Anyway, in the new regime, these things must be treated exactly like
>   any other GC-allocated memory.  I don't consider this a huge
>   problem, since we're all used to following those rules for other
>   objects already.

IMHO this should go in separately.

>  * The g++ get_identifier_nocopy had to go.  And btw, in the future,
>   please don't hide hacks like this away from the rest of the
>   functions associated with a data structure -- this, plus a rogue use
>   of ggc_strdup in g++, took me 3 painful days of debugging to find.

Likewise.  I also had fun with this part somewhen in the past.  Can you
measure the effect on  compile-time and memory usage of this change?
I guess building libstdc++ or something similar (tramp3d..).

>  * I didn't make the cpp hash table support deletion.  So, there's a
>   bit of churn at each GC where we re-allocate the backing array and
>   re-hash.  Few collections actually collect no identifiers (a mildly
>   surprising fact), though, so I left it as is.  This is simple to
>   change.

I didn't understand why we need to re-hash?  Can you explain that part?

>  One omission here is that I have not done any performance testing.
>  Maybe someone could say what the norm is here.  Do we have a standard
>  compile-time benchmark?  Bootstrap?

Bootstrap is ok, for this kind of change I'd measure -O0 compile-time
on cc1files.
Though most of the changes look like they should affect C++ the most, so -O0
libstdc++ build would also apply.

I don't like the idea of the langhook and the struct hack too much (but can't
approve anything here anyway).  IMHO maintainability (and code readability)
should have preference over saving an extra pointer per tree identifier.

Thanks,
Richard.



More information about the Gcc-patches mailing list