String pool

Zack Weinberg zackw@Stanford.EDU
Sun Oct 29 11:03:00 GMT 2000


On Sun, Oct 29, 2000 at 10:16:46AM -0800, per@bothner.com wrote:
> "Zack Weinberg" <zackw@stanford.edu> writes:
> 
> > The original goal of this patch was to save memory by not padding
> > strings to powers of two, or allocating multiple copies of the same
> > string.  In my tests, we don't actually save anything, because the
> > overhead of maintaining the hash table is substantial: 1.5 megs of
> > strings takes roughly 3 megs of hash table.  (Consider that each hash
> > entry takes three words on a 32-bit system, which is 12 bytes, which
> > is comparable to the length of many identifiers; and that there's
> > unavoidably many unused slots in that hash table.)
> 
> Have you considered replacing struct str_header by struct tree_identifier?
> For those strings that actually are identifiers, that saves you 3
> words.

This, or something similar, might be a logical thing to do as a future
improvement.  I haven't done it now because it has knock-on effects on
garbage collection - all the places that call ggc_set_mark on
IDENTIFIER_NODEs have to be changed to not do that.  This is harder
than it looks, because of e.g. the ridpointers array in the C++ front
end that has both identifier nodes and other strange stuff in it.

> Yes, you do waste some space for such strings that are not identifiers.
> However, my guess the overwhelming majority of strings are actually
> identifiers.

Your guess is correct.  However, a large number of those strings may
cease being identifiers - there's no particular reason not to have
DECL_ASSEMBLER_NAME point straight to the string, for instance.
Again, this is a future improvement.

> Furthermore, the wasted space for non-identifiers is modest:
> sizeof(struct tree_identifier) - sizeof(struct str_header), which I
> calculate to 5-3 = 2 words extras waste for non-identifiers.

That's the generic struct tree_identifier; it's enlarged by each
language, to 12 words in C and 11 in C++ (but one of those is a
pointer to an auxiliary structure, so really 14).  That'd be a much
larger overhead.

What I'd been thinking of doing, long term, is having the data field
of struct str_header point directly to the language-specific data
structure.  The chain and type fields of an IDENTIFIER_NODE are not
used, we know what the code is, and the flag bits can be moved into
the language-specific structure if necessary.  [We have to figure out
what to do with TREE_ADDRESSABLE and TREE_SYMBOL_REFERENCED.]

zw


More information about the Gcc-patches mailing list