Compiler identifier hashtable improvements (and ObjC cleanup)

Neil Booth neil@daikokuya.demon.co.uk
Wed May 16 00:05:00 GMT 2001


Hi Zack,

Zack Weinberg wrote:-

> You're still allocating these via make_node.  You might want to
> consider moving them into the obstack with the strings, since they can
> never be garbage collected anyway.  This would save some memory;
> struct lang_identifier in C is 48 bytes, in C++ 44.  They come out of
> the size-64 pool, so we're wasting 16-20 bytes per, and we can
> allocate thousands of them.

Yes I tried this; however is confuses garbage collection which expects
all trees to have been allocated from its pages.  I had it storing the
strings immediately after the tree structure too, until I realised GC
got confused.  Doing this saves half a pointer on average, since there
is no longer a need to store a pointer to the string, and you waste
half a pointer on average from alignment.  Identifier node to char *
needs to become a subroutine if you do this.  Since we're allocating
plain strings from the same obstack too, and so losing half a pointer
for them, storing them adjacently in the obstack might not be worth
it.  Maybe 2 obstacks, one for IDENTIFIER_NODEs and one for text is a
better plan.

How much of struct identifier's common is really unused?  Can they
never be chained?  Is the type pointer never meaningful?  If so, I can
(for the C front ends at least) use these for cpplib's information.

We should be able to move the rid enum to use 1 byte of the common
structure; because of alignment that will save a whole pointer at
present.

> The garbage collector would then have to be adjusted so it never
> marked an identifier_node, and the code which marks from the
> stringpool would need to go straight to the things the identifier_node
> points at.

Ah, yes.  I hadn't considered doing that; one step at a time :-) It'd
probably be worth it; let's do this last, after we get CPP involved.

> Given that you changed ggc_alloc_string not to go through the hash
> table anymore, how do we get non-empty entries that haven't gone
> through get_identifier?

We don't, but we only store the string space (permanently - it is not
garbage collected) not the tree.  So something else would have a
reference to a used node.  Or am I missing something that could cause
problems?

> I understand that this works, but I'm not clear on why.  This sounds
> like the way it used to work, which was broken because these
> identifiers were used in the protocol context, stored in trees, then
> examined (by grokdeclarator) outside the protocol context.  At that
> point they'd stopped being magic.

I take it you mean were used in protocol context as reserved words,
not as identifiers?

Something similar was happening to me [about 6 testcases would fail,
right?] until I put the check in yylexname, and kept the identifiers
always flagged as RIDs.  The hash entries are still available for use
as identifiers and contain the identifier information; just that they
are not recognised as such within the parser at the appropriate point.
The parser just wants to see the correct YACC code returned.  When I
did this, the grokdeclarator issues and the regressions went away.  I
admit I don't fully understand the way the C front end handles types
and grokdeclarator to be certain it's 100% safe; but the lack of
regressions seemed to validate it to some extent.

> Careful; some idioms can produce many copies of the same string.  For
> example, the old assert() macro generated the same string constant
> every time it was used.  The Linux kernel's BUG() macro has the same
> problem.
> 
> This does not mean we need to handle them with the identifier hash
> table; in fact it's probably best if we don't.  I do think some code
> should prevent duplicates.  We already have code in varasm.c to
> prevent _emitting_ the same string more than once per file, perhaps it
> can be persuaded to do this job too.

Hmm.  You may need to help me in that area.

> The patch does look nice and I look forward to the unified symbol
> handling between cpplib and front ends.

Me too.  Thanks,

Neil.



More information about the Gcc-patches mailing list