Compiler identifier hashtable improvements (and ObjC cleanup)

Tue May 15 23:35:00 GMT 2001

On Tue, May 15, 2001 at 10:33:19PM +0100, Neil Booth wrote:
> Further to what Zack and I were discussing a couple of days ago, I bit
> the bullet and converted GCC's identifier hashtable so that an entry
> is now one of the following 3 things:
> 
> o an empty entry is a NULL_TREE
> o a non-empty entry that has undergone a get_identifier () call is
>   a pointer to an IDENTIFIER_NODE
> o a non-empty entry that has not undergone a get_identifier () call is
>   a pointer to an IDENTIFIER_NODE, but that temporarily has its "code"
>   set to ERROR_MARK instead of IDENTIFIER_NODE.

You're still allocating these via make_node.  You might want to
consider moving them into the obstack with the strings, since they can
never be garbage collected anyway.  This would save some memory;
struct lang_identifier in C is 48 bytes, in C++ 44.  They come out of
the size-64 pool, so we're wasting 16-20 bytes per, and we can
allocate thousands of them.

The garbage collector would then have to be adjusted so it never
marked an identifier_node, and the code which marks from the
stringpool would need to go straight to the things the identifier_node
points at.

Given that you changed ggc_alloc_string not to go through the hash
table anymore, how do we get non-empty entries that haven't gone
through get_identifier?

> The Objective C front-end has 6 reserved words (protocol qualifiers)
> that are only reserved in certain contexts.  It handled this by
> switching the tree pointers of the string headers of those identifiers
> in the hashtable when entering and leaving those contexts.  This
> involved tricky tree handling and 2 new gc roots, and 6 hash table
> lookups on each entry and exit from those contexts.
> 
> This patch replaces that mechanism with something a more efficient -
> we simply toggle a boolean flag when entering and leaving those
> contexts, leaving the identifiers flagged as reserved words.  If we
> notice a reserved word that is one of these identifiers, we only treat
> it as reserved if we're in the right context according to the boolean
> flag.

I understand that this works, but I'm not clear on why.  This sounds
like the way it used to work, which was broken because these
identifiers were used in the protocol context, stored in trees, then
examined (by grokdeclarator) outside the protocol context.  At that
point they'd stopped being magic.

> Strings are now allocated from an obstack, and don't get hashed in the
> hashtable.

Careful; some idioms can produce many copies of the same string.  For
example, the old assert() macro generated the same string constant
every time it was used.  The Linux kernel's BUG() macro has the same
problem.

This does not mean we need to handle them with the identifier hash
table; in fact it's probably best if we don't.  I do think some code
should prevent duplicates.  We already have code in varasm.c to
prevent _emitting_ the same string more than once per file, perhaps it
can be persuaded to do this job too.

The patch does look nice and I look forward to the unified symbol
handling between cpplib and front ends.

zw