This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Merge cpplib and front end hashtables, part 1
On Sun, May 13, 2001 at 09:25:21PM +0100, Neil Booth wrote:
> Oddly enough, what started me looking at this hashtable stuff was
> being distracted from looking at string handling in the front ends.
> We do a ridiculous amount of copying of the text of string literals in
> their journey from CPP to the front end tree structure; a minimum
> (assuming no reallocation during lexing) of 4 times for single
> strings, and 5 for concatenated strings.
That's awful, although string literals are unusual enough that it
probably isn't a performance issue. Here's some numbers: Of 9,571,612
lines of source code (counted by wc -l) I have lying around (gcc, gdb,
binutils, glibc, XFree86, and Linux kernel), 536,722 lines (5.61%)
match grep -c '"'; 50,490 lines (0.53%) match grep -c "'.'". I didn't
just grep for "'" because that would get many false positives from
apostrophes in comments.
> I think it should be possible to cut out at least one copy, and maybe
> two. If we are going to handle arbitrary charsets on input, though,
> things may get more complicated. I'm not at all clear about how we
> intend to handle various charsets.
It's particularly nasty inside character constants, where the user may
well want a string written in encoding X to bloody well *remain* in
encoding X in the object file, but we have to do some sort of
conversion if only to find the close quote.
I put a brain dump on charset handling into the cpplib projects web
page. It remains a pretty good statement of what I think our end goal
should be in terms of user-visible behavior. It'd be reasonable to do
a subset of this stuff to begin with, then get better as things go on.
> I also want to move the job of combine_strings to c-lex.c - IMO that's
> the natural place for string concatenation.
*nod* Although, do watch out for L"" and @"". If I remember
correctly, they're contagious - "foo" L"bar" "baz" is valid and
equivalent to L"foobarbaz"...
> It has another benefit, too: the __func__ cannot-be-concatenated bug
> get fixed transparently.
Didn't Nathan Sidwell fix that already?
> However, it has a complication in that it involves a token of
> lookahead - we'd need to keep track of the location of the prior
> token in case CPP doesn't hand us another string literal.
Have you been following the header-names discussion on comp.std.c? We
may need more than that :(
> > What I'd like to do is merge the data carried in the hash node with
> > the data carried in the tree node - both language-dependent and
> > language-independent. At that point the 'hash node' is a lot bigger,
> > and storing it directly in the table ceases to be a good idea.
> > Instead, I'd allocate it alongside the string itself, thus avoiding
> > the extra allocation you're worried about. Ideally we wouldn't need
> > the string pointer anymore; practically, the variable size of the
> > structure makes this unlikely to work out.
>
> Do you intend that your identifier nodes remain as tree structures?
Maybe, maybe not. An unboxed 'struct identifier' might work, too.
> I assume you mean making the hash node a struct tree_identifier
> instead? I don't think we need to worry about a variable-length
> struct; we could just have a per-frontend stringpool.c if we kept
> all the sizing logic hidden in there, which is the way I'm basically
> moving anyway.
I meant, variable sized in that each front end's struct
lang_identifier is different, and middle-end code has to have hacks to
cope. See set_identifier_size etc.
[snip stuff discussed elsewhere]
--
zw The beginning of almost every story is actually a bone, something with
which to court the dog, which may bring you closer to the lady.
-- Amos Oz, _The Story Begins_