This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: Merge cpplib and front end hashtables, part 1

To: Neil Booth <neil at daikokuya dot demon dot co dot uk>
Subject: Re: Merge cpplib and front end hashtables, part 1
From: "Zack Weinberg" <zackw at Stanford dot EDU>
Date: Wed, 16 May 2001 23:15:58 -0700
Cc: gcc-patches at gcc dot gnu dot org
References: <20010512212945.A31175@daikokuya.demon.co.uk> <20010513175419.A20351@daikokuya.demon.co.uk> <20010513115202.C434@stanford.edu> <20010513212521.A28870@daikokuya.demon.co.uk>

On Sun, May 13, 2001 at 09:25:21PM +0100, Neil Booth wrote:
> Oddly enough, what started me looking at this hashtable stuff was
> being distracted from looking at string handling in the front ends.
> We do a ridiculous amount of copying of the text of string literals in
> their journey from CPP to the front end tree structure; a minimum
> (assuming no reallocation during lexing) of 4 times for single
> strings, and 5 for concatenated strings.

That's awful, although string literals are unusual enough that it
probably isn't a performance issue.  Here's some numbers: Of 9,571,612
lines of source code (counted by wc -l) I have lying around (gcc, gdb,
binutils, glibc, XFree86, and Linux kernel), 536,722 lines (5.61%)
match grep -c '"'; 50,490 lines (0.53%) match grep -c "'.'".  I didn't
just grep for "'" because that would get many false positives from
apostrophes in comments.

> I think it should be possible to cut out at least one copy, and maybe
> two.  If we are going to handle arbitrary charsets on input, though,
> things may get more complicated.  I'm not at all clear about how we
> intend to handle various charsets.

It's particularly nasty inside character constants, where the user may
well want a string written in encoding X to bloody well *remain* in
encoding X in the object file, but we have to do some sort of
conversion if only to find the close quote.

I put a brain dump on charset handling into the cpplib projects web
page.  It remains a pretty good statement of what I think our end goal
should be in terms of user-visible behavior.  It'd be reasonable to do
a subset of this stuff to begin with, then get better as things go on.

> I also want to move the job of combine_strings to c-lex.c - IMO that's
> the natural place for string concatenation.

*nod* Although, do watch out for L"" and @"".  If I remember
correctly, they're contagious - "foo" L"bar" "baz" is valid and
equivalent to L"foobarbaz"...

> It has another benefit, too: the __func__ cannot-be-concatenated bug
> get fixed transparently.

Didn't Nathan Sidwell fix that already?

> However, it has a complication in that it involves a token of
> lookahead - we'd need to keep track of the location of the prior
> token in case CPP doesn't hand us another string literal.

Have you been following the header-names discussion on comp.std.c? We
may need more than that :(

> > What I'd like to do is merge the data carried in the hash node with
> > the data carried in the tree node - both language-dependent and
> > language-independent.  At that point the 'hash node' is a lot bigger,
> > and storing it directly in the table ceases to be a good idea.
> > Instead, I'd allocate it alongside the string itself, thus avoiding
> > the extra allocation you're worried about.  Ideally we wouldn't need
> > the string pointer anymore; practically, the variable size of the
> > structure makes this unlikely to work out.
> 
> Do you intend that your identifier nodes remain as tree structures?

Maybe, maybe not.  An unboxed 'struct identifier' might work, too.

> I assume you mean making the hash node a struct tree_identifier
> instead?  I don't think we need to worry about a variable-length
> struct; we could just have a per-frontend stringpool.c if we kept
> all the sizing logic hidden in there, which is the way I'm basically
> moving anyway.

I meant, variable sized in that each front end's struct
lang_identifier is different, and middle-end code has to have hacks to
cope.  See set_identifier_size etc.

[snip stuff discussed elsewhere]

-- 
zw   The beginning of almost every story is actually a bone, something with
     which to court the dog, which may bring you closer to the lady.
     	-- Amos Oz, _The Story Begins_

Follow-Ups:
- Re: Merge cpplib and front end hashtables, part 1
  - From: Neil Booth

References:
- Merge cpplib and front end hashtables, part 1
  - From: Neil Booth
- Re: Merge cpplib and front end hashtables, part 1
  - From: Neil Booth
- Re: Merge cpplib and front end hashtables, part 1
  - From: Zack Weinberg
- Re: Merge cpplib and front end hashtables, part 1
  - From: Neil Booth

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]