This is the mail archive of the
mailing list for the GCC project.
Re: gcc compile-time performance
- From: Neil Booth <neil at daikokuya dot demon dot co dot uk>
- To: Robert Dewar <dewar at gnat dot com>
- Cc: aoliva at redhat dot com, chip dot cuntz at earthling dot net, davem at redhat dot com,gcc at gcc dot gnu dot org, jh at suse dot cz
- Date: Sun, 19 May 2002 14:29:40 +0100
- Subject: Re: gcc compile-time performance
- References: <20020519131315.320D7F28D6@nile.gnat.com>
Robert Dewar wrote:-
> Again, I don't see that affecting basic lexical scanning, and why would
> one ever want to look ahead or backwards in the character stream for the
> lexical analyzer (I agree it makes life harder for general text processing,
> and indeed the peek functions in Ada.Text_IO (which supports all these
> encoding methods) are a pain, but I see no impact on the lexical analysis.
> I do not understand why trigraphs make life harder here, so I probably am
> missing some key point.
You need to look ahead many times, such as when seeing '.' you need two
chars to see if it's '...'. But that can be arbitrarily long because
it could be '.\\n.\\n.". If you're using the mb functions, what do you
do with the chars you've just read in if the 3rd one wasn't a dot?
You can't just go back to after the initial dot, because the mb functions
have state. So I imagine you have to buffer them elsewhere, and that
means maintaining a buffer that needs to be checked whenever you read
a character. It gets nasty.
I think a better solution is to scan logical lines in before doing
any kind of tokenization, possibly to UTF-8 since then lookahead and
look-back are not a problem, but that then leads to other issues like
knowing what line and column in the physical source file any given
character from the logical line is. [And how do you get this info if
I can't see a really clean solution to these issues. However, I'm
no expert on the mb stuff, so I could be missing something.