This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: Very Fast: Directly Coded Lexical Analyzer
- From: "Joseph S. Myers" <joseph at codesourcery dot com>
- To: Andi Kleen <andi at firstfloor dot org>
- Cc: Frank Schaefer <frank_r_schaefer at gmx dot net>, gcc at gcc dot gnu dot org
- Date: Thu, 31 May 2007 16:31:39 +0000 (UTC)
- Subject: Re: Very Fast: Directly Coded Lexical Analyzer
- References: <20070531131945.56410@gmx.net.suse.lists.egcs> <p73abvlc9e0.fsf@bingen.suse.de>
On Thu, 31 May 2007, Andi Kleen wrote:
> "Frank Schaefer" <frank_r_schaefer@gmx.net> writes:
> >
> > Is there any interest in using such an engine in the GCC toolset?
>
> Right now gcc doesn't use flex so it would be probably non trivial
> to implement support. You would need to rewrite c-lex.c
All the relevant code is in cpplib, not c-lex.c.
Zack had some ideas a few years ago (I don't think they were ever posted
to a public list) about how to speed up _cpp_clean_line in particular, and
some or all of translation phases 1 to 3 in general. The idea is that you
have several Mealy machines (state machines where all the work happens in
transitions), where edges apply for a given set of input characters in a
given state and describe the actions to be taken in that case. Actions
include both passing output to another machine, and emitting diagnostics.
So you start with converting character sets to UTF-8, then strip trailing
whitespace and canonicalise newlines, then convert trigraphs, then remove
backslash-newline pairs, then strip comments, then split the file into
preprocessing tokens.
A state machine that does all these things at once is too complicated to
write and verify by hand; the idea involves descriptions of the state
machine for each subphase, which would be automatically combined into a
single machine and optimized. For example, inside comments the state
machine doesn't need to convert character sets for most well-behaved
character sets (those where an ASCII comment end is always one in the
character set and vice versa - for example all those where extended
characters are made up entirely of bytes with the high bit set and other
bytes have their ASCII values), and doesn't need to convert trigraphs or
backslash-newlines except in places where ??/ or \ followed by newline
might split a multi-line */.
Furthermore, the generation process should generate a separate state
machine for each mode in which the compiler can operate that affects
lexing - machines with trigraphs enabled should be separate from those
with trigraphs disabled, so there should be no checks of cpplib
configuration flags in inner loops; just the state and next character.
(If it turns out we have too many option combinations, we could have fast
machines for the most common combinations and slower ones that check flags
at runtime for the less common ones.)
--
Joseph S. Myers
joseph@codesourcery.com