This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Replacement lexer, token-at-a-time
- To: Neil Booth <NeilB at earthling dot net>
- Subject: Re: Replacement lexer, token-at-a-time
- From: Jeffrey A Law <law at cygnus dot com>
- Date: Mon, 18 Sep 2000 10:36:26 -0600
- cc: gcc-patches at gcc dot gnu dot org
- Reply-To: law at cygnus dot com
In message <20000917110805.A2659@daikokuya.demon.co.uk>you write:
> This patch provides a new lexer, which hopefully is future-proof.
>
> Initially I wrote this because I wanted to get away from some of the
> constraints of the current code; in particular line-at-a-time lexing
> and token lists. Zack was, I think, supportive of the general idea
> because of the benefits it will give, but didn't like the
> implementation since it took us further away from multi-byte character
> support.
>
> So I rehashed it, and this is the result. Unlike even the existing
> lexer, it does not look backwards in the character stream (which
> causes problems for stateful character encodings), it only scans
> forwards. Nor does it re-scan characters at all, except in 3 places
> where I think it is unavoidable. Further, these 3 places are all out
> of the fast path. They are:-
>
> o Trigraphs and escaped-newlines
> o Tokens beginning with '.' (e.g. '.' '.123' and '...').
> o The digraph paste operator '%:%:'.
>
> I would like to eliminate the second one here, but cannot see a way to
> at present. Note that the current lexer has even more places that
> require this, and they are not all removable either.
>
> These three places are well flagged - they use macros SAVE_STATE and
> RESTORE_STATE. When we move to multibyte character support, these
> macros just need to be amended to save the mbstate_t object, or
> whatever other cookie is provided by the implementation. For glibc,
> this is an 8-byte structure, so does not add any significant overhead.
>
> The benefits I see are significant:
>
> o Almost drop-in support for multi-byte character sets. All that is
> needed is to replace each occurrence of *buffer->cur++ with a function
> call to translate the next character sequence pointed to by
> buffer->cur, and to amend the macros SAVE_STATE and RESTORE_STATE.
>
> o All trigraph and escaped-newline logic is now in just one place.
> Currently it is in 4 or 5 places. For example, detection of
> backslash-space-newline is kludged in the current lexer; it just
> catches the case where the "\ \n" occurs between tokens. The current
> code catches all cases, including within string and character
> literals, numbers and identifiers, and handles and warns
> appropriately.
>
> o I think the code is a lot cleaner and more logical. It will also
> enable cleanups in various other parts of cpplib that cannot be done
> with the current lexer.
>
> o It should allow optimisations of other parts of cpplib that are not
> possible at present.
>
> The code is probably slightly slower, say up to 10%, than the current
> code. Some slow-down is inevitable when moving towards multibyte
> character support. However, with tuning I suspect this could be
> eliminated: the current implementation of lex-line is complex and
> probably poorly optimised by gcc. It is much cleaner here, and less
> if statements are needed.
>
> So I'd like to commit this. It passes a checking-enabled bootstrap of
> all front ends and the preprocessor tests, as a separate preprocessor.
> I'm about to test the integrated preprocessor now, but I see no reason
> why the results should differ. If the integrated stuff passes, would
> someone authorise this?
>
> Thanks,
>
> Neil.
>
> * cpphash.h (HASHSTEP): Take character rather than pointer
> to character.
> (_cpp_check_directive, _cpp_check_linemarker): Update prototypes.
>
> * cpphash.c (cpp_loookup): Update for new HASHSTEP.
>
> * cpplex.c (auto_expand_name_space, trigraph_replace,
> backslash_start, handle_newline, parse_name, INIT_TOKEN_STR,
> IMMED_TOKEN, PREV_TOKEN_TYPE, PUSH_TOKEN, REVISE_TOKEN,
> BACKUP_TOKEN, BACKUP_TRIGRAPH, MIGHT_BE_DIRECTIVE,
> KNOWN_DIRECTIVE): Delete.
>
> (handle_newline, check_long_token, skip_escaped_newlines,
> unterminated): New functions.
> (ACCEPT_CHAR, SAVE_STATE, RESTORE_STATE): New macros.
>
> (parse_identifier): Was parse_name, new implementation.
> (skip_line_comment, skip_block_comment, skip_whitespace,
> parse_number, parse_string, trigraph_ok, save_comment,
> adjust_column, _cpp_get_line): New implementations.
>
> (lex_token): New function. Lexes a token at a time, looking
> forwards. Contains most of the guts of the old lex_line.
> (lex_line): New implementation, using lex_token to obtain
> individual tokens.
> (cpp_scan_buffer): Use the token's line, not the list's line.
>
> * cpplib.c (_cpp_check_directive, _cpp_check_linemarker):
> New implementations.
> (do_assert): Don't bother setting the answer's list's line.
> (cpp_push_buffer): Initialise new pfile and read_ahead members
> of struct cpp_buffer.
>
> * cpplib.h (cppchar_t): New typedef.
> (struct cpp_buffer): read_ahead, pfile and col_adjust are
> new members.
> (struct lexer_state): New structure that determines the state
> and behaviour of the lexer.
> (IN_DIRECTIVE, KNOWN_DIRECTIVE): New macros.
> (struct cpp_reader): New member "state". Rename
> multiline_string_line and multiline_string_column. Delete
> col_adjust, in_lex_line members.
> (CPP_BUF_COLUMN): Update.
>
> * gcc.dg/cpp/cmdlne-C.c: Remove bogus warning test.
Assuming the integrated processor works, you can install this change.
Thanks,
jeff