This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Replacement lexer, token-at-a-time


  In message <20000917110805.A2659@daikokuya.demon.co.uk>you write:
  > This patch provides a new lexer, which hopefully is future-proof.
  > 
  > Initially I wrote this because I wanted to get away from some of the
  > constraints of the current code; in particular line-at-a-time lexing
  > and token lists.  Zack was, I think, supportive of the general idea
  > because of the benefits it will give, but didn't like the
  > implementation since it took us further away from multi-byte character
  > support.
  > 
  > So I rehashed it, and this is the result.  Unlike even the existing
  > lexer, it does not look backwards in the character stream (which
  > causes problems for stateful character encodings), it only scans
  > forwards.  Nor does it re-scan characters at all, except in 3 places
  > where I think it is unavoidable.  Further, these 3 places are all out
  > of the fast path.  They are:-
  > 
  > 	o Trigraphs and escaped-newlines
  > 	o Tokens beginning with '.' (e.g. '.' '.123' and '...').
  > 	o The digraph paste operator '%:%:'.
  > 
  > I would like to eliminate the second one here, but cannot see a way to
  > at present.  Note that the current lexer has even more places that
  > require this, and they are not all removable either.
  > 
  > These three places are well flagged - they use macros SAVE_STATE and
  > RESTORE_STATE.  When we move to multibyte character support, these
  > macros just need to be amended to save the mbstate_t object, or
  > whatever other cookie is provided by the implementation.  For glibc,
  > this is an 8-byte structure, so does not add any significant overhead.
  > 
  > The benefits I see are significant:
  > 
  > o Almost drop-in support for multi-byte character sets. All that is
  > needed is to replace each occurrence of *buffer->cur++ with a function
  > call to translate the next character sequence pointed to by
  > buffer->cur, and to amend the macros SAVE_STATE and RESTORE_STATE.
  > 
  > o All trigraph and escaped-newline logic is now in just one place.
  > Currently it is in 4 or 5 places.  For example, detection of
  > backslash-space-newline is kludged in the current lexer; it just
  > catches the case where the "\ \n" occurs between tokens.  The current
  > code catches all cases, including within string and character
  > literals, numbers and identifiers, and handles and warns
  > appropriately.
  > 
  > o I think the code is a lot cleaner and more logical.  It will also
  > enable cleanups in various other parts of cpplib that cannot be done
  > with the current lexer.
  > 
  > o It should allow optimisations of other parts of cpplib that are not
  > possible at present.
  > 
  > The code is probably slightly slower, say up to 10%, than the current
  > code.  Some slow-down is inevitable when moving towards multibyte
  > character support.  However, with tuning I suspect this could be
  > eliminated: the current implementation of lex-line is complex and
  > probably poorly optimised by gcc.  It is much cleaner here, and less
  > if statements are needed.
  > 
  > So I'd like to commit this.  It passes a checking-enabled bootstrap of
  > all front ends and the preprocessor tests, as a separate preprocessor.
  > I'm about to test the integrated preprocessor now, but I see no reason
  > why the results should differ.  If the integrated stuff passes, would
  > someone authorise this?
  > 
  > Thanks,
  > 
  > Neil.
  > 
  > 	* cpphash.h (HASHSTEP): Take character rather than pointer
  > 	to character.
  > 	(_cpp_check_directive, _cpp_check_linemarker): Update prototypes.
  > 
  > 	* cpphash.c (cpp_loookup): Update for new HASHSTEP.
  > 
  > 	* cpplex.c (auto_expand_name_space, trigraph_replace,
  > 	backslash_start, handle_newline, parse_name, INIT_TOKEN_STR,
  > 	IMMED_TOKEN, PREV_TOKEN_TYPE, PUSH_TOKEN, REVISE_TOKEN,
  > 	BACKUP_TOKEN, BACKUP_TRIGRAPH, MIGHT_BE_DIRECTIVE,
  > 	KNOWN_DIRECTIVE): Delete.
  > 
  > 	(handle_newline, check_long_token, skip_escaped_newlines,
  > 	unterminated): New functions.
  > 	(ACCEPT_CHAR, SAVE_STATE, RESTORE_STATE): New macros.
  > 
  > 	(parse_identifier): Was parse_name, new implementation.
  > 	(skip_line_comment, skip_block_comment, skip_whitespace,
  > 	parse_number, parse_string, trigraph_ok, save_comment,
  > 	adjust_column, _cpp_get_line): New implementations.
  > 
  > 	(lex_token): New function.  Lexes a token at a time, looking
  > 	forwards.  Contains most of the guts of the old lex_line.
  > 	(lex_line): New implementation, using lex_token to obtain
  > 	individual tokens.
  > 	(cpp_scan_buffer): Use the token's line, not the list's line.
  > 
  > 	* cpplib.c (_cpp_check_directive, _cpp_check_linemarker):
  > 	 New implementations.
  > 	(do_assert): Don't bother setting the answer's list's line.
  > 	(cpp_push_buffer): Initialise new pfile and read_ahead members
  > 	of struct cpp_buffer.
  > 
  > 	* cpplib.h (cppchar_t): New typedef.
  > 	(struct cpp_buffer): read_ahead, pfile and col_adjust are
  > 	new members.
  > 	(struct lexer_state): New structure that determines the state
  > 	and behaviour of the lexer.
  > 	(IN_DIRECTIVE, KNOWN_DIRECTIVE): New macros.
  > 	(struct cpp_reader): New member "state". Rename
  > 	multiline_string_line and multiline_string_column. Delete
  > 	col_adjust, in_lex_line members.
  > 	(CPP_BUF_COLUMN): Update.
  > 
  > 	* gcc.dg/cpp/cmdlne-C.c: Remove bogus warning test.
Assuming the integrated processor works, you can install this change.

Thanks,
jeff


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]