This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: New C parser [patch]


"Joseph S. Myers" <jsm@polyomino.org.uk> writes:

> This second slightly revised patch version fixes the lexer lookahead
> strategy not to look ahead one token until required, so improving the
> correspondence of diagnostic locations with the new parser to those
> with the old parser.

I'm generally quite happy with this patch.  Could you enumerate all
the places where 2-token lookahead is needed, and give an opinion as
to how many of those could be eliminated through future cleanups?

Some kibitzes follow...

> +int yydebug;

Where is this used?

> +void
> +c_parse_init (void)
> +{
> +  init_reswords ();
> +}

Seems a little silly to have two functions one of which does nothing
but call the other.

> +/* A lexer with up to two tokens of look-ahead; more are not needed
> +   for C.  */
> +typedef struct c_lexer GTY (())
> +{
> +  /* The look-ahead tokens.  */
> +  c_token tokens[2];
> +  /* How many look-ahead tokens are available (0, 1 or 2).  */
> +  int tokens_avail;
> +} c_lexer;

Unlike C++, the lexer abstraction seems superfluous.  By folding this
into the parser structure, you would avoid an extra pointer
dereference on every lexer operation.

If you want to keep the distinction between parser and lexer, suggest
embedding this struct within the parser struct.  Come to think, I
could do that to the C++ parser too.  (makes note)

This is especially a good idea since ...

> +/* A parser structure recording information about the state and
> +   context of parsing.  */
> +typedef struct c_parser GTY(())
> +{
> +  /* The lexer.  */
> +  c_lexer *lexer;
> +  /* True if a syntax error is being recovered from; false otherwise.
> +     c_parser_error sets this flag.  It should clear this flag when
> +     enough tokens have been consumed to recover from the error.  */
> +  BOOL_BITFIELD error : 1;
> +} c_parser;

the parser structure has so little in it, and you could save a word of
memory by using a short int for the tokens_avail field.  This only
helps if you don't keep the separate lexer structure, because of tail
padding.

> +/* Issue a diagnostic of the form
> +      FILE:LINE: MESSAGE before TOKEN
> +   where TOKEN is the next token in the input stream of PARSER.
> +   MESSAGE (specified by the caller) is usually of the form "expected
> +   OTHER-TOKEN".
> +
> +   Do not issue a diagnostic if still recovering from an error.
> +
> +   ??? This is taken from the C++ parser, but building up messages in
> +   this way is not i18n-friendly and some other approach should be
> +   used.  */

As a tangential note, the C++ parser does it this way because the C
parser does it this way, and the C parser did it this way because
Bison's error reporting logic more or less forced it to.  Once your
new parser goes in, it will be easy to clean all of this up.

> +/* If the next token is the indicated keyword, consume it.  Otherwise,
> +   issue the error MSGID.  Returns true if found, false otherwise.  */
> +
> +static bool
> +c_parser_require_keyword (c_parser *parser,
> +			  enum rid keyword,
> +			  const char *msgid)

Please put this next to c_parser_require.

> +static void
> +c_parser_translation_unit (c_parser *parser)
> +{
> +  if (c_lexer_next_token_is (parser->lexer, CPP_EOF))
> +    {
> +      if (pedantic)
> +	pedwarn ("ISO C forbids an empty source file");
> +    }
> +  else
> +    {
> +      while (c_lexer_next_token_is_not (parser->lexer, CPP_EOF))
> +	{
> +	  void *obstack_position = obstack_alloc (&parser_obstack, 0);
> +	  ggc_collect ();
> +	  c_parser_external_declaration (parser);
> +	  obstack_free (&parser_obstack, obstack_position);
> +	}
> +    }
> +}

If control reaches the else block, we know the first token is not
CPP_EOF, so the while can be turned into a do-while; I do not think
the compiler can do that itself.

The obstack_alloc call can and should be hoisted out of the loop,
since your goal is to flush the entire obstack between external
declarations.


> +      int ext;
> +      SAVE_EXT_FLAGS (ext);

It is not obvious from the name that SAVE_EXT_FLAGS disables
diagnostics for extensions.  Also, it would be nice if the macro
returned a value (make it into an inline function maybe?) so that
this could be written

  int old_ext_flags = disable_extension_diagnostics ();
  ...
  restore_extension_diagnostics (old_ext_flags);

Also, (due to the recursive call to c_parser_external_declaration) you
accept arbitrary numbers of __extension__ keywords; is this really
intended?  (I see the grammar as commented does allow it.)

> +static void
> +c_parser_asm_definition (c_parser *parser)
> +{
> +  tree asm_str = c_parser_simple_asm_expr (parser);
> +  if (asm_str)
> +    assemble_asm (asm_str);
> +  c_parser_skip_until_found (parser, CPP_SEMICOLON, "expected ';'");
> +}

This isn't your fault, but in the modern compiler, calling
assemble_anything directly from the parser is wrong.  The only way
asm() at top level can continue to make sense is if we pass it along
to cgraph, and somehow get cgraph to preserve the global order of
functions and asm()s.  [Mind you, I would not have any problem with
dropping this feature.]

> +	      c_parser_error (parser, "expected ',' or ';'");
> +	      break;
> +	    }
> +	}
> +      else
> +	{
> +	  c_parser_error (parser,
> +			  "expected ':', ',', ';' or '__attribute__'");
> +	  break;

These want %<%>ifying.  There are several other cases - I'm not going
to call them all out explicitly.

> +  /* ??? The old parser accepted wide string literals here, but do we
> +     want to?  */

The C++ parser does not accept wide string literals in any form
of asm() string, so I support not accepting them in C either.

I strongly recommend you grab the C++ parser's logic for
context-dependent string constant translation, by the way, so that the
horrible kludge used for C can go away and the C++ parser can stop
duplicating the get-a-sequence-of-strings-concatenate-and-translate-them 
routine.

> +   attrib:
> +     empty
> +     any-word
> +     any-word ( identifier )
> +     any-word ( identifier , nonempty-expr-list )
> +     any-word ( expr-list )
> +
> +   where the "identifier" must not be declared as a type, and
> +   "any-word" may be any identifier (including one declared as a
> +   type), a reserved word storage class specifier, type specifier or
> +   type qualifier.  ??? This still leaves out most reserved keywords
> +   (following the old parser), shouldn't we include them, and why not
> +   allow identifiers declared as types to start the arguments?  */

What's really going on is that there is a finite set of attribute
keywords which, in the "any-word" context, have that meaning and no
other.  This set really ought to be tagged explicitly by the lexer.
The grammar inside the parentheses has more constraints than you have
here, but it's also highly dependent on the specific attribute; maybe
we should have individual attribute-parser functions, as we do for
pragmas.

> +   ??? Allowing old-style [n] designators has as a side-effect (copied
> +   from the old parser) allowing ".member" without "=" as a
> +   designator, but perhaps as this has never been documented (and only
> +   worked from 2.95.x onwards) it could be freely removed.

I think I prefer allowing ".member" without "=" as a designator, until
and unless we decide to drop support for all of this grammar that's
not C99-ish.  In fact, I think I would prefer that the extension be
regularized down to (a) allowing the omission of the "=" from any
C99-style designator, and (b) allowing "identifier:" as meaning the
same thing as ".identifier =".  Unless that would introduce additional
grammar ambiguities, but I don't think it would.

It's not obvious to me from code or comment - we do allow 
"[ value ... value ] = initializer", yes?  (That might be
standardizable independently from the rest of the extensions in this
area.)

> +  /* Two cases cannot and do not have line numbers associated: If stmt
> +     is degenerate, such as "2;", then stmt is an INTEGER_CST, which
> +     cannot hold line numbers.  But that's OK because the statement
> +     will either be changed to a MODIFY_EXPR during gimplification of
> +     the statement expr, or discarded.  If stmt was compound, but
> +     without new variables, we will have skipped the creation of a
> +     BIND and will have a bare STATEMENT_LIST.  But that's OK because
> +     (recursively) all of the component statements should already have
> +     line numbers assigned.  */

Can we arrange to scrap all no-op statements very early - "2;" other
than as the last statement of a statement expr, and so on?

> +   ??? In accordance with the old parser, the declaration may be a
> +   nested function, which is then rejected in check_for_loop_decls,
> +   but does it make any sense for this to be included in the grammar?

First reaction: Barf.

Second reaction: It would be nice to pitch out invalid declarations
early.  check_for_loop_decls is potentially an expensive operation.

> +/* The actual parser and external interface.  */
> +
> +static GTY (()) c_parser *the_parser;

Is this global used for anything other than GGC, and does it actually
need to be allocated under the garbage collector?

zw


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]