Implementing Universal Character Names in identifiers

Thu Nov 7 00:09:00 GMT 2002

Martin v. L?wis wrote:-

> This patch implements UCNs in cpplib. It does so by converting the
> UCN to UTF-8, putting the UTF-8 bytes into the internal
> representation of the identifier.
> 
> The back-ends will transparently output the UTF-8 identifiers into the
> assembler file. If GNU as is used (or any other assembler supporting
> non-ASCII identifiers), these UTF-8 strings will be copied transparently
> into the object file. If the assembler does not support UTF-8, it
> will produce a diagnostic.
> 
> As a result of this strategy, UCNs are now allowed in all places
> mandated by the relevant standards, i.e. both in C99 and C++, and in
> all identifiers, including macro names.
> 
> Regards,
> Martin
> 
> 2002-10-27  Martin v. L?wis  <loewis@informatik.hu-berlin.de>
> 
> 	* c-lex.c (is_extended_char, utf8_extend_token): Remove.
> 	* cpplex.c (identifier_ucs_p, utf8_extend_token, 
> 	utf8_to_char): New functions.
> 	(parse_slow): Add utf8 parameter. Parse UCS names.
> 	(parse_identifier, parse_number): Adjust.
> 	(_cpp_lex_direct): Parse UCS names.
> 	(cpp_output_token): Print UCS names.
> 	* cpplib.h (NODE_UTF8): New flag.

It would be nice if you could handle escaped newline issues in
the UCS; I don't think your patch does that.  I think it's a bit
painful, and is one of the reasons I'd not added support for them
yet.  It would be easier if there was a prescan of phases 1 and 2
(a logical line at a time) of translation, which Zack and I
keep wondering whether to do or not.

Also, as a QOI issue I'd like token pasting to work for UCS's,
though the standard does not require it.  Does your patch handle
that?

Thanks,

Neil.