This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug c/67224] UTF-8 support for identifier names in GCC

From: "joseph at codesourcery dot com" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Thu, 20 Aug 2015 22:41:19 +0000
Subject: [Bug c/67224] UTF-8 support for identifier names in GCC
Auto-submitted: auto-generated
References: <bug-67224-4 at http dot gcc dot gnu dot org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #21 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
_cpp_interpret_identifier converts UCNs to UTF-8 which is the canonical 
internal form for identifiers - for UTF-8 in identifiers, you just need to 
pass in straight through unmodified there.  (cpplib takes care to store 
the original spelling of the identifier as well for purposes for which 
that matters, but that's simply a matter of lex_identifier calling 
cpp_lookup on the original spelling as well as using 
_cpp_interpret_identifier to get the canonical version.)

So you never need to convert UTF-8 to UCNs in order to handle UTF-8 in 
identifiers (cpplib has logic to do so when needed for output, but you 
don't need to add anything new in that regard).  You do need to decode 
UTF-8 into character values for the code that checks normalization, which 
characters are allowed at the start of identifiers, etc., just as the 
existing code decodes UCNs into such values.  (But as I noted, a UCN not 
allowed in identifiers is lexed as part of an identifier, which is then 
considered invalid, whereas a UTF-8 character not allowed in identifiers 
should be lexed as a separate pp-token.  However, UTF-8 for a character 
allowed in identifiers but not at the start of an identifier should, I 
think, be lexed as an identifier character even at the start of an 
identifier, and then give an error for an invalid identifier if it appears 
at the start of an identifier.  That's my reading of the syntax 
productions in the C standard.)

You can ignore anything claiming to handle UTF-EBCDIC.

References:
- [Bug c/67224] New: UTF-8 support for identifier names in GCC
  - From: ejolson at unr dot edu

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]