This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: UCNs-in-IDs patch


"Joseph S. Myers" <joseph@codesourcery.com> writes:

> 1. The following, preprocessed with -std=c99, should yield a diagnostic 
> for the duplicate macro definitions with different expansion.
> 
> #define foo \u00c1
> #define foo \u00C1

Could you quote a part of the standard which says that \u00c1 and
\u00C1 count as a "different expansion" (or, in standardese, that they
have "different spelling")?  I couldn't find any definition of the
word 'spelling' at all, but maybe I missed it.

Google's dictionary says that "spelling" means "the forming of words
with letters in an accepted order".  I would not consider \ to be a
letter, but "\u00c1" is a (string containing a) letter.


Alternatively, the C rationale says:

  ... there was still one problem, how to specify UCNs in the Standard.
  Both the C and C++ Committees studied this situation and the available
  solutions, and drafted three models:

  A.  Convert everything to UCNs in basic source characters as soon as
  possible, that is, in translation phase 1.

  B.  Use native encodings where possible, UCNs otherwise.

  C.  Convert everything to wide characters as soon as possible using an
  internal encoding that encompasses the entire source character set and
  all UCNs.

  Furthermore, in any place where a program could tell which model was
  being used, the standard should try to label those corner cases as
  undefined behavior.

  ...

  In any case, translation phase 1 begins with an implementation-defined
  mapping; and such mapping can choose to implement model A or C (but
  the implementation must specify it).

Since users can tell the difference between the three models only in
obscure corner cases, which the standard tried to make undefined
anyway, I think it's fine to say that we're doing model C.

> 3. The following, compiled as C++, should execute successfully instead of 
> aborting.
> 
> #include <stdlib.h>
> #include <string.h>
> #define h(s) #s
> #define str(s) h(s)
> int
> main()
> {
>   if (strcmp(str(str(\u00c1)), "\"\\u00c1\"")) abort ();
>   if (strcmp(str(str(\u00C1)), "\"\\u00C1\"")) abort ();
> }

[lex.phases] paragraph 1 says:

  An implementation may use any internal encoding, so long as an actual
  extended character encountered in the source file, and the same
  extended character expressed in the source file as a
  universal-character-name (i.e. using the \uXXXX notation), are handled
  equivalently.

I believe this is specifically intended to allow implementations to
use UTF-8 (or other encoding) as an internal encoding for identifiers,
and so when [cpp.stringize] says "the original spelling" it means in
the internal encoding, not as the user wrote it.


Alternatively, phase 1 starts with the same mapping as for C, and so the
comment from the C rationale applies for C++ too.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]