This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: UCNs-in-IDs patch
"Joseph S. Myers" <joseph@codesourcery.com> writes:
> 1. The following, preprocessed with -std=c99, should yield a diagnostic
> for the duplicate macro definitions with different expansion.
>
> #define foo \u00c1
> #define foo \u00C1
Could you quote a part of the standard which says that \u00c1 and
\u00C1 count as a "different expansion" (or, in standardese, that they
have "different spelling")? I couldn't find any definition of the
word 'spelling' at all, but maybe I missed it.
Google's dictionary says that "spelling" means "the forming of words
with letters in an accepted order". I would not consider \ to be a
letter, but "\u00c1" is a (string containing a) letter.
Alternatively, the C rationale says:
... there was still one problem, how to specify UCNs in the Standard.
Both the C and C++ Committees studied this situation and the available
solutions, and drafted three models:
A. Convert everything to UCNs in basic source characters as soon as
possible, that is, in translation phase 1.
B. Use native encodings where possible, UCNs otherwise.
C. Convert everything to wide characters as soon as possible using an
internal encoding that encompasses the entire source character set and
all UCNs.
Furthermore, in any place where a program could tell which model was
being used, the standard should try to label those corner cases as
undefined behavior.
...
In any case, translation phase 1 begins with an implementation-defined
mapping; and such mapping can choose to implement model A or C (but
the implementation must specify it).
Since users can tell the difference between the three models only in
obscure corner cases, which the standard tried to make undefined
anyway, I think it's fine to say that we're doing model C.
> 3. The following, compiled as C++, should execute successfully instead of
> aborting.
>
> #include <stdlib.h>
> #include <string.h>
> #define h(s) #s
> #define str(s) h(s)
> int
> main()
> {
> if (strcmp(str(str(\u00c1)), "\"\\u00c1\"")) abort ();
> if (strcmp(str(str(\u00C1)), "\"\\u00C1\"")) abort ();
> }
[lex.phases] paragraph 1 says:
An implementation may use any internal encoding, so long as an actual
extended character encountered in the source file, and the same
extended character expressed in the source file as a
universal-character-name (i.e. using the \uXXXX notation), are handled
equivalently.
I believe this is specifically intended to allow implementations to
use UTF-8 (or other encoding) as an internal encoding for identifiers,
and so when [cpp.stringize] says "the original spelling" it means in
the internal encoding, not as the user wrote it.
Alternatively, phase 1 starts with the same mapping as for C, and so the
comment from the C rationale applies for C++ too.