[Bug preprocessor/9449] UCNs not recognized in identifiers (c++/c99)

Thu Dec 16 23:05:00 GMT 2004

------- Additional Comments From joseph at codesourcery dot com  2004-12-16 23:04 -------
Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

The following example illustrates the problems with lack of normalisation.  
(I still expect WG14 and WG21 to consider the lack of normalisation to be 
both the current meaning of the standards and their correct meaning in 
context, though future revisions might change the exact lists of 
characters, but this is an appropriate example to present to them and 
shows why diagnostics would be needed for various cases.)

\u05e9\u05bc\u05c1
\u05e9\u05c1\u05bc
are valid identifiers in C99 but not C++ while
\ufb2c
is a valid identifier in C++ but not in C99.

In Unicode, the three are canonically equivalent, the first being both NFC 
and NFD.

05BC HEBREW POINT DAGESH OR MAPIQ (combining class 21)
05C1 HEBREW POINT SHIN DOT (combining class 24)
05E9 HEBREW LETTER SHIN (combining class 0)
FB2C HEBREW LETTER SHIN WITH DAGESH AND SHIN DOT (combining class 0)

(U+FB2C is excluded from the compositions allowed in NFC, hence the 
decomposed form being NFC.)

So with current C and C++ standards users cannot portably link some 
pointed Hebrew identifiers between the two languages; it would be 
advisable for them to avoid such identifiers.  Warning for any use of the 
characters permitted by C++ but not C seems appropriate in the expectation 
that such characters will cease to be permitted in future, regardless of 
any other changes there may be.  Making the C++ extern "C" \ufb2c into 
something else would seem to me to be the road to madness, though we could 
see how other implementations of the C++ ABI interpret it as regards 
identifiers with UCNs.

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9449