Universal Character Names, v2
Martin v. Löwis
martin@v.loewis.de
Thu Nov 28 15:43:00 GMT 2002
This is the second version of my UCN patch. It incorporates all
comments from the previous patch (AFAIR).
Specifically, the changes relative to the previous patch are:
- Update character sets for C99, and C++ DR 131.
- Support escaped newlines in the middle of an UCN. This is done
through the addition of maybe_read_ucs_reader function, which
uses get_effective_char internally.
- Support UCNs in numbers. In the internal represantation, such
a number still has the UCN in it, i.e. no conversion to UTF-8
takes place. Such numbers will only be valid if they are pasted
with an identifier.
- Support pasting of names that have UCNs in them. For that,
cpp_spell_token had to be updated.
- Check for assembler UTF-8 support, and reject UCNs if no such
support is available. As a side effect, gcj will automatically
use UTF-8 mangling where g++ supports UCNs.
I have considered the following comments, but chose to take a
different approach:
- I have not put the test function for characters in libiberty.
It is quite specific to C and C++, and only ever used in the
preprocessor.
- I have not decided to deviate from the C and C++ standards for
character tests. Reviewers commented that they dislike the approach
taken by the standards committees, and that the relevant Unicode
specification should be taken into account instead. I disagree, as I
consider the approach of giving explicit lists quite reasonable.
More importantly, I think that standards conformance should be
valued quite highly unless specific user demands require to
ignore or extend the standards; this is not the case in the
specific issue.
A few issues need to be resolved with the Java compiler:
- somehow, defining HAVE_AS_UTF8 (which the patch does) triggers
bugs in the mangler; it will now emit symbols like
_ZN4java4lang6Double8<clinit>Ev
- The Java mangler currently emits the number of characters for an
UTF-8 <source-name>; the ABI specifies that this ought to be the
byte length.
I'd appreciate if some Java expert could help with resolving the first
issue; resolving the second one seems simple.
Any comments appreciated,
Martin
2002-10-27 Martin v. Löwis <loewis@informatik.hu-berlin.de>
* c-lex.c (is_extended_char, utf8_extend_token): Remove.
* cpplex.c (identifier_ucs_p, utf8_extend_token,
ucn_extend_token, utf8_to_char, maybe_read_ucs_reader): New functions.
(parse_slow): Add utf8 parameter. Parse UCS names.
(parse_identifier, parse_number): Adjust.
(_cpp_lex_direct): Parse UCS names.
(cpp_output_token): Print UCS names.
(cpp_spell_token, cpp_output_token): Unparse extended characters.
* cpplib.h (NODE_USES_EXTENDED_CHARACTERS): New flag.
* configure.in (HAVE_AS_UTF8): New test.
* configure, config.in: Rebuilt.
Index: c-lex.c
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/c-lex.c,v
retrieving revision 1.190
diff -u -r1.190 c-lex.c
--- c-lex.c 16 Sep 2002 16:36:31 -0000 1.190
+++ c-lex.c 28 Nov 2002 22:50:07 -0000
@@ -356,314 +356,6 @@
(const char *) NODE_NAME (node));
}
-#if 0 /* not yet */
-/* Returns nonzero if C is a universal-character-name. Give an error if it
- is not one which may appear in an identifier, as per [extendid].
-
- Note that extended character support in identifiers has not yet been
- implemented. It is my personal opinion that this is not a desirable
- feature. Portable code cannot count on support for more than the basic
- identifier character set. */
-
-static inline int
-is_extended_char (c)
- int c;
-{
-#ifdef TARGET_EBCDIC
- return 0;
-#else
- /* ASCII. */
- if (c < 0x7f)
- return 0;
-
- /* None of the valid chars are outside the Basic Multilingual Plane (the
- low 16 bits). */
- if (c > 0xffff)
- {
- error ("universal-character-name '\\U%08x' not valid in identifier", c);
- return 1;
- }
-
- /* Latin */
- if ((c >= 0x00c0 && c <= 0x00d6)
- || (c >= 0x00d8 && c <= 0x00f6)
- || (c >= 0x00f8 && c <= 0x01f5)
- || (c >= 0x01fa && c <= 0x0217)
- || (c >= 0x0250 && c <= 0x02a8)
- || (c >= 0x1e00 && c <= 0x1e9a)
- || (c >= 0x1ea0 && c <= 0x1ef9))
- return 1;
-
- /* Greek */
- if ((c == 0x0384)
- || (c >= 0x0388 && c <= 0x038a)
- || (c == 0x038c)
- || (c >= 0x038e && c <= 0x03a1)
- || (c >= 0x03a3 && c <= 0x03ce)
- || (c >= 0x03d0 && c <= 0x03d6)
- || (c == 0x03da)
- || (c == 0x03dc)
- || (c == 0x03de)
- || (c == 0x03e0)
- || (c >= 0x03e2 && c <= 0x03f3)
- || (c >= 0x1f00 && c <= 0x1f15)
- || (c >= 0x1f18 && c <= 0x1f1d)
- || (c >= 0x1f20 && c <= 0x1f45)
- || (c >= 0x1f48 && c <= 0x1f4d)
- || (c >= 0x1f50 && c <= 0x1f57)
- || (c == 0x1f59)
- || (c == 0x1f5b)
- || (c == 0x1f5d)
- || (c >= 0x1f5f && c <= 0x1f7d)
- || (c >= 0x1f80 && c <= 0x1fb4)
- || (c >= 0x1fb6 && c <= 0x1fbc)
- || (c >= 0x1fc2 && c <= 0x1fc4)
- || (c >= 0x1fc6 && c <= 0x1fcc)
- || (c >= 0x1fd0 && c <= 0x1fd3)
- || (c >= 0x1fd6 && c <= 0x1fdb)
- || (c >= 0x1fe0 && c <= 0x1fec)
- || (c >= 0x1ff2 && c <= 0x1ff4)
- || (c >= 0x1ff6 && c <= 0x1ffc))
- return 1;
-
- /* Cyrillic */
- if ((c >= 0x0401 && c <= 0x040d)
- || (c >= 0x040f && c <= 0x044f)
- || (c >= 0x0451 && c <= 0x045c)
- || (c >= 0x045e && c <= 0x0481)
- || (c >= 0x0490 && c <= 0x04c4)
- || (c >= 0x04c7 && c <= 0x04c8)
- || (c >= 0x04cb && c <= 0x04cc)
- || (c >= 0x04d0 && c <= 0x04eb)
- || (c >= 0x04ee && c <= 0x04f5)
- || (c >= 0x04f8 && c <= 0x04f9))
- return 1;
-
- /* Armenian */
- if ((c >= 0x0531 && c <= 0x0556)
- || (c >= 0x0561 && c <= 0x0587))
- return 1;
-
- /* Hebrew */
- if ((c >= 0x05d0 && c <= 0x05ea)
- || (c >= 0x05f0 && c <= 0x05f4))
- return 1;
-
- /* Arabic */
- if ((c >= 0x0621 && c <= 0x063a)
- || (c >= 0x0640 && c <= 0x0652)
- || (c >= 0x0670 && c <= 0x06b7)
- || (c >= 0x06ba && c <= 0x06be)
- || (c >= 0x06c0 && c <= 0x06ce)
- || (c >= 0x06e5 && c <= 0x06e7))
- return 1;
-
- /* Devanagari */
- if ((c >= 0x0905 && c <= 0x0939)
- || (c >= 0x0958 && c <= 0x0962))
- return 1;
-
- /* Bengali */
- if ((c >= 0x0985 && c <= 0x098c)
- || (c >= 0x098f && c <= 0x0990)
- || (c >= 0x0993 && c <= 0x09a8)
- || (c >= 0x09aa && c <= 0x09b0)
- || (c == 0x09b2)
- || (c >= 0x09b6 && c <= 0x09b9)
- || (c >= 0x09dc && c <= 0x09dd)
- || (c >= 0x09df && c <= 0x09e1)
- || (c >= 0x09f0 && c <= 0x09f1))
- return 1;
-
- /* Gurmukhi */
- if ((c >= 0x0a05 && c <= 0x0a0a)
- || (c >= 0x0a0f && c <= 0x0a10)
- || (c >= 0x0a13 && c <= 0x0a28)
- || (c >= 0x0a2a && c <= 0x0a30)
- || (c >= 0x0a32 && c <= 0x0a33)
- || (c >= 0x0a35 && c <= 0x0a36)
- || (c >= 0x0a38 && c <= 0x0a39)
- || (c >= 0x0a59 && c <= 0x0a5c)
- || (c == 0x0a5e))
- return 1;
-
- /* Gujarati */
- if ((c >= 0x0a85 && c <= 0x0a8b)
- || (c == 0x0a8d)
- || (c >= 0x0a8f && c <= 0x0a91)
- || (c >= 0x0a93 && c <= 0x0aa8)
- || (c >= 0x0aaa && c <= 0x0ab0)
- || (c >= 0x0ab2 && c <= 0x0ab3)
- || (c >= 0x0ab5 && c <= 0x0ab9)
- || (c == 0x0ae0))
- return 1;
-
- /* Oriya */
- if ((c >= 0x0b05 && c <= 0x0b0c)
- || (c >= 0x0b0f && c <= 0x0b10)
- || (c >= 0x0b13 && c <= 0x0b28)
- || (c >= 0x0b2a && c <= 0x0b30)
- || (c >= 0x0b32 && c <= 0x0b33)
- || (c >= 0x0b36 && c <= 0x0b39)
- || (c >= 0x0b5c && c <= 0x0b5d)
- || (c >= 0x0b5f && c <= 0x0b61))
- return 1;
-
- /* Tamil */
- if ((c >= 0x0b85 && c <= 0x0b8a)
- || (c >= 0x0b8e && c <= 0x0b90)
- || (c >= 0x0b92 && c <= 0x0b95)
- || (c >= 0x0b99 && c <= 0x0b9a)
- || (c == 0x0b9c)
- || (c >= 0x0b9e && c <= 0x0b9f)
- || (c >= 0x0ba3 && c <= 0x0ba4)
- || (c >= 0x0ba8 && c <= 0x0baa)
- || (c >= 0x0bae && c <= 0x0bb5)
- || (c >= 0x0bb7 && c <= 0x0bb9))
- return 1;
-
- /* Telugu */
- if ((c >= 0x0c05 && c <= 0x0c0c)
- || (c >= 0x0c0e && c <= 0x0c10)
- || (c >= 0x0c12 && c <= 0x0c28)
- || (c >= 0x0c2a && c <= 0x0c33)
- || (c >= 0x0c35 && c <= 0x0c39)
- || (c >= 0x0c60 && c <= 0x0c61))
- return 1;
-
- /* Kannada */
- if ((c >= 0x0c85 && c <= 0x0c8c)
- || (c >= 0x0c8e && c <= 0x0c90)
- || (c >= 0x0c92 && c <= 0x0ca8)
- || (c >= 0x0caa && c <= 0x0cb3)
- || (c >= 0x0cb5 && c <= 0x0cb9)
- || (c >= 0x0ce0 && c <= 0x0ce1))
- return 1;
-
- /* Malayalam */
- if ((c >= 0x0d05 && c <= 0x0d0c)
- || (c >= 0x0d0e && c <= 0x0d10)
- || (c >= 0x0d12 && c <= 0x0d28)
- || (c >= 0x0d2a && c <= 0x0d39)
- || (c >= 0x0d60 && c <= 0x0d61))
- return 1;
-
- /* Thai */
- if ((c >= 0x0e01 && c <= 0x0e30)
- || (c >= 0x0e32 && c <= 0x0e33)
- || (c >= 0x0e40 && c <= 0x0e46)
- || (c >= 0x0e4f && c <= 0x0e5b))
- return 1;
-
- /* Lao */
- if ((c >= 0x0e81 && c <= 0x0e82)
- || (c == 0x0e84)
- || (c == 0x0e87)
- || (c == 0x0e88)
- || (c == 0x0e8a)
- || (c == 0x0e0d)
- || (c >= 0x0e94 && c <= 0x0e97)
- || (c >= 0x0e99 && c <= 0x0e9f)
- || (c >= 0x0ea1 && c <= 0x0ea3)
- || (c == 0x0ea5)
- || (c == 0x0ea7)
- || (c == 0x0eaa)
- || (c == 0x0eab)
- || (c >= 0x0ead && c <= 0x0eb0)
- || (c == 0x0eb2)
- || (c == 0x0eb3)
- || (c == 0x0ebd)
- || (c >= 0x0ec0 && c <= 0x0ec4)
- || (c == 0x0ec6))
- return 1;
-
- /* Georgian */
- if ((c >= 0x10a0 && c <= 0x10c5)
- || (c >= 0x10d0 && c <= 0x10f6))
- return 1;
-
- /* Hiragana */
- if ((c >= 0x3041 && c <= 0x3094)
- || (c >= 0x309b && c <= 0x309e))
- return 1;
-
- /* Katakana */
- if ((c >= 0x30a1 && c <= 0x30fe))
- return 1;
-
- /* Bopmofo */
- if ((c >= 0x3105 && c <= 0x312c))
- return 1;
-
- /* Hangul */
- if ((c >= 0x1100 && c <= 0x1159)
- || (c >= 0x1161 && c <= 0x11a2)
- || (c >= 0x11a8 && c <= 0x11f9))
- return 1;
-
- /* CJK Unified Ideographs */
- if ((c >= 0xf900 && c <= 0xfa2d)
- || (c >= 0xfb1f && c <= 0xfb36)
- || (c >= 0xfb38 && c <= 0xfb3c)
- || (c == 0xfb3e)
- || (c >= 0xfb40 && c <= 0xfb41)
- || (c >= 0xfb42 && c <= 0xfb44)
- || (c >= 0xfb46 && c <= 0xfbb1)
- || (c >= 0xfbd3 && c <= 0xfd3f)
- || (c >= 0xfd50 && c <= 0xfd8f)
- || (c >= 0xfd92 && c <= 0xfdc7)
- || (c >= 0xfdf0 && c <= 0xfdfb)
- || (c >= 0xfe70 && c <= 0xfe72)
- || (c == 0xfe74)
- || (c >= 0xfe76 && c <= 0xfefc)
- || (c >= 0xff21 && c <= 0xff3a)
- || (c >= 0xff41 && c <= 0xff5a)
- || (c >= 0xff66 && c <= 0xffbe)
- || (c >= 0xffc2 && c <= 0xffc7)
- || (c >= 0xffca && c <= 0xffcf)
- || (c >= 0xffd2 && c <= 0xffd7)
- || (c >= 0xffda && c <= 0xffdc)
- || (c >= 0x4e00 && c <= 0x9fa5))
- return 1;
-
- error ("universal-character-name '\\u%04x' not valid in identifier", c);
- return 1;
-#endif
-}
-
-/* Add the UTF-8 representation of C to the token_buffer. */
-
-static void
-utf8_extend_token (c)
- int c;
-{
- int shift, mask;
-
- if (c <= 0x0000007f)
- {
- extend_token (c);
- return;
- }
- else if (c <= 0x000007ff)
- shift = 6, mask = 0xc0;
- else if (c <= 0x0000ffff)
- shift = 12, mask = 0xe0;
- else if (c <= 0x001fffff)
- shift = 18, mask = 0xf0;
- else if (c <= 0x03ffffff)
- shift = 24, mask = 0xf8;
- else
- shift = 30, mask = 0xfc;
-
- extend_token (mask | (c >> shift));
- do
- {
- shift -= 6;
- extend_token ((unsigned char) (0x80 | (c >> shift)));
- }
- while (shift);
-}
-#endif
int
c_lex (value)
Index: configure.in
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/configure.in,v
retrieving revision 1.626
diff -u -r1.626 configure.in
--- configure.in 26 Nov 2002 20:08:07 -0000 1.626
+++ configure.in 28 Nov 2002 22:50:13 -0000
@@ -1889,6 +1889,22 @@
fi
AC_MSG_RESULT($gcc_cv_as_tls)
+AC_MSG_CHECKING(assembler support for UTF-8 identifiers)
+gcc_cv_as_utf8="no"
+if test x$gcc_cv_as != x; then
+ echo fooab:|tr ab '\303\200' > conftest.s
+ if $gcc_cv_as --fatal-warnings -o conftest.o conftest.s > /dev/null 2>&1
+ then
+ gcc_cv_as_utf8=yes
+ fi
+ rm -rf conftest.s
+fi
+if test "$gcc_cv_as_utf8" = yes; then
+ AC_DEFINE(HAVE_AS_UTF8, 1,
+ [Define if your assembler supports UTF-8 bytes in identifiers])
+fi
+AC_MSG_RESULT($gcc_cv_as_utf8)
+
case "$target" in
# All TARGET_ABI_OSF targets.
alpha*-*-osf* | alpha*-*-linux* | alpha*-*-*bsd*)
Index: cpplex.c
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/cpplex.c,v
retrieving revision 1.215
diff -u -p -r1.215 cpplex.c
--- cpplex.c 26 Sep 2002 22:25:12 -0000 1.215
+++ cpplex.c 28 Nov 2002 23:04:54 -0000
@@ -71,7 +71,7 @@ static void adjust_column PARAMS ((cpp_r
static int skip_whitespace PARAMS ((cpp_reader *, cppchar_t));
static cpp_hashnode *parse_identifier PARAMS ((cpp_reader *));
static uchar *parse_slow PARAMS ((cpp_reader *, const uchar *, int,
- unsigned int *));
+ unsigned int *, unsigned int *));
static void parse_number PARAMS ((cpp_reader *, cpp_string *, int));
static int unescaped_terminator_p PARAMS ((cpp_reader *, const uchar *));
static void parse_string PARAMS ((cpp_reader *, cpp_token *, cppchar_t));
@@ -82,10 +82,16 @@ static bool continue_after_nul PARAMS ((
static int name_p PARAMS ((cpp_reader *, const cpp_string *));
static int maybe_read_ucs PARAMS ((cpp_reader *, const unsigned char **,
const unsigned char *, cppchar_t *));
+static int maybe_read_ucs_reader PARAMS ((cpp_reader *, cppchar_t *));
static tokenrun *next_tokenrun PARAMS ((tokenrun *));
static unsigned int hex_digit_value PARAMS ((unsigned int));
static _cpp_buff *new_buff PARAMS ((size_t));
+static bool identifier_ucs_p PARAMS ((cpp_reader *, cppchar_t, int));
+static void utf8_extend_token PARAMS ((struct obstack *, int));
+static void ucn_extend_token PARAMS ((struct obstack *, int));
+static cppchar_t utf8_to_char PARAMS((const unsigned char **));
+
/* Utility routine:
@@ -161,6 +167,673 @@ trigraph_p (pfile)
return accept;
}
+/* Returns nonzero if C is a universal-character-name. Give an error
+ if it is not one which may appear in an identifier, as per C++98
+ Annex E [extendid], and C99 Annex F. */
+
+static bool
+identifier_ucs_p (pfile, c, allow_digits)
+ cpp_reader *pfile;
+ cppchar_t c;
+ int allow_digits;
+{
+#ifdef TARGET_EBCDIC
+ return 0;
+#else
+ int cxx98 = CPP_OPTION (pfile, cplusplus);
+ int c99 = CPP_OPTION (pfile, c99);
+
+ /* ASCII. */
+ if (c < 0x7f)
+ return 0;
+
+ /* None of the valid chars are outside the Basic Multilingual Plane (the
+ low 16 bits). */
+ if (c > 0xffff)
+ {
+ cpp_error_with_line (pfile, DL_ERROR,
+ pfile->line, 1, /* XXX */
+ "universal-character-name '\\U%08x' not valid in identifier", (int)c);
+ return 0;
+ }
+
+#define NOTIN_C99(code) if(c==code && c99) goto fail
+#define NOTIN_CXX98(code) if(c==code && cxx98) goto fail
+
+ /* Latin */
+ if ((c == 0x00aa)
+ || (c == 0x00ba)
+ || (c >= 0x00c0 && c <= 0x00d6)
+ || (c >= 0x00d8 && c <= 0x00f6)
+ || (c >= 0x00f8 && c <= 0x01f5)
+ || (c >= 0x01fa && c <= 0x0217)
+ || (c >= 0x0250 && c <= 0x02a8)
+ || (c >= 0x1e00 && c <= 0x1e9b)
+ || (c >= 0x1ea0 && c <= 0x1ef9)
+ || (c == 0x207F))
+ {
+ NOTIN_CXX98(0x00aa);
+ NOTIN_CXX98(0x00ab);
+ NOTIN_CXX98(0x1e9b);
+ NOTIN_CXX98(0x207f);
+ return 1;
+ }
+
+ /* Greek */
+ if ((c == 0x0384)
+ || (c >= 0x0388 && c <= 0x038a)
+ || (c == 0x038c)
+ || (c >= 0x038e && c <= 0x03a1)
+ || (c >= 0x03a3 && c <= 0x03ce)
+ || (c >= 0x03d0 && c <= 0x03d6)
+ || (c == 0x03da)
+ || (c == 0x03dc)
+ || (c == 0x03de)
+ || (c == 0x03e0)
+ || (c >= 0x03e2 && c <= 0x03f3)
+ || (c >= 0x1f00 && c <= 0x1f15)
+ || (c >= 0x1f18 && c <= 0x1f1d)
+ || (c >= 0x1f20 && c <= 0x1f45)
+ || (c >= 0x1f48 && c <= 0x1f4d)
+ || (c >= 0x1f50 && c <= 0x1f57)
+ || (c == 0x1f59)
+ || (c == 0x1f5b)
+ || (c == 0x1f5d)
+ || (c >= 0x1f5f && c <= 0x1f7d)
+ || (c >= 0x1f80 && c <= 0x1fb4)
+ || (c >= 0x1fb6 && c <= 0x1fbc)
+ || (c >= 0x1fc2 && c <= 0x1fc4)
+ || (c >= 0x1fc6 && c <= 0x1fcc)
+ || (c >= 0x1fd0 && c <= 0x1fd3)
+ || (c >= 0x1fd6 && c <= 0x1fdb)
+ || (c >= 0x1fe0 && c <= 0x1fec)
+ || (c >= 0x1ff2 && c <= 0x1ff4)
+ || (c >= 0x1ff6 && c <= 0x1ffc))
+ {
+ NOTIN_C99(0x0384);
+ return 1;
+ }
+
+ /* Cyrillic */
+ if ((c >= 0x0401 && c <= 0x044f)
+ || (c >= 0x0451 && c <= 0x045c)
+ || (c >= 0x045e && c <= 0x0481)
+ || (c >= 0x0490 && c <= 0x04c4)
+ || (c >= 0x04c7 && c <= 0x04c8)
+ || (c >= 0x04cb && c <= 0x04cc)
+ || (c >= 0x04d0 && c <= 0x04eb)
+ || (c >= 0x04ee && c <= 0x04f5)
+ || (c >= 0x04f8 && c <= 0x04f9))
+ {
+ NOTIN_C99(0x040d);
+ NOTIN_CXX98(0x040e);
+ return 1;
+ }
+
+ /* Armenian */
+ if ((c >= 0x0531 && c <= 0x0556)
+ || (c >= 0x0561 && c <= 0x0587))
+ {
+ return 1;
+ }
+
+ /* Hebrew */
+ if ((c >= 0x05B0 && c <= 0x05B9)
+ || (c >= 0x05BB&& c <= 0x05BD)
+ || (c == 0x05BF)
+ || (c >= 0x05C1 && c <= 0x05C2))
+ {
+ if (cxx98)
+ goto fail;
+ return 1;
+ }
+ if ((c >= 0x05d0 && c <= 0x05ea)
+ || (c >= 0x05f0 && c <= 0x05f4))
+ {
+ NOTIN_C99(0x05f3);
+ NOTIN_C99(0x05f4);
+ return 1;
+ }
+
+ /* Arabic */
+ if ((c >= 0x06d0 && c <= 0x06dc)
+ || (c >= 0x06ea && c <= 0x06ed))
+ {
+ if (cxx98)
+ goto fail;
+ return 1;
+ }
+ if ((c >= 0x0621 && c <= 0x063a)
+ || (c >= 0x0640 && c <= 0x0652)
+ || (c >= 0x0670 && c <= 0x06b7)
+ || (c >= 0x06ba && c <= 0x06be)
+ || (c >= 0x06c0 && c <= 0x06ce)
+ || (c >= 0x06e5 && c <= 0x06e8))
+ {
+ NOTIN_CXX98(0x06e8);
+ return 1;
+ }
+
+ /* Devanagari */
+ if ((c >= 0x0901 && c <= 0x0903)
+ || (c >= 0x093e && c <= 0x094d)
+ || (c >= 0x0950 && c <= 0x0952))
+ {
+ if (cxx98)
+ goto fail;
+ return 1;
+ }
+ if ((c >= 0x0905 && c <= 0x0939)
+ || (c >= 0x0958 && c <= 0x0963))
+ {
+ NOTIN_CXX98(0x0963);
+ return 1;
+ }
+
+ /* Bengali */
+ if ((c >= 0x0981 && c <= 0x0983)
+ || (c >= 0x09be && c <= 0x09c4)
+ || (c >= 0x09c7 && c <= 0x09c8)
+ || (c >= 0x09cb && c <= 0x09cd))
+ {
+ if (cxx98)
+ goto fail;
+ return 1;
+ }
+ if ((c >= 0x0985 && c <= 0x098c)
+ || (c >= 0x098f && c <= 0x0990)
+ || (c >= 0x0993 && c <= 0x09a8)
+ || (c >= 0x09aa && c <= 0x09b0)
+ || (c == 0x09b2)
+ || (c >= 0x09b6 && c <= 0x09b9)
+ || (c >= 0x09dc && c <= 0x09dd)
+ || (c >= 0x09df && c <= 0x09e3)
+ || (c >= 0x09f0 && c <= 0x09f1))
+ {
+ NOTIN_CXX98(0x09e2);
+ NOTIN_CXX98(0x09e3);
+ return 1;
+ }
+
+ /* Gurmukhi */
+ if ((c == 0x0a02)
+ || (c >= 0x0a3e && c <= 0x0a42)
+ || (c >= 0x0a47 && c <= 0x0a48)
+ || (c >= 0x0a4b && c <= 0x0a4d)
+ || (c == 0x0a74))
+ {
+ if (cxx98)
+ goto fail;
+ return 1;
+ }
+ if ((c >= 0x0a05 && c <= 0x0a0a)
+ || (c >= 0x0a0f && c <= 0x0a10)
+ || (c >= 0x0a13 && c <= 0x0a28)
+ || (c >= 0x0a2a && c <= 0x0a30)
+ || (c >= 0x0a32 && c <= 0x0a33)
+ || (c >= 0x0a35 && c <= 0x0a36)
+ || (c >= 0x0a38 && c <= 0x0a39)
+ || (c >= 0x0a59 && c <= 0x0a5c)
+ || (c == 0x0a5e))
+ {
+ return 1;
+ }
+
+ /* Gujarati */
+ if ((c == 0x0a02)
+ || (c >= 0x0a81 && c <= 0x0a81)
+ || (c >= 0x0abd && c <= 0x0ac5)
+ || (c >= 0x0ac7 && c <= 0x0ac9)
+ || (c >= 0x0acb && c <= 0x0acd)
+ || (c == 0x0ad0))
+ {
+ if (cxx98)
+ goto fail;
+ return 1;
+ }
+ if ((c >= 0x0a85 && c <= 0x0a8b)
+ || (c == 0x0a8d)
+ || (c >= 0x0a8f && c <= 0x0a91)
+ || (c >= 0x0a93 && c <= 0x0aa8)
+ || (c >= 0x0aaa && c <= 0x0ab0)
+ || (c >= 0x0ab2 && c <= 0x0ab3)
+ || (c >= 0x0ab5 && c <= 0x0ab9)
+ || (c == 0x0ad0)
+ || (c == 0x0ae0))
+ {
+ NOTIN_CXX98(0x0ad0);
+ return 1;
+ }
+
+ /* Oriya */
+ if ((c >= 0x0b01 && c <= 0x0b03)
+ || (c >= 0x0b3e && c <= 0x0b43)
+ || (c >= 0x0b47 && c <= 0x0b48)
+ || (c >= 0x0b4b && c <= 0x0b4d))
+ {
+ if (cxx98)
+ goto fail;
+ return 1;
+ }
+ if ((c >= 0x0b05 && c <= 0x0b0c)
+ || (c >= 0x0b0f && c <= 0x0b10)
+ || (c >= 0x0b13 && c <= 0x0b28)
+ || (c >= 0x0b2a && c <= 0x0b30)
+ || (c >= 0x0b32 && c <= 0x0b33)
+ || (c >= 0x0b36 && c <= 0x0b39)
+ || (c >= 0x0b5c && c <= 0x0b5d)
+ || (c >= 0x0b5f && c <= 0x0b61))
+ {
+ return 1;
+ }
+
+ /* Tamil */
+ if ((c >= 0x0b82 && c <= 0x0b83)
+ || (c >= 0x0bbe && c <= 0x0bc2)
+ || (c >= 0x0bc6 && c <= 0x0bc8)
+ || (c >= 0x0bca && c <= 0x0bcd))
+ {
+ if (cxx98)
+ goto fail;
+ return 1;
+ }
+ if ((c >= 0x0b85 && c <= 0x0b8a)
+ || (c >= 0x0b8e && c <= 0x0b90)
+ || (c >= 0x0b92 && c <= 0x0b95)
+ || (c >= 0x0b99 && c <= 0x0b9a)
+ || (c == 0x0b9c)
+ || (c >= 0x0b9e && c <= 0x0b9f)
+ || (c >= 0x0ba3 && c <= 0x0ba4)
+ || (c >= 0x0ba8 && c <= 0x0baa)
+ || (c >= 0x0bae && c <= 0x0bb5)
+ || (c >= 0x0bb7 && c <= 0x0bb9))
+ {
+ return 1;
+ }
+
+ /* Telugu */
+ if ((c >= 0x0c01 && c <= 0x0c03)
+ || (c >= 0x0c3e && c <= 0x0c44)
+ || (c >= 0x0c46 && c <= 0x0c48)
+ || (c >= 0x0c4a && c <= 0x0c4d))
+ {
+ if (cxx98)
+ goto fail;
+ return 1;
+ }
+ if ((c >= 0x0c05 && c <= 0x0c0c)
+ || (c >= 0x0c0e && c <= 0x0c10)
+ || (c >= 0x0c12 && c <= 0x0c28)
+ || (c >= 0x0c2a && c <= 0x0c33)
+ || (c >= 0x0c35 && c <= 0x0c39)
+ || (c >= 0x0c60 && c <= 0x0c61))
+ {
+ return 1;
+ }
+
+ /* Kannada */
+ if ((c >= 0x0c82 && c <= 0x0c83)
+ || (c >= 0x0cbe && c <= 0x0cc4)
+ || (c >= 0x0cc6 && c <= 0x0cc8)
+ || (c >= 0x0cca && c <= 0x0ccd)
+ || (c == 0x0cde))
+ {
+ if (cxx98)
+ goto fail;
+ return 1;
+ }
+ if ((c >= 0x0c85 && c <= 0x0c8c)
+ || (c >= 0x0c8e && c <= 0x0c90)
+ || (c >= 0x0c92 && c <= 0x0ca8)
+ || (c >= 0x0caa && c <= 0x0cb3)
+ || (c >= 0x0cb5 && c <= 0x0cb9)
+ || (c >= 0x0ce0 && c <= 0x0ce1))
+ {
+ return 1;
+ }
+
+ /* Malayalam */
+ if ((c >= 0x0d02 && c <= 0x0d03)
+ || (c >= 0x0d3e && c <= 0x0d43)
+ || (c >= 0x0d46 && c <= 0x0d48)
+ || (c >= 0x0d4a && c <= 0x0d4d))
+ {
+ if (cxx98)
+ goto fail;
+ return 1;
+ }
+ if ((c >= 0x0d05 && c <= 0x0d0c)
+ || (c >= 0x0d0e && c <= 0x0d10)
+ || (c >= 0x0d12 && c <= 0x0d28)
+ || (c >= 0x0d2a && c <= 0x0d39)
+ || (c >= 0x0d60 && c <= 0x0d61))
+ {
+ return 1;
+ }
+
+ /* Thai */
+ if ((c >= 0x0e34 && c <= 0x0e3a)
+ || (c >= 0x0e47 && c <= 0x0e4e))
+ {
+ if (cxx98)
+ goto fail;
+ return 1;
+ }
+ if ((c >= 0x0e01 && c <= 0x0e33)
+ || (c >= 0x0e40 && c <= 0x0e46)
+ || (c >= 0x0e4f && c <= 0x0e5b))
+ {
+ NOTIN_CXX98(0x0e31);
+ return 1;
+ }
+
+ /* Lao */
+ if ((c >= 0x0eb4 && c <= 0x0eb9)
+ || (c >= 0x0ebb && c <= 0x0ebc)
+ || (c >= 0x0ec8 && c <= 0x0ecc)
+ || (c >= 0x0edc && c <= 0x0edd))
+ {
+ if (cxx98)
+ goto fail;
+ return 1;
+ }
+ if ((c >= 0x0e81 && c <= 0x0e82)
+ || (c == 0x0e84)
+ || (c == 0x0e87)
+ || (c == 0x0e88)
+ || (c == 0x0e8a)
+ || (c == 0x0e8d) /* C++ DR 131 */
+ || (c >= 0x0e94 && c <= 0x0e97)
+ || (c >= 0x0e99 && c <= 0x0e9f)
+ || (c >= 0x0ea1 && c <= 0x0ea3)
+ || (c == 0x0ea5)
+ || (c == 0x0ea7)
+ || (c == 0x0eaa)
+ || (c == 0x0eab)
+ || (c >= 0x0ead && c <= 0x0eb3)
+ || (c == 0x0ebd)
+ || (c >= 0x0ec0 && c <= 0x0ec4)
+ || (c == 0x0ec6))
+ {
+ NOTIN_C99(0x0eaf);
+ NOTIN_CXX98(0x0eb1);
+ return 1;
+ }
+
+ /* Tibetan */
+ if ((c == 0x0f00)
+ || (c >= 0x0f18 && c <= 0x0f19)
+ || (c == 0x0f35)
+ || (c == 0x0f37)
+ || (c == 0x0f39)
+ || (c >= 0x0f3e && c <= 0x0f47)
+ || (c >= 0x0f49 && c <= 0x0f69)
+ || (c >= 0x0f71 && c <= 0x0f84)
+ || (c >= 0x0f86 && c <= 0x0f8b)
+ || (c >= 0x0f90 && c <= 0x0f95)
+ || (c == 0x0f97)
+ || (c >= 0x0f99 && c <= 0x0fad)
+ || (c >= 0x0fb1 && c <= 0x0fb7)
+ || (c == 0x0fb9))
+ {
+ if (cxx98)
+ goto fail;
+ return 1;
+ }
+
+ /* Georgian */
+ if ((c >= 0x10a0 && c <= 0x10c5)
+ || (c >= 0x10d0 && c <= 0x10f6))
+ {
+ return 1;
+ }
+
+ /* Hiragana */
+ if ((c >= 0x3041 && c <= 0x3094)
+ || (c >= 0x309b && c <= 0x309e))
+ {
+ NOTIN_C99(0x039d);
+ NOTIN_C99(0x039e);
+ return 1;
+ }
+
+ /* Katakana */
+ if ((c >= 0x30a1 && c <= 0x30fe))
+ {
+ if (c99
+ && ((c >= 0x30f7 && c <= 0x30fa)
+ || (c == 0x03fd)
+ || (c == 0x03fe)))
+ goto fail;
+ return 1;
+ }
+
+ /* Bopomofo */
+ if ((c >= 0x3105 && c <= 0x312c))
+ {
+ return 1;
+ }
+
+ /* Hangul */
+ if (c >= 0xac00 && c <= 0xd7a3)
+ {
+ if (cxx98)
+ goto fail;
+ return 1;
+ }
+ if ((c >= 0x1100 && c <= 0x1159)
+ || (c >= 0x1161 && c <= 0x11a2)
+ || (c >= 0x11a8 && c <= 0x11f9))
+ {
+ if (c99)
+ goto fail;
+ return 1;
+ }
+
+
+ /* CJK Unified Ideographs */
+ if (c >= 0x4e00 && c <= 0x9f45)
+ {
+ return 1;
+ }
+ if ((c >= 0xf900 && c <= 0xfa2d)
+ || (c >= 0xfb1f && c <= 0xfb36)
+ || (c >= 0xfb38 && c <= 0xfb3c)
+ || (c == 0xfb3e)
+ || (c >= 0xfb40 && c <= 0xfb41)
+ || (c >= 0xfb42 && c <= 0xfb44)
+ || (c >= 0xfb46 && c <= 0xfbb1)
+ || (c >= 0xfbd3 && c <= 0xfd3f)
+ || (c >= 0xfd50 && c <= 0xfd8f)
+ || (c >= 0xfd92 && c <= 0xfdc7)
+ || (c >= 0xfdf0 && c <= 0xfdfb)
+ || (c >= 0xfe70 && c <= 0xfe72)
+ || (c == 0xfe74)
+ || (c >= 0xfe76 && c <= 0xfefc)
+ || (c >= 0xff21 && c <= 0xff3a)
+ || (c >= 0xff41 && c <= 0xff5a)
+ || (c >= 0xff66 && c <= 0xffbe)
+ || (c >= 0xffc2 && c <= 0xffc7)
+ || (c >= 0xffca && c <= 0xffcf)
+ || (c >= 0xffd2 && c <= 0xffd7)
+ || (c >= 0xffda && c <= 0xffdc))
+ {
+ if (c99)
+ goto fail;
+ return 1;
+ }
+
+ /* Digits */
+ if((c >= 0x0660 && c <= 0x0669)
+ || (c >= 0x06f0 && c <= 0x06f9)
+ || (c >= 0x0966 && c <= 0x096f)
+ || (c >= 0x09e6 && c <= 0x09ef)
+ || (c >= 0x0a66 && c <= 0x0a6f)
+ || (c >= 0x0ae6 && c <= 0x0aef)
+ || (c >= 0x0b66 && c <= 0x0b6f)
+ || (c >= 0x0be7 && c <= 0x0bef)
+ || (c >= 0x0c66 && c <= 0x0c6f)
+ || (c >= 0x0ce6 && c <= 0x0cef)
+ || (c >= 0x0d66 && c <= 0x0d6f)
+ || (c >= 0x0e50 && c <= 0x0e59)
+ || (c >= 0x0ed0 && c <= 0x0ed9)
+ || (c >= 0x0f20 && c <= 0x0f33))
+ {
+ if (!allow_digits || cxx98)
+ goto fail;
+ return 1;
+ }
+
+ /* Special characters */
+ if ((c == 0x00b5)
+ || (c == 0x00b7)
+ || (c >= 0x02b0 && c <= 0x02b8)
+ || (c == 0x02bb)
+ || (c >= 0x02bd && c <= 0x02c1)
+ || (c >= 0x02d0 && c <= 0x02d1)
+ || (c >= 0x02e0 && c <= 0x02e4)
+ || (c == 0x037a)
+ || (c == 0x0559)
+ || (c == 0x093d)
+ || (c == 0x0b3d)
+ || (c == 0x1fbe)
+ || (c >= 0x203f && c <= 0x2040)
+ || (c == 0x2102)
+ || (c == 0x2107)
+ || (c >= 0x210a && c <= 0x2113)
+ || (c == 0x2115)
+ || (c >= 0x2118 && c <= 0x211d)
+ || (c == 0x2124)
+ || (c == 0x2126)
+ || (c == 0x2128)
+ || (c >= 0x212a && c <= 0x2131)
+ || (c >= 0x2133 && c <= 0x2138)
+ || (c >= 0x2160 && c <= 0x2182)
+ || (c >= 0x3005 && c <= 0x3007)
+ || (c >= 0x3021 && c <= 0x3029))
+ {
+ if (cxx98)
+ goto fail;
+ return 1;
+ }
+
+ fail:
+ cpp_error_with_line (pfile, DL_ERROR,
+ pfile->line, 1, /* XXX */
+ "universal-character-name '\\u%04x' not valid in identifier", c);
+ return 0;
+#endif
+}
+
+/* Add the UTF-8 representation of C to the token_buffer. */
+
+static void
+utf8_extend_token (stack, c)
+ struct obstack *stack;
+ int c;
+{
+ int shift, mask;
+
+ if (c <= 0x0000007f)
+ {
+ obstack_1grow (stack, c);
+ return;
+ }
+ else if (c <= 0x000007ff)
+ shift = 6, mask = 0xc0;
+ else if (c <= 0x0000ffff)
+ shift = 12, mask = 0xe0;
+ else if (c <= 0x001fffff)
+ shift = 18, mask = 0xf0;
+ else if (c <= 0x03ffffff)
+ shift = 24, mask = 0xf8;
+ else
+ shift = 30, mask = 0xfc;
+
+ obstack_1grow (stack, mask | (c >> shift));
+ do
+ {
+ shift -= 6;
+ obstack_1grow (stack, (unsigned char) (0x80 | ((c >> shift) & 0x3f)));
+ }
+ while (shift);
+}
+
+/* Put the UCN form onto the obstack. */
+
+static void
+ucn_extend_token (stack, c)
+ struct obstack *stack;
+ int c;
+{
+ int len;
+ obstack_1grow (stack, '\\');
+ if (c < 0x10000)
+ {
+ obstack_1grow (stack, 'u');
+ len = 4;
+ }
+ else
+ {
+ obstack_1grow (stack, 'U');
+ len = 8;
+ }
+ while (len--)
+ {
+ int d = (c >> 4*len) & 0xF;
+ if (d < 10)
+ obstack_1grow (stack, '0' + d);
+ else
+ obstack_1grow (stack, 'a' + d - 10);
+ }
+}
+
+static cppchar_t
+utf8_to_char (pos)
+ const unsigned char **pos;
+{
+ cppchar_t result = 0;
+ const unsigned char *s = *pos;
+ if (*s < 128)
+ {
+ result = *s;
+ *pos += 1;
+ }
+ else if (*s < 0xc0)
+ {
+ /* Cannot occur as first byte */
+ abort();
+ }
+ else if (*s < 0xE0)
+ {
+ result = ((s[0] & 0x1f) << 6) + (s[1] & 0x3f);
+ *pos += 2;
+ }
+ else if (*s < 0xF0)
+ {
+ result =
+ ((s[0] & 0xf) << 12) +
+ ((s[1] & 0x3f) << 6) +
+ (s[2] & 0x3f);
+ *pos += 3;
+ }
+ else if (*s < 0xF8)
+ {
+ result =
+ ((s[0] & 0x7) << 18) +
+ ((s[1] & 0x3f) << 12) +
+ ((s[2] & 0x3f) << 6) +
+ (s[3] & 0x3f);
+ *pos += 4;
+ }
+ else
+ {
+ /* Other codes are reserved. */
+ abort ();
+ }
+ return result;
+}
+
/* Skips any escaped newlines introduced by '?' or a '\\', assumed to
lie in buffer->cur[-1]. Returns the next byte, which will be in
buffer->cur[-1]. This routine performs preprocessing stages 1 and
@@ -451,11 +1124,19 @@ parse_identifier (pfile)
/* Check for slow-path cases. */
if (*cur == '?' || *cur == '\\' || *cur == '$')
{
- unsigned int len;
+ unsigned int len, utf8;
- base = parse_slow (pfile, cur, 0, &len);
+ base = parse_slow (pfile, cur, 0, &len, &utf8);
result = (cpp_hashnode *)
ht_lookup (pfile->hash_table, base, len, HT_ALLOCED);
+ if (utf8)
+ {
+ result->flags |= NODE_USES_EXTENDED_CHARACTERS;
+#ifndef HAVE_AS_UTF8
+ cpp_error (pfile, DL_ERROR,
+ "Non-ASCII identifiers not supported by your assembler");
+#endif
+ }
}
else
{
@@ -493,11 +1174,12 @@ parse_identifier (pfile)
pointer to the token's NUL-terminated spelling in permanent
storage, and sets PLEN to its length. */
static uchar *
-parse_slow (pfile, cur, number_p, plen)
+parse_slow (pfile, cur, number_p, plen, utf8)
cpp_reader *pfile;
const uchar *cur;
int number_p;
unsigned int *plen;
+ unsigned int *utf8;
{
cpp_buffer *buffer = pfile->buffer;
const uchar *base = buffer->cur - 1;
@@ -516,12 +1198,33 @@ parse_slow (pfile, cur, number_p, plen)
prevc = cur[-1];
c = *cur++;
buffer->cur = cur;
+ *utf8 = 0;
for (;;)
{
/* Potential escaped newline? */
buffer->backup_to = buffer->cur - 1;
if (c == '?' || c == '\\')
- c = skip_escaped_newlines (pfile);
+ c = skip_escaped_newlines (pfile);
+
+ if (c == '\\' && (*buffer->cur == 'u'
+ || *buffer->cur == 'U'))
+ {
+ cur = buffer->cur - 1;
+ c = *buffer->cur++;
+ if (maybe_read_ucs_reader (pfile, &c) == 0
+ && identifier_ucs_p (pfile, c, 1))
+ {
+ if (number_p)
+ ucn_extend_token (stack, c);
+ else
+ utf8_extend_token (stack, c);
+ c = *buffer->cur++;
+ *utf8 = 1;
+ continue;
+ }
+ buffer->cur = cur;
+ c = *buffer->cur++;
+ }
if (!is_idchar (c))
{
@@ -570,6 +1273,7 @@ parse_number (pfile, number, leading_per
int leading_period;
{
const uchar *cur;
+ unsigned int unused;
/* Fast-path loop. Skim over a normal number.
N.B. ISIDNUM does not include $. */
@@ -579,7 +1283,8 @@ parse_number (pfile, number, leading_per
/* Check for slow-path cases. */
if (*cur == '?' || *cur == '\\' || *cur == '$')
- number->text = parse_slow (pfile, cur, 1 + leading_period, &number->len);
+ number->text = parse_slow (pfile, cur, 1 + leading_period,
+ &number->len, &unused);
else
{
const uchar *base = pfile->buffer->cur - 1;
@@ -1025,7 +1730,24 @@ _cpp_lex_direct (pfile)
if (c == '?')
result->type = CPP_QUERY;
else if (c == '\\')
- goto random_char;
+ {
+ const unsigned char *pos = buffer->cur;
+
+ c = *buffer->cur++;
+ if ((c == 'u' || c == 'U')
+ && maybe_read_ucs_reader (pfile, &c) == 0
+ && identifier_ucs_p (pfile, c, 0))
+ {
+ buffer->cur = pos;
+ goto start_ident;
+ }
+ else
+ {
+ c = '\\';
+ buffer->cur = pos;
+ goto random_char;
+ }
+ }
else
goto trigraph;
break;
@@ -1402,8 +2124,35 @@ cpp_spell_token (pfile, token, buffer)
spell_ident:
case SPELL_IDENT:
- memcpy (buffer, NODE_NAME (token->val.node), NODE_LEN (token->val.node));
- buffer += NODE_LEN (token->val.node);
+ if ((token->val.node->flags & NODE_USES_EXTENDED_CHARACTERS) == 0)
+ {
+ memcpy (buffer, NODE_NAME (token->val.node),
+ NODE_LEN (token->val.node));
+ buffer += NODE_LEN (token->val.node);
+ }
+ else
+ {
+ const unsigned char *s = NODE_NAME (token->val.node);
+ int len = NODE_LEN (token->val.node);
+ while (len)
+ {
+ if (*s < 128)
+ {
+ *buffer++ = *s++;
+ len--;
+ }
+ else
+ {
+ const unsigned char *old = s;
+ cppchar_t code = utf8_to_char (&s);
+ if (code < 0x10000)
+ buffer += sprintf ((char*)buffer, "\\u%.4x", code);
+ else
+ buffer += sprintf ((char*)buffer, "\\U%.8x", code);
+ len -= s - old;
+ }
+ }
+ }
break;
case SPELL_NUMBER:
@@ -1503,7 +2252,32 @@ cpp_output_token (token, fp)
spell_ident:
case SPELL_IDENT:
- fwrite (NODE_NAME (token->val.node), 1, NODE_LEN (token->val.node), fp);
+ if ((token->val.node->flags & NODE_USES_EXTENDED_CHARACTERS) == 0)
+ fwrite (NODE_NAME (token->val.node), 1, NODE_LEN (token->val.node), fp);
+ else
+ {
+ const unsigned char *s = NODE_NAME (token->val.node);
+ int len = NODE_LEN (token->val.node);
+ while (len)
+ {
+ if (*s < 128)
+ {
+ fputc (*s, fp);
+ s++;
+ len--;
+ }
+ else
+ {
+ const unsigned char *old = s;
+ cppchar_t code = utf8_to_char (&s);
+ if (code < 0x10000)
+ fprintf (fp, "\\u%.4x", code);
+ else
+ fprintf (fp, "\\U%.8x", code);
+ len -= s - old;
+ }
+ }
+ }
break;
case SPELL_NUMBER:
@@ -1738,6 +2512,63 @@ maybe_read_ucs (pfile, pstr, limit, pc)
#endif
*pstr = p;
+ *pc = code;
+ return 0;
+}
+
+/* Like maybe_read_ucs, but always read the data from a parser. */
+
+static int
+maybe_read_ucs_reader (pfile, pc)
+ cpp_reader *pfile;
+ cppchar_t *pc;
+{
+ unsigned int code = 0;
+ cppchar_t c = *pc;
+ unsigned int length;
+
+ /* Only attempt to interpret a UCS for C++ and C99. */
+ if (! (CPP_OPTION (pfile, cplusplus) || CPP_OPTION (pfile, c99)))
+ return 1;
+
+ if (CPP_WTRADITIONAL (pfile))
+ cpp_error (pfile, DL_WARNING,
+ "the meaning of '\\%c' is different in traditional C", c);
+
+ length = (c == 'u' ? 4: 8);
+
+ for (; length; length--)
+ {
+ c = get_effective_char (pfile);
+ if (ISXDIGIT (c))
+ code = (code << 4) + hex_digit_value (c);
+ else
+ {
+ cpp_error (pfile, DL_ERROR,
+ "non-hex digit '%c' in universal-character-name", c);
+ /* We shouldn't skip in case there are multibyte chars. */
+ break;
+ }
+ }
+
+#ifdef TARGET_EBCDIC
+ cpp_error (pfile, DL_ERROR, "universal-character-name on EBCDIC target");
+ code = 0x3f; /* EBCDIC invalid character */
+#else
+ /* True extended characters are OK. */
+ if (code >= 0xa0
+ && !(code & 0x80000000)
+ && !(code >= 0xD800 && code <= 0xDFFF))
+ ;
+ /* The standard permits $, @ and ` to be specified as UCNs. We use
+ hex escapes so that this also works with EBCDIC hosts. */
+ else if (code == 0x24 || code == 0x40 || code == 0x60)
+ ;
+ /* Don't give another error if one occurred above. */
+ else if (length == 0)
+ cpp_error (pfile, DL_ERROR, "universal-character-name out of range");
+#endif
+
*pc = code;
return 0;
}
Index: cpplib.h
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/cpplib.h,v
retrieving revision 1.237
diff -u -r1.237 cpplib.h
--- cpplib.h 26 Sep 2002 22:25:12 -0000 1.237
+++ cpplib.h 28 Nov 2002 22:50:15 -0000
@@ -443,6 +443,7 @@
#define NODE_DIAGNOSTIC (1 << 3) /* Possible diagnostic when lexed. */
#define NODE_WARN (1 << 4) /* Warn if redefined or undefined. */
#define NODE_DISABLED (1 << 5) /* A disabled macro. */
+#define NODE_USES_EXTENDED_CHARACTERS (1 << 6) /* Node has UTF-8 bytes in it */
/* Different flavors of hash node. */
enum node_type
More information about the Java
mailing list