Universal Character Names, v2

Martin v. Löwis martin@v.loewis.de
Thu Nov 28 15:43:00 GMT 2002


This is the second version of my UCN patch. It incorporates all
comments from the previous patch (AFAIR).

Specifically, the changes relative to the previous patch are:
- Update character sets for C99, and C++ DR 131.
- Support escaped newlines in the middle of an UCN. This is done
  through the addition of maybe_read_ucs_reader function, which
  uses get_effective_char internally.
- Support UCNs in numbers. In the internal represantation, such
  a number still has the UCN in it, i.e. no conversion to UTF-8
  takes place. Such numbers will only be valid if they are pasted
  with an identifier.
- Support pasting of names that have UCNs in them. For that,
  cpp_spell_token had to be updated.
- Check for assembler UTF-8 support, and reject UCNs if no such
  support is available. As a side effect, gcj will automatically
  use UTF-8 mangling where g++ supports UCNs.

I have considered the following comments, but chose to take a
different approach:
- I have not put the test function for characters in libiberty.
  It is quite specific to C and C++, and only ever used in the
  preprocessor.
- I have not decided to deviate from the C and C++ standards for
  character tests. Reviewers commented that they dislike the approach
  taken by the standards committees, and that the relevant Unicode
  specification should be taken into account instead. I disagree, as I
  consider the approach of giving explicit lists quite reasonable.
  More importantly, I think that standards conformance should be
  valued quite highly unless specific user demands require to
  ignore or extend the standards; this is not the case in the
  specific issue.

A few issues need to be resolved with the Java compiler:
- somehow, defining HAVE_AS_UTF8 (which the patch does) triggers
  bugs in the mangler; it will now emit symbols like
  
    _ZN4java4lang6Double8<clinit>Ev

- The Java mangler currently emits the number of characters for an
  UTF-8 <source-name>; the ABI specifies that this ought to be the
  byte length.

I'd appreciate if some Java expert could help with resolving the first
issue; resolving the second one seems simple.

Any comments appreciated,

Martin

2002-10-27  Martin v. Löwis  <loewis@informatik.hu-berlin.de>

	* c-lex.c (is_extended_char, utf8_extend_token): Remove.
	* cpplex.c (identifier_ucs_p, utf8_extend_token, 
	ucn_extend_token, utf8_to_char, maybe_read_ucs_reader): New functions.
	(parse_slow): Add utf8 parameter. Parse UCS names.
	(parse_identifier, parse_number): Adjust.
	(_cpp_lex_direct): Parse UCS names.
	(cpp_output_token): Print UCS names.
	(cpp_spell_token, cpp_output_token): Unparse extended characters.
	* cpplib.h (NODE_USES_EXTENDED_CHARACTERS): New flag.
	* configure.in (HAVE_AS_UTF8): New test.
	* configure, config.in: Rebuilt.

Index: c-lex.c
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/c-lex.c,v
retrieving revision 1.190
diff -u -r1.190 c-lex.c
--- c-lex.c	16 Sep 2002 16:36:31 -0000	1.190
+++ c-lex.c	28 Nov 2002 22:50:07 -0000
@@ -356,314 +356,6 @@
 			 (const char *) NODE_NAME (node));
 }
 
-#if 0 /* not yet */
-/* Returns nonzero if C is a universal-character-name.  Give an error if it
-   is not one which may appear in an identifier, as per [extendid].
-
-   Note that extended character support in identifiers has not yet been
-   implemented.  It is my personal opinion that this is not a desirable
-   feature.  Portable code cannot count on support for more than the basic
-   identifier character set.  */
-
-static inline int
-is_extended_char (c)
-     int c;
-{
-#ifdef TARGET_EBCDIC
-  return 0;
-#else
-  /* ASCII.  */
-  if (c < 0x7f)
-    return 0;
-
-  /* None of the valid chars are outside the Basic Multilingual Plane (the
-     low 16 bits).  */
-  if (c > 0xffff)
-    {
-      error ("universal-character-name '\\U%08x' not valid in identifier", c);
-      return 1;
-    }
-  
-  /* Latin */
-  if ((c >= 0x00c0 && c <= 0x00d6)
-      || (c >= 0x00d8 && c <= 0x00f6)
-      || (c >= 0x00f8 && c <= 0x01f5)
-      || (c >= 0x01fa && c <= 0x0217)
-      || (c >= 0x0250 && c <= 0x02a8)
-      || (c >= 0x1e00 && c <= 0x1e9a)
-      || (c >= 0x1ea0 && c <= 0x1ef9))
-    return 1;
-
-  /* Greek */
-  if ((c == 0x0384)
-      || (c >= 0x0388 && c <= 0x038a)
-      || (c == 0x038c)
-      || (c >= 0x038e && c <= 0x03a1)
-      || (c >= 0x03a3 && c <= 0x03ce)
-      || (c >= 0x03d0 && c <= 0x03d6)
-      || (c == 0x03da)
-      || (c == 0x03dc)
-      || (c == 0x03de)
-      || (c == 0x03e0)
-      || (c >= 0x03e2 && c <= 0x03f3)
-      || (c >= 0x1f00 && c <= 0x1f15)
-      || (c >= 0x1f18 && c <= 0x1f1d)
-      || (c >= 0x1f20 && c <= 0x1f45)
-      || (c >= 0x1f48 && c <= 0x1f4d)
-      || (c >= 0x1f50 && c <= 0x1f57)
-      || (c == 0x1f59)
-      || (c == 0x1f5b)
-      || (c == 0x1f5d)
-      || (c >= 0x1f5f && c <= 0x1f7d)
-      || (c >= 0x1f80 && c <= 0x1fb4)
-      || (c >= 0x1fb6 && c <= 0x1fbc)
-      || (c >= 0x1fc2 && c <= 0x1fc4)
-      || (c >= 0x1fc6 && c <= 0x1fcc)
-      || (c >= 0x1fd0 && c <= 0x1fd3)
-      || (c >= 0x1fd6 && c <= 0x1fdb)
-      || (c >= 0x1fe0 && c <= 0x1fec)
-      || (c >= 0x1ff2 && c <= 0x1ff4)
-      || (c >= 0x1ff6 && c <= 0x1ffc))
-    return 1;
-
-  /* Cyrillic */
-  if ((c >= 0x0401 && c <= 0x040d)
-      || (c >= 0x040f && c <= 0x044f)
-      || (c >= 0x0451 && c <= 0x045c)
-      || (c >= 0x045e && c <= 0x0481)
-      || (c >= 0x0490 && c <= 0x04c4)
-      || (c >= 0x04c7 && c <= 0x04c8)
-      || (c >= 0x04cb && c <= 0x04cc)
-      || (c >= 0x04d0 && c <= 0x04eb)
-      || (c >= 0x04ee && c <= 0x04f5)
-      || (c >= 0x04f8 && c <= 0x04f9))
-    return 1;
-
-  /* Armenian */
-  if ((c >= 0x0531 && c <= 0x0556)
-      || (c >= 0x0561 && c <= 0x0587))
-    return 1;
-
-  /* Hebrew */
-  if ((c >= 0x05d0 && c <= 0x05ea)
-      || (c >= 0x05f0 && c <= 0x05f4))
-    return 1;
-
-  /* Arabic */
-  if ((c >= 0x0621 && c <= 0x063a)
-      || (c >= 0x0640 && c <= 0x0652)
-      || (c >= 0x0670 && c <= 0x06b7)
-      || (c >= 0x06ba && c <= 0x06be)
-      || (c >= 0x06c0 && c <= 0x06ce)
-      || (c >= 0x06e5 && c <= 0x06e7))
-    return 1;
-
-  /* Devanagari */
-  if ((c >= 0x0905 && c <= 0x0939)
-      || (c >= 0x0958 && c <= 0x0962))
-    return 1;
-
-  /* Bengali */
-  if ((c >= 0x0985 && c <= 0x098c)
-      || (c >= 0x098f && c <= 0x0990)
-      || (c >= 0x0993 && c <= 0x09a8)
-      || (c >= 0x09aa && c <= 0x09b0)
-      || (c == 0x09b2)
-      || (c >= 0x09b6 && c <= 0x09b9)
-      || (c >= 0x09dc && c <= 0x09dd)
-      || (c >= 0x09df && c <= 0x09e1)
-      || (c >= 0x09f0 && c <= 0x09f1))
-    return 1;
-
-  /* Gurmukhi */
-  if ((c >= 0x0a05 && c <= 0x0a0a)
-      || (c >= 0x0a0f && c <= 0x0a10)
-      || (c >= 0x0a13 && c <= 0x0a28)
-      || (c >= 0x0a2a && c <= 0x0a30)
-      || (c >= 0x0a32 && c <= 0x0a33)
-      || (c >= 0x0a35 && c <= 0x0a36)
-      || (c >= 0x0a38 && c <= 0x0a39)
-      || (c >= 0x0a59 && c <= 0x0a5c)
-      || (c == 0x0a5e))
-    return 1;
-
-  /* Gujarati */
-  if ((c >= 0x0a85 && c <= 0x0a8b)
-      || (c == 0x0a8d)
-      || (c >= 0x0a8f && c <= 0x0a91)
-      || (c >= 0x0a93 && c <= 0x0aa8)
-      || (c >= 0x0aaa && c <= 0x0ab0)
-      || (c >= 0x0ab2 && c <= 0x0ab3)
-      || (c >= 0x0ab5 && c <= 0x0ab9)
-      || (c == 0x0ae0))
-    return 1;
-
-  /* Oriya */
-  if ((c >= 0x0b05 && c <= 0x0b0c)
-      || (c >= 0x0b0f && c <= 0x0b10)
-      || (c >= 0x0b13 && c <= 0x0b28)
-      || (c >= 0x0b2a && c <= 0x0b30)
-      || (c >= 0x0b32 && c <= 0x0b33)
-      || (c >= 0x0b36 && c <= 0x0b39)
-      || (c >= 0x0b5c && c <= 0x0b5d)
-      || (c >= 0x0b5f && c <= 0x0b61))
-    return 1;
-
-  /* Tamil */
-  if ((c >= 0x0b85 && c <= 0x0b8a)
-      || (c >= 0x0b8e && c <= 0x0b90)
-      || (c >= 0x0b92 && c <= 0x0b95)
-      || (c >= 0x0b99 && c <= 0x0b9a)
-      || (c == 0x0b9c)
-      || (c >= 0x0b9e && c <= 0x0b9f)
-      || (c >= 0x0ba3 && c <= 0x0ba4)
-      || (c >= 0x0ba8 && c <= 0x0baa)
-      || (c >= 0x0bae && c <= 0x0bb5)
-      || (c >= 0x0bb7 && c <= 0x0bb9))
-    return 1;
-
-  /* Telugu */
-  if ((c >= 0x0c05 && c <= 0x0c0c)
-      || (c >= 0x0c0e && c <= 0x0c10)
-      || (c >= 0x0c12 && c <= 0x0c28)
-      || (c >= 0x0c2a && c <= 0x0c33)
-      || (c >= 0x0c35 && c <= 0x0c39)
-      || (c >= 0x0c60 && c <= 0x0c61))
-    return 1;
-
-  /* Kannada */
-  if ((c >= 0x0c85 && c <= 0x0c8c)
-      || (c >= 0x0c8e && c <= 0x0c90)
-      || (c >= 0x0c92 && c <= 0x0ca8)
-      || (c >= 0x0caa && c <= 0x0cb3)
-      || (c >= 0x0cb5 && c <= 0x0cb9)
-      || (c >= 0x0ce0 && c <= 0x0ce1))
-    return 1;
-
-  /* Malayalam */
-  if ((c >= 0x0d05 && c <= 0x0d0c)
-      || (c >= 0x0d0e && c <= 0x0d10)
-      || (c >= 0x0d12 && c <= 0x0d28)
-      || (c >= 0x0d2a && c <= 0x0d39)
-      || (c >= 0x0d60 && c <= 0x0d61))
-    return 1;
-
-  /* Thai */
-  if ((c >= 0x0e01 && c <= 0x0e30)
-      || (c >= 0x0e32 && c <= 0x0e33)
-      || (c >= 0x0e40 && c <= 0x0e46)
-      || (c >= 0x0e4f && c <= 0x0e5b))
-    return 1;
-
-  /* Lao */
-  if ((c >= 0x0e81 && c <= 0x0e82)
-      || (c == 0x0e84)
-      || (c == 0x0e87)
-      || (c == 0x0e88)
-      || (c == 0x0e8a)
-      || (c == 0x0e0d)
-      || (c >= 0x0e94 && c <= 0x0e97)
-      || (c >= 0x0e99 && c <= 0x0e9f)
-      || (c >= 0x0ea1 && c <= 0x0ea3)
-      || (c == 0x0ea5)
-      || (c == 0x0ea7)
-      || (c == 0x0eaa)
-      || (c == 0x0eab)
-      || (c >= 0x0ead && c <= 0x0eb0)
-      || (c == 0x0eb2)
-      || (c == 0x0eb3)
-      || (c == 0x0ebd)
-      || (c >= 0x0ec0 && c <= 0x0ec4)
-      || (c == 0x0ec6))
-    return 1;
-
-  /* Georgian */
-  if ((c >= 0x10a0 && c <= 0x10c5)
-      || (c >= 0x10d0 && c <= 0x10f6))
-    return 1;
-
-  /* Hiragana */
-  if ((c >= 0x3041 && c <= 0x3094)
-      || (c >= 0x309b && c <= 0x309e))
-    return 1;
-
-  /* Katakana */
-  if ((c >= 0x30a1 && c <= 0x30fe))
-    return 1;
-
-  /* Bopmofo */
-  if ((c >= 0x3105 && c <= 0x312c))
-    return 1;
-
-  /* Hangul */
-  if ((c >= 0x1100 && c <= 0x1159)
-      || (c >= 0x1161 && c <= 0x11a2)
-      || (c >= 0x11a8 && c <= 0x11f9))
-    return 1;
-
-  /* CJK Unified Ideographs */
-  if ((c >= 0xf900 && c <= 0xfa2d)
-      || (c >= 0xfb1f && c <= 0xfb36)
-      || (c >= 0xfb38 && c <= 0xfb3c)
-      || (c == 0xfb3e)
-      || (c >= 0xfb40 && c <= 0xfb41)
-      || (c >= 0xfb42 && c <= 0xfb44)
-      || (c >= 0xfb46 && c <= 0xfbb1)
-      || (c >= 0xfbd3 && c <= 0xfd3f)
-      || (c >= 0xfd50 && c <= 0xfd8f)
-      || (c >= 0xfd92 && c <= 0xfdc7)
-      || (c >= 0xfdf0 && c <= 0xfdfb)
-      || (c >= 0xfe70 && c <= 0xfe72)
-      || (c == 0xfe74)
-      || (c >= 0xfe76 && c <= 0xfefc)
-      || (c >= 0xff21 && c <= 0xff3a)
-      || (c >= 0xff41 && c <= 0xff5a)
-      || (c >= 0xff66 && c <= 0xffbe)
-      || (c >= 0xffc2 && c <= 0xffc7)
-      || (c >= 0xffca && c <= 0xffcf)
-      || (c >= 0xffd2 && c <= 0xffd7)
-      || (c >= 0xffda && c <= 0xffdc)
-      || (c >= 0x4e00 && c <= 0x9fa5))
-    return 1;
-
-  error ("universal-character-name '\\u%04x' not valid in identifier", c);
-  return 1;
-#endif
-}
-
-/* Add the UTF-8 representation of C to the token_buffer.  */
-
-static void
-utf8_extend_token (c)
-     int c;
-{
-  int shift, mask;
-
-  if      (c <= 0x0000007f)
-    {
-      extend_token (c);
-      return;
-    }
-  else if (c <= 0x000007ff)
-    shift = 6, mask = 0xc0;
-  else if (c <= 0x0000ffff)
-    shift = 12, mask = 0xe0;
-  else if (c <= 0x001fffff)
-    shift = 18, mask = 0xf0;
-  else if (c <= 0x03ffffff)
-    shift = 24, mask = 0xf8;
-  else
-    shift = 30, mask = 0xfc;
-
-  extend_token (mask | (c >> shift));
-  do
-    {
-      shift -= 6;
-      extend_token ((unsigned char) (0x80 | (c >> shift)));
-    }
-  while (shift);
-}
-#endif
 

 int
 c_lex (value)
Index: configure.in
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/configure.in,v
retrieving revision 1.626
diff -u -r1.626 configure.in
--- configure.in	26 Nov 2002 20:08:07 -0000	1.626
+++ configure.in	28 Nov 2002 22:50:13 -0000
@@ -1889,6 +1889,22 @@
 fi
 AC_MSG_RESULT($gcc_cv_as_tls)
 
+AC_MSG_CHECKING(assembler support for UTF-8 identifiers)
+gcc_cv_as_utf8="no"
+if test x$gcc_cv_as != x; then
+  echo fooab:|tr ab '\303\200' > conftest.s
+  if $gcc_cv_as --fatal-warnings -o conftest.o conftest.s > /dev/null 2>&1
+  then
+    gcc_cv_as_utf8=yes
+  fi
+  rm -rf conftest.s
+fi
+if test "$gcc_cv_as_utf8" = yes; then
+  AC_DEFINE(HAVE_AS_UTF8, 1,
+            [Define if your assembler supports UTF-8 bytes in identifiers])
+fi
+AC_MSG_RESULT($gcc_cv_as_utf8)
+
 case "$target" in
   # All TARGET_ABI_OSF targets.
   alpha*-*-osf* | alpha*-*-linux* | alpha*-*-*bsd*)
Index: cpplex.c
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/cpplex.c,v
retrieving revision 1.215
diff -u -p -r1.215 cpplex.c
--- cpplex.c	26 Sep 2002 22:25:12 -0000	1.215
+++ cpplex.c	28 Nov 2002 23:04:54 -0000
@@ -71,7 +71,7 @@ static void adjust_column PARAMS ((cpp_r
 static int skip_whitespace PARAMS ((cpp_reader *, cppchar_t));
 static cpp_hashnode *parse_identifier PARAMS ((cpp_reader *));
 static uchar *parse_slow PARAMS ((cpp_reader *, const uchar *, int,
-				  unsigned int *));
+				  unsigned int *, unsigned int *));
 static void parse_number PARAMS ((cpp_reader *, cpp_string *, int));
 static int unescaped_terminator_p PARAMS ((cpp_reader *, const uchar *));
 static void parse_string PARAMS ((cpp_reader *, cpp_token *, cppchar_t));
@@ -82,10 +82,16 @@ static bool continue_after_nul PARAMS ((
 static int name_p PARAMS ((cpp_reader *, const cpp_string *));
 static int maybe_read_ucs PARAMS ((cpp_reader *, const unsigned char **,
 				   const unsigned char *, cppchar_t *));
+static int maybe_read_ucs_reader PARAMS ((cpp_reader *, cppchar_t *));
 static tokenrun *next_tokenrun PARAMS ((tokenrun *));
 
 static unsigned int hex_digit_value PARAMS ((unsigned int));
 static _cpp_buff *new_buff PARAMS ((size_t));
+static bool identifier_ucs_p PARAMS ((cpp_reader *, cppchar_t, int));
+static void utf8_extend_token PARAMS ((struct obstack *, int));
+static void ucn_extend_token PARAMS ((struct obstack *, int));
+static cppchar_t utf8_to_char PARAMS((const unsigned char **));
+
 
 /* Utility routine:
 
@@ -161,6 +167,673 @@ trigraph_p (pfile)
   return accept;
 }
 
+/* Returns nonzero if C is a universal-character-name.  Give an error
+   if it is not one which may appear in an identifier, as per C++98
+   Annex E [extendid], and C99 Annex F.  */
+
+static bool
+identifier_ucs_p (pfile, c, allow_digits)
+     cpp_reader *pfile;
+     cppchar_t c;
+     int allow_digits;
+{
+#ifdef TARGET_EBCDIC
+  return 0;
+#else
+  int cxx98 = CPP_OPTION (pfile, cplusplus);
+  int c99 = CPP_OPTION (pfile, c99);
+
+  /* ASCII.  */
+  if (c < 0x7f)
+    return 0;
+
+  /* None of the valid chars are outside the Basic Multilingual Plane (the
+     low 16 bits).  */
+  if (c > 0xffff)
+    {
+      cpp_error_with_line (pfile, DL_ERROR,
+                           pfile->line, 1, /* XXX */
+                           "universal-character-name '\\U%08x' not valid in identifier", (int)c);
+      return 0;
+    }
+
+#define NOTIN_C99(code) if(c==code && c99) goto fail
+#define NOTIN_CXX98(code) if(c==code && cxx98) goto fail
+  
+  /* Latin */
+  if ((c == 0x00aa)
+      || (c == 0x00ba)
+      || (c >= 0x00c0 && c <= 0x00d6)
+      || (c >= 0x00d8 && c <= 0x00f6)
+      || (c >= 0x00f8 && c <= 0x01f5)
+      || (c >= 0x01fa && c <= 0x0217)
+      || (c >= 0x0250 && c <= 0x02a8)
+      || (c >= 0x1e00 && c <= 0x1e9b)
+      || (c >= 0x1ea0 && c <= 0x1ef9)
+      || (c == 0x207F))
+    {
+      NOTIN_CXX98(0x00aa);
+      NOTIN_CXX98(0x00ab);
+      NOTIN_CXX98(0x1e9b);
+      NOTIN_CXX98(0x207f);
+      return 1;
+    }
+
+  /* Greek */
+  if ((c == 0x0384)
+      || (c >= 0x0388 && c <= 0x038a)
+      || (c == 0x038c)
+      || (c >= 0x038e && c <= 0x03a1)
+      || (c >= 0x03a3 && c <= 0x03ce)
+      || (c >= 0x03d0 && c <= 0x03d6)
+      || (c == 0x03da)
+      || (c == 0x03dc)
+      || (c == 0x03de)
+      || (c == 0x03e0)
+      || (c >= 0x03e2 && c <= 0x03f3)
+      || (c >= 0x1f00 && c <= 0x1f15)
+      || (c >= 0x1f18 && c <= 0x1f1d)
+      || (c >= 0x1f20 && c <= 0x1f45)
+      || (c >= 0x1f48 && c <= 0x1f4d)
+      || (c >= 0x1f50 && c <= 0x1f57)
+      || (c == 0x1f59)
+      || (c == 0x1f5b)
+      || (c == 0x1f5d)
+      || (c >= 0x1f5f && c <= 0x1f7d)
+      || (c >= 0x1f80 && c <= 0x1fb4)
+      || (c >= 0x1fb6 && c <= 0x1fbc)
+      || (c >= 0x1fc2 && c <= 0x1fc4)
+      || (c >= 0x1fc6 && c <= 0x1fcc)
+      || (c >= 0x1fd0 && c <= 0x1fd3)
+      || (c >= 0x1fd6 && c <= 0x1fdb)
+      || (c >= 0x1fe0 && c <= 0x1fec)
+      || (c >= 0x1ff2 && c <= 0x1ff4)
+      || (c >= 0x1ff6 && c <= 0x1ffc))
+    {
+      NOTIN_C99(0x0384);
+      return 1;
+    }
+
+  /* Cyrillic */
+  if ((c >= 0x0401 && c <= 0x044f)
+      || (c >= 0x0451 && c <= 0x045c)
+      || (c >= 0x045e && c <= 0x0481)
+      || (c >= 0x0490 && c <= 0x04c4)
+      || (c >= 0x04c7 && c <= 0x04c8)
+      || (c >= 0x04cb && c <= 0x04cc)
+      || (c >= 0x04d0 && c <= 0x04eb)
+      || (c >= 0x04ee && c <= 0x04f5)
+      || (c >= 0x04f8 && c <= 0x04f9))
+    {
+      NOTIN_C99(0x040d);
+      NOTIN_CXX98(0x040e);
+      return 1;
+    }
+
+  /* Armenian */
+  if ((c >= 0x0531 && c <= 0x0556)
+      || (c >= 0x0561 && c <= 0x0587))
+    {
+      return 1;
+    }
+
+  /* Hebrew */
+  if ((c >= 0x05B0 && c <= 0x05B9)
+      || (c >= 0x05BB&& c <= 0x05BD)
+      || (c == 0x05BF)
+      || (c >= 0x05C1 && c <= 0x05C2))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x05d0 && c <= 0x05ea)
+      || (c >= 0x05f0 && c <= 0x05f4))
+    {
+      NOTIN_C99(0x05f3);
+      NOTIN_C99(0x05f4);
+      return 1;
+    }
+
+  /* Arabic */
+  if ((c >= 0x06d0 && c <= 0x06dc)
+      || (c >= 0x06ea && c <= 0x06ed))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0621 && c <= 0x063a)
+      || (c >= 0x0640 && c <= 0x0652)
+      || (c >= 0x0670 && c <= 0x06b7)
+      || (c >= 0x06ba && c <= 0x06be)
+      || (c >= 0x06c0 && c <= 0x06ce)
+      || (c >= 0x06e5 && c <= 0x06e8))
+    {
+      NOTIN_CXX98(0x06e8);
+      return 1;
+    }
+
+  /* Devanagari */
+  if ((c >= 0x0901 && c <= 0x0903)
+      || (c >= 0x093e && c <= 0x094d)
+      || (c >= 0x0950 && c <= 0x0952))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0905 && c <= 0x0939)
+      || (c >= 0x0958 && c <= 0x0963))
+    {
+      NOTIN_CXX98(0x0963);
+      return 1;
+    }
+
+  /* Bengali */
+  if ((c >= 0x0981 && c <= 0x0983)
+      || (c >= 0x09be && c <= 0x09c4)
+      || (c >= 0x09c7 && c <= 0x09c8)
+      || (c >= 0x09cb && c <= 0x09cd))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0985 && c <= 0x098c)
+      || (c >= 0x098f && c <= 0x0990)
+      || (c >= 0x0993 && c <= 0x09a8)
+      || (c >= 0x09aa && c <= 0x09b0)
+      || (c == 0x09b2)
+      || (c >= 0x09b6 && c <= 0x09b9)
+      || (c >= 0x09dc && c <= 0x09dd)
+      || (c >= 0x09df && c <= 0x09e3)
+      || (c >= 0x09f0 && c <= 0x09f1))
+    {
+      NOTIN_CXX98(0x09e2);
+      NOTIN_CXX98(0x09e3);
+      return 1;
+    }
+
+  /* Gurmukhi */
+  if ((c == 0x0a02)
+      || (c >= 0x0a3e && c <= 0x0a42)
+      || (c >= 0x0a47 && c <= 0x0a48)
+      || (c >= 0x0a4b && c <= 0x0a4d)
+      || (c == 0x0a74))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0a05 && c <= 0x0a0a)
+      || (c >= 0x0a0f && c <= 0x0a10)
+      || (c >= 0x0a13 && c <= 0x0a28)
+      || (c >= 0x0a2a && c <= 0x0a30)
+      || (c >= 0x0a32 && c <= 0x0a33)
+      || (c >= 0x0a35 && c <= 0x0a36)
+      || (c >= 0x0a38 && c <= 0x0a39)
+      || (c >= 0x0a59 && c <= 0x0a5c)
+      || (c == 0x0a5e))
+    {
+      return 1;
+    }
+
+  /* Gujarati */
+  if ((c == 0x0a02)
+      || (c >= 0x0a81 && c <= 0x0a81)
+      || (c >= 0x0abd && c <= 0x0ac5)
+      || (c >= 0x0ac7 && c <= 0x0ac9)
+      || (c >= 0x0acb && c <= 0x0acd)
+      || (c == 0x0ad0))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0a85 && c <= 0x0a8b)
+      || (c == 0x0a8d)
+      || (c >= 0x0a8f && c <= 0x0a91)
+      || (c >= 0x0a93 && c <= 0x0aa8)
+      || (c >= 0x0aaa && c <= 0x0ab0)
+      || (c >= 0x0ab2 && c <= 0x0ab3)
+      || (c >= 0x0ab5 && c <= 0x0ab9)
+      || (c == 0x0ad0)
+      || (c == 0x0ae0))
+    {
+      NOTIN_CXX98(0x0ad0);
+      return 1;
+    }
+
+  /* Oriya */
+  if ((c >= 0x0b01 && c <= 0x0b03)
+      || (c >= 0x0b3e && c <= 0x0b43)
+      || (c >= 0x0b47 && c <= 0x0b48)
+      || (c >= 0x0b4b && c <= 0x0b4d))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0b05 && c <= 0x0b0c)
+      || (c >= 0x0b0f && c <= 0x0b10)
+      || (c >= 0x0b13 && c <= 0x0b28)
+      || (c >= 0x0b2a && c <= 0x0b30)
+      || (c >= 0x0b32 && c <= 0x0b33)
+      || (c >= 0x0b36 && c <= 0x0b39)
+      || (c >= 0x0b5c && c <= 0x0b5d)
+      || (c >= 0x0b5f && c <= 0x0b61))
+    {
+      return 1;
+    }
+
+  /* Tamil */
+  if ((c >= 0x0b82 && c <= 0x0b83)
+      || (c >= 0x0bbe && c <= 0x0bc2)
+      || (c >= 0x0bc6 && c <= 0x0bc8)
+      || (c >= 0x0bca && c <= 0x0bcd))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0b85 && c <= 0x0b8a)
+      || (c >= 0x0b8e && c <= 0x0b90)
+      || (c >= 0x0b92 && c <= 0x0b95)
+      || (c >= 0x0b99 && c <= 0x0b9a)
+      || (c == 0x0b9c)
+      || (c >= 0x0b9e && c <= 0x0b9f)
+      || (c >= 0x0ba3 && c <= 0x0ba4)
+      || (c >= 0x0ba8 && c <= 0x0baa)
+      || (c >= 0x0bae && c <= 0x0bb5)
+      || (c >= 0x0bb7 && c <= 0x0bb9))
+    {
+      return 1;
+    }
+
+  /* Telugu */
+  if ((c >= 0x0c01 && c <= 0x0c03)
+      || (c >= 0x0c3e && c <= 0x0c44)
+      || (c >= 0x0c46 && c <= 0x0c48)
+      || (c >= 0x0c4a && c <= 0x0c4d))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0c05 && c <= 0x0c0c)
+      || (c >= 0x0c0e && c <= 0x0c10)
+      || (c >= 0x0c12 && c <= 0x0c28)
+      || (c >= 0x0c2a && c <= 0x0c33)
+      || (c >= 0x0c35 && c <= 0x0c39)
+      || (c >= 0x0c60 && c <= 0x0c61))
+    {
+      return 1;
+    }
+
+  /* Kannada */
+  if ((c >= 0x0c82 && c <= 0x0c83)
+      || (c >= 0x0cbe && c <= 0x0cc4)
+      || (c >= 0x0cc6 && c <= 0x0cc8)
+      || (c >= 0x0cca && c <= 0x0ccd)
+      || (c == 0x0cde))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0c85 && c <= 0x0c8c)
+      || (c >= 0x0c8e && c <= 0x0c90)
+      || (c >= 0x0c92 && c <= 0x0ca8)
+      || (c >= 0x0caa && c <= 0x0cb3)
+      || (c >= 0x0cb5 && c <= 0x0cb9)
+      || (c >= 0x0ce0 && c <= 0x0ce1))
+    {
+      return 1;
+    }
+
+  /* Malayalam */
+  if ((c >= 0x0d02 && c <= 0x0d03)
+      || (c >= 0x0d3e && c <= 0x0d43)
+      || (c >= 0x0d46 && c <= 0x0d48)
+      || (c >= 0x0d4a && c <= 0x0d4d))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0d05 && c <= 0x0d0c)
+      || (c >= 0x0d0e && c <= 0x0d10)
+      || (c >= 0x0d12 && c <= 0x0d28)
+      || (c >= 0x0d2a && c <= 0x0d39)
+      || (c >= 0x0d60 && c <= 0x0d61))
+    {
+      return 1;
+    }
+
+  /* Thai */
+  if ((c >= 0x0e34 && c <= 0x0e3a)
+      || (c >= 0x0e47 && c <= 0x0e4e))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0e01 && c <= 0x0e33)
+      || (c >= 0x0e40 && c <= 0x0e46)
+      || (c >= 0x0e4f && c <= 0x0e5b))
+    {
+      NOTIN_CXX98(0x0e31);
+      return 1;
+    }
+
+  /* Lao */
+  if ((c >= 0x0eb4 && c <= 0x0eb9)
+      || (c >= 0x0ebb && c <= 0x0ebc)
+      || (c >= 0x0ec8 && c <= 0x0ecc)
+      || (c >= 0x0edc && c <= 0x0edd))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0e81 && c <= 0x0e82)
+      || (c == 0x0e84)
+      || (c == 0x0e87)
+      || (c == 0x0e88)
+      || (c == 0x0e8a)
+      || (c == 0x0e8d) /* C++ DR 131 */
+      || (c >= 0x0e94 && c <= 0x0e97)
+      || (c >= 0x0e99 && c <= 0x0e9f)
+      || (c >= 0x0ea1 && c <= 0x0ea3)
+      || (c == 0x0ea5)
+      || (c == 0x0ea7)
+      || (c == 0x0eaa)
+      || (c == 0x0eab)
+      || (c >= 0x0ead && c <= 0x0eb3)
+      || (c == 0x0ebd)
+      || (c >= 0x0ec0 && c <= 0x0ec4)
+      || (c == 0x0ec6))
+    {
+      NOTIN_C99(0x0eaf);
+      NOTIN_CXX98(0x0eb1);
+      return 1;
+    }
+
+  /* Tibetan */
+  if ((c == 0x0f00)
+      || (c >= 0x0f18 && c <= 0x0f19)
+      || (c == 0x0f35)
+      || (c == 0x0f37)
+      || (c == 0x0f39)
+      || (c >= 0x0f3e && c <= 0x0f47)
+      || (c >= 0x0f49 && c <= 0x0f69)
+      || (c >= 0x0f71 && c <= 0x0f84)
+      || (c >= 0x0f86 && c <= 0x0f8b)
+      || (c >= 0x0f90 && c <= 0x0f95)
+      || (c == 0x0f97)
+      || (c >= 0x0f99 && c <= 0x0fad)
+      || (c >= 0x0fb1 && c <= 0x0fb7)
+      || (c == 0x0fb9))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+
+  /* Georgian */
+  if ((c >= 0x10a0 && c <= 0x10c5)
+      || (c >= 0x10d0 && c <= 0x10f6))
+    {
+      return 1;
+    }
+
+  /* Hiragana */
+  if ((c >= 0x3041 && c <= 0x3094)
+      || (c >= 0x309b && c <= 0x309e))
+    {
+      NOTIN_C99(0x039d);
+      NOTIN_C99(0x039e);
+      return 1;
+    }
+
+  /* Katakana */
+  if ((c >= 0x30a1 && c <= 0x30fe))
+    {
+      if (c99
+	  && ((c >= 0x30f7 && c <= 0x30fa)
+	      || (c == 0x03fd)
+	      || (c == 0x03fe)))
+	  goto fail;
+      return 1;
+    }
+
+  /* Bopomofo */
+  if ((c >= 0x3105 && c <= 0x312c))
+    {
+      return 1;
+    }
+
+  /* Hangul */
+  if (c >= 0xac00 && c <= 0xd7a3)
+    {
+      if (cxx98)
+	goto fail;
+      return 1;
+    }
+  if ((c >= 0x1100 && c <= 0x1159)
+      || (c >= 0x1161 && c <= 0x11a2)
+      || (c >= 0x11a8 && c <= 0x11f9))
+    {
+      if (c99)
+	goto fail;
+      return 1;
+    }
+
+
+  /* CJK Unified Ideographs */
+  if (c >= 0x4e00 && c <= 0x9f45)
+    {
+      return 1;
+    }
+  if ((c >= 0xf900 && c <= 0xfa2d)
+      || (c >= 0xfb1f && c <= 0xfb36)
+      || (c >= 0xfb38 && c <= 0xfb3c)
+      || (c == 0xfb3e)
+      || (c >= 0xfb40 && c <= 0xfb41)
+      || (c >= 0xfb42 && c <= 0xfb44)
+      || (c >= 0xfb46 && c <= 0xfbb1)
+      || (c >= 0xfbd3 && c <= 0xfd3f)
+      || (c >= 0xfd50 && c <= 0xfd8f)
+      || (c >= 0xfd92 && c <= 0xfdc7)
+      || (c >= 0xfdf0 && c <= 0xfdfb)
+      || (c >= 0xfe70 && c <= 0xfe72)
+      || (c == 0xfe74)
+      || (c >= 0xfe76 && c <= 0xfefc)
+      || (c >= 0xff21 && c <= 0xff3a)
+      || (c >= 0xff41 && c <= 0xff5a)
+      || (c >= 0xff66 && c <= 0xffbe)
+      || (c >= 0xffc2 && c <= 0xffc7)
+      || (c >= 0xffca && c <= 0xffcf)
+      || (c >= 0xffd2 && c <= 0xffd7)
+      || (c >= 0xffda && c <= 0xffdc))
+    {
+      if (c99)
+	goto fail;
+      return 1;
+    }
+
+  /* Digits */
+  if((c >= 0x0660 && c <= 0x0669)
+     || (c >= 0x06f0 && c <= 0x06f9)
+     || (c >= 0x0966 && c <= 0x096f)
+     || (c >= 0x09e6 && c <= 0x09ef)
+     || (c >= 0x0a66 && c <= 0x0a6f)
+     || (c >= 0x0ae6 && c <= 0x0aef)
+     || (c >= 0x0b66 && c <= 0x0b6f)
+     || (c >= 0x0be7 && c <= 0x0bef)
+     || (c >= 0x0c66 && c <= 0x0c6f)
+     || (c >= 0x0ce6 && c <= 0x0cef)
+     || (c >= 0x0d66 && c <= 0x0d6f)
+     || (c >= 0x0e50 && c <= 0x0e59)
+     || (c >= 0x0ed0 && c <= 0x0ed9)
+     || (c >= 0x0f20 && c <= 0x0f33))
+    {
+      if (!allow_digits || cxx98)
+	goto fail;
+      return 1;
+    }
+
+  /* Special characters */
+  if ((c == 0x00b5)
+      || (c == 0x00b7)
+      || (c >= 0x02b0 && c <= 0x02b8)
+      || (c == 0x02bb)
+      || (c >= 0x02bd && c <= 0x02c1)
+      || (c >= 0x02d0 && c <= 0x02d1)
+      || (c >= 0x02e0 && c <= 0x02e4)
+      || (c == 0x037a)
+      || (c == 0x0559)
+      || (c == 0x093d)
+      || (c == 0x0b3d)
+      || (c == 0x1fbe)
+      || (c >= 0x203f && c <= 0x2040)
+      || (c == 0x2102)
+      || (c == 0x2107)
+      || (c >= 0x210a && c <= 0x2113)
+      || (c == 0x2115)
+      || (c >= 0x2118 && c <= 0x211d)
+      || (c == 0x2124)
+      || (c == 0x2126)
+      || (c == 0x2128)
+      || (c >= 0x212a && c <= 0x2131)
+      || (c >= 0x2133 && c <= 0x2138)
+      || (c >= 0x2160 && c <= 0x2182)
+      || (c >= 0x3005 && c <= 0x3007)
+      || (c >= 0x3021 && c <= 0x3029))
+    {
+      if (cxx98)
+	goto fail;
+      return 1;
+    }
+
+    fail:
+  cpp_error_with_line (pfile, DL_ERROR,
+                       pfile->line, 1, /* XXX */
+                       "universal-character-name '\\u%04x' not valid in identifier", c);
+  return 0;
+#endif
+}
+
+/* Add the UTF-8 representation of C to the token_buffer.  */
+
+static void
+utf8_extend_token (stack, c)
+     struct obstack *stack;
+     int c;
+{
+  int shift, mask;
+
+  if      (c <= 0x0000007f)
+    {
+      obstack_1grow (stack, c);
+      return;
+    }
+  else if (c <= 0x000007ff)
+    shift = 6, mask = 0xc0;
+  else if (c <= 0x0000ffff)
+    shift = 12, mask = 0xe0;
+  else if (c <= 0x001fffff)
+    shift = 18, mask = 0xf0;
+  else if (c <= 0x03ffffff)
+    shift = 24, mask = 0xf8;
+  else
+    shift = 30, mask = 0xfc;
+
+  obstack_1grow (stack, mask | (c >> shift));
+  do
+    {
+      shift -= 6;
+      obstack_1grow (stack, (unsigned char) (0x80 | ((c >> shift) & 0x3f)));
+    }
+  while (shift);
+}
+
+/* Put the UCN form onto the obstack. */
+
+static void
+ucn_extend_token (stack, c)
+     struct obstack *stack;
+     int c;
+{
+  int len;
+  obstack_1grow (stack, '\\');
+  if (c < 0x10000)
+    {
+      obstack_1grow (stack, 'u');
+      len = 4;
+    }
+  else
+    {
+      obstack_1grow (stack, 'U');
+      len = 8;
+    }
+  while (len--)
+    {
+      int d = (c >> 4*len) & 0xF;
+      if (d < 10)
+	obstack_1grow (stack, '0' + d);
+      else
+	obstack_1grow (stack, 'a' + d - 10);
+    }
+}
+
+static cppchar_t
+utf8_to_char (pos)
+     const unsigned char **pos;
+{
+  cppchar_t result = 0;
+  const unsigned char *s = *pos;
+  if (*s < 128)
+    {
+      result = *s;
+      *pos += 1;
+    }
+  else if (*s < 0xc0)
+    {
+      /* Cannot occur as first byte */
+      abort();
+    }
+  else if (*s < 0xE0)
+    {
+      result = ((s[0] & 0x1f) << 6) + (s[1] & 0x3f);
+      *pos += 2;
+    }
+  else if (*s < 0xF0)
+    {
+      result =
+        ((s[0] & 0xf) << 12) +
+        ((s[1] & 0x3f) << 6) +
+        (s[2] & 0x3f);
+      *pos += 3;
+    }
+  else if (*s < 0xF8)
+    {
+      result =
+        ((s[0] & 0x7) << 18) +
+        ((s[1] & 0x3f) << 12) +
+        ((s[2] & 0x3f) << 6) +
+        (s[3] & 0x3f);
+      *pos += 4;
+    }
+  else
+    {
+      /* Other codes are reserved. */
+      abort ();
+    }
+  return result;
+}
+
 /* Skips any escaped newlines introduced by '?' or a '\\', assumed to
    lie in buffer->cur[-1].  Returns the next byte, which will be in
    buffer->cur[-1].  This routine performs preprocessing stages 1 and
@@ -451,11 +1124,19 @@ parse_identifier (pfile)
   /* Check for slow-path cases.  */
   if (*cur == '?' || *cur == '\\' || *cur == '$')
     {
-      unsigned int len;
+      unsigned int len, utf8;
 
-      base = parse_slow (pfile, cur, 0, &len);
+      base = parse_slow (pfile, cur, 0, &len, &utf8);
       result = (cpp_hashnode *)
 	ht_lookup (pfile->hash_table, base, len, HT_ALLOCED);
+      if (utf8)
+	{
+	  result->flags |= NODE_USES_EXTENDED_CHARACTERS;
+#ifndef HAVE_AS_UTF8
+	  cpp_error (pfile, DL_ERROR, 
+		     "Non-ASCII identifiers not supported by your assembler");
+#endif
+	}
     }
   else
     {
@@ -493,11 +1174,12 @@ parse_identifier (pfile)
    pointer to the token's NUL-terminated spelling in permanent
    storage, and sets PLEN to its length.  */
 static uchar *
-parse_slow (pfile, cur, number_p, plen)
+parse_slow (pfile, cur, number_p, plen, utf8)
      cpp_reader *pfile;
      const uchar *cur;
      int number_p;
      unsigned int *plen;
+     unsigned int *utf8;
 {
   cpp_buffer *buffer = pfile->buffer;
   const uchar *base = buffer->cur - 1;
@@ -516,12 +1198,33 @@ parse_slow (pfile, cur, number_p, plen)
   prevc = cur[-1];
   c = *cur++;
   buffer->cur = cur;
+  *utf8 = 0;
   for (;;)
     {
       /* Potential escaped newline?  */
       buffer->backup_to = buffer->cur - 1;
       if (c == '?' || c == '\\')
-	c = skip_escaped_newlines (pfile);
+	  c = skip_escaped_newlines (pfile);
+
+      if (c == '\\' && (*buffer->cur == 'u'
+                        || *buffer->cur == 'U'))
+        {
+          cur = buffer->cur - 1;
+          c = *buffer->cur++;
+          if (maybe_read_ucs_reader (pfile, &c) == 0
+              && identifier_ucs_p (pfile, c, 1))
+            {
+	      if (number_p)
+		ucn_extend_token (stack, c);
+	      else
+		utf8_extend_token (stack, c);
+              c = *buffer->cur++;
+              *utf8 = 1;
+              continue;
+            }
+          buffer->cur = cur;
+          c = *buffer->cur++;
+        }
 
       if (!is_idchar (c))
 	{
@@ -570,6 +1273,7 @@ parse_number (pfile, number, leading_per
      int leading_period;
 {
   const uchar *cur;
+  unsigned int unused;
 
   /* Fast-path loop.  Skim over a normal number.
      N.B. ISIDNUM does not include $.  */
@@ -579,7 +1283,8 @@ parse_number (pfile, number, leading_per
 
   /* Check for slow-path cases.  */
   if (*cur == '?' || *cur == '\\' || *cur == '$')
-    number->text = parse_slow (pfile, cur, 1 + leading_period, &number->len);
+    number->text = parse_slow (pfile, cur, 1 + leading_period, 
+			       &number->len, &unused);
   else
     {
       const uchar *base = pfile->buffer->cur - 1;
@@ -1025,7 +1730,24 @@ _cpp_lex_direct (pfile)
       if (c == '?')
 	result->type = CPP_QUERY;
       else if (c == '\\')
-	goto random_char;
+        {
+          const unsigned char *pos = buffer->cur;
+          
+          c = *buffer->cur++;
+          if ((c == 'u' || c == 'U')
+              && maybe_read_ucs_reader (pfile, &c) == 0
+              && identifier_ucs_p (pfile, c, 0))
+            {
+              buffer->cur = pos;
+              goto start_ident;
+            }
+          else
+            {
+              c = '\\';
+              buffer->cur = pos;
+              goto random_char;
+            }
+        }
       else
 	goto trigraph;
       break;
@@ -1402,8 +2124,35 @@ cpp_spell_token (pfile, token, buffer)
 
     spell_ident:
     case SPELL_IDENT:
-      memcpy (buffer, NODE_NAME (token->val.node), NODE_LEN (token->val.node));
-      buffer += NODE_LEN (token->val.node);
+      if ((token->val.node->flags & NODE_USES_EXTENDED_CHARACTERS) == 0)
+	{
+	  memcpy (buffer, NODE_NAME (token->val.node), 
+		  NODE_LEN (token->val.node));
+	  buffer += NODE_LEN (token->val.node);
+	}
+      else
+	{
+          const unsigned char *s = NODE_NAME (token->val.node);
+          int len = NODE_LEN (token->val.node);
+          while (len)
+            {
+              if (*s < 128)
+                {
+                  *buffer++ = *s++;
+                  len--;
+                }
+              else
+                {
+                  const unsigned char *old = s;
+                  cppchar_t code = utf8_to_char (&s);
+                  if (code < 0x10000)
+                    buffer += sprintf ((char*)buffer, "\\u%.4x", code);
+                  else
+                    buffer += sprintf ((char*)buffer, "\\U%.8x", code);
+                  len -= s - old;
+                }
+            }
+	}
       break;
 
     case SPELL_NUMBER:
@@ -1503,7 +2252,32 @@ cpp_output_token (token, fp)
 
     spell_ident:
     case SPELL_IDENT:
-      fwrite (NODE_NAME (token->val.node), 1, NODE_LEN (token->val.node), fp);
+      if ((token->val.node->flags & NODE_USES_EXTENDED_CHARACTERS) == 0)
+        fwrite (NODE_NAME (token->val.node), 1, NODE_LEN (token->val.node), fp);
+      else
+        {
+          const unsigned char *s = NODE_NAME (token->val.node);
+          int len = NODE_LEN (token->val.node);
+          while (len)
+            {
+              if (*s < 128)
+                {
+                  fputc (*s, fp);
+		  s++;
+                  len--;
+                }
+              else
+                {
+                  const unsigned char *old = s;
+                  cppchar_t code = utf8_to_char (&s);
+                  if (code < 0x10000)
+                    fprintf (fp, "\\u%.4x", code);
+                  else
+                    fprintf (fp, "\\U%.8x", code);
+                  len -= s - old;
+                }
+            }
+        }
     break;
 
     case SPELL_NUMBER:
@@ -1738,6 +2512,63 @@ maybe_read_ucs (pfile, pstr, limit, pc)
 #endif
 
   *pstr = p;
+  *pc = code;
+  return 0;
+}
+
+/* Like maybe_read_ucs, but always read the data from a parser. */
+
+static int
+maybe_read_ucs_reader (pfile, pc)
+     cpp_reader *pfile;
+     cppchar_t *pc;
+{
+  unsigned int code = 0;
+  cppchar_t c = *pc;
+  unsigned int length;
+
+  /* Only attempt to interpret a UCS for C++ and C99.  */
+  if (! (CPP_OPTION (pfile, cplusplus) || CPP_OPTION (pfile, c99)))
+    return 1;
+
+  if (CPP_WTRADITIONAL (pfile))
+    cpp_error (pfile, DL_WARNING,
+	       "the meaning of '\\%c' is different in traditional C", c);
+
+  length = (c == 'u' ? 4: 8);
+
+  for (; length; length--)
+    {
+      c = get_effective_char (pfile);
+      if (ISXDIGIT (c))
+	code = (code << 4) + hex_digit_value (c);
+      else
+	{
+	  cpp_error (pfile, DL_ERROR,
+		     "non-hex digit '%c' in universal-character-name", c);
+	  /* We shouldn't skip in case there are multibyte chars.  */
+	  break;
+	}
+    }
+
+#ifdef TARGET_EBCDIC
+  cpp_error (pfile, DL_ERROR, "universal-character-name on EBCDIC target");
+  code = 0x3f;  /* EBCDIC invalid character */
+#else
+ /* True extended characters are OK.  */
+  if (code >= 0xa0
+      && !(code & 0x80000000)
+      && !(code >= 0xD800 && code <= 0xDFFF))
+    ;
+  /* The standard permits $, @ and ` to be specified as UCNs.  We use
+     hex escapes so that this also works with EBCDIC hosts.  */
+  else if (code == 0x24 || code == 0x40 || code == 0x60)
+    ;
+  /* Don't give another error if one occurred above.  */
+  else if (length == 0)
+    cpp_error (pfile, DL_ERROR, "universal-character-name out of range");
+#endif
+
   *pc = code;
   return 0;
 }
Index: cpplib.h
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/cpplib.h,v
retrieving revision 1.237
diff -u -r1.237 cpplib.h
--- cpplib.h	26 Sep 2002 22:25:12 -0000	1.237
+++ cpplib.h	28 Nov 2002 22:50:15 -0000
@@ -443,6 +443,7 @@
 #define NODE_DIAGNOSTIC (1 << 3)	/* Possible diagnostic when lexed.  */
 #define NODE_WARN	(1 << 4)	/* Warn if redefined or undefined.  */
 #define NODE_DISABLED	(1 << 5)	/* A disabled macro.  */
+#define NODE_USES_EXTENDED_CHARACTERS (1 << 6) /* Node has UTF-8 bytes in it */
 
 /* Different flavors of hash node.  */
 enum node_type



More information about the Java mailing list