This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Universal Character Names, v2


This is the second version of my UCN patch. It incorporates all
comments from the previous patch (AFAIR).

Specifically, the changes relative to the previous patch are:
- Update character sets for C99, and C++ DR 131.
- Support escaped newlines in the middle of an UCN. This is done
  through the addition of maybe_read_ucs_reader function, which
  uses get_effective_char internally.
- Support UCNs in numbers. In the internal represantation, such
  a number still has the UCN in it, i.e. no conversion to UTF-8
  takes place. Such numbers will only be valid if they are pasted
  with an identifier.
- Support pasting of names that have UCNs in them. For that,
  cpp_spell_token had to be updated.
- Check for assembler UTF-8 support, and reject UCNs if no such
  support is available. As a side effect, gcj will automatically
  use UTF-8 mangling where g++ supports UCNs.

I have considered the following comments, but chose to take a
different approach:
- I have not put the test function for characters in libiberty.
  It is quite specific to C and C++, and only ever used in the
  preprocessor.
- I have not decided to deviate from the C and C++ standards for
  character tests. Reviewers commented that they dislike the approach
  taken by the standards committees, and that the relevant Unicode
  specification should be taken into account instead. I disagree, as I
  consider the approach of giving explicit lists quite reasonable.
  More importantly, I think that standards conformance should be
  valued quite highly unless specific user demands require to
  ignore or extend the standards; this is not the case in the
  specific issue.

A few issues need to be resolved with the Java compiler:
- somehow, defining HAVE_AS_UTF8 (which the patch does) triggers
  bugs in the mangler; it will now emit symbols like
  
    _ZN4java4lang6Double8<clinit>Ev

- The Java mangler currently emits the number of characters for an
  UTF-8 <source-name>; the ABI specifies that this ought to be the
  byte length.

I'd appreciate if some Java expert could help with resolving the first
issue; resolving the second one seems simple.

Any comments appreciated,

Martin

2002-10-27  Martin v. Löwis  <loewis@informatik.hu-berlin.de>

	* c-lex.c (is_extended_char, utf8_extend_token): Remove.
	* cpplex.c (identifier_ucs_p, utf8_extend_token, 
	ucn_extend_token, utf8_to_char, maybe_read_ucs_reader): New functions.
	(parse_slow): Add utf8 parameter. Parse UCS names.
	(parse_identifier, parse_number): Adjust.
	(_cpp_lex_direct): Parse UCS names.
	(cpp_output_token): Print UCS names.
	(cpp_spell_token, cpp_output_token): Unparse extended characters.
	* cpplib.h (NODE_USES_EXTENDED_CHARACTERS): New flag.
	* configure.in (HAVE_AS_UTF8): New test.
	* configure, config.in: Rebuilt.

Index: c-lex.c
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/c-lex.c,v
retrieving revision 1.190
diff -u -r1.190 c-lex.c
--- c-lex.c	16 Sep 2002 16:36:31 -0000	1.190
+++ c-lex.c	28 Nov 2002 22:50:07 -0000
@@ -356,314 +356,6 @@
 			 (const char *) NODE_NAME (node));
 }
 
-#if 0 /* not yet */
-/* Returns nonzero if C is a universal-character-name.  Give an error if it
-   is not one which may appear in an identifier, as per [extendid].
-
-   Note that extended character support in identifiers has not yet been
-   implemented.  It is my personal opinion that this is not a desirable
-   feature.  Portable code cannot count on support for more than the basic
-   identifier character set.  */
-
-static inline int
-is_extended_char (c)
-     int c;
-{
-#ifdef TARGET_EBCDIC
-  return 0;
-#else
-  /* ASCII.  */
-  if (c < 0x7f)
-    return 0;
-
-  /* None of the valid chars are outside the Basic Multilingual Plane (the
-     low 16 bits).  */
-  if (c > 0xffff)
-    {
-      error ("universal-character-name '\\U%08x' not valid in identifier", c);
-      return 1;
-    }
-  
-  /* Latin */
-  if ((c >= 0x00c0 && c <= 0x00d6)
-      || (c >= 0x00d8 && c <= 0x00f6)
-      || (c >= 0x00f8 && c <= 0x01f5)
-      || (c >= 0x01fa && c <= 0x0217)
-      || (c >= 0x0250 && c <= 0x02a8)
-      || (c >= 0x1e00 && c <= 0x1e9a)
-      || (c >= 0x1ea0 && c <= 0x1ef9))
-    return 1;
-
-  /* Greek */
-  if ((c == 0x0384)
-      || (c >= 0x0388 && c <= 0x038a)
-      || (c == 0x038c)
-      || (c >= 0x038e && c <= 0x03a1)
-      || (c >= 0x03a3 && c <= 0x03ce)
-      || (c >= 0x03d0 && c <= 0x03d6)
-      || (c == 0x03da)
-      || (c == 0x03dc)
-      || (c == 0x03de)
-      || (c == 0x03e0)
-      || (c >= 0x03e2 && c <= 0x03f3)
-      || (c >= 0x1f00 && c <= 0x1f15)
-      || (c >= 0x1f18 && c <= 0x1f1d)
-      || (c >= 0x1f20 && c <= 0x1f45)
-      || (c >= 0x1f48 && c <= 0x1f4d)
-      || (c >= 0x1f50 && c <= 0x1f57)
-      || (c == 0x1f59)
-      || (c == 0x1f5b)
-      || (c == 0x1f5d)
-      || (c >= 0x1f5f && c <= 0x1f7d)
-      || (c >= 0x1f80 && c <= 0x1fb4)
-      || (c >= 0x1fb6 && c <= 0x1fbc)
-      || (c >= 0x1fc2 && c <= 0x1fc4)
-      || (c >= 0x1fc6 && c <= 0x1fcc)
-      || (c >= 0x1fd0 && c <= 0x1fd3)
-      || (c >= 0x1fd6 && c <= 0x1fdb)
-      || (c >= 0x1fe0 && c <= 0x1fec)
-      || (c >= 0x1ff2 && c <= 0x1ff4)
-      || (c >= 0x1ff6 && c <= 0x1ffc))
-    return 1;
-
-  /* Cyrillic */
-  if ((c >= 0x0401 && c <= 0x040d)
-      || (c >= 0x040f && c <= 0x044f)
-      || (c >= 0x0451 && c <= 0x045c)
-      || (c >= 0x045e && c <= 0x0481)
-      || (c >= 0x0490 && c <= 0x04c4)
-      || (c >= 0x04c7 && c <= 0x04c8)
-      || (c >= 0x04cb && c <= 0x04cc)
-      || (c >= 0x04d0 && c <= 0x04eb)
-      || (c >= 0x04ee && c <= 0x04f5)
-      || (c >= 0x04f8 && c <= 0x04f9))
-    return 1;
-
-  /* Armenian */
-  if ((c >= 0x0531 && c <= 0x0556)
-      || (c >= 0x0561 && c <= 0x0587))
-    return 1;
-
-  /* Hebrew */
-  if ((c >= 0x05d0 && c <= 0x05ea)
-      || (c >= 0x05f0 && c <= 0x05f4))
-    return 1;
-
-  /* Arabic */
-  if ((c >= 0x0621 && c <= 0x063a)
-      || (c >= 0x0640 && c <= 0x0652)
-      || (c >= 0x0670 && c <= 0x06b7)
-      || (c >= 0x06ba && c <= 0x06be)
-      || (c >= 0x06c0 && c <= 0x06ce)
-      || (c >= 0x06e5 && c <= 0x06e7))
-    return 1;
-
-  /* Devanagari */
-  if ((c >= 0x0905 && c <= 0x0939)
-      || (c >= 0x0958 && c <= 0x0962))
-    return 1;
-
-  /* Bengali */
-  if ((c >= 0x0985 && c <= 0x098c)
-      || (c >= 0x098f && c <= 0x0990)
-      || (c >= 0x0993 && c <= 0x09a8)
-      || (c >= 0x09aa && c <= 0x09b0)
-      || (c == 0x09b2)
-      || (c >= 0x09b6 && c <= 0x09b9)
-      || (c >= 0x09dc && c <= 0x09dd)
-      || (c >= 0x09df && c <= 0x09e1)
-      || (c >= 0x09f0 && c <= 0x09f1))
-    return 1;
-
-  /* Gurmukhi */
-  if ((c >= 0x0a05 && c <= 0x0a0a)
-      || (c >= 0x0a0f && c <= 0x0a10)
-      || (c >= 0x0a13 && c <= 0x0a28)
-      || (c >= 0x0a2a && c <= 0x0a30)
-      || (c >= 0x0a32 && c <= 0x0a33)
-      || (c >= 0x0a35 && c <= 0x0a36)
-      || (c >= 0x0a38 && c <= 0x0a39)
-      || (c >= 0x0a59 && c <= 0x0a5c)
-      || (c == 0x0a5e))
-    return 1;
-
-  /* Gujarati */
-  if ((c >= 0x0a85 && c <= 0x0a8b)
-      || (c == 0x0a8d)
-      || (c >= 0x0a8f && c <= 0x0a91)
-      || (c >= 0x0a93 && c <= 0x0aa8)
-      || (c >= 0x0aaa && c <= 0x0ab0)
-      || (c >= 0x0ab2 && c <= 0x0ab3)
-      || (c >= 0x0ab5 && c <= 0x0ab9)
-      || (c == 0x0ae0))
-    return 1;
-
-  /* Oriya */
-  if ((c >= 0x0b05 && c <= 0x0b0c)
-      || (c >= 0x0b0f && c <= 0x0b10)
-      || (c >= 0x0b13 && c <= 0x0b28)
-      || (c >= 0x0b2a && c <= 0x0b30)
-      || (c >= 0x0b32 && c <= 0x0b33)
-      || (c >= 0x0b36 && c <= 0x0b39)
-      || (c >= 0x0b5c && c <= 0x0b5d)
-      || (c >= 0x0b5f && c <= 0x0b61))
-    return 1;
-
-  /* Tamil */
-  if ((c >= 0x0b85 && c <= 0x0b8a)
-      || (c >= 0x0b8e && c <= 0x0b90)
-      || (c >= 0x0b92 && c <= 0x0b95)
-      || (c >= 0x0b99 && c <= 0x0b9a)
-      || (c == 0x0b9c)
-      || (c >= 0x0b9e && c <= 0x0b9f)
-      || (c >= 0x0ba3 && c <= 0x0ba4)
-      || (c >= 0x0ba8 && c <= 0x0baa)
-      || (c >= 0x0bae && c <= 0x0bb5)
-      || (c >= 0x0bb7 && c <= 0x0bb9))
-    return 1;
-
-  /* Telugu */
-  if ((c >= 0x0c05 && c <= 0x0c0c)
-      || (c >= 0x0c0e && c <= 0x0c10)
-      || (c >= 0x0c12 && c <= 0x0c28)
-      || (c >= 0x0c2a && c <= 0x0c33)
-      || (c >= 0x0c35 && c <= 0x0c39)
-      || (c >= 0x0c60 && c <= 0x0c61))
-    return 1;
-
-  /* Kannada */
-  if ((c >= 0x0c85 && c <= 0x0c8c)
-      || (c >= 0x0c8e && c <= 0x0c90)
-      || (c >= 0x0c92 && c <= 0x0ca8)
-      || (c >= 0x0caa && c <= 0x0cb3)
-      || (c >= 0x0cb5 && c <= 0x0cb9)
-      || (c >= 0x0ce0 && c <= 0x0ce1))
-    return 1;
-
-  /* Malayalam */
-  if ((c >= 0x0d05 && c <= 0x0d0c)
-      || (c >= 0x0d0e && c <= 0x0d10)
-      || (c >= 0x0d12 && c <= 0x0d28)
-      || (c >= 0x0d2a && c <= 0x0d39)
-      || (c >= 0x0d60 && c <= 0x0d61))
-    return 1;
-
-  /* Thai */
-  if ((c >= 0x0e01 && c <= 0x0e30)
-      || (c >= 0x0e32 && c <= 0x0e33)
-      || (c >= 0x0e40 && c <= 0x0e46)
-      || (c >= 0x0e4f && c <= 0x0e5b))
-    return 1;
-
-  /* Lao */
-  if ((c >= 0x0e81 && c <= 0x0e82)
-      || (c == 0x0e84)
-      || (c == 0x0e87)
-      || (c == 0x0e88)
-      || (c == 0x0e8a)
-      || (c == 0x0e0d)
-      || (c >= 0x0e94 && c <= 0x0e97)
-      || (c >= 0x0e99 && c <= 0x0e9f)
-      || (c >= 0x0ea1 && c <= 0x0ea3)
-      || (c == 0x0ea5)
-      || (c == 0x0ea7)
-      || (c == 0x0eaa)
-      || (c == 0x0eab)
-      || (c >= 0x0ead && c <= 0x0eb0)
-      || (c == 0x0eb2)
-      || (c == 0x0eb3)
-      || (c == 0x0ebd)
-      || (c >= 0x0ec0 && c <= 0x0ec4)
-      || (c == 0x0ec6))
-    return 1;
-
-  /* Georgian */
-  if ((c >= 0x10a0 && c <= 0x10c5)
-      || (c >= 0x10d0 && c <= 0x10f6))
-    return 1;
-
-  /* Hiragana */
-  if ((c >= 0x3041 && c <= 0x3094)
-      || (c >= 0x309b && c <= 0x309e))
-    return 1;
-
-  /* Katakana */
-  if ((c >= 0x30a1 && c <= 0x30fe))
-    return 1;
-
-  /* Bopmofo */
-  if ((c >= 0x3105 && c <= 0x312c))
-    return 1;
-
-  /* Hangul */
-  if ((c >= 0x1100 && c <= 0x1159)
-      || (c >= 0x1161 && c <= 0x11a2)
-      || (c >= 0x11a8 && c <= 0x11f9))
-    return 1;
-
-  /* CJK Unified Ideographs */
-  if ((c >= 0xf900 && c <= 0xfa2d)
-      || (c >= 0xfb1f && c <= 0xfb36)
-      || (c >= 0xfb38 && c <= 0xfb3c)
-      || (c == 0xfb3e)
-      || (c >= 0xfb40 && c <= 0xfb41)
-      || (c >= 0xfb42 && c <= 0xfb44)
-      || (c >= 0xfb46 && c <= 0xfbb1)
-      || (c >= 0xfbd3 && c <= 0xfd3f)
-      || (c >= 0xfd50 && c <= 0xfd8f)
-      || (c >= 0xfd92 && c <= 0xfdc7)
-      || (c >= 0xfdf0 && c <= 0xfdfb)
-      || (c >= 0xfe70 && c <= 0xfe72)
-      || (c == 0xfe74)
-      || (c >= 0xfe76 && c <= 0xfefc)
-      || (c >= 0xff21 && c <= 0xff3a)
-      || (c >= 0xff41 && c <= 0xff5a)
-      || (c >= 0xff66 && c <= 0xffbe)
-      || (c >= 0xffc2 && c <= 0xffc7)
-      || (c >= 0xffca && c <= 0xffcf)
-      || (c >= 0xffd2 && c <= 0xffd7)
-      || (c >= 0xffda && c <= 0xffdc)
-      || (c >= 0x4e00 && c <= 0x9fa5))
-    return 1;
-
-  error ("universal-character-name '\\u%04x' not valid in identifier", c);
-  return 1;
-#endif
-}
-
-/* Add the UTF-8 representation of C to the token_buffer.  */
-
-static void
-utf8_extend_token (c)
-     int c;
-{
-  int shift, mask;
-
-  if      (c <= 0x0000007f)
-    {
-      extend_token (c);
-      return;
-    }
-  else if (c <= 0x000007ff)
-    shift = 6, mask = 0xc0;
-  else if (c <= 0x0000ffff)
-    shift = 12, mask = 0xe0;
-  else if (c <= 0x001fffff)
-    shift = 18, mask = 0xf0;
-  else if (c <= 0x03ffffff)
-    shift = 24, mask = 0xf8;
-  else
-    shift = 30, mask = 0xfc;
-
-  extend_token (mask | (c >> shift));
-  do
-    {
-      shift -= 6;
-      extend_token ((unsigned char) (0x80 | (c >> shift)));
-    }
-  while (shift);
-}
-#endif
 
 int
 c_lex (value)
Index: configure.in
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/configure.in,v
retrieving revision 1.626
diff -u -r1.626 configure.in
--- configure.in	26 Nov 2002 20:08:07 -0000	1.626
+++ configure.in	28 Nov 2002 22:50:13 -0000
@@ -1889,6 +1889,22 @@
 fi
 AC_MSG_RESULT($gcc_cv_as_tls)
 
+AC_MSG_CHECKING(assembler support for UTF-8 identifiers)
+gcc_cv_as_utf8="no"
+if test x$gcc_cv_as != x; then
+  echo fooab:|tr ab '\303\200' > conftest.s
+  if $gcc_cv_as --fatal-warnings -o conftest.o conftest.s > /dev/null 2>&1
+  then
+    gcc_cv_as_utf8=yes
+  fi
+  rm -rf conftest.s
+fi
+if test "$gcc_cv_as_utf8" = yes; then
+  AC_DEFINE(HAVE_AS_UTF8, 1,
+            [Define if your assembler supports UTF-8 bytes in identifiers])
+fi
+AC_MSG_RESULT($gcc_cv_as_utf8)
+
 case "$target" in
   # All TARGET_ABI_OSF targets.
   alpha*-*-osf* | alpha*-*-linux* | alpha*-*-*bsd*)
Index: cpplex.c
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/cpplex.c,v
retrieving revision 1.215
diff -u -p -r1.215 cpplex.c
--- cpplex.c	26 Sep 2002 22:25:12 -0000	1.215
+++ cpplex.c	28 Nov 2002 23:04:54 -0000
@@ -71,7 +71,7 @@ static void adjust_column PARAMS ((cpp_r
 static int skip_whitespace PARAMS ((cpp_reader *, cppchar_t));
 static cpp_hashnode *parse_identifier PARAMS ((cpp_reader *));
 static uchar *parse_slow PARAMS ((cpp_reader *, const uchar *, int,
-				  unsigned int *));
+				  unsigned int *, unsigned int *));
 static void parse_number PARAMS ((cpp_reader *, cpp_string *, int));
 static int unescaped_terminator_p PARAMS ((cpp_reader *, const uchar *));
 static void parse_string PARAMS ((cpp_reader *, cpp_token *, cppchar_t));
@@ -82,10 +82,16 @@ static bool continue_after_nul PARAMS ((
 static int name_p PARAMS ((cpp_reader *, const cpp_string *));
 static int maybe_read_ucs PARAMS ((cpp_reader *, const unsigned char **,
 				   const unsigned char *, cppchar_t *));
+static int maybe_read_ucs_reader PARAMS ((cpp_reader *, cppchar_t *));
 static tokenrun *next_tokenrun PARAMS ((tokenrun *));
 
 static unsigned int hex_digit_value PARAMS ((unsigned int));
 static _cpp_buff *new_buff PARAMS ((size_t));
+static bool identifier_ucs_p PARAMS ((cpp_reader *, cppchar_t, int));
+static void utf8_extend_token PARAMS ((struct obstack *, int));
+static void ucn_extend_token PARAMS ((struct obstack *, int));
+static cppchar_t utf8_to_char PARAMS((const unsigned char **));
+
 
 /* Utility routine:
 
@@ -161,6 +167,673 @@ trigraph_p (pfile)
   return accept;
 }
 
+/* Returns nonzero if C is a universal-character-name.  Give an error
+   if it is not one which may appear in an identifier, as per C++98
+   Annex E [extendid], and C99 Annex F.  */
+
+static bool
+identifier_ucs_p (pfile, c, allow_digits)
+     cpp_reader *pfile;
+     cppchar_t c;
+     int allow_digits;
+{
+#ifdef TARGET_EBCDIC
+  return 0;
+#else
+  int cxx98 = CPP_OPTION (pfile, cplusplus);
+  int c99 = CPP_OPTION (pfile, c99);
+
+  /* ASCII.  */
+  if (c < 0x7f)
+    return 0;
+
+  /* None of the valid chars are outside the Basic Multilingual Plane (the
+     low 16 bits).  */
+  if (c > 0xffff)
+    {
+      cpp_error_with_line (pfile, DL_ERROR,
+                           pfile->line, 1, /* XXX */
+                           "universal-character-name '\\U%08x' not valid in identifier", (int)c);
+      return 0;
+    }
+
+#define NOTIN_C99(code) if(c==code && c99) goto fail
+#define NOTIN_CXX98(code) if(c==code && cxx98) goto fail
+  
+  /* Latin */
+  if ((c == 0x00aa)
+      || (c == 0x00ba)
+      || (c >= 0x00c0 && c <= 0x00d6)
+      || (c >= 0x00d8 && c <= 0x00f6)
+      || (c >= 0x00f8 && c <= 0x01f5)
+      || (c >= 0x01fa && c <= 0x0217)
+      || (c >= 0x0250 && c <= 0x02a8)
+      || (c >= 0x1e00 && c <= 0x1e9b)
+      || (c >= 0x1ea0 && c <= 0x1ef9)
+      || (c == 0x207F))
+    {
+      NOTIN_CXX98(0x00aa);
+      NOTIN_CXX98(0x00ab);
+      NOTIN_CXX98(0x1e9b);
+      NOTIN_CXX98(0x207f);
+      return 1;
+    }
+
+  /* Greek */
+  if ((c == 0x0384)
+      || (c >= 0x0388 && c <= 0x038a)
+      || (c == 0x038c)
+      || (c >= 0x038e && c <= 0x03a1)
+      || (c >= 0x03a3 && c <= 0x03ce)
+      || (c >= 0x03d0 && c <= 0x03d6)
+      || (c == 0x03da)
+      || (c == 0x03dc)
+      || (c == 0x03de)
+      || (c == 0x03e0)
+      || (c >= 0x03e2 && c <= 0x03f3)
+      || (c >= 0x1f00 && c <= 0x1f15)
+      || (c >= 0x1f18 && c <= 0x1f1d)
+      || (c >= 0x1f20 && c <= 0x1f45)
+      || (c >= 0x1f48 && c <= 0x1f4d)
+      || (c >= 0x1f50 && c <= 0x1f57)
+      || (c == 0x1f59)
+      || (c == 0x1f5b)
+      || (c == 0x1f5d)
+      || (c >= 0x1f5f && c <= 0x1f7d)
+      || (c >= 0x1f80 && c <= 0x1fb4)
+      || (c >= 0x1fb6 && c <= 0x1fbc)
+      || (c >= 0x1fc2 && c <= 0x1fc4)
+      || (c >= 0x1fc6 && c <= 0x1fcc)
+      || (c >= 0x1fd0 && c <= 0x1fd3)
+      || (c >= 0x1fd6 && c <= 0x1fdb)
+      || (c >= 0x1fe0 && c <= 0x1fec)
+      || (c >= 0x1ff2 && c <= 0x1ff4)
+      || (c >= 0x1ff6 && c <= 0x1ffc))
+    {
+      NOTIN_C99(0x0384);
+      return 1;
+    }
+
+  /* Cyrillic */
+  if ((c >= 0x0401 && c <= 0x044f)
+      || (c >= 0x0451 && c <= 0x045c)
+      || (c >= 0x045e && c <= 0x0481)
+      || (c >= 0x0490 && c <= 0x04c4)
+      || (c >= 0x04c7 && c <= 0x04c8)
+      || (c >= 0x04cb && c <= 0x04cc)
+      || (c >= 0x04d0 && c <= 0x04eb)
+      || (c >= 0x04ee && c <= 0x04f5)
+      || (c >= 0x04f8 && c <= 0x04f9))
+    {
+      NOTIN_C99(0x040d);
+      NOTIN_CXX98(0x040e);
+      return 1;
+    }
+
+  /* Armenian */
+  if ((c >= 0x0531 && c <= 0x0556)
+      || (c >= 0x0561 && c <= 0x0587))
+    {
+      return 1;
+    }
+
+  /* Hebrew */
+  if ((c >= 0x05B0 && c <= 0x05B9)
+      || (c >= 0x05BB&& c <= 0x05BD)
+      || (c == 0x05BF)
+      || (c >= 0x05C1 && c <= 0x05C2))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x05d0 && c <= 0x05ea)
+      || (c >= 0x05f0 && c <= 0x05f4))
+    {
+      NOTIN_C99(0x05f3);
+      NOTIN_C99(0x05f4);
+      return 1;
+    }
+
+  /* Arabic */
+  if ((c >= 0x06d0 && c <= 0x06dc)
+      || (c >= 0x06ea && c <= 0x06ed))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0621 && c <= 0x063a)
+      || (c >= 0x0640 && c <= 0x0652)
+      || (c >= 0x0670 && c <= 0x06b7)
+      || (c >= 0x06ba && c <= 0x06be)
+      || (c >= 0x06c0 && c <= 0x06ce)
+      || (c >= 0x06e5 && c <= 0x06e8))
+    {
+      NOTIN_CXX98(0x06e8);
+      return 1;
+    }
+
+  /* Devanagari */
+  if ((c >= 0x0901 && c <= 0x0903)
+      || (c >= 0x093e && c <= 0x094d)
+      || (c >= 0x0950 && c <= 0x0952))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0905 && c <= 0x0939)
+      || (c >= 0x0958 && c <= 0x0963))
+    {
+      NOTIN_CXX98(0x0963);
+      return 1;
+    }
+
+  /* Bengali */
+  if ((c >= 0x0981 && c <= 0x0983)
+      || (c >= 0x09be && c <= 0x09c4)
+      || (c >= 0x09c7 && c <= 0x09c8)
+      || (c >= 0x09cb && c <= 0x09cd))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0985 && c <= 0x098c)
+      || (c >= 0x098f && c <= 0x0990)
+      || (c >= 0x0993 && c <= 0x09a8)
+      || (c >= 0x09aa && c <= 0x09b0)
+      || (c == 0x09b2)
+      || (c >= 0x09b6 && c <= 0x09b9)
+      || (c >= 0x09dc && c <= 0x09dd)
+      || (c >= 0x09df && c <= 0x09e3)
+      || (c >= 0x09f0 && c <= 0x09f1))
+    {
+      NOTIN_CXX98(0x09e2);
+      NOTIN_CXX98(0x09e3);
+      return 1;
+    }
+
+  /* Gurmukhi */
+  if ((c == 0x0a02)
+      || (c >= 0x0a3e && c <= 0x0a42)
+      || (c >= 0x0a47 && c <= 0x0a48)
+      || (c >= 0x0a4b && c <= 0x0a4d)
+      || (c == 0x0a74))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0a05 && c <= 0x0a0a)
+      || (c >= 0x0a0f && c <= 0x0a10)
+      || (c >= 0x0a13 && c <= 0x0a28)
+      || (c >= 0x0a2a && c <= 0x0a30)
+      || (c >= 0x0a32 && c <= 0x0a33)
+      || (c >= 0x0a35 && c <= 0x0a36)
+      || (c >= 0x0a38 && c <= 0x0a39)
+      || (c >= 0x0a59 && c <= 0x0a5c)
+      || (c == 0x0a5e))
+    {
+      return 1;
+    }
+
+  /* Gujarati */
+  if ((c == 0x0a02)
+      || (c >= 0x0a81 && c <= 0x0a81)
+      || (c >= 0x0abd && c <= 0x0ac5)
+      || (c >= 0x0ac7 && c <= 0x0ac9)
+      || (c >= 0x0acb && c <= 0x0acd)
+      || (c == 0x0ad0))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0a85 && c <= 0x0a8b)
+      || (c == 0x0a8d)
+      || (c >= 0x0a8f && c <= 0x0a91)
+      || (c >= 0x0a93 && c <= 0x0aa8)
+      || (c >= 0x0aaa && c <= 0x0ab0)
+      || (c >= 0x0ab2 && c <= 0x0ab3)
+      || (c >= 0x0ab5 && c <= 0x0ab9)
+      || (c == 0x0ad0)
+      || (c == 0x0ae0))
+    {
+      NOTIN_CXX98(0x0ad0);
+      return 1;
+    }
+
+  /* Oriya */
+  if ((c >= 0x0b01 && c <= 0x0b03)
+      || (c >= 0x0b3e && c <= 0x0b43)
+      || (c >= 0x0b47 && c <= 0x0b48)
+      || (c >= 0x0b4b && c <= 0x0b4d))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0b05 && c <= 0x0b0c)
+      || (c >= 0x0b0f && c <= 0x0b10)
+      || (c >= 0x0b13 && c <= 0x0b28)
+      || (c >= 0x0b2a && c <= 0x0b30)
+      || (c >= 0x0b32 && c <= 0x0b33)
+      || (c >= 0x0b36 && c <= 0x0b39)
+      || (c >= 0x0b5c && c <= 0x0b5d)
+      || (c >= 0x0b5f && c <= 0x0b61))
+    {
+      return 1;
+    }
+
+  /* Tamil */
+  if ((c >= 0x0b82 && c <= 0x0b83)
+      || (c >= 0x0bbe && c <= 0x0bc2)
+      || (c >= 0x0bc6 && c <= 0x0bc8)
+      || (c >= 0x0bca && c <= 0x0bcd))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0b85 && c <= 0x0b8a)
+      || (c >= 0x0b8e && c <= 0x0b90)
+      || (c >= 0x0b92 && c <= 0x0b95)
+      || (c >= 0x0b99 && c <= 0x0b9a)
+      || (c == 0x0b9c)
+      || (c >= 0x0b9e && c <= 0x0b9f)
+      || (c >= 0x0ba3 && c <= 0x0ba4)
+      || (c >= 0x0ba8 && c <= 0x0baa)
+      || (c >= 0x0bae && c <= 0x0bb5)
+      || (c >= 0x0bb7 && c <= 0x0bb9))
+    {
+      return 1;
+    }
+
+  /* Telugu */
+  if ((c >= 0x0c01 && c <= 0x0c03)
+      || (c >= 0x0c3e && c <= 0x0c44)
+      || (c >= 0x0c46 && c <= 0x0c48)
+      || (c >= 0x0c4a && c <= 0x0c4d))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0c05 && c <= 0x0c0c)
+      || (c >= 0x0c0e && c <= 0x0c10)
+      || (c >= 0x0c12 && c <= 0x0c28)
+      || (c >= 0x0c2a && c <= 0x0c33)
+      || (c >= 0x0c35 && c <= 0x0c39)
+      || (c >= 0x0c60 && c <= 0x0c61))
+    {
+      return 1;
+    }
+
+  /* Kannada */
+  if ((c >= 0x0c82 && c <= 0x0c83)
+      || (c >= 0x0cbe && c <= 0x0cc4)
+      || (c >= 0x0cc6 && c <= 0x0cc8)
+      || (c >= 0x0cca && c <= 0x0ccd)
+      || (c == 0x0cde))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0c85 && c <= 0x0c8c)
+      || (c >= 0x0c8e && c <= 0x0c90)
+      || (c >= 0x0c92 && c <= 0x0ca8)
+      || (c >= 0x0caa && c <= 0x0cb3)
+      || (c >= 0x0cb5 && c <= 0x0cb9)
+      || (c >= 0x0ce0 && c <= 0x0ce1))
+    {
+      return 1;
+    }
+
+  /* Malayalam */
+  if ((c >= 0x0d02 && c <= 0x0d03)
+      || (c >= 0x0d3e && c <= 0x0d43)
+      || (c >= 0x0d46 && c <= 0x0d48)
+      || (c >= 0x0d4a && c <= 0x0d4d))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0d05 && c <= 0x0d0c)
+      || (c >= 0x0d0e && c <= 0x0d10)
+      || (c >= 0x0d12 && c <= 0x0d28)
+      || (c >= 0x0d2a && c <= 0x0d39)
+      || (c >= 0x0d60 && c <= 0x0d61))
+    {
+      return 1;
+    }
+
+  /* Thai */
+  if ((c >= 0x0e34 && c <= 0x0e3a)
+      || (c >= 0x0e47 && c <= 0x0e4e))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0e01 && c <= 0x0e33)
+      || (c >= 0x0e40 && c <= 0x0e46)
+      || (c >= 0x0e4f && c <= 0x0e5b))
+    {
+      NOTIN_CXX98(0x0e31);
+      return 1;
+    }
+
+  /* Lao */
+  if ((c >= 0x0eb4 && c <= 0x0eb9)
+      || (c >= 0x0ebb && c <= 0x0ebc)
+      || (c >= 0x0ec8 && c <= 0x0ecc)
+      || (c >= 0x0edc && c <= 0x0edd))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+  if ((c >= 0x0e81 && c <= 0x0e82)
+      || (c == 0x0e84)
+      || (c == 0x0e87)
+      || (c == 0x0e88)
+      || (c == 0x0e8a)
+      || (c == 0x0e8d) /* C++ DR 131 */
+      || (c >= 0x0e94 && c <= 0x0e97)
+      || (c >= 0x0e99 && c <= 0x0e9f)
+      || (c >= 0x0ea1 && c <= 0x0ea3)
+      || (c == 0x0ea5)
+      || (c == 0x0ea7)
+      || (c == 0x0eaa)
+      || (c == 0x0eab)
+      || (c >= 0x0ead && c <= 0x0eb3)
+      || (c == 0x0ebd)
+      || (c >= 0x0ec0 && c <= 0x0ec4)
+      || (c == 0x0ec6))
+    {
+      NOTIN_C99(0x0eaf);
+      NOTIN_CXX98(0x0eb1);
+      return 1;
+    }
+
+  /* Tibetan */
+  if ((c == 0x0f00)
+      || (c >= 0x0f18 && c <= 0x0f19)
+      || (c == 0x0f35)
+      || (c == 0x0f37)
+      || (c == 0x0f39)
+      || (c >= 0x0f3e && c <= 0x0f47)
+      || (c >= 0x0f49 && c <= 0x0f69)
+      || (c >= 0x0f71 && c <= 0x0f84)
+      || (c >= 0x0f86 && c <= 0x0f8b)
+      || (c >= 0x0f90 && c <= 0x0f95)
+      || (c == 0x0f97)
+      || (c >= 0x0f99 && c <= 0x0fad)
+      || (c >= 0x0fb1 && c <= 0x0fb7)
+      || (c == 0x0fb9))
+    {
+      if (cxx98)
+        goto fail;
+      return 1;
+    }
+
+  /* Georgian */
+  if ((c >= 0x10a0 && c <= 0x10c5)
+      || (c >= 0x10d0 && c <= 0x10f6))
+    {
+      return 1;
+    }
+
+  /* Hiragana */
+  if ((c >= 0x3041 && c <= 0x3094)
+      || (c >= 0x309b && c <= 0x309e))
+    {
+      NOTIN_C99(0x039d);
+      NOTIN_C99(0x039e);
+      return 1;
+    }
+
+  /* Katakana */
+  if ((c >= 0x30a1 && c <= 0x30fe))
+    {
+      if (c99
+	  && ((c >= 0x30f7 && c <= 0x30fa)
+	      || (c == 0x03fd)
+	      || (c == 0x03fe)))
+	  goto fail;
+      return 1;
+    }
+
+  /* Bopomofo */
+  if ((c >= 0x3105 && c <= 0x312c))
+    {
+      return 1;
+    }
+
+  /* Hangul */
+  if (c >= 0xac00 && c <= 0xd7a3)
+    {
+      if (cxx98)
+	goto fail;
+      return 1;
+    }
+  if ((c >= 0x1100 && c <= 0x1159)
+      || (c >= 0x1161 && c <= 0x11a2)
+      || (c >= 0x11a8 && c <= 0x11f9))
+    {
+      if (c99)
+	goto fail;
+      return 1;
+    }
+
+
+  /* CJK Unified Ideographs */
+  if (c >= 0x4e00 && c <= 0x9f45)
+    {
+      return 1;
+    }
+  if ((c >= 0xf900 && c <= 0xfa2d)
+      || (c >= 0xfb1f && c <= 0xfb36)
+      || (c >= 0xfb38 && c <= 0xfb3c)
+      || (c == 0xfb3e)
+      || (c >= 0xfb40 && c <= 0xfb41)
+      || (c >= 0xfb42 && c <= 0xfb44)
+      || (c >= 0xfb46 && c <= 0xfbb1)
+      || (c >= 0xfbd3 && c <= 0xfd3f)
+      || (c >= 0xfd50 && c <= 0xfd8f)
+      || (c >= 0xfd92 && c <= 0xfdc7)
+      || (c >= 0xfdf0 && c <= 0xfdfb)
+      || (c >= 0xfe70 && c <= 0xfe72)
+      || (c == 0xfe74)
+      || (c >= 0xfe76 && c <= 0xfefc)
+      || (c >= 0xff21 && c <= 0xff3a)
+      || (c >= 0xff41 && c <= 0xff5a)
+      || (c >= 0xff66 && c <= 0xffbe)
+      || (c >= 0xffc2 && c <= 0xffc7)
+      || (c >= 0xffca && c <= 0xffcf)
+      || (c >= 0xffd2 && c <= 0xffd7)
+      || (c >= 0xffda && c <= 0xffdc))
+    {
+      if (c99)
+	goto fail;
+      return 1;
+    }
+
+  /* Digits */
+  if((c >= 0x0660 && c <= 0x0669)
+     || (c >= 0x06f0 && c <= 0x06f9)
+     || (c >= 0x0966 && c <= 0x096f)
+     || (c >= 0x09e6 && c <= 0x09ef)
+     || (c >= 0x0a66 && c <= 0x0a6f)
+     || (c >= 0x0ae6 && c <= 0x0aef)
+     || (c >= 0x0b66 && c <= 0x0b6f)
+     || (c >= 0x0be7 && c <= 0x0bef)
+     || (c >= 0x0c66 && c <= 0x0c6f)
+     || (c >= 0x0ce6 && c <= 0x0cef)
+     || (c >= 0x0d66 && c <= 0x0d6f)
+     || (c >= 0x0e50 && c <= 0x0e59)
+     || (c >= 0x0ed0 && c <= 0x0ed9)
+     || (c >= 0x0f20 && c <= 0x0f33))
+    {
+      if (!allow_digits || cxx98)
+	goto fail;
+      return 1;
+    }
+
+  /* Special characters */
+  if ((c == 0x00b5)
+      || (c == 0x00b7)
+      || (c >= 0x02b0 && c <= 0x02b8)
+      || (c == 0x02bb)
+      || (c >= 0x02bd && c <= 0x02c1)
+      || (c >= 0x02d0 && c <= 0x02d1)
+      || (c >= 0x02e0 && c <= 0x02e4)
+      || (c == 0x037a)
+      || (c == 0x0559)
+      || (c == 0x093d)
+      || (c == 0x0b3d)
+      || (c == 0x1fbe)
+      || (c >= 0x203f && c <= 0x2040)
+      || (c == 0x2102)
+      || (c == 0x2107)
+      || (c >= 0x210a && c <= 0x2113)
+      || (c == 0x2115)
+      || (c >= 0x2118 && c <= 0x211d)
+      || (c == 0x2124)
+      || (c == 0x2126)
+      || (c == 0x2128)
+      || (c >= 0x212a && c <= 0x2131)
+      || (c >= 0x2133 && c <= 0x2138)
+      || (c >= 0x2160 && c <= 0x2182)
+      || (c >= 0x3005 && c <= 0x3007)
+      || (c >= 0x3021 && c <= 0x3029))
+    {
+      if (cxx98)
+	goto fail;
+      return 1;
+    }
+
+    fail:
+  cpp_error_with_line (pfile, DL_ERROR,
+                       pfile->line, 1, /* XXX */
+                       "universal-character-name '\\u%04x' not valid in identifier", c);
+  return 0;
+#endif
+}
+
+/* Add the UTF-8 representation of C to the token_buffer.  */
+
+static void
+utf8_extend_token (stack, c)
+     struct obstack *stack;
+     int c;
+{
+  int shift, mask;
+
+  if      (c <= 0x0000007f)
+    {
+      obstack_1grow (stack, c);
+      return;
+    }
+  else if (c <= 0x000007ff)
+    shift = 6, mask = 0xc0;
+  else if (c <= 0x0000ffff)
+    shift = 12, mask = 0xe0;
+  else if (c <= 0x001fffff)
+    shift = 18, mask = 0xf0;
+  else if (c <= 0x03ffffff)
+    shift = 24, mask = 0xf8;
+  else
+    shift = 30, mask = 0xfc;
+
+  obstack_1grow (stack, mask | (c >> shift));
+  do
+    {
+      shift -= 6;
+      obstack_1grow (stack, (unsigned char) (0x80 | ((c >> shift) & 0x3f)));
+    }
+  while (shift);
+}
+
+/* Put the UCN form onto the obstack. */
+
+static void
+ucn_extend_token (stack, c)
+     struct obstack *stack;
+     int c;
+{
+  int len;
+  obstack_1grow (stack, '\\');
+  if (c < 0x10000)
+    {
+      obstack_1grow (stack, 'u');
+      len = 4;
+    }
+  else
+    {
+      obstack_1grow (stack, 'U');
+      len = 8;
+    }
+  while (len--)
+    {
+      int d = (c >> 4*len) & 0xF;
+      if (d < 10)
+	obstack_1grow (stack, '0' + d);
+      else
+	obstack_1grow (stack, 'a' + d - 10);
+    }
+}
+
+static cppchar_t
+utf8_to_char (pos)
+     const unsigned char **pos;
+{
+  cppchar_t result = 0;
+  const unsigned char *s = *pos;
+  if (*s < 128)
+    {
+      result = *s;
+      *pos += 1;
+    }
+  else if (*s < 0xc0)
+    {
+      /* Cannot occur as first byte */
+      abort();
+    }
+  else if (*s < 0xE0)
+    {
+      result = ((s[0] & 0x1f) << 6) + (s[1] & 0x3f);
+      *pos += 2;
+    }
+  else if (*s < 0xF0)
+    {
+      result =
+        ((s[0] & 0xf) << 12) +
+        ((s[1] & 0x3f) << 6) +
+        (s[2] & 0x3f);
+      *pos += 3;
+    }
+  else if (*s < 0xF8)
+    {
+      result =
+        ((s[0] & 0x7) << 18) +
+        ((s[1] & 0x3f) << 12) +
+        ((s[2] & 0x3f) << 6) +
+        (s[3] & 0x3f);
+      *pos += 4;
+    }
+  else
+    {
+      /* Other codes are reserved. */
+      abort ();
+    }
+  return result;
+}
+
 /* Skips any escaped newlines introduced by '?' or a '\\', assumed to
    lie in buffer->cur[-1].  Returns the next byte, which will be in
    buffer->cur[-1].  This routine performs preprocessing stages 1 and
@@ -451,11 +1124,19 @@ parse_identifier (pfile)
   /* Check for slow-path cases.  */
   if (*cur == '?' || *cur == '\\' || *cur == '$')
     {
-      unsigned int len;
+      unsigned int len, utf8;
 
-      base = parse_slow (pfile, cur, 0, &len);
+      base = parse_slow (pfile, cur, 0, &len, &utf8);
       result = (cpp_hashnode *)
 	ht_lookup (pfile->hash_table, base, len, HT_ALLOCED);
+      if (utf8)
+	{
+	  result->flags |= NODE_USES_EXTENDED_CHARACTERS;
+#ifndef HAVE_AS_UTF8
+	  cpp_error (pfile, DL_ERROR, 
+		     "Non-ASCII identifiers not supported by your assembler");
+#endif
+	}
     }
   else
     {
@@ -493,11 +1174,12 @@ parse_identifier (pfile)
    pointer to the token's NUL-terminated spelling in permanent
    storage, and sets PLEN to its length.  */
 static uchar *
-parse_slow (pfile, cur, number_p, plen)
+parse_slow (pfile, cur, number_p, plen, utf8)
      cpp_reader *pfile;
      const uchar *cur;
      int number_p;
      unsigned int *plen;
+     unsigned int *utf8;
 {
   cpp_buffer *buffer = pfile->buffer;
   const uchar *base = buffer->cur - 1;
@@ -516,12 +1198,33 @@ parse_slow (pfile, cur, number_p, plen)
   prevc = cur[-1];
   c = *cur++;
   buffer->cur = cur;
+  *utf8 = 0;
   for (;;)
     {
       /* Potential escaped newline?  */
       buffer->backup_to = buffer->cur - 1;
       if (c == '?' || c == '\\')
-	c = skip_escaped_newlines (pfile);
+	  c = skip_escaped_newlines (pfile);
+
+      if (c == '\\' && (*buffer->cur == 'u'
+                        || *buffer->cur == 'U'))
+        {
+          cur = buffer->cur - 1;
+          c = *buffer->cur++;
+          if (maybe_read_ucs_reader (pfile, &c) == 0
+              && identifier_ucs_p (pfile, c, 1))
+            {
+	      if (number_p)
+		ucn_extend_token (stack, c);
+	      else
+		utf8_extend_token (stack, c);
+              c = *buffer->cur++;
+              *utf8 = 1;
+              continue;
+            }
+          buffer->cur = cur;
+          c = *buffer->cur++;
+        }
 
       if (!is_idchar (c))
 	{
@@ -570,6 +1273,7 @@ parse_number (pfile, number, leading_per
      int leading_period;
 {
   const uchar *cur;
+  unsigned int unused;
 
   /* Fast-path loop.  Skim over a normal number.
      N.B. ISIDNUM does not include $.  */
@@ -579,7 +1283,8 @@ parse_number (pfile, number, leading_per
 
   /* Check for slow-path cases.  */
   if (*cur == '?' || *cur == '\\' || *cur == '$')
-    number->text = parse_slow (pfile, cur, 1 + leading_period, &number->len);
+    number->text = parse_slow (pfile, cur, 1 + leading_period, 
+			       &number->len, &unused);
   else
     {
       const uchar *base = pfile->buffer->cur - 1;
@@ -1025,7 +1730,24 @@ _cpp_lex_direct (pfile)
       if (c == '?')
 	result->type = CPP_QUERY;
       else if (c == '\\')
-	goto random_char;
+        {
+          const unsigned char *pos = buffer->cur;
+          
+          c = *buffer->cur++;
+          if ((c == 'u' || c == 'U')
+              && maybe_read_ucs_reader (pfile, &c) == 0
+              && identifier_ucs_p (pfile, c, 0))
+            {
+              buffer->cur = pos;
+              goto start_ident;
+            }
+          else
+            {
+              c = '\\';
+              buffer->cur = pos;
+              goto random_char;
+            }
+        }
       else
 	goto trigraph;
       break;
@@ -1402,8 +2124,35 @@ cpp_spell_token (pfile, token, buffer)
 
     spell_ident:
     case SPELL_IDENT:
-      memcpy (buffer, NODE_NAME (token->val.node), NODE_LEN (token->val.node));
-      buffer += NODE_LEN (token->val.node);
+      if ((token->val.node->flags & NODE_USES_EXTENDED_CHARACTERS) == 0)
+	{
+	  memcpy (buffer, NODE_NAME (token->val.node), 
+		  NODE_LEN (token->val.node));
+	  buffer += NODE_LEN (token->val.node);
+	}
+      else
+	{
+          const unsigned char *s = NODE_NAME (token->val.node);
+          int len = NODE_LEN (token->val.node);
+          while (len)
+            {
+              if (*s < 128)
+                {
+                  *buffer++ = *s++;
+                  len--;
+                }
+              else
+                {
+                  const unsigned char *old = s;
+                  cppchar_t code = utf8_to_char (&s);
+                  if (code < 0x10000)
+                    buffer += sprintf ((char*)buffer, "\\u%.4x", code);
+                  else
+                    buffer += sprintf ((char*)buffer, "\\U%.8x", code);
+                  len -= s - old;
+                }
+            }
+	}
       break;
 
     case SPELL_NUMBER:
@@ -1503,7 +2252,32 @@ cpp_output_token (token, fp)
 
     spell_ident:
     case SPELL_IDENT:
-      fwrite (NODE_NAME (token->val.node), 1, NODE_LEN (token->val.node), fp);
+      if ((token->val.node->flags & NODE_USES_EXTENDED_CHARACTERS) == 0)
+        fwrite (NODE_NAME (token->val.node), 1, NODE_LEN (token->val.node), fp);
+      else
+        {
+          const unsigned char *s = NODE_NAME (token->val.node);
+          int len = NODE_LEN (token->val.node);
+          while (len)
+            {
+              if (*s < 128)
+                {
+                  fputc (*s, fp);
+		  s++;
+                  len--;
+                }
+              else
+                {
+                  const unsigned char *old = s;
+                  cppchar_t code = utf8_to_char (&s);
+                  if (code < 0x10000)
+                    fprintf (fp, "\\u%.4x", code);
+                  else
+                    fprintf (fp, "\\U%.8x", code);
+                  len -= s - old;
+                }
+            }
+        }
     break;
 
     case SPELL_NUMBER:
@@ -1738,6 +2512,63 @@ maybe_read_ucs (pfile, pstr, limit, pc)
 #endif
 
   *pstr = p;
+  *pc = code;
+  return 0;
+}
+
+/* Like maybe_read_ucs, but always read the data from a parser. */
+
+static int
+maybe_read_ucs_reader (pfile, pc)
+     cpp_reader *pfile;
+     cppchar_t *pc;
+{
+  unsigned int code = 0;
+  cppchar_t c = *pc;
+  unsigned int length;
+
+  /* Only attempt to interpret a UCS for C++ and C99.  */
+  if (! (CPP_OPTION (pfile, cplusplus) || CPP_OPTION (pfile, c99)))
+    return 1;
+
+  if (CPP_WTRADITIONAL (pfile))
+    cpp_error (pfile, DL_WARNING,
+	       "the meaning of '\\%c' is different in traditional C", c);
+
+  length = (c == 'u' ? 4: 8);
+
+  for (; length; length--)
+    {
+      c = get_effective_char (pfile);
+      if (ISXDIGIT (c))
+	code = (code << 4) + hex_digit_value (c);
+      else
+	{
+	  cpp_error (pfile, DL_ERROR,
+		     "non-hex digit '%c' in universal-character-name", c);
+	  /* We shouldn't skip in case there are multibyte chars.  */
+	  break;
+	}
+    }
+
+#ifdef TARGET_EBCDIC
+  cpp_error (pfile, DL_ERROR, "universal-character-name on EBCDIC target");
+  code = 0x3f;  /* EBCDIC invalid character */
+#else
+ /* True extended characters are OK.  */
+  if (code >= 0xa0
+      && !(code & 0x80000000)
+      && !(code >= 0xD800 && code <= 0xDFFF))
+    ;
+  /* The standard permits $, @ and ` to be specified as UCNs.  We use
+     hex escapes so that this also works with EBCDIC hosts.  */
+  else if (code == 0x24 || code == 0x40 || code == 0x60)
+    ;
+  /* Don't give another error if one occurred above.  */
+  else if (length == 0)
+    cpp_error (pfile, DL_ERROR, "universal-character-name out of range");
+#endif
+
   *pc = code;
   return 0;
 }
Index: cpplib.h
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/cpplib.h,v
retrieving revision 1.237
diff -u -r1.237 cpplib.h
--- cpplib.h	26 Sep 2002 22:25:12 -0000	1.237
+++ cpplib.h	28 Nov 2002 22:50:15 -0000
@@ -443,6 +443,7 @@
 #define NODE_DIAGNOSTIC (1 << 3)	/* Possible diagnostic when lexed.  */
 #define NODE_WARN	(1 << 4)	/* Warn if redefined or undefined.  */
 #define NODE_DISABLED	(1 << 5)	/* A disabled macro.  */
+#define NODE_USES_EXTENDED_CHARACTERS (1 << 6) /* Node has UTF-8 bytes in it */
 
 /* Different flavors of hash node.  */
 enum node_type


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]