This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH] utf-16 and utf-32 support in C and C++


Oracle has a full copyright assignment in place with the FSF.

This patch provides an implementation for support of UTF-16 and UTF-32
character data types in C and C++, based on the ISO/IEC draft technical
report for C (ISO/IEC JTC1 SC22 WG14 N1040) and the proposal for C++
(ISO/IEC JTC1 SC22 WG21 N2249).  Neither proposal defines a specific
encoding for UTF-16.  This implementation uses the target endianness
to determine whether UTF-16BE or UTF-16LE will be used.

Support is added for the following wide character datatypes (internal
for C, primitive types for C++) with the given underlying data types:

	char16_t		short unsigned int
	char32_t		unsigned int

Support is added to the tokenizer to accept the following new character
and string literal notations:

	u'c-char-sequence'	char16_t character literal (UTF-16)
	U'c-char-sequence'	char32_t character literal (UTF-32)

	u"s-char-sequence"	array of char16_t (UTF-16)
	U"s-char-sequence"	array of char32_t (UTF-32)

The aforementioned proposals do not specifically state what should be
done when a UTF-16 (char16_t) character literal contains a 32-bit
universal character (\Unnnnnnnn).  This implementation will issue an
error about the constant being too long.

Support is added to the C parser and the C++ parser to handle the
following concatenations of string literals:

	 "a" u"a"	-> u"ab"
	u"a"  "b"	-> u"ab"
	u"a" u"b"	-> u"ab"

	 "a" U"b"	-> U"ab"
	U"a"  "b"	-> U"ab"
	U"a" U"b"	-> U"ab"

The proposals do not exclude the implementation of additional rules
for concatenation.  This implementation also provides for the following
valid concatenations.  The rationale behind this choice is that the
concatenation of strings shall result in a string with the highest width,
according to the ascending order: char - char16_t - char32_t - wchar.

	u"a" U"a"	-> U"ab"
	U"a" u"b"	-> U"ab"
	u"a" L"a"	-> L"ab"
	L"a" u"b"	-> L"ab"
	U"a" L"b"	-> L"ab"
	L"a" U"b"	-> L"ab"

Changes were also needed in some parts of the tokenizer and the parser
to change the existing logic from distinguishing between non-wide and
wide character to supporting characters of varying widths.

Testcases:
----------
This patch adds testcases for all functionality described above.  The
test cases ensure that the literals are parsed correctly, and that the
resulting values are correct.  The tests also ensure that the width of
the character literals is correct.  All combinations of string
concatenation are exercised as well.  Finally, tests were added to
ensure that errors are flagged for empty characters (u'' and U''),
warnings for constants that are too long (u'ab', U'ab' and u"\Unnnnnnnn"
where \Unnnnnnnn is outside the BMP), and warnings for implicit truncation
of values (char16_t c = U'\Unnnnnnnn' or char32_t c = u'\Unnnnnnnn'
where \Unnnnnnnn is outside the BMP).

ChangeLog entries:
------------------
libcpp/ChangeLog:
2008-03-13  Kris Van Hees <kris.van.hees@oracle.com>

        * include/cpp-id-data.h (UC): Was U, conflicts with U"..." literal.
        * include/cpplib.h (CHAR16, CHAR32, STRING16, STRING32): New tokens.
        (cpp_interpret_string): Update prototype.
        (cpp_interpret_string_notranslate): Idem.
        * charset.c (init_iconv_desc): New width member in cset_converter.
        (cpp_init_iconv): Add support for char{16,32}_cset_desc.
        (convert_ucn): Idem.
        (emit_numeric_escape): Idem.
        (convert_hex): Idem.
        (convert_oct): Idem.
        (convert_escape): Idem.
        (convertor_for_type): New function.
        (cpp_interpret_string): Use convertor_for_type, support u and U prefix.
        (cpp_interpret_string_notranslate): Match changed prototype.
        (wide_str_to_charconst): Use convertor_for_type.
        (cpp_interpret_charconst): Add support for CPP_CHAR{16,32}.
        * directives.c (linemarker_dir): Macro U changed to UC.
        (parse_include): Idem.
        (register_pragma_1): Idem.
        (restore_registered_pragmas): Idem.
        (get__Pragma_string): Support CPP_STRING{16,32}.
        * expr.c (eval_token): Support CPP_CHAR{16,32}.
        * internal.h (struct cset_converter) <width>: New field.
        (struct cpp_reader) <char16_cset_desc>: Idem.
        (struct cpp_reader) <char32_cset_desc>: Idem.
        * lex.c (digraph_spellings): Macro U changed to UC.
        (OP, TK): Idem.
        (lex_string): Add support for u'...', U'...', u"..." and U"...".
        (_cpp_lex_direct): Idem.
        * macro.c (_cpp_builtin_macro_text): Macro U changed to UC.
        (stringify_arg): Support CPP_CHAR{16,32} and CPP_STRING{16,32}.

gcc/ChangeLog:
2008-03-13  Kris Van Hees <kris.van.hees@oracle.com>
          
        * c-common.c (CHAR16_TYPE, CHAR32_TYPE): New macros.
        (fname_as_string): Match updated cpp_interpret_string prototype.
        (fix_string_type): Support char16_t* and char32_t*.
        (c_common_nodes_and_builtins): Add char16_t and char32_t (and
        derivative) nodes.
        (c_parse_error): Support CPP_CHAR{16,32}.
        * c-common.h (RID_CHAR16, RID_CHAR32): New elements. 
        (enum c_tree_index) <CTI_CHAR16_TYPE, CTI_SIGNED_CHAR16_TYPE,
        CTI_UNSIGNED_CHAR16_TYPE, CTI_CHAR32_TYPE, CTI_SIGNED_CHAR32_TYPE,
        CTI_UNSIGNED_CHAR32_TYPE, CTI_CHAR16_ARRAY_TYPE,
        CTI_CHAR32_ARRAY_TYPE>: New elements.
        (char16_type_node, signed_char16_type_node, unsigned_char16_type_node,
        char32_type_node, signed_char32_type_node, char16_array_type_node,
        char32_array_type_node): New defines.
        * c-lex.c (cb_ident): Match updated cpp_interpret_string prototype.
        (c_lex_with_flags): Support CPP_CHAR{16,32} and CPP_STRING{16,32}.
        (lex_string): Support CPP_STRING{16,32}, match updated
        cpp_interpret_string and cpp_interpret_string_notranslate prototypes.
        (lex_charconst): Support CPP_CHAR{16,32}.
        * c-parser.c (c_parser_postfix_expression): Support CPP_CHAR{16,32}
        and CPP_STRING{16,32}.

gcc/cp/ChangeLog:
2008-03-13  Kris Van Hees <kris.van.hees@oracle.com>

        * parser.c (cp_lexer_next_token_is_decl_specifier_ke): Support
        RID_CHAR{16,32}.
        (cp_lexer_print_token): Support CPP_STRING{16,32}.
        (cp_parser_is_string_literal): Idem.
        (cp_parser_string_literal): Idem.
        (cp_parser_primary_expression): Support CPP_CHAR{16,32} and
        CPP_STRING{16,32}.
        (cp_parser_simple_type_specifier): Support RID_CHAR{16,32}. 
        * tree.c (char_type_p): Support char16_t and char32_t as char types.

gcc/testsuite/ChangeLog:
2008-03-13  Kris Van Hees <kris.van.hees@oracle.com>

        Tests for char16_t and char32_t support.
        * g++.dg/other/utf16-1.C: New
        * g++.dg/other/utf16-2.C: New
        * g++.dg/other/utf16-3.C: New
        * g++.dg/other/utf16-4.C: New
        * g++.dg/other/utf32-1.C: New
        * g++.dg/other/utf32-2.C: New
        * g++.dg/other/utf32-3.C: New
        * g++.dg/other/utf32-4.C: New
        * gcc.dg/utf16-1.c: New
        * gcc.dg/utf16-2.c: New
        * gcc.dg/utf16-3.c: New
        * gcc.dg/utf16-4.c: New
        * gcc.dg/utf32-1.c: New
        * gcc.dg/utf32-2.c: New
        * gcc.dg/utf32-3.c: New
        * gcc.dg/utf32-4.c: New

Bootstrapping and testing:
--------------------------
The source tree was built on the following platforms (target == host):

	i686-linux
	x86_64-linux
	ppc64-linux

Builds were done for both the unpatched tree and the patched tree, and
testsuite (make -k check) summary results were verified to be identical,
except for the added tests in the patched tree.  This was done to ensure
that the patch does not introduce regressions.

Index: gcc/c-lex.c
===================================================================
--- gcc/c-lex.c	(revision 133117)
+++ gcc/c-lex.c	(working copy)
@@ -174,7 +174,7 @@ cb_ident (cpp_reader * ARG_UNUSED (pfile
     {
       /* Convert escapes in the string.  */
       cpp_string cstr = { 0, 0 };
-      if (cpp_interpret_string (pfile, str, 1, &cstr, false))
+      if (cpp_interpret_string (pfile, str, 1, &cstr, CPP_STRING))
 	{
 	  ASM_OUTPUT_IDENT (asm_out_file, (const char *) cstr.text);
 	  free (CONST_CAST (unsigned char *, cstr.text));
@@ -361,6 +361,8 @@ c_lex_with_flags (tree *value, location_
 
 	    case CPP_STRING:
 	    case CPP_WSTRING:
+	    case CPP_STRING16:
+	    case CPP_STRING32:
 	      type = lex_string (tok, value, true, true);
 	      break;
 
@@ -410,11 +412,15 @@ c_lex_with_flags (tree *value, location_
 
     case CPP_CHAR:
     case CPP_WCHAR:
+    case CPP_CHAR16:
+    case CPP_CHAR32:
       *value = lex_charconst (tok);
       break;
 
     case CPP_STRING:
     case CPP_WSTRING:
+    case CPP_STRING16:
+    case CPP_STRING32:
       if ((lex_flags & C_LEX_RAW_STRINGS) == 0)
 	{
 	  type = lex_string (tok, value, false,
@@ -822,12 +828,12 @@ interpret_fixed (const cpp_token *token,
   return value;
 }
 
-/* Convert a series of STRING and/or WSTRING tokens into a tree,
-   performing string constant concatenation.  TOK is the first of
-   these.  VALP is the location to write the string into.  OBJC_STRING
-   indicates whether an '@' token preceded the incoming token.
+/* Convert a series of STRING, WSTRING, STRING16 and/or STRING32 tokens
+   into a tree, performing string constant concatenation.  TOK is the
+   first of these.  VALP is the location to write the string into.
+   OBJC_STRING indicates whether an '@' token preceded the incoming token.
    Returns the CPP token type of the result (CPP_STRING, CPP_WSTRING,
-   or CPP_OBJC_STRING).
+   CPP_STRING32, CPP_STRING16, or CPP_OBJC_STRING).
 
    This is unfortunately more work than it should be.  If any of the
    strings in the series has an L prefix, the result is a wide string
@@ -842,19 +848,16 @@ static enum cpp_ttype
 lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
 {
   tree value;
-  bool wide = false;
   size_t concats = 0;
   struct obstack str_ob;
   cpp_string istr;
+  enum cpp_ttype type = tok->type;
 
   /* Try to avoid the overhead of creating and destroying an obstack
      for the common case of just one string.  */
   cpp_string str = tok->val.str;
   cpp_string *strs = &str;
 
-  if (tok->type == CPP_WSTRING)
-    wide = true;
-
  retry:
   tok = cpp_get_token (parse_in);
   switch (tok->type)
@@ -873,10 +876,21 @@ lex_string (const cpp_token *tok, tree *
       break;
 
     case CPP_WSTRING:
-      wide = true;
-      /* FALLTHROUGH */
+      type = CPP_WSTRING;
+      goto concat;
+
+    case CPP_STRING32:
+      if (type != CPP_WSTRING)
+	type = CPP_STRING32;
+      goto concat;
+
+    case CPP_STRING16:
+      if (type == CPP_STRING)
+	type = CPP_STRING16;
+      goto concat;
 
     case CPP_STRING:
+  concat:
       if (!concats)
 	{
 	  gcc_obstack_init (&str_ob);
@@ -899,7 +913,7 @@ lex_string (const cpp_token *tok, tree *
 
   if ((translate
        ? cpp_interpret_string : cpp_interpret_string_notranslate)
-      (parse_in, strs, concats + 1, &istr, wide))
+      (parse_in, strs, concats + 1, &istr, type))
     {
       value = build_string (istr.len, (const char *) istr.text);
       free (CONST_CAST (unsigned char *, istr.text));
@@ -909,22 +923,50 @@ lex_string (const cpp_token *tok, tree *
       /* Callers cannot generally handle error_mark_node in this context,
 	 so return the empty string instead.  cpp_interpret_string has
 	 issued an error.  */
-      if (wide)
-	value = build_string (TYPE_PRECISION (wchar_type_node)
-			      / TYPE_PRECISION (char_type_node),
-			      "\0\0\0");  /* widest supported wchar_t
-					     is 32 bits */
-      else
-	value = build_string (1, "");
+      switch (type) {
+	default:
+	case CPP_STRING:
+	  value = build_string (1, "");
+	  break;
+	case CPP_STRING16:
+	  value = build_string (TYPE_PRECISION (char16_type_node)
+				/ TYPE_PRECISION (char_type_node),
+				"\0");  /* char16_t is 16 bits */
+	  break;
+	case CPP_STRING32:
+	  value = build_string (TYPE_PRECISION (char32_type_node)
+				/ TYPE_PRECISION (char_type_node),
+				"\0\0\0");  /* char32_t is 32 bits */
+	  break;
+	case CPP_WSTRING:
+	  value = build_string (TYPE_PRECISION (wchar_type_node)
+				/ TYPE_PRECISION (char_type_node),
+				"\0\0\0");  /* widest supported wchar_t
+					       is 32 bits */
+	  break;
+      }
     }
 
-  TREE_TYPE (value) = wide ? wchar_array_type_node : char_array_type_node;
+  switch (type) {
+    default:
+    case CPP_STRING:
+      TREE_TYPE (value) = char_array_type_node;
+      break;
+    case CPP_STRING16:
+      TREE_TYPE (value) = char16_array_type_node;
+      break;
+    case CPP_STRING32:
+      TREE_TYPE (value) = char32_array_type_node;
+      break;
+    case CPP_WSTRING:
+      TREE_TYPE (value) = wchar_array_type_node;
+  }
   *valp = fix_string_type (value);
 
   if (concats)
     obstack_free (&str_ob, 0);
 
-  return objc_string ? CPP_OBJC_STRING : wide ? CPP_WSTRING : CPP_STRING;
+  return objc_string ? CPP_OBJC_STRING : type;
 }
 
 /* Converts a (possibly wide) character constant token into a tree.  */
@@ -941,6 +983,10 @@ lex_charconst (const cpp_token *token)
 
   if (token->type == CPP_WCHAR)
     type = wchar_type_node;
+  else if (token->type == CPP_CHAR32)
+    type = char32_type_node;
+  else if (token->type == CPP_CHAR16)
+    type = char16_type_node;
   /* In C, a character constant has type 'int'.
      In C++ 'char', but multi-char charconsts have type 'int'.  */
   else if (!c_dialect_cxx () || chars_seen > 1)
Index: gcc/cp/tree.c
===================================================================
--- gcc/cp/tree.c	(revision 133117)
+++ gcc/cp/tree.c	(working copy)
@@ -2474,6 +2474,8 @@ char_type_p (tree type)
   return (same_type_p (type, char_type_node)
 	  || same_type_p (type, unsigned_char_type_node)
 	  || same_type_p (type, signed_char_type_node)
+	  || same_type_p (type, char16_type_node)
+	  || same_type_p (type, char32_type_node)
 	  || same_type_p (type, wchar_type_node));
 }
 
Index: gcc/cp/parser.c
===================================================================
--- gcc/cp/parser.c	(revision 133117)
+++ gcc/cp/parser.c	(working copy)
@@ -556,6 +556,8 @@ cp_lexer_next_token_is_decl_specifier_ke
     case RID_TYPENAME:
       /* Simple type specifiers.  */
     case RID_CHAR:
+    case RID_CHAR16:
+    case RID_CHAR32:
     case RID_WCHAR:
     case RID_BOOL:
     case RID_SHORT:
@@ -789,6 +791,8 @@ cp_lexer_print_token (FILE * stream, cp_
       break;
 
     case CPP_STRING:
+    case CPP_STRING16:
+    case CPP_STRING32:
     case CPP_WSTRING:
       fprintf (stream, " \"%s\"", TREE_STRING_POINTER (token->u.value));
       break;
@@ -2033,7 +2037,10 @@ cp_parser_parsing_tentatively (cp_parser
 static bool
 cp_parser_is_string_literal (cp_token* token)
 {
-  return (token->type == CPP_STRING || token->type == CPP_WSTRING);
+  return (token->type == CPP_STRING ||
+	  token->type == CPP_STRING16 ||
+	  token->type == CPP_STRING32 ||
+	  token->type == CPP_WSTRING);
 }
 
 /* Returns nonzero if TOKEN is the indicated KEYWORD.  */
@@ -2861,11 +2868,11 @@ static tree
 cp_parser_string_literal (cp_parser *parser, bool translate, bool wide_ok)
 {
   tree value;
-  bool wide = false;
   size_t count;
   struct obstack str_ob;
   cpp_string str, istr, *strs;
   cp_token *tok;
+  enum cpp_ttype type;
 
   tok = cp_lexer_peek_token (parser->lexer);
   if (!cp_parser_is_string_literal (tok))
@@ -2874,6 +2881,8 @@ cp_parser_string_literal (cp_parser *par
       return error_mark_node;
     }
 
+  type = tok->type;
+
   /* Try to avoid the overhead of creating and destroying an obstack
      for the common case of just one string.  */
   if (!cp_parser_is_string_literal
@@ -2884,8 +2893,6 @@ cp_parser_string_literal (cp_parser *par
       str.text = (const unsigned char *)TREE_STRING_POINTER (tok->u.value);
       str.len = TREE_STRING_LENGTH (tok->u.value);
       count = 1;
-      if (tok->type == CPP_WSTRING)
-	wide = true;
 
       strs = &str;
     }
@@ -2900,8 +2907,24 @@ cp_parser_string_literal (cp_parser *par
 	  count++;
 	  str.text = (const unsigned char *)TREE_STRING_POINTER (tok->u.value);
 	  str.len = TREE_STRING_LENGTH (tok->u.value);
-	  if (tok->type == CPP_WSTRING)
-	    wide = true;
+
+	  switch (tok->type) {
+	    case CPP_STRING:
+	      break;
+	    case CPP_STRING16:
+	      if (type == CPP_STRING)
+		type = CPP_STRING16;
+
+	      break;
+	    case CPP_STRING32:
+	      if (type != CPP_WSTRING)
+		type = CPP_STRING32;
+
+	      break;
+	    case CPP_WSTRING:
+	      type = CPP_WSTRING;
+	      break;
+	  }
 
 	  obstack_grow (&str_ob, &str, sizeof (cpp_string));
 
@@ -2912,19 +2935,34 @@ cp_parser_string_literal (cp_parser *par
       strs = (cpp_string *) obstack_finish (&str_ob);
     }
 
-  if (wide && !wide_ok)
+  if (type != CPP_STRING && !wide_ok)
     {
       cp_parser_error (parser, "a wide string is invalid in this context");
-      wide = false;
+      type = CPP_STRING;
     }
 
   if ((translate ? cpp_interpret_string : cpp_interpret_string_notranslate)
-      (parse_in, strs, count, &istr, wide))
+      (parse_in, strs, count, &istr, type))
     {
       value = build_string (istr.len, (const char *)istr.text);
       free (CONST_CAST (unsigned char *, istr.text));
 
-      TREE_TYPE (value) = wide ? wchar_array_type_node : char_array_type_node;
+      switch (type) {
+	default:
+	case CPP_STRING:
+	  TREE_TYPE (value) = char_array_type_node;
+	  break;
+	case CPP_STRING16:
+	  TREE_TYPE (value) = char16_array_type_node;
+	  break;
+	case CPP_STRING32:
+	  TREE_TYPE (value) = char32_array_type_node;
+	  break;
+	case CPP_WSTRING:
+	  TREE_TYPE (value) = wchar_array_type_node;
+	  break;
+      }
+
       value = fix_string_type (value);
     }
   else
@@ -3079,6 +3117,8 @@ cp_parser_primary_expression (cp_parser 
 	   string-literal
 	   boolean-literal  */
     case CPP_CHAR:
+    case CPP_CHAR16:
+    case CPP_CHAR32:
     case CPP_WCHAR:
     case CPP_NUMBER:
       token = cp_lexer_consume_token (parser->lexer);
@@ -3130,6 +3170,8 @@ cp_parser_primary_expression (cp_parser 
       return token->u.value;
 
     case CPP_STRING:
+    case CPP_STRING16:
+    case CPP_STRING32:
     case CPP_WSTRING:
       /* ??? Should wide strings be allowed when parser->translate_strings_p
 	 is false (i.e. in attributes)?  If not, we can kill the third
@@ -10762,6 +10804,12 @@ cp_parser_simple_type_specifier (cp_pars
 	decl_specs->explicit_char_p = true;
       type = char_type_node;
       break;
+    case RID_CHAR16:
+      type = char16_type_node;
+      break;
+    case RID_CHAR32:
+      type = char32_type_node;
+      break;
     case RID_WCHAR:
       type = wchar_type_node;
       break;
Index: gcc/c-common.c
===================================================================
--- gcc/c-common.c	(revision 133117)
+++ gcc/c-common.c	(working copy)
@@ -66,6 +66,14 @@ cpp_reader *parse_in;		/* Declared in c-
 #define PID_TYPE "int"
 #endif
 
+#ifndef CHAR16_TYPE
+#define CHAR16_TYPE "short unsigned int"
+#endif
+
+#ifndef CHAR32_TYPE
+#define CHAR32_TYPE "unsigned int"
+#endif
+
 #ifndef WCHAR_TYPE
 #define WCHAR_TYPE "int"
 #endif
@@ -123,6 +131,13 @@ cpp_reader *parse_in;		/* Declared in c-
 	tree signed_wchar_type_node;
 	tree unsigned_wchar_type_node;
 
+	tree char16_type_node;
+	tree signed_char16_type_node;
+	tree unsigned_char16_type_node;
+	tree char32_type_node;
+	tree signed_char32_type_node;
+	tree unsigned_char32_type_node;
+
 	tree float_type_node;
 	tree double_type_node;
 	tree long_double_type_node;
@@ -174,6 +189,16 @@ cpp_reader *parse_in;		/* Declared in c-
 
 	tree wchar_array_type_node;
 
+   Type `char16_t[SOMENUMBER]' or something like it.
+   Used when a UTF-16 string literal is created.
+
+	tree char16_array_type_node;
+
+   Type `char32_t[SOMENUMBER]' or something like it.
+   Used when a UTF-32 string literal is created.
+
+	tree char32_array_type_node;
+
    Type `int ()' -- used for implicit declaration of functions.
 
 	tree default_function_type;
@@ -777,7 +802,7 @@ fname_as_string (int pretty_p)
   strname.text = (unsigned char *) namep;
   strname.len = len - 1;
 
-  if (cpp_interpret_string (parse_in, &strname, 1, &cstr, false))
+  if (cpp_interpret_string (parse_in, &strname, 1, &cstr, CPP_STRING))
     {
       XDELETEVEC (namep);
       return (const char *) cstr.text;
@@ -857,14 +882,28 @@ fname_decl (unsigned int rid, tree id)
 tree
 fix_string_type (tree value)
 {
-  const int wchar_bytes = TYPE_PRECISION (wchar_type_node) / BITS_PER_UNIT;
-  const int wide_flag = TREE_TYPE (value) == wchar_array_type_node;
+  const bool wide = TREE_TYPE (value)
+		    && TREE_TYPE (value) != char_array_type_node;
   int length = TREE_STRING_LENGTH (value);
   int nchars;
   tree e_type, i_type, a_type;
 
   /* Compute the number of elements, for the array type.  */
-  nchars = wide_flag ? length / wchar_bytes : length;
+  if (wide) {
+    if (TREE_TYPE (value) == char16_array_type_node) {
+      nchars = length / (TYPE_PRECISION (char16_type_node) / BITS_PER_UNIT);
+      e_type = char16_type_node;
+    } else if (TREE_TYPE (value) == char32_array_type_node) {
+      nchars = length / (TYPE_PRECISION (char32_type_node) / BITS_PER_UNIT);
+      e_type = char32_type_node;
+    } else {
+      nchars = length / (TYPE_PRECISION (wchar_type_node) / BITS_PER_UNIT);
+      e_type = wchar_type_node;
+    }
+  } else {
+    nchars = length;
+    e_type = char_type_node;
+  }
 
   /* C89 2.2.4.1, C99 5.2.4.1 (Translation limits).  The analogous
      limit in C++98 Annex B is very large (65536) and is not normative,
@@ -899,7 +938,6 @@ fix_string_type (tree value)
      construct the matching unqualified array type first.  The C front
      end does not require this, but it does no harm, so we do it
      unconditionally.  */
-  e_type = wide_flag ? wchar_type_node : char_type_node;
   i_type = build_index_type (build_int_cst (NULL_TREE, nchars - 1));
   a_type = build_array_type (e_type, i_type);
   if (c_dialect_cxx() || warn_write_strings)
@@ -3625,6 +3663,8 @@ c_define_builtins (tree va_list_ref_type
 void
 c_common_nodes_and_builtins (void)
 {
+  int char16_type_size;
+  int char32_type_size;
   int wchar_type_size;
   tree array_domain_type;
   tree va_list_ref_type_node;
@@ -3874,6 +3914,50 @@ c_common_nodes_and_builtins (void)
   wchar_array_type_node
     = build_array_type (wchar_type_node, array_domain_type);
 
+  /* Define 'char16_t', `signed char16_t' and `unsigned char16_t'.  */
+  char16_type_node = get_identifier (CHAR16_TYPE);
+  char16_type_node = TREE_TYPE (identifier_global_value (char16_type_node));
+  char16_type_size = TYPE_PRECISION (char16_type_node);
+  if (c_dialect_cxx ())
+    {
+      if (TYPE_UNSIGNED (char16_type_node))
+	char16_type_node = make_unsigned_type (char16_type_size);
+      else
+	char16_type_node = make_signed_type (char16_type_size);
+      record_builtin_type (RID_CHAR16, "char16_t", char16_type_node);
+    }
+  else
+    {
+      signed_char16_type_node = c_common_signed_type (char16_type_node);
+      unsigned_char16_type_node = c_common_unsigned_type (char16_type_node);
+    }
+
+  /* This is for UTF-16 string constants.  */
+  char16_array_type_node
+    = build_array_type (char16_type_node, array_domain_type);
+
+  /* Define 'char32_t', `signed char32_t' and `unsigned char32_t'.  */
+  char32_type_node = get_identifier (CHAR32_TYPE);
+  char32_type_node = TREE_TYPE (identifier_global_value (char32_type_node));
+  char32_type_size = TYPE_PRECISION (char32_type_node);
+  if (c_dialect_cxx ())
+    {
+      if (TYPE_UNSIGNED (char32_type_node))
+	char32_type_node = make_unsigned_type (char32_type_size);
+      else
+	char32_type_node = make_signed_type (char32_type_size);
+      record_builtin_type (RID_CHAR32, "char32_t", char32_type_node);
+    }
+  else
+    {
+      signed_char32_type_node = c_common_signed_type (char32_type_node);
+      unsigned_char32_type_node = c_common_unsigned_type (char32_type_node);
+    }
+
+  /* This is for UTF-32 string constants.  */
+  char32_array_type_node
+    = build_array_type (char32_type_node, array_domain_type);
+
   wint_type_node =
     TREE_TYPE (identifier_global_value (get_identifier (WINT_TYPE)));
 
@@ -6652,20 +6736,38 @@ c_parse_error (const char *gmsgid, enum 
 
   if (token == CPP_EOF)
     message = catenate_messages (gmsgid, " at end of input");
-  else if (token == CPP_CHAR || token == CPP_WCHAR)
+  else if (token == CPP_CHAR || token == CPP_WCHAR || token == CPP_CHAR16
+	   || token == CPP_CHAR32)
     {
       unsigned int val = TREE_INT_CST_LOW (value);
-      const char *const ell = (token == CPP_CHAR) ? "" : "L";
+      const char *prefix;
+
+      switch (token) {
+	default:
+	  prefix = "";
+	  break;
+	case CPP_WCHAR:
+	  prefix = "L";
+	  break;
+	case CPP_CHAR16:
+	  prefix = "u";
+	  break;
+	case CPP_CHAR32:
+	  prefix = "U";
+	  break;
+      }
+
       if (val <= UCHAR_MAX && ISGRAPH (val))
 	message = catenate_messages (gmsgid, " before %s'%c'");
       else
 	message = catenate_messages (gmsgid, " before %s'\\x%x'");
 
-      error (message, ell, val);
+      error (message, prefix, val);
       free (message);
       message = NULL;
     }
-  else if (token == CPP_STRING || token == CPP_WSTRING)
+  else if (token == CPP_STRING || token == CPP_WSTRING || token == CPP_STRING16
+	   || token == CPP_STRING32)
     message = catenate_messages (gmsgid, " before string constant");
   else if (token == CPP_NUMBER)
     message = catenate_messages (gmsgid, " before numeric constant");
Index: gcc/c-common.h
===================================================================
--- gcc/c-common.h	(revision 133117)
+++ gcc/c-common.h	(working copy)
@@ -85,7 +85,7 @@ enum rid
   RID_NEW,      RID_OFFSETOF, RID_OPERATOR,
   RID_THIS,     RID_THROW,    RID_TRUE,
   RID_TRY,      RID_TYPENAME, RID_TYPEID,
-  RID_USING,
+  RID_USING,    RID_CHAR16,   RID_CHAR32,
 
   /* casts */
   RID_CONSTCAST, RID_DYNCAST, RID_REINTCAST, RID_STATCAST,
@@ -143,6 +143,12 @@ extern GTY ((length ("(int) RID_MAX"))) 
 
 enum c_tree_index
 {
+    CTI_CHAR16_TYPE,
+    CTI_SIGNED_CHAR16_TYPE,
+    CTI_UNSIGNED_CHAR16_TYPE,
+    CTI_CHAR32_TYPE,
+    CTI_SIGNED_CHAR32_TYPE,
+    CTI_UNSIGNED_CHAR32_TYPE,
     CTI_WCHAR_TYPE,
     CTI_SIGNED_WCHAR_TYPE,
     CTI_UNSIGNED_WCHAR_TYPE,
@@ -155,6 +161,8 @@ enum c_tree_index
     CTI_WIDEST_UINT_LIT_TYPE,
 
     CTI_CHAR_ARRAY_TYPE,
+    CTI_CHAR16_ARRAY_TYPE,
+    CTI_CHAR32_ARRAY_TYPE,
     CTI_WCHAR_ARRAY_TYPE,
     CTI_INT_ARRAY_TYPE,
     CTI_STRING_TYPE,
@@ -190,6 +198,12 @@ struct c_common_identifier GTY(())
   struct cpp_hashnode node;
 };
 
+#define char16_type_node		c_global_trees[CTI_CHAR16_TYPE]
+#define signed_char16_type_node		c_global_trees[CTI_SIGNED_CHAR16_TYPE]
+#define unsigned_char16_type_node	c_global_trees[CTI_UNSIGNED_CHAR16_TYPE]
+#define char32_type_node		c_global_trees[CTI_CHAR32_TYPE]
+#define signed_char32_type_node		c_global_trees[CTI_SIGNED_CHAR32_TYPE]
+#define unsigned_char32_type_node	c_global_trees[CTI_UNSIGNED_CHAR32_TYPE]
 #define wchar_type_node			c_global_trees[CTI_WCHAR_TYPE]
 #define signed_wchar_type_node		c_global_trees[CTI_SIGNED_WCHAR_TYPE]
 #define unsigned_wchar_type_node	c_global_trees[CTI_UNSIGNED_WCHAR_TYPE]
@@ -206,6 +220,8 @@ struct c_common_identifier GTY(())
 #define truthvalue_false_node		c_global_trees[CTI_TRUTHVALUE_FALSE]
 
 #define char_array_type_node		c_global_trees[CTI_CHAR_ARRAY_TYPE]
+#define char16_array_type_node		c_global_trees[CTI_CHAR16_ARRAY_TYPE]
+#define char32_array_type_node		c_global_trees[CTI_CHAR32_ARRAY_TYPE]
 #define wchar_array_type_node		c_global_trees[CTI_WCHAR_ARRAY_TYPE]
 #define int_array_type_node		c_global_trees[CTI_INT_ARRAY_TYPE]
 #define string_type_node		c_global_trees[CTI_STRING_TYPE]
Index: gcc/c-parser.c
===================================================================
--- gcc/c-parser.c	(revision 133117)
+++ gcc/c-parser.c	(working copy)
@@ -5168,12 +5168,16 @@ c_parser_postfix_expression (c_parser *p
     {
     case CPP_NUMBER:
     case CPP_CHAR:
+    case CPP_CHAR16:
+    case CPP_CHAR32:
     case CPP_WCHAR:
       expr.value = c_parser_peek_token (parser)->value;
       expr.original_code = ERROR_MARK;
       c_parser_consume_token (parser);
       break;
     case CPP_STRING:
+    case CPP_STRING16:
+    case CPP_STRING32:
     case CPP_WSTRING:
       expr.value = c_parser_peek_token (parser)->value;
       expr.original_code = STRING_CST;
Index: libcpp/macro.c
===================================================================
--- libcpp/macro.c	(revision 133117)
+++ libcpp/macro.c	(working copy)
@@ -158,7 +158,7 @@ _cpp_builtin_macro_text (cpp_reader *pfi
 		  {
 		    cpp_errno (pfile, CPP_DL_WARNING,
 			"could not determine file timestamp");
-		    pbuffer->timestamp = U"\"??? ??? ?? ??:??:?? ????\"";
+		    pbuffer->timestamp = UC"\"??? ??? ?? ??:??:?? ????\"";
 		  }
 	      }
 	  }
@@ -256,8 +256,8 @@ _cpp_builtin_macro_text (cpp_reader *pfi
 	      cpp_errno (pfile, CPP_DL_WARNING,
 			 "could not determine date and time");
 		
-	      pfile->date = U"\"??? ?? ????\"";
-	      pfile->time = U"\"??:??:??\"";
+	      pfile->date = UC"\"??? ?? ????\"";
+	      pfile->time = UC"\"??:??:??\"";
 	    }
 	}
 
@@ -375,8 +375,10 @@ stringify_arg (cpp_reader *pfile, macro_
 	  continue;
 	}
 
-      escape_it = (token->type == CPP_STRING || token->type == CPP_WSTRING
-		   || token->type == CPP_CHAR || token->type == CPP_WCHAR);
+      escape_it = (token->type == CPP_STRING || token->type == CPP_CHAR
+		   || token->type == CPP_WSTRING || token->type == CPP_STRING
+		   || token->type == CPP_STRING32 || token->type == CPP_CHAR32
+		   || token->type == CPP_STRING16 || token->type == CPP_CHAR16);
 
       /* Room for each char being written in octal, initial space and
 	 final quote and NUL.  */
Index: libcpp/directives.c
===================================================================
--- libcpp/directives.c	(revision 133117)
+++ libcpp/directives.c	(working copy)
@@ -188,7 +188,7 @@ DIRECTIVE_TABLE
    did use this notation in its preprocessed output.  */
 static const directive linemarker_dir =
 {
-  do_linemarker, U"#", 1, KANDR, IN_I
+  do_linemarker, UC"#", 1, KANDR, IN_I
 };
 
 #define SEEN_EOL() (pfile->cur_token[-1].type == CPP_EOF)
@@ -689,7 +689,7 @@ parse_include (cpp_reader *pfile, int *p
       const unsigned char *dir;
 
       if (pfile->directive == &dtable[T_PRAGMA])
-	dir = U"pragma dependency";
+	dir = UC"pragma dependency";
       else
 	dir = pfile->directive->name;
       cpp_error (pfile, CPP_DL_ERROR, "#%s expects \"FILENAME\" or <FILENAME>",
@@ -1077,7 +1077,7 @@ register_pragma_1 (cpp_reader *pfile, co
 
   if (space)
     {
-      node = cpp_lookup (pfile, U space, strlen (space));
+      node = cpp_lookup (pfile, UC space, strlen (space));
       entry = lookup_pragma_entry (*chain, node);
       if (!entry)
 	{
@@ -1106,7 +1106,7 @@ register_pragma_1 (cpp_reader *pfile, co
     }
 
   /* Check for duplicates.  */
-  node = cpp_lookup (pfile, U name, strlen (name));
+  node = cpp_lookup (pfile, UC name, strlen (name));
   entry = lookup_pragma_entry (*chain, node);
   if (entry == NULL)
     {
@@ -1254,7 +1254,7 @@ restore_registered_pragmas (cpp_reader *
     {
       if (pe->is_nspace)
 	sd = restore_registered_pragmas (pfile, pe->u.space, sd);
-      pe->pragma = cpp_lookup (pfile, U *sd, strlen (*sd));
+      pe->pragma = cpp_lookup (pfile, UC *sd, strlen (*sd));
       free (*sd);
       sd++;
     }
@@ -1483,7 +1483,8 @@ get__Pragma_string (cpp_reader *pfile)
   string = get_token_no_padding (pfile);
   if (string->type == CPP_EOF)
     _cpp_backup_tokens (pfile, 1);
-  if (string->type != CPP_STRING && string->type != CPP_WSTRING)
+  if (string->type != CPP_STRING && string->type != CPP_WSTRING
+      && string->type != CPP_STRING32 && string->type != CPP_STRING16)
     return NULL;
 
   paren = get_token_no_padding (pfile);
Index: libcpp/include/cpplib.h
===================================================================
--- libcpp/include/cpplib.h	(revision 133117)
+++ libcpp/include/cpplib.h	(working copy)
@@ -123,10 +123,14 @@ struct _cpp_file;
 									\
   TK(CHAR,		LITERAL) /* 'char' */				\
   TK(WCHAR,		LITERAL) /* L'char' */				\
+  TK(CHAR16,		LITERAL) /* u'char' */				\
+  TK(CHAR32,		LITERAL) /* U'char' */				\
   TK(OTHER,		LITERAL) /* stray punctuation */		\
 									\
   TK(STRING,		LITERAL) /* "string" */				\
   TK(WSTRING,		LITERAL) /* L"string" */			\
+  TK(STRING16,		LITERAL) /* u"string" */			\
+  TK(STRING32,		LITERAL) /* U"string" */			\
   TK(OBJC_STRING,	LITERAL) /* @"string" - Objective-C */		\
   TK(HEADER_NAME,	LITERAL) /* <stdio.h> in #include */		\
 									\
@@ -703,10 +707,10 @@ extern cppchar_t cpp_interpret_charconst
 /* Evaluate a vector of CPP_STRING or CPP_WSTRING tokens.  */
 extern bool cpp_interpret_string (cpp_reader *,
 				  const cpp_string *, size_t,
-				  cpp_string *, bool);
+				  cpp_string *, enum cpp_ttype);
 extern bool cpp_interpret_string_notranslate (cpp_reader *,
 					      const cpp_string *, size_t,
-					      cpp_string *, bool);
+					      cpp_string *, enum cpp_ttype);
 
 /* Convert a host character constant to the execution character set.  */
 extern cppchar_t cpp_host_to_exec_charset (cpp_reader *, cppchar_t);
Index: libcpp/include/cpp-id-data.h
===================================================================
--- libcpp/include/cpp-id-data.h	(revision 133117)
+++ libcpp/include/cpp-id-data.h	(working copy)
@@ -22,7 +22,7 @@ Foundation, 51 Franklin Street, Fifth Fl
 typedef unsigned char uchar;
 #endif
 
-#define U (const unsigned char *)  /* Intended use: U"string" */
+#define UC (const unsigned char *)  /* Intended use: UC"string" */
 
 /* Chained list of answers to an assertion.  */
 struct answer GTY(())
Index: libcpp/expr.c
===================================================================
--- libcpp/expr.c	(revision 133117)
+++ libcpp/expr.c	(working copy)
@@ -691,6 +691,8 @@ eval_token (cpp_reader *pfile, const cpp
 
     case CPP_WCHAR:
     case CPP_CHAR:
+    case CPP_CHAR16:
+    case CPP_CHAR32:
       {
 	cppchar_t cc = cpp_interpret_charconst (pfile, token,
 						&temp, &unsignedp);
@@ -849,6 +851,8 @@ _cpp_parse_expr (cpp_reader *pfile)
 	case CPP_NUMBER:
 	case CPP_CHAR:
 	case CPP_WCHAR:
+	case CPP_CHAR16:
+	case CPP_CHAR32:
 	case CPP_NAME:
 	case CPP_HASH:
 	  if (!want_value)
Index: libcpp/internal.h
===================================================================
--- libcpp/internal.h	(revision 133117)
+++ libcpp/internal.h	(working copy)
@@ -48,6 +48,7 @@ struct cset_converter
 {
   convert_f func;
   iconv_t cd;
+  int width;
 };
 
 #define BITS_PER_CPPCHAR_T (CHAR_BIT * sizeof (cppchar_t))
@@ -399,6 +400,14 @@ struct cpp_reader
   struct cset_converter narrow_cset_desc;
 
   /* Descriptor for converting from the source character set to the
+     UTF-16 execution character set.  */
+  struct cset_converter char16_cset_desc;
+
+  /* Descriptor for converting from the source character set to the
+     UTF-32 execution character set.  */
+  struct cset_converter char32_cset_desc;
+
+  /* Descriptor for converting from the source character set to the
      wide execution character set.  */
   struct cset_converter wide_cset_desc;
 
Index: libcpp/lex.c
===================================================================
--- libcpp/lex.c	(revision 133117)
+++ libcpp/lex.c	(working copy)
@@ -39,10 +39,10 @@ struct token_spelling
 };
 
 static const unsigned char *const digraph_spellings[] =
-{ U"%:", U"%:%:", U"<:", U":>", U"<%", U"%>" };
+{ UC"%:", UC"%:%:", UC"<:", UC":>", UC"<%", UC"%>" };
 
-#define OP(e, s) { SPELL_OPERATOR, U s  },
-#define TK(e, s) { SPELL_ ## s,    U #e },
+#define OP(e, s) { SPELL_OPERATOR, UC s  },
+#define TK(e, s) { SPELL_ ## s,    UC #e },
 static const struct token_spelling token_spellings[N_TTYPES] = { TTYPE_TABLE };
 #undef OP
 #undef TK
@@ -611,8 +611,8 @@ create_literal (cpp_reader *pfile, cpp_t
 
 /* Lexes a string, character constant, or angle-bracketed header file
    name.  The stored string contains the spelling, including opening
-   quote and leading any leading 'L'.  It returns the type of the
-   literal, or CPP_OTHER if it was not properly terminated.
+   quote and leading any leading 'L', 'u' or 'U'.  It returns the type
+   of the literal, or CPP_OTHER if it was not properly terminated.
 
    The spelling is NUL-terminated, but it is not guaranteed that this
    is the first NUL since embedded NULs are preserved.  */
@@ -626,12 +626,17 @@ lex_string (cpp_reader *pfile, cpp_token
 
   cur = base;
   terminator = *cur++;
-  if (terminator == 'L')
+  if (terminator == 'L' || terminator == 'u' || terminator == 'U')
     terminator = *cur++;
   if (terminator == '\"')
-    type = *base == 'L' ? CPP_WSTRING: CPP_STRING;
+    type = *base == 'L' ? CPP_WSTRING
+			: *base == 'U' ? CPP_STRING32
+				       : *base == 'u' ? CPP_STRING16
+						      : CPP_STRING;
   else if (terminator == '\'')
-    type = *base == 'L' ? CPP_WCHAR: CPP_CHAR;
+    type = *base == 'L' ? CPP_WCHAR
+			: *base == 'U' ? CPP_CHAR32
+				       : *base == 'u' ? CPP_CHAR16 : CPP_CHAR;
   else
     terminator = '>', type = CPP_HEADER_NAME;
 
@@ -965,7 +970,9 @@ _cpp_lex_direct (cpp_reader *pfile)
       }
 
     case 'L':
-      /* 'L' may introduce wide characters or strings.  */
+    case 'u':
+    case 'U':
+      /* 'L', 'u' or 'U' may introduce wide characters or strings.  */
       if (*buffer->cur == '\'' || *buffer->cur == '"')
 	{
 	  lex_string (pfile, result, buffer->cur - 1);
@@ -977,12 +984,12 @@ _cpp_lex_direct (cpp_reader *pfile)
     case 'a': case 'b': case 'c': case 'd': case 'e': case 'f':
     case 'g': case 'h': case 'i': case 'j': case 'k': case 'l':
     case 'm': case 'n': case 'o': case 'p': case 'q': case 'r':
-    case 's': case 't': case 'u': case 'v': case 'w': case 'x':
+    case 's': case 't':           case 'v': case 'w': case 'x':
     case 'y': case 'z':
     case 'A': case 'B': case 'C': case 'D': case 'E': case 'F':
     case 'G': case 'H': case 'I': case 'J': case 'K':
     case 'M': case 'N': case 'O': case 'P': case 'Q': case 'R':
-    case 'S': case 'T': case 'U': case 'V': case 'W': case 'X':
+    case 'S': case 'T':           case 'V': case 'W': case 'X':
     case 'Y': case 'Z':
       result->type = CPP_NAME;
       {
Index: libcpp/charset.c
===================================================================
--- libcpp/charset.c	(revision 133117)
+++ libcpp/charset.c	(working copy)
@@ -642,6 +642,7 @@ init_iconv_desc (cpp_reader *pfile, cons
     {
       ret.func = convert_no_conversion;
       ret.cd = (iconv_t) -1;
+      ret.width = -1;
       return ret;
     }
 
@@ -655,6 +656,7 @@ init_iconv_desc (cpp_reader *pfile, cons
       {
 	ret.func = conversion_tab[i].func;
 	ret.cd = conversion_tab[i].fake_cd;
+	ret.width = -1;
 	return ret;
       }
 
@@ -663,6 +665,7 @@ init_iconv_desc (cpp_reader *pfile, cons
     {
       ret.func = convert_using_iconv;
       ret.cd = iconv_open (to, from);
+      ret.width = -1;
 
       if (ret.cd == (iconv_t) -1)
 	{
@@ -683,6 +686,7 @@ init_iconv_desc (cpp_reader *pfile, cons
 		 from, to);
       ret.func = convert_no_conversion;
       ret.cd = (iconv_t) -1;
+      ret.width = -1;
     }
   return ret;
 }
@@ -716,7 +720,17 @@ cpp_init_iconv (cpp_reader *pfile)
     wcset = default_wcset;
 
   pfile->narrow_cset_desc = init_iconv_desc (pfile, ncset, SOURCE_CHARSET);
+  pfile->narrow_cset_desc.width = CPP_OPTION (pfile, char_precision);
+  pfile->char16_cset_desc = init_iconv_desc (pfile,
+					     be ? "UTF-16BE" : "UTF-16LE",
+					     SOURCE_CHARSET);
+  pfile->char16_cset_desc.width = 16;
+  pfile->char32_cset_desc = init_iconv_desc (pfile,
+					     be ? "UTF-32BE" : "UTF-32LE",
+					     SOURCE_CHARSET);
+  pfile->char32_cset_desc.width = 32;
   pfile->wide_cset_desc = init_iconv_desc (pfile, wcset, SOURCE_CHARSET);
+  pfile->wide_cset_desc.width = CPP_OPTION (pfile, wchar_precision);
 }
 
 /* Destroy iconv(3) descriptors set up by cpp_init_iconv, if necessary.  */
@@ -1051,15 +1065,13 @@ _cpp_valid_ucn (cpp_reader *pfile, const
    An advanced pointer is returned.  Issues all relevant diagnostics.  */
 static const uchar *
 convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, bool wide)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
 {
   cppchar_t ucn;
   uchar buf[6];
   uchar *bufp = buf;
   size_t bytesleft = 6;
   int rval;
-  struct cset_converter cvt
-    = wide ? pfile->wide_cset_desc : pfile->narrow_cset_desc;
   struct normalize_state nst = INITIAL_NORMALIZE_STATE;
 
   from++;  /* Skip u/U.  */
@@ -1086,14 +1098,15 @@ convert_ucn (cpp_reader *pfile, const uc
    function issues no diagnostics and never fails.  */
 static void
 emit_numeric_escape (cpp_reader *pfile, cppchar_t n,
-		     struct _cpp_strbuf *tbuf, bool wide)
+		     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
 {
-  if (wide)
+  size_t width = cvt.width;
+
+  if (width != CPP_OPTION(pfile, char_precision))
     {
       /* We have to render this into the target byte order, which may not
 	 be our byte order.  */
       bool bigend = CPP_OPTION (pfile, bytes_big_endian);
-      size_t width = CPP_OPTION (pfile, wchar_precision);
       size_t cwidth = CPP_OPTION (pfile, char_precision);
       size_t cmask = width_to_mask (cwidth);
       size_t nbwc = width / cwidth;
@@ -1136,12 +1149,11 @@ emit_numeric_escape (cpp_reader *pfile, 
    number.  You can, e.g. generate surrogate pairs this way.  */
 static const uchar *
 convert_hex (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, bool wide)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
 {
   cppchar_t c, n = 0, overflow = 0;
   int digits_found = 0;
-  size_t width = (wide ? CPP_OPTION (pfile, wchar_precision)
-		  : CPP_OPTION (pfile, char_precision));
+  size_t width = cvt.width;
   size_t mask = width_to_mask (width);
 
   if (CPP_WTRADITIONAL (pfile))
@@ -1174,7 +1186,7 @@ convert_hex (cpp_reader *pfile, const uc
       n &= mask;
     }
 
-  emit_numeric_escape (pfile, n, tbuf, wide);
+  emit_numeric_escape (pfile, n, tbuf, cvt);
 
   return from;
 }
@@ -1187,12 +1199,11 @@ convert_hex (cpp_reader *pfile, const uc
    number.  */
 static const uchar *
 convert_oct (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, bool wide)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
 {
   size_t count = 0;
   cppchar_t c, n = 0;
-  size_t width = (wide ? CPP_OPTION (pfile, wchar_precision)
-		  : CPP_OPTION (pfile, char_precision));
+  size_t width = cvt.width;
   size_t mask = width_to_mask (width);
   bool overflow = false;
 
@@ -1213,7 +1224,7 @@ convert_oct (cpp_reader *pfile, const uc
       n &= mask;
     }
 
-  emit_numeric_escape (pfile, n, tbuf, wide);
+  emit_numeric_escape (pfile, n, tbuf, cvt);
 
   return from;
 }
@@ -1224,7 +1235,7 @@ convert_oct (cpp_reader *pfile, const uc
    pointer.  Handles all relevant diagnostics.  */
 static const uchar *
 convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
-		struct _cpp_strbuf *tbuf, bool wide)
+		struct _cpp_strbuf *tbuf, struct cset_converter cvt)
 {
   /* Values of \a \b \e \f \n \r \t \v respectively.  */
 #if HOST_CHARSET == HOST_CHARSET_ASCII
@@ -1236,23 +1247,21 @@ convert_escape (cpp_reader *pfile, const
 #endif
 
   uchar c;
-  struct cset_converter cvt
-    = wide ? pfile->wide_cset_desc : pfile->narrow_cset_desc;
 
   c = *from;
   switch (c)
     {
       /* UCNs, hex escapes, and octal escapes are processed separately.  */
     case 'u': case 'U':
-      return convert_ucn (pfile, from, limit, tbuf, wide);
+      return convert_ucn (pfile, from, limit, tbuf, cvt);
 
     case 'x':
-      return convert_hex (pfile, from, limit, tbuf, wide);
+      return convert_hex (pfile, from, limit, tbuf, cvt);
       break;
 
     case '0':  case '1':  case '2':  case '3':
     case '4':  case '5':  case '6':  case '7':
-      return convert_oct (pfile, from, limit, tbuf, wide);
+      return convert_oct (pfile, from, limit, tbuf, cvt);
 
       /* Various letter escapes.  Get the appropriate host-charset
 	 value into C.  */
@@ -1312,6 +1321,26 @@ convert_escape (cpp_reader *pfile, const
   return from + 1;
 }
 
+/* TYPE is a token type.  The return value is the conversion needed to
+   convert from source to execution character set for the given type. */
+static struct cset_converter
+convertor_for_type (cpp_reader *pfile, enum cpp_ttype type)
+{
+  switch (type) {
+    default:
+	return pfile->narrow_cset_desc;
+    case CPP_CHAR16:
+    case CPP_STRING16:
+	return pfile->char16_cset_desc;
+    case CPP_CHAR32:
+    case CPP_STRING32:
+	return pfile->char32_cset_desc;
+    case CPP_WCHAR:
+    case CPP_WSTRING:
+	return pfile->wide_cset_desc;
+  }
+}
+
 /* FROM is an array of cpp_string structures of length COUNT.  These
    are to be converted from the source to the execution character set,
    escape sequences translated, and finally all are to be
@@ -1320,13 +1349,12 @@ convert_escape (cpp_reader *pfile, const
    false for failure.  */
 bool
 cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
-		      cpp_string *to, bool wide)
+		      cpp_string *to,  enum cpp_ttype type)
 {
   struct _cpp_strbuf tbuf;
   const uchar *p, *base, *limit;
   size_t i;
-  struct cset_converter cvt
-    = wide ? pfile->wide_cset_desc : pfile->narrow_cset_desc;
+  struct cset_converter cvt = convertor_for_type (pfile, type);
 
   tbuf.asize = MAX (OUTBUF_BLOCK_SIZE, from->len);
   tbuf.text = XNEWVEC (uchar, tbuf.asize);
@@ -1335,7 +1363,7 @@ cpp_interpret_string (cpp_reader *pfile,
   for (i = 0; i < count; i++)
     {
       p = from[i].text;
-      if (*p == 'L') p++;
+      if (*p == 'L' || *p == 'u' || *p == 'U') p++;
       p++; /* Skip leading quote.  */
       limit = from[i].text + from[i].len - 1; /* Skip trailing quote.  */
 
@@ -1354,12 +1382,12 @@ cpp_interpret_string (cpp_reader *pfile,
 	  if (p == limit)
 	    break;
 
-	  p = convert_escape (pfile, p + 1, limit, &tbuf, wide);
+	  p = convert_escape (pfile, p + 1, limit, &tbuf, cvt);
 	}
     }
   /* NUL-terminate the 'to' buffer and translate it to a cpp_string
      structure.  */
-  emit_numeric_escape (pfile, 0, &tbuf, wide);
+  emit_numeric_escape (pfile, 0, &tbuf, cvt);
   tbuf.text = XRESIZEVEC (uchar, tbuf.text, tbuf.len);
   to->text = tbuf.text;
   to->len = tbuf.len;
@@ -1375,7 +1403,8 @@ cpp_interpret_string (cpp_reader *pfile,
    in a string, but do not perform character set conversion.  */
 bool
 cpp_interpret_string_notranslate (cpp_reader *pfile, const cpp_string *from,
-				  size_t count,	cpp_string *to, bool wide)
+				  size_t count,	cpp_string *to,
+				  enum cpp_ttype type ATTRIBUTE_UNUSED)
 {
   struct cset_converter save_narrow_cset_desc = pfile->narrow_cset_desc;
   bool retval;
@@ -1383,7 +1412,7 @@ cpp_interpret_string_notranslate (cpp_re
   pfile->narrow_cset_desc.func = convert_no_conversion;
   pfile->narrow_cset_desc.cd = (iconv_t) -1;
 
-  retval = cpp_interpret_string (pfile, from, count, to, wide);
+  retval = cpp_interpret_string (pfile, from, count, to, CPP_STRING);
 
   pfile->narrow_cset_desc = save_narrow_cset_desc;
   return retval;
@@ -1462,13 +1491,14 @@ narrow_str_to_charconst (cpp_reader *pfi
 /* Subroutine of cpp_interpret_charconst which performs the conversion
    to a number, for wide strings.  STR is the string structure returned
    by cpp_interpret_string.  PCHARS_SEEN and UNSIGNEDP are as for
-   cpp_interpret_charconst.  */
+   cpp_interpret_charconst.  TYPE is the token type.  */
 static cppchar_t
 wide_str_to_charconst (cpp_reader *pfile, cpp_string str,
-		       unsigned int *pchars_seen, int *unsignedp)
+		       unsigned int *pchars_seen, int *unsignedp,
+		       enum cpp_ttype type)
 {
   bool bigend = CPP_OPTION (pfile, bytes_big_endian);
-  size_t width = CPP_OPTION (pfile, wchar_precision);
+  size_t width = convertor_for_type (pfile, type).width;
   size_t cwidth = CPP_OPTION (pfile, char_precision);
   size_t mask = width_to_mask (width);
   size_t cmask = width_to_mask (cwidth);
@@ -1490,7 +1520,7 @@ wide_str_to_charconst (cpp_reader *pfile
   /* Wide character constants have type wchar_t, and a single
      character exactly fills a wchar_t, so a multi-character wide
      character constant is guaranteed to overflow.  */
-  if (off > 0)
+  if (str.len > nbwc * 2)
     cpp_error (pfile, CPP_DL_WARNING,
 	       "character constant too long for its type");
 
@@ -1518,20 +1548,21 @@ cpp_interpret_charconst (cpp_reader *pfi
 			 unsigned int *pchars_seen, int *unsignedp)
 {
   cpp_string str = { 0, 0 };
-  bool wide = (token->type == CPP_WCHAR);
+  bool wide = (token->type != CPP_CHAR);
   cppchar_t result;
 
-  /* an empty constant will appear as L'' or '' */
+  /* an empty constant will appear as L'', u'', U'' or '' */
   if (token->val.str.len == (size_t) (2 + wide))
     {
       cpp_error (pfile, CPP_DL_ERROR, "empty character constant");
       return 0;
     }
-  else if (!cpp_interpret_string (pfile, &token->val.str, 1, &str, wide))
+  else if (!cpp_interpret_string (pfile, &token->val.str, 1, &str, token->type))
     return 0;
 
   if (wide)
-    result = wide_str_to_charconst (pfile, str, pchars_seen, unsignedp);
+    result = wide_str_to_charconst (pfile, str, pchars_seen, unsignedp,
+				    token->type);
   else
     result = narrow_str_to_charconst (pfile, str, pchars_seen, unsignedp);
 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]