Fix cpplib handling of 5-byte UTF-8 sequences

Joseph S. Myers joseph@codesourcery.com
Sun May 3 12:01:00 GMT 2009


This patch fixes a bug I found in cpplib's internal character set
conversions that affects 5-byte UTF-8 sequences for characters between
U+03000000 and U+03FFFFFF.  (ISO C and C++ use normative references to
ISO 10646, which defines UTF-8 encodings for characters up to
U+7FFFFFFF, not Unicode where encodings of characters abve U+0010FFFF
are invalid, and cpplib follows this.)  It appears this bug has been
present since
<http://gcc.gnu.org/ml/gcc-patches/2003-07/msg01054.html> which added
the internal character set conversions.

Bootstrapped with no regressions on x86_64-unknown-linux-gnu.  Applied
to mainline.

libcpp:
2009-05-03  Joseph Myers  <joseph@codesourcery.com>

	* charset.c (one_utf8_to_cppchar): Correct mask used for 5-byte
	UTF-8 sequences.

gcc/testsuite:
2009-05-03  Joseph Myers  <joseph@codesourcery.com>

	* gcc.dg/cpp/utf8-5byte-1.c: New test.

Index: gcc/testsuite/gcc.dg/cpp/utf8-5byte-1.c
===================================================================
--- gcc/testsuite/gcc.dg/cpp/utf8-5byte-1.c	(revision 0)
+++ gcc/testsuite/gcc.dg/cpp/utf8-5byte-1.c	(revision 0)
@@ -0,0 +1,17 @@
+/* Test for bug in conversions from 5-byte UTF-8 sequences in
+   cpplib.  */
+/* { dg-do run { target { 4byte_wchar_t } } } */
+/* { dg-options "-std=gnu99" } */
+
+extern void abort (void);
+extern void exit (int);
+
+__WCHAR_TYPE__ ws[] = L"û¿¿¿¿";
+
+int
+main (void)
+{
+  if (ws[0] != L'\U03FFFFFF' || ws[1] != 0)
+    abort ();
+  exit (0);
+}
Index: libcpp/charset.c
===================================================================
--- libcpp/charset.c	(revision 147065)
+++ libcpp/charset.c	(working copy)
@@ -169,7 +169,7 @@ static inline int
 one_utf8_to_cppchar (const uchar **inbufp, size_t *inbytesleftp,
 		     cppchar_t *cp)
 {
-  static const uchar masks[6] = { 0x7F, 0x1F, 0x0F, 0x07, 0x02, 0x01 };
+  static const uchar masks[6] = { 0x7F, 0x1F, 0x0F, 0x07, 0x03, 0x01 };
   static const uchar patns[6] = { 0x00, 0xC0, 0xE0, 0xF0, 0xF8, 0xFC };
 
   cppchar_t c;

-- 
Joseph S. Myers
joseph@codesourcery.com


More information about the Gcc-patches mailing list