This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

extended characters (was Re: The integrated preprocessor)

To: Zack Weinberg <zack at wolery dot cumb dot org>
Subject: extended characters (was Re: The integrated preprocessor)
From: Jason Merrill <jason at redhat dot com>
Date: 24 Aug 2000 23:49:44 -0700
Cc: gcc-patches at gcc dot gnu dot org
References: <20000823000521.E15699@wolery.cumb.org><u9vgwqhxk5.fsf@yorick.soma.redhat.com><20000824220722.R17776@wolery.cumb.org>

>>>>> Zack Weinberg <zack@wolery.cumb.org> writes:

 >> Why do you say before is_extended_char that "Portable code cannot count on
 >> support for more than the basic identifier character set"?  There are lots
 >> of features specified by the standard that portable code can't count on
 >> these days, but that doesn't stop us from implementing them.  True, the
 >> interpretation of source code written in, say, S-JIS will depend on the
 >> compiler running in the appropriate locale, but that does not affect
 >> extended characters written using \u.  And our Japanese customers have
 >> already paid for support for extended source character set support in
 >> string and character constants.
 >> 
 >> I don't mind you removing the partial support that was there, but I have a
 >> problem with declaring it an undesirable feature.

 > This is only my personal opinion (as it said in the comment) and I
 > don't mean to inflict it on GCC.  I believe I haven't disabled any
 > functionality; if I understood the code correctly, extended chars in
 > identifiers were not implemented yet.

Yep, that's why I said "partial support".  Perhaps I was too generous.  :)

 > \u, \U escapes should still work in strings and character constants (in
 > fact, they should now work in C, too).  Multibyte chars in strings and
 > character constants still work if and only if subsequent bytes of a
 > multibyte char cannot be confused with single bytes in 7-bit ASCII.
 > (true of UTF8, not of S-JIS).  That wouldn't be too hard to change, and
 > Neil Booth wants to rewrite cpplib's lexer again, so I'll ask him if he
 > can put in real mbchar support while he's at it.

FWIW, the C++ standard says that multibyte chars and \[uU] escapes are
handled equivalently, as if all multibyte chars were converted to the
equivalent escape in phase 1.  The C99 standard is not as clear on this
point, but can be interpreted similarly.

Note that the current handling of multibyte chars in narrow
character/string constants is broken; we decode a multibyte character into
a wchar_t and then try to stuff it back into a single char.  Instead, we
should convert to multibyte characters in the execution character set.

Also, we currently assume that the source and execution character sets are
the same, which is not valid.  Speaking of which, you seem to have left
MAP_CHARACTER support out of lex_string.

And I assumed in writing the read_ucs bits that wchar_t was always
host-endian UCS-4, which turns out to be unsafe unless __STDC_ISO_10646__
is defined.

I think it makes sense for the preprocessor to handle at least the frontend
conversion from multibyte chars to \u escapes.  It probably makes sense for
the preprocessor to go ahead and convert directly to UTF-8 (which is how I
think we should encode extended characters in identifiers) or the execution
character set (for string and character constants).  What do you think?

 > I see that the C++ standard spells out which Unicode characters are
 > acceptable in identifiers; the C standard (as I understand it) left
 > that to the implementation, and that was the major reason why I didn't
 > like the idea.

The C99 standard also has a list of acceptable characters, in Annex D.
Looking at it now, I see that though both standards claim to have gotten
this list from the same place, there are differences, and the C standard's
version is clearly more correct.  For instance, the first character in the
Greek section is 0386 in the C standard (capital Alpha with an accent),
while in the C++ standard it is listed as 0384 (an accent).  There are
other, similar discrepancies.

Ulrich believes, though I disagree, that these lists are not exclusive.  He
also believes, and I agree, that they are too restrictive: they do not
allow for future language additions to 10646.

Jason

Follow-Ups:
- Re: extended characters (was Re: The integrated preprocessor)
  - From: Jason Merrill

References:
- The integrated preprocessor
  - From: Zack Weinberg
- Re: The integrated preprocessor
  - From: Jason Merrill
- Re: The integrated preprocessor
  - From: Zack Weinberg

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]