This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: thoughts on martin's proposed patch for GCC and UTF-8
Date: Sat, 12 Dec 1998 11:18:00 +0100
From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>
> I have misgivings about having GCC support multiple locales
> simultaneously.
gcc/g++ process strictly-conforming input that is already in the base
character set (plus \u escapes, in a way that the standards
mandate. Object files are then UTF-8, (or U escapes for C++).
But this would mean that \u escapes wouldn't have their intended
effect in non-UTF-8 locales. E.g. "\u00b5" would turn into a two-byte
multibyte character string, which is incorrect for the common ISO
8859/1 encoding where it is represented by a single byte.
gcc/g++ also process input based on the current locale
Yes. But the current locale should affect the processing of \u
escapes, as well as the recognition of multibyte characters.
and pass the input unmodified to the output.
This is largely correct for multibyte characters (though their bytes
may need escaping to satisfy some assemblers). I think \u will need
to be translated, though, if possible -- unless the assembler handles
\u, which is not true for gas at least.
There is no interworking between the two (i.e: characters in the
current locale are not at all related to \u escapes)
I'm not sure that this is a good idea, partly for the reasons
described above. Tt would mean that \u escapes would turn into
gibberish in the vast majority of locales in practical use today.
This means that the compiler, in locale-aware mode, would not be
strictly conforming, but so what?
Actually, draft C9x allows the behavior that you propose, because it
says that the relationship between multibyte chars and \u is
implementation defined. I lobbied for this design freedom; earlier
C9x drafts required closer conformance to Unicode (and my impression
is that C++ still requires it). I was hoping that this freedom would
let GCC (or at least cpp :-) function in a locale-invariant way. But
if we go this route, we have several problems:
* We won't handle \u the way that users will expect.
* We're limited to locales whose multibyte encodings never use ASCII
bytes -- and this rules out several popular encodings.
* We'll have to disable the checking for identifier spellings in
multibyte chars, since we won't know which multibyte chars are
letters and/or digits.
* In general, assembly language files will not be text files.