This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: thoughts on martin's proposed patch for GCC and UTF-8


   Date: Sat, 12 Dec 1998 11:18:00 +0100
   From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>

   > I have misgivings about having GCC support multiple locales
   > simultaneously.

   gcc/g++ process strictly-conforming input that is already in the base
   character set (plus \u escapes, in a way that the standards
   mandate. Object files are then UTF-8, (or U escapes for C++).

But this would mean that \u escapes wouldn't have their intended
effect in non-UTF-8 locales.  E.g. "\u00b5" would turn into a two-byte
multibyte character string, which is incorrect for the common ISO
8859/1 encoding where it is represented by a single byte.

   gcc/g++ also process input based on the current locale

Yes.  But the current locale should affect the processing of \u
escapes, as well as the recognition of multibyte characters.

   and pass the input unmodified to the output.

This is largely correct for multibyte characters (though their bytes
may need escaping to satisfy some assemblers).  I think \u will need
to be translated, though, if possible -- unless the assembler handles
\u, which is not true for gas at least.

   There is no interworking between the two (i.e: characters in the
   current locale are not at all related to \u escapes)

I'm not sure that this is a good idea, partly for the reasons
described above.  Tt would mean that \u escapes would turn into
gibberish in the vast majority of locales in practical use today.

   This means that the compiler, in locale-aware mode, would not be
   strictly conforming, but so what?

Actually, draft C9x allows the behavior that you propose, because it
says that the relationship between multibyte chars and \u is
implementation defined.  I lobbied for this design freedom; earlier
C9x drafts required closer conformance to Unicode (and my impression
is that C++ still requires it).  I was hoping that this freedom would
let GCC (or at least cpp :-) function in a locale-invariant way.  But
if we go this route, we have several problems:

* We won't handle \u the way that users will expect.

* We're limited to locales whose multibyte encodings never use ASCII
  bytes -- and this rules out several popular encodings.

* We'll have to disable the checking for identifier spellings in
  multibyte chars, since we won't know which multibyte chars are
  letters and/or digits.

* In general, assembly language files will not be text files.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]