This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: thoughts on martin's proposed patch for GCC and UTF-8

To: martin at mira dot isdn dot cs dot tu-berlin dot de
Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8
From: Paul Eggert <eggert at twinsun dot com>
Date: Tue, 15 Dec 1998 21:59:21 -0800 (PST)
CC: bothner at cygnus dot com, gcc2 at gnu dot org, egcs at cygnus dot com
References: <199812100702.XAA26400@cygnus.com> <199812120323.TAA10442@shade.twinsun.com> <199812121018.LAA02558@mira.isdn.cs.tu-berlin.de>

   Date: Sat, 12 Dec 1998 11:18:00 +0100
   From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>

   > I have misgivings about having GCC support multiple locales
   > simultaneously.

   gcc/g++ process strictly-conforming input that is already in the base
   character set (plus \u escapes, in a way that the standards
   mandate. Object files are then UTF-8, (or U escapes for C++).

But this would mean that \u escapes wouldn't have their intended
effect in non-UTF-8 locales.  E.g. "\u00b5" would turn into a two-byte
multibyte character string, which is incorrect for the common ISO
8859/1 encoding where it is represented by a single byte.

   gcc/g++ also process input based on the current locale

Yes.  But the current locale should affect the processing of \u
escapes, as well as the recognition of multibyte characters.

   and pass the input unmodified to the output.

This is largely correct for multibyte characters (though their bytes
may need escaping to satisfy some assemblers).  I think \u will need
to be translated, though, if possible -- unless the assembler handles
\u, which is not true for gas at least.

   There is no interworking between the two (i.e: characters in the
   current locale are not at all related to \u escapes)

I'm not sure that this is a good idea, partly for the reasons
described above.  Tt would mean that \u escapes would turn into
gibberish in the vast majority of locales in practical use today.

   This means that the compiler, in locale-aware mode, would not be
   strictly conforming, but so what?

Actually, draft C9x allows the behavior that you propose, because it
says that the relationship between multibyte chars and \u is
implementation defined.  I lobbied for this design freedom; earlier
C9x drafts required closer conformance to Unicode (and my impression
is that C++ still requires it).  I was hoping that this freedom would
let GCC (or at least cpp :-) function in a locale-invariant way.  But
if we go this route, we have several problems:

* We won't handle \u the way that users will expect.

* We're limited to locales whose multibyte encodings never use ASCII
  bytes -- and this rules out several popular encodings.

* We'll have to disable the checking for identifier spellings in
  multibyte chars, since we won't know which multibyte chars are
  letters and/or digits.

* In general, assembly language files will not be text files.

Follow-Ups:
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Martin von Loewis
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Per Bothner

References:
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Per Bothner
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Paul Eggert
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Martin von Loewis

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]