This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: c/3804: Extended ASCII "wide" characters not behaving with UTF-8 locale

To: Neil Booth <neil at daikokuya dot demon dot co dot uk>
Subject: Re: c/3804: Extended ASCII "wide" characters not behaving with UTF-8 locale
From: Markus Kuhn <Markus dot Kuhn at cl dot cam dot ac dot uk>
Date: Wed, 25 Jul 2001 23:27:10 +0100
cc: gcc-bugs at gcc dot gnu dot org

Neil Booth wrote on 2001-07-25 22:10 UTC:
> Markus Kuhn wrote:-
> 
> > All GNU applications really should be designed such that if you decide
> > one morning to follow Plan9 and finally turn everything into UTF-8, then
> > all you have to do is add something "export LC_CTYPE=en_GB.UTF-8" to
> > /etc/profile, run iconv once over all your plaintext files and filenames,
> > and everything works from then on right out of the box in UTF-8 now.
> > Everything, except for gcc is seems. Sad. :-(
> 
> But what if you're compiling some Japanese software?  Do you want
> GCC's diagnostics in Japanese?  I doubt it.

I use LC_CTYPE to specify the encoding and LC_MESSAGES to specify the
language of the messages. The two are perfectly orthogonal and
LANG=en_GB LC_CTYPE=ja_JP.EUC should allows me to compile an EUC encoded
source file while getting English error messages, without having to
generate an en_GB.EUC locale first.

POSIX admittedly does not support different encodings for input files
and error messages, but neither does by terminal emulator, so that issue
is irrelevant. It is perfectly ok and sane to require and expect that
source code and error messages have always the same encoding.

> The way I see it, the locale should determine the diagnostics.

The locale comes in several categories. LANG sets them all at once, but
you can specify them individually with the various LC_*. The language of
the diagnostics is determined by LC_MESSAGES. At least, that is how
POSIX.2 and I see it.

For instance, I use LANG=en_GB.UTF-8 LC_COLLATE=C in everyday usage,
because I prefer the naive ASCII sorting order with all uppercase words
first in "ls".

> The character sets of input files should be specified independently,
> particularly since different files will have different charsets
> (e.g. Japanese EUC-jp or Shift-JIS files #including ASCII system
> headers).

The systems headers case is easy: System-wide headers should be in ASCII
only for a long time to come, and an ASCII file is already both a valid
UTF-8 and at the same time a valid EUC-JP file. The interpretation of
ASCII files is locale invariant.

POSIX systems should not support ASCII-incompatible locales (such as
national 7-bit ISO 646 variants) in which ASCII bytes can have non-ASCII
meanings. It is just an endless source of trouble. EUC-JP is fine and it
is together with UTF-8 the only Japanese locale really suited for use
under POSIX. Shift-JIS should best disappear. It is really an email-only
format, nothing to do with locales. Glibc 2.2 has only ASCII compatible
locales fortunately, so ASCII files are guaranteed to be locale
invariant.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

References:
- Re: c/3804: Extended ASCII "wide" characters not behaving with UTF-8 locale
  - From: Neil Booth

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]