gcc ignores locale (no UTF-8 source code supported)

Markus Kuhn Markus.Kuhn@cl.cam.ac.uk
Sat Sep 23 09:35:00 GMT 2000

"Martin v. Loewis" wrote on 2000-09-22 19:34 UTC:
> > It seems that gcc ignores the locale and does not use glibc's multi-byte
> > decoding functions to read in wide-string literals. :-(
> I believe that gcc rightfully ignores the locale.

I strongly disagree for the reasons outlined below.

> The C standard says
> that input files are mapped to the source character set in an
> implementation-defined way; nowhere it says that environment settings
> of the user operating the compiler should be taken into account.

If gcc runs on a POSIX system, then the POSIX spec also comes into play
and POSIX applications should clearly determine the character encoding
in all their input/output streams based on the locale setting, unless
some other way (e.g., MIME headers, command-line options,
implementation-defined source code pragmas for compilers, etc.) has been
used to override the current locale. POSIX specifies already what the
"implementation-defined way of determining the source character set" is
that the C standard refers to.

> It would be wrong to take such settings into account: the results of
> invoking the compiler would not be reproducable anymore, and it would
> not be possible to mix header files that are written in different
> encodings - who says that header files on a system have an encoding
> that necessarily matches the environment settings of some user?

First of all: Encodings are trivially to convert into each other (simply
use iconv, recode, etc.). Users on POSIX systems have to make an effort
to keep all their files in the same encoding, namely the encoding
specified in their locale. The rapid proliferation of UTF-8 will make
this actually feasible in the near future, because UTF-8 can be very
practically used in place of all other encodings. The fathers of Unix
have already decided back in 1992 (Plan9) that this is the only real way
to go and I hope the GNU/ Linux world will follow soon.

I hope that one day in the not too far future I can simply place into
/etc/profile the line

  export LANG=C.UTF-8

then convert all my plain text files on my installation to UTF-8, and
from then on never have to worry about the restrictions of ASCII or the
problems of switching between different encodings any more. Sounds like
a promising idea to me, but it clearly requires also that gcc -- like
any other POSIX application that has to know the file encodings -- will
honor the locale setting.

> I believe that characters outside the basic character set (i.e. ASCII)
> should not be used in portable software.

The authors of the C standard made it very clear that they want to
support the ISO 10646 repertoire in source code, and I hope that this
will soon become common practice.

> If you absolutely have to
> have non-ASCII characters in your source code, you should use
> universal character names, i.e.
> wprintf(L"Sch\u00f6ne Gr\u00FC\u00DFe!\n");

Please not!!! If I run on a beautiful modern system with full UTF-8
support, then I definitely want to make full use of this encoding in my
development environment. Hex escape sequences like the above one have
soon to be seen as an emergency fallback mechanism for use in cases
where archaic environments (such as gcc 2.95 ;-) have to be maintained.
In such situations, a trivial recoding program can be used to convert
the normal UTF-8 source code into an ugly and user-unfriendly emergency
fallback such as L"Sch\u00f6ne Gr\u00FC\u00DFe!\n" when files are
transmitted to the archaic system. You must not confuse the emergency
hack (hex fallbacks) with the daily usage on modern systems (UTF-8).
Gettext() makes only sense if support of multi-lingual messages is a
requirement. If I am a Thai student writing UTF-8 C source code for a
Thai programming class, then I want to use the Thai alphabet in
variables, comments, and wide-string literals just like you use ASCII.

I am convinced that

  a) people will use lots of non-ASCII text in C source code (even
     English-speaking people will find en/em-dashes, curly quotation marks
     and mathematical symbols a highly desirable extension beyond ASCII)
  b) people will prefer to have these characters UTF-8 encoded in their
     development environment such that they see in the text editor the
     actual characters and not the hex fallback
  c) people will find it trivial to use a 5-line Perl script to
     convert L"Schöne Grüße!\n" into L"Sch\u00f6ne Gr\u00FC\u00DFe!\n"
     in case they encounter a (hopefully soon very rare) environment
     that can't handle ISO 10646 characters. It's just like they find it
     already trivial to convert {[]}^~ into trigraphs when they
     encounter a (thanks god already exceedingly rare) system that does
     not handle all ISO 646 IRV characters.

Please please treat L"Sch\u00f6ne Gr\u00FC\u00DFe!\n" as something as
ugly and hopefully unnecessary as trigraphs, not as common or even
recommendable practice! Otherwise you will just reveal yourself as an
ASCII chauvinist and I shall condem you to years of maximum-portable
trigraph usage ... ;-)


P.S.: See also http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate

Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: < http://www.cl.cam.ac.uk/~mgk25/ >

More information about the Gcc mailing list