This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: gcc ignores locale (no UTF-8 source code supported)

To: "Martin v. Loewis" <martin at loewis dot home dot cs dot tu-berlin dot de>
Subject: Re: gcc ignores locale (no UTF-8 source code supported)
From: Markus Kuhn <Markus dot Kuhn at cl dot cam dot ac dot uk>
Date: Sat, 23 Sep 2000 21:30:51 +0100
cc: gcc at gcc dot gnu dot org

"Martin v. Loewis" wrote on 2000-09-23 19:17 UTC:
> > POSIX specifies already what the "implementation-defined way of
> > determining the source character set" is that the C standard refers
> > to.
> 
> Can you please point to the exact chapter and verse of Posix that
> specifies that the C compiler must consider environment variables when
> reading source code?

POSIX.2 requires the interpretation of LANG for most of its own
applications and this way sets an example of good implementation
practice that should be followed by other applications as well. [I can
provide holy words of IEEE when I'm back in our the departmental library
on Monday to read the precious scripture. :]

> I can easily emagine that gcc supports a -futf-8 option some day (or
> -fencoding=utf8). I hope it will never consider LANG when reading
> source code, though. That is evil.

Your -futf-8 just adds yet another entry to my long list of non-standard
command-line options for telling an application to use UTF-8, which you
can find on

  http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate

UTF-8 will never fly if switching from ASCII to UTF-8 requires users to
memorize and specify two dozen different special command-line options
from now on. That is what we have the locale environment variables for
and it works pretty beautifully. Unix was fundamentally built around the
idea that files and pipes are not typed, so every effort should be made
to move towards one single globally acceptable character encoding that
will hopefully soon be as ubiquitous as ASCII.

A far cleaner solution for your needs is to extend the existing option
-pedantic to issue a warning whenever it encounters a character outside
the portable character set (such as @ or ü) in the source code. This
way, you can easily check your code if you insist on bible-proof
character portability. [Feel free to add -tripedantic which adds
warnings whenever {[]}~^ and other characters that are not present in
all ISO 646 variants occur. (Oh yes, also -morsepedantic and
-baudotpedantic for guaranteed shortwave and telex compatibility of C
source code. You really can't trust the average telegraph operator with
any non-portable characters in your precious wired C source code.)]

> I guess you don't type UTF-8 bytes byte-for-byte into the files;

Usually not, but actually, sometimes (rarely, on old ASCII terminals) I
do indeed and that is usually not even much less convenient than \u
escapes. Why should "\u00a1" be more readable than "<C2> <A1>" (what
less produces in ASCII mode if it sees UTF-8) or \302\241 (what emacs
says in ASCII mode). All these hex forms are equally unfreindly and only
for emergency usage.

> instead, your editor is capable of producing them on a key
> stroke. Just tell your beautiful modern system to produce
> universal-character-names when you type the keys. So the line above
> would *display* with umlauts, even though the file uses a MBCS
> encoding (namely, \u escapes).

But this works ONLY if the ALL the applications handling the source code
(editor, file viewer, CASE tools, diff, CVS web browser, etc. etc. etc.
etc. etc. etc.) are familiar with the C syntax and apply a rather
non-trivial and very language-specific tokenization process before they
can display Unicode characters adequately. UTF-8 on the other hand can
be safely and robustly interpreted in components as ignorant as the
terminal emulator without any of the programs in the processing pipeline
involved having to know the slightest bit about C's token syntax. I want
"cat test.c" still to work in a user-friendly way when non-ASCII
characters are present. UTF-8 allows this, \u certainly not.

> An advanced editor (such as Emacs) is capable of dealing with multiple
> encodings, it certainly could associate C files (and C++ and Java and
> Tcl) with an encoding unicode-escape or such. Maybe it is time to
> further improve your system.

But this would just restrict me to one single kitchen-sink tool such as
Emacs, and I would still find non-ASCII characters in my source code
being treated as third-class citizens by the many many many other tools
that I use besides Emacs (wdiff, gdb, tcl tools for cvs, etc. etc.
etc.). Paste a few lines of C code into your mailer and (unless you
operate completely within a single product such as Emacs) the \uXXXX
will show up again. It is far more likely that both your C editor and
your email editor can understand UTF-8 in the near future than that both
are identical to Emacs.

> > You must not confuse the emergency hack (hex fallbacks) with the
> > daily usage on modern systems (UTF-8).
> 
> Why is one multibyte encoding capable of expressing full Unicode
> (UTF-8) more modern than another one (universal character names)?

Should be obvious: UTF-8 does not require a C scanner to be processed,
but the C universal character names do. UTF-8 can be easily and safely
integrated into such dumb things as terminal emulators and can be used
end-to-end in a processing pipeline in which any ASCII sequence can have
special semantics. Universal character names are C specific and not at
all universal. Fortran, Ada95, TeX and XML all have their own
independent "universal character name" equivalents, yet all of them
could process UTF-8 smoothly.

If you look at it this way (namely portability and interoperability of
tools), then UTF-8 becomes quickly the least common denominator that
enables portable exchange of non-ASCII content across tools and
platforms. \u sequences remain a C specific fallback hack. We should
make the use of UTF-8 as easy and natural as possible. As natural as
ASCII.

> Just use the right text editor - not one that produces UTF-8,
> but one that produces universal character names. That way, you can
> have all the features you want, *and* your code will compile even if
> you take it with you when hired by a German company.

Much more likely is the scenario in which the German company is already
anyway using the same encoding as the Thai company: UTF-8.

> >   b) people will prefer to have these characters UTF-8 encoded in their
> >      development environment such that they see in the text editor the
> >      actual characters and not the hex fallback
> 
> People won't care about encodings as long as it works.

The problem is that I don't see your "editor hides \u sequences from the
user" proposal to work conveniently. It will remain as ugly as base64 as
soon as you leave the confines of your editor. Don't make the same
mistake again that the email folks made with their base64 mess. It
simply does not scale across all tools that you might want to use to
touch your source code.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

References:
- Re: gcc ignores locale (no UTF-8 source code supported)
  - From: Martin v. Loewis

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]