This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: thoughts on martin's proposed patch for GCC and UTF-8

To: rms at gnu dot org
Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8
From: Paul Eggert <eggert at twinsun dot com>
Date: Wed, 23 Dec 1998 19:20:03 -0800 (PST)
CC: zack at rabi dot columbia dot edu, amylaar at cygnus dot co dot uk, martin at mira dot isdn dot cs dot tu-berlin dot de, gcc2 at gnu dot org, egcs at cygnus dot com
References: <199812220428.XAA13457@blastula.phys.columbia.edu> <199812221058.CAA08826@shade.twinsun.com> <199812240116.SAA28796@wijiji.santafe.edu>

   Date: Wed, 23 Dec 1998 18:16:42 -0700 (MST)
   From: Richard Stallman <rms@gnu.org>

   It is pointless and buggy to translate strings to UTF-8 and then
   translate them back.

I agree, and my proposal doesn't do that for C.  String bytes are
copied straight through.

   It is pointless and mistaken to translate symbols to UTF-8.  The
   assembler won't accept them in UTF-8, and users who use other
   encodings wouldn't want them in UTF-8 anyway.

For non-GNU platforms like Solaris, we'll have to follow the
platform's convention in this area, so that GCC-compiled code can link
to non-GCC-compiled code.  Most likely we'll need a way to configure
the method GCC uses to output non-ASCII identifiers in assembly
language, as there probably won't be a universally accepted standard
method.  Possibly, some platforms will require symbols to be
translated to a canonical form (allowing cross-locale linking) and
other platforms will just use the symbol bytes as-is (disallowing
cross-locale linking); GCC will just have to go with the flow.

For GNU platforms, my understanding is that GAS allows arbitrary bytes
in symbols, so it is plausible to use UTF-8 for the canonical symbol
encoding.  If we go this route, assembler files will be UTF-8.  In
general, GCC will have to use \x escapes in strings to represent the
bytes of non-ASCII characters, so that string bytes are copied
straight-through without loss of information -- but \x escapes will be
required no matter what solution is employed, since we want the
assembler to be locale-independent, so requiring \x escapes is not a
major loss.

Another possibility for GNU is to mangle symbols into some form of
ASCII.  To do this, we'll have to come up with a mangling method that
is compatible with existing C++ mangling, and which doesn't usurp
existing user identifier space.  You proposed a method, but someone
else found a problem with it (sorry, I don't recall the details).
Even if we solve the mangling problem, though, the ASCII-only
name-mangling method seems less useful than UTF-8 name mangling.
Neither mangling method allows an arbitrary native encoding
(e.g. Shift-JIS or ISO-2022-JP) to be used uniformly, but at least the
UTF-8 mangling method allows UTF-8 to be used uniformly.

By the way, even if we don't care about linking from different
locales, GCC must still translate symbols to a canonical form.  For
example, suppose `@' denotes the character MICRO SIGN (Unicode
character 00b5).  Then `@' (1 character) and `\u00b5' (6 characters)
are different spellings of the same symbol, and GCC must unify the two
spellings.  This is true no matter how the symbol is represented in
assembly language output.

Follow-Ups:
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Richard Stallman
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Richard Stallman

References:
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Zack Weinberg
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Paul Eggert
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Richard Stallman

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]