This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: thoughts on martin's proposed patch for GCC and UTF-8
Date: Wed, 23 Dec 1998 18:16:42 -0700 (MST)
From: Richard Stallman <rms@gnu.org>
It is pointless and buggy to translate strings to UTF-8 and then
translate them back.
I agree, and my proposal doesn't do that for C. String bytes are
copied straight through.
It is pointless and mistaken to translate symbols to UTF-8. The
assembler won't accept them in UTF-8, and users who use other
encodings wouldn't want them in UTF-8 anyway.
For non-GNU platforms like Solaris, we'll have to follow the
platform's convention in this area, so that GCC-compiled code can link
to non-GCC-compiled code. Most likely we'll need a way to configure
the method GCC uses to output non-ASCII identifiers in assembly
language, as there probably won't be a universally accepted standard
method. Possibly, some platforms will require symbols to be
translated to a canonical form (allowing cross-locale linking) and
other platforms will just use the symbol bytes as-is (disallowing
cross-locale linking); GCC will just have to go with the flow.
For GNU platforms, my understanding is that GAS allows arbitrary bytes
in symbols, so it is plausible to use UTF-8 for the canonical symbol
encoding. If we go this route, assembler files will be UTF-8. In
general, GCC will have to use \x escapes in strings to represent the
bytes of non-ASCII characters, so that string bytes are copied
straight-through without loss of information -- but \x escapes will be
required no matter what solution is employed, since we want the
assembler to be locale-independent, so requiring \x escapes is not a
major loss.
Another possibility for GNU is to mangle symbols into some form of
ASCII. To do this, we'll have to come up with a mangling method that
is compatible with existing C++ mangling, and which doesn't usurp
existing user identifier space. You proposed a method, but someone
else found a problem with it (sorry, I don't recall the details).
Even if we solve the mangling problem, though, the ASCII-only
name-mangling method seems less useful than UTF-8 name mangling.
Neither mangling method allows an arbitrary native encoding
(e.g. Shift-JIS or ISO-2022-JP) to be used uniformly, but at least the
UTF-8 mangling method allows UTF-8 to be used uniformly.
By the way, even if we don't care about linking from different
locales, GCC must still translate symbols to a canonical form. For
example, suppose `@' denotes the character MICRO SIGN (Unicode
character 00b5). Then `@' (1 character) and `\u00b5' (6 characters)
are different spellings of the same symbol, and GCC must unify the two
spellings. This is true no matter how the symbol is represented in
assembly language output.