This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: thoughts on martin's proposed patch for GCC and UTF-8
Date: Fri, 25 Dec 1998 03:07:56 -0500
From: Richard Stallman <rms@gnu.org>
Even if we solve the mangling problem, though, the ASCII-only
name-mangling method seems less useful than UTF-8 name mangling.
Neither mangling method allows an arbitrary native encoding
(e.g. Shift-JIS or ISO-2022-JP) to be used uniformly,
ASCII-only name mangling ought to achieve that. Could you
please explain why you think it will not?
Here's what I was thinking:
* Unsafe native encodings can't be used in assembly-language strings.
The simplest way to handle this is to do what GCC currently does:
escape non-ASCII bytes in assembly-language strings using notation
like `\377'.
* Hence, if ASCII-only name mangling is also used, assembly language
files will contain only ASCII, regardless of the input encoding.
* This will work, but it's unfriendly for non-English writers, because
it means that assembly language uses ASCII instead of the native
encoding -- i.e. the native encoding isn't being used uniformly in
both source and assembly language output. E.g. suppose we have the
following code:
const char message[] = "contents";
except that the words `message and `contents' are in Japanese. A
Japanese reader would naturally desire to see something like the
following assembly language output:
message:
.asciz "contents"
except, of course, the words `message' and `contents' would be in
Japanese. Unfortunately, though, with ASCII name mangling, and with
string mangling as described above, the Japanese reader will see
something like the following instead:
.x8c.x32.x9c.x41.x91.x32.xac.x90:
.asciz "\200 \x309!\x240@\x201\\\x300\""
which is painful to work with.
If GCC outputs bytes with the top bit on in assembly language
identifiers and strings, then at least safe encodings like UTF-8, ISO
8859, and EUC will yield the naturally desired assembly language
output. (Shift-JIS and other unsafe encodings may still yield
undesirable escapes in output, but this is no worse than the escapes
they already get.) I believe this is what is partly motivating
martin's proposed patch, and I'm sympathetic to this motivation.
Date: Fri, 25 Dec 1998 03:09:25 -0500
From: Richard Stallman <rms@gnu.org>
the default mode should be not to convert, and in that case, GCC
doesn't need to know what the encoding is (unless /u is used).
Even when not converting, GCC needs to know the input encoding if it's
an unsafe one like Shift-JIS or ISO-2022-JP (``unsafe'' meaning ``some
multibyte chars contain ASCII bytes'') -- otherwise GCC won't be able
to parse comments, strings, and identifiers correctly. Much (if not
most) east Asian text currently uses unsafe encodings, so this is not
a minor point.