This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: thoughts on martin's proposed patch for GCC and UTF-8

To: rms at gnu dot org
Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8
From: Paul Eggert <eggert at twinsun dot com>
Date: Sat, 26 Dec 1998 00:34:38 -0800 (PST)
CC: zack at rabi dot columbia dot edu, bothner at cygnus dot com, amylaar at cygnus dot co dot uk, martin at mira dot isdn dot cs dot tu-berlin dot de, gcc2 at gnu dot org, egcs at cygnus dot com
References: <199812220245.SAA05358@cygnus.com> <199812220415.UAA08568@shade.twinsun.com> <199812250809.DAA05042@psilocin.gnu.org>

   Date: Fri, 25 Dec 1998 03:07:56 -0500
   From: Richard Stallman <rms@gnu.org>

       Even if we solve the mangling problem, though, the ASCII-only
       name-mangling method seems less useful than UTF-8 name mangling.
       Neither mangling method allows an arbitrary native encoding
       (e.g. Shift-JIS or ISO-2022-JP) to be used uniformly, 

   ASCII-only name mangling ought to achieve that.  Could you
   please explain why you think it will not?

Here's what I was thinking:

* Unsafe native encodings can't be used in assembly-language strings.
  The simplest way to handle this is to do what GCC currently does:
  escape non-ASCII bytes in assembly-language strings using notation
  like `\377'.

* Hence, if ASCII-only name mangling is also used, assembly language
  files will contain only ASCII, regardless of the input encoding.

* This will work, but it's unfriendly for non-English writers, because
  it means that assembly language uses ASCII instead of the native
  encoding -- i.e. the native encoding isn't being used uniformly in
  both source and assembly language output.  E.g. suppose we have the
  following code:

	const char message[] = "contents";

  except that the words `message and `contents' are in Japanese.  A
  Japanese reader would naturally desire to see something like the
  following assembly language output:

	message:
		.asciz	"contents"

  except, of course, the words `message' and `contents' would be in
  Japanese.  Unfortunately, though, with ASCII name mangling, and with
  string mangling as described above, the Japanese reader will see
  something like the following instead:

	.x8c.x32.x9c.x41.x91.x32.xac.x90:
		.asciz "\200 \x309!\x240@\x201\\\x300\""

  which is painful to work with.

If GCC outputs bytes with the top bit on in assembly language
identifiers and strings, then at least safe encodings like UTF-8, ISO
8859, and EUC will yield the naturally desired assembly language
output.  (Shift-JIS and other unsafe encodings may still yield
undesirable escapes in output, but this is no worse than the escapes
they already get.)  I believe this is what is partly motivating
martin's proposed patch, and I'm sympathetic to this motivation.

   Date: Fri, 25 Dec 1998 03:09:25 -0500
   From: Richard Stallman <rms@gnu.org>

   the default mode should be not to convert, and in that case, GCC
   doesn't need to know what the encoding is (unless /u is used).

Even when not converting, GCC needs to know the input encoding if it's
an unsafe one like Shift-JIS or ISO-2022-JP (``unsafe'' meaning ``some
multibyte chars contain ASCII bytes'') -- otherwise GCC won't be able
to parse comments, strings, and identifiers correctly.  Much (if not
most) east Asian text currently uses unsafe encodings, so this is not
a minor point.

Follow-Ups:
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Richard Stallman

References:
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Per Bothner
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Paul Eggert
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Richard Stallman

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]