This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: thoughts on martin's proposed patch for GCC and UTF-8

To: bothner at cygnus dot com
Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8
From: Paul Eggert <eggert at twinsun dot com>
Date: Tue, 22 Dec 1998 02:34:37 -0800 (PST)
CC: gcc2 at gnu dot org, egcs at cygnus dot com
References: <199812220502.VAA10296@cygnus.com>

   Date: Mon, 21 Dec 1998 21:02:31 -0800
   From: Per Bothner <bothner@cygnus.com>

   > (3) GCC transliterates each \u escape in a string to the string's charset,
   >     which is specified as described in (1) above.

   Hm.  (1) above specifies the *file's* charset.  It does not follow
   that the *string's* charset is the same.  Certainly for Java, it
   would not be.

(1) also specifies the string's charset in C, because you can switch
charsets in the middle of a file e.g. with _Pragma ("charset Shift_JIS")
or whatever.

   What happens to:
	   wchar_t x = '\u1234';  /* or:  L'\u1234' */
   are these different from:
	   wchar_t x = (wchar_t) 0x1234;

Yes, e.g. the string's charset might specify JIS for wide characters.

   I assume your proposal is that the string charset at least
   by default should be the file charset except for Java where
   the string charset is Unicode.

Yes.

   > If the input character set is a superset of UTF-8
   > (e.g. ISO-2022-JP), then the extra information is lost.

   I'm confused.  I thought that Unicode was specifically designed
   so that dictinct characters in existing Japanese character
   standards were mapped into distinct Unicode characters.
   Did I misunderstand, or is ISO-2022-JP not one of the "source"
   character sets the Unicode designers used?

You understood correctly.  To some extent, ISO-2022-JP and Unicode are
competing standards.  ISO-2022-JP distinguishes between (say) the
Japanese and Chinese forms of the same character, whereas Unicode does
not.

Right now, my impression is that ISO-2022-JP is used more often in
Japanese world than Unicode is.  This is certainly true for email.
Microsoft is pushing Unicode mightily in the DOS and NT domains,
though.

There is little call for distinguishing Chinese from Japanese in
identifiers.  So it's OK if GCC supports only the Unicode ``subset''
of ISO-2022-JP in identifiers.

If there are ISO-2022-JP partisans who are disturbed by this part of
my proposal, then I have some reassurance for them.  Rumor has it that
ISO 10646 might be officially extended so that it will become a
functional superset of ISO-2022-JP.  (This is the ``plane-14''
language-tagging effort.)  This will require more than 16 bits per
character, so it won't be Unicode, and presumably Java char and string
won't support it (unless Java is also extended); but C and C++ will
support plane-14, because they already have \u escapes for 32-bit
characters, and allow UTF-8 implementations (which also supports
32-bit chars).  If and when the plane-14 proposal becomes a standard,
then C and C++ could distinguish between Chinese and Japanese in
identifiers under my proposal.

Isn't internationalization fun?

References:
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Per Bothner

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]