This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: thoughts on martin's proposed patch for GCC and UTF-8
- To: bothner at cygnus dot com
- Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8
- From: Paul Eggert <eggert at twinsun dot com>
- Date: Tue, 22 Dec 1998 02:34:37 -0800 (PST)
- CC: gcc2 at gnu dot org, egcs at cygnus dot com
- References: <199812220502.VAA10296@cygnus.com>
Date: Mon, 21 Dec 1998 21:02:31 -0800
From: Per Bothner <bothner@cygnus.com>
> (3) GCC transliterates each \u escape in a string to the string's charset,
> which is specified as described in (1) above.
Hm. (1) above specifies the *file's* charset. It does not follow
that the *string's* charset is the same. Certainly for Java, it
would not be.
(1) also specifies the string's charset in C, because you can switch
charsets in the middle of a file e.g. with _Pragma ("charset Shift_JIS")
or whatever.
What happens to:
wchar_t x = '\u1234'; /* or: L'\u1234' */
are these different from:
wchar_t x = (wchar_t) 0x1234;
Yes, e.g. the string's charset might specify JIS for wide characters.
I assume your proposal is that the string charset at least
by default should be the file charset except for Java where
the string charset is Unicode.
Yes.
> If the input character set is a superset of UTF-8
> (e.g. ISO-2022-JP), then the extra information is lost.
I'm confused. I thought that Unicode was specifically designed
so that dictinct characters in existing Japanese character
standards were mapped into distinct Unicode characters.
Did I misunderstand, or is ISO-2022-JP not one of the "source"
character sets the Unicode designers used?
You understood correctly. To some extent, ISO-2022-JP and Unicode are
competing standards. ISO-2022-JP distinguishes between (say) the
Japanese and Chinese forms of the same character, whereas Unicode does
not.
Right now, my impression is that ISO-2022-JP is used more often in
Japanese world than Unicode is. This is certainly true for email.
Microsoft is pushing Unicode mightily in the DOS and NT domains,
though.
There is little call for distinguishing Chinese from Japanese in
identifiers. So it's OK if GCC supports only the Unicode ``subset''
of ISO-2022-JP in identifiers.
If there are ISO-2022-JP partisans who are disturbed by this part of
my proposal, then I have some reassurance for them. Rumor has it that
ISO 10646 might be officially extended so that it will become a
functional superset of ISO-2022-JP. (This is the ``plane-14''
language-tagging effort.) This will require more than 16 bits per
character, so it won't be Unicode, and presumably Java char and string
won't support it (unless Java is also extended); but C and C++ will
support plane-14, because they already have \u escapes for 32-bit
characters, and allow UTF-8 implementations (which also supports
32-bit chars). If and when the plane-14 proposal becomes a standard,
then C and C++ could distinguish between Chinese and Japanese in
identifiers under my proposal.
Isn't internationalization fun?