This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: thoughts on martin's proposed patch for GCC and UTF-8
- To: bothner at cygnus dot com
- Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8
- From: Martin von Loewis <martin at mira dot isdn dot cs dot tu-berlin dot de>
- Date: Mon, 28 Dec 1998 17:08:16 +0100
- CC: eggert at twinsun dot com, gcc2 at gnu dot org, egcs at cygnus dot com
- References: <199812220502.VAA10296@cygnus.com>
> I'm confused. I thought that Unicode was specifically designed
> so that dictinct characters in existing Japanese character
> standards were mapped into distinct Unicode characters.
Paul already answered that, I'd like to add from a different angle.
ISO 2022 uses escapes sequences to switch between different character
sets. ISO-2022-JP combines four different character sets in this way.
Now, there are potential overlappings between the character sets. In
such cases, Unicode typically unifies the overlappings, whereas ISO
2022 leaves them as-is.
The argument is which is the right thing. For example, there are four
encodings for "LATIN CAPITAL LETTER A":
ESC ( B A (ASCII)
ESC ( J A (JIS X 0201)
ESC $ @ # A (JIS X 0208-1978)
ESC $ B # A (JIS X 0208-1983) (*)
Unicode has only one character here (U+0041). In other places, Unicode
probably was wrong to unify (Han Unification).
Not that I want to push a particular solution: Converted to Unicode,
encoded in UTF-8, we would get the following for all four encodings:
A
Regards,
Martin
(*) Somebody correct me if my tables are wrong. The three-bytes
escape-sequence can be omitted if previous characters are already in
this encoding.