This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: thoughts on martin's proposed patch for GCC and UTF-8


> I'm confused.  I thought that Unicode was specifically designed
> so that dictinct characters in existing Japanese character
> standards were mapped into distinct Unicode characters.

Paul already answered that, I'd like to add from a different angle.

ISO 2022 uses escapes sequences to switch between different character
sets. ISO-2022-JP combines four different character sets in this way.
Now, there are potential overlappings between the character sets. In
such cases, Unicode typically unifies the overlappings, whereas ISO
2022 leaves them as-is.

The argument is which is the right thing. For example, there are four
encodings for "LATIN CAPITAL LETTER A": 
ESC ( B A         (ASCII)
ESC ( J A         (JIS X 0201)
ESC $ @ # A       (JIS X 0208-1978)
ESC $ B # A       (JIS X 0208-1983) (*)
Unicode has only one character here (U+0041). In other places, Unicode
probably was wrong to unify (Han Unification).

Not that I want to push a particular solution: Converted to Unicode,
encoded in UTF-8, we would get the following for all four encodings:
A

Regards,
Martin

(*) Somebody correct me if my tables are wrong. The three-bytes
escape-sequence can be omitted if previous characters are already in
this encoding.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]