This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: Proposal for 2 Byte Unicode implementation in gcc and glibc
- To: "Nuesser, Wilhelm" <wilhelm dot nuesser at sap dot com>
- Subject: Re: Proposal for 2 Byte Unicode implementation in gcc and glibc
- From: Jamie Lokier <egcs at tantalophile dot demon dot co dot uk>
- Date: Fri, 4 Aug 2000 15:31:01 +0200
- Cc: "'sap-list at redhat dot com'" <sap-list at redhat dot com>, "'gcc at gcc dot gnu dot org'" <gcc at gcc dot gnu dot org>, "'linux-utf8 at humbolt dot nl dot linux dot org'" <linux-utf8 at humbolt dot nl dot linux dot org>, "'libc-hacker at sources dot redhat dot com'" <libc-hacker at sources dot redhat dot com>, "Rohland, Hans-Christoph" <hans-christoph dot rohland at sap dot com>
- References: <816D93CCC927D31188570008C75D1DE1011A0BDF@dbwdfx1a.wdf.sap-ag.de>
Nuesser, Wilhelm wrote:
> PS: When UTF-8 is used, the complexity of variable width characters
> shows up with almost every commonly used language except pure 7-Bit
> ASCII. For a number of languages, the UTF-8 representation saves some
> storage when compared with UTF-16, but for Asian characters UTF-8
> requires 50% more storage than UTF-16. We do not consider UTF-8 as
> advantageous for text representation in the memory. It may be well
> suited for files where access is sequential but in general it is no
> universal solution.
The *complexity* of variable width characters shows up with UTF-16 too.
So although space concerns may be a good reason to choose UTF-16 for
external representations (on disk), within a program UTF-32 is simple
and UTF-8/UTF-16 are more complex.
UTF-8 has the advantage that there is no endianness ambiguity, and has
some other nice lexical properties.
This is why UTF-8 is the standard "unix" representation of large chars.
(Space is not a significant issue, provided you compress your text
files. Compressed UTF-8 should take about the same space as compressed
UTF-16).
Therefore, it is good to have conversion functions between UTF-8, UTF-16
and UTF-32. It is perhaps a nice extension to have the compiler able to
parse UTF-16 and UTF-32 constant strings.
But I don't see the point in an extensive set of printfU16
etc. functions. Standard unix text files use UTF-8 (or unfortunately
they are often ISO-8859-1). Non-standard formats like databases may use
UTF-16, but databases don't use printf to write to the database.
Btw, I prefer "UTF16" or "utf16" instead of "U16" ;-)
have a nice day,
-- Jamie