This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: Proposal for 2 Byte Unicode implementation in gcc and glibc

To: "Nuesser, Wilhelm" <wilhelm dot nuesser at sap dot com>
Subject: Re: Proposal for 2 Byte Unicode implementation in gcc and glibc
From: Jamie Lokier <egcs at tantalophile dot demon dot co dot uk>
Date: Fri, 4 Aug 2000 15:31:01 +0200
Cc: "'sap-list at redhat dot com'" <sap-list at redhat dot com>, "'gcc at gcc dot gnu dot org'" <gcc at gcc dot gnu dot org>, "'linux-utf8 at humbolt dot nl dot linux dot org'" <linux-utf8 at humbolt dot nl dot linux dot org>, "'libc-hacker at sources dot redhat dot com'" <libc-hacker at sources dot redhat dot com>, "Rohland, Hans-Christoph" <hans-christoph dot rohland at sap dot com>
References: <816D93CCC927D31188570008C75D1DE1011A0BDF@dbwdfx1a.wdf.sap-ag.de>

Nuesser, Wilhelm wrote:
> PS: When UTF-8 is used, the complexity of variable width characters
> shows up with almost every commonly used language except pure 7-Bit
> ASCII. For a number of languages, the UTF-8 representation saves some
> storage when compared with UTF-16, but for Asian characters UTF-8
> requires 50% more storage than UTF-16. We do not consider UTF-8 as
> advantageous for text representation in the memory. It may be well
> suited for files where access is sequential but in general it is no
> universal solution.

The *complexity* of variable width characters shows up with UTF-16 too.
So although space concerns may be a good reason to choose UTF-16 for
external representations (on disk), within a program UTF-32 is simple
and UTF-8/UTF-16 are more complex.

UTF-8 has the advantage that there is no endianness ambiguity, and has
some other nice lexical properties.

This is why UTF-8 is the standard "unix" representation of large chars.
(Space is not a significant issue, provided you compress your text
files.  Compressed UTF-8 should take about the same space as compressed
UTF-16).

Therefore, it is good to have conversion functions between UTF-8, UTF-16
and UTF-32.  It is perhaps a nice extension to have the compiler able to
parse UTF-16 and UTF-32 constant strings.

But I don't see the point in an extensive set of printfU16
etc. functions.  Standard unix text files use UTF-8 (or unfortunately
they are often ISO-8859-1).  Non-standard formats like databases may use
UTF-16, but databases don't use printf to write to the database.

Btw, I prefer "UTF16" or "utf16" instead of "U16" ;-)

have a nice day,
-- Jamie

Follow-Ups:
- Re: Proposal for 2 Byte Unicode implementation in gcc and glibc
  - From: Christoph Rohland

References:
- Proposal for 2 Byte Unicode implementation in gcc and glibc
  - From: Nuesser, Wilhelm

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]