This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Proposal for 2 Byte Unicode implementation in gcc and glibc


Nuesser, Wilhelm wrote:
> PS: When UTF-8 is used, the complexity of variable width characters
> shows up with almost every commonly used language except pure 7-Bit
> ASCII. For a number of languages, the UTF-8 representation saves some
> storage when compared with UTF-16, but for Asian characters UTF-8
> requires 50% more storage than UTF-16. We do not consider UTF-8 as
> advantageous for text representation in the memory. It may be well
> suited for files where access is sequential but in general it is no
> universal solution.

The *complexity* of variable width characters shows up with UTF-16 too.
So although space concerns may be a good reason to choose UTF-16 for
external representations (on disk), within a program UTF-32 is simple
and UTF-8/UTF-16 are more complex.

UTF-8 has the advantage that there is no endianness ambiguity, and has
some other nice lexical properties.

This is why UTF-8 is the standard "unix" representation of large chars.
(Space is not a significant issue, provided you compress your text
files.  Compressed UTF-8 should take about the same space as compressed
UTF-16).

Therefore, it is good to have conversion functions between UTF-8, UTF-16
and UTF-32.  It is perhaps a nice extension to have the compiler able to
parse UTF-16 and UTF-32 constant strings.

But I don't see the point in an extensive set of printfU16
etc. functions.  Standard unix text files use UTF-8 (or unfortunately
they are often ISO-8859-1).  Non-standard formats like databases may use
UTF-16, but databases don't use printf to write to the database.

Btw, I prefer "UTF16" or "utf16" instead of "U16" ;-)

have a nice day,
-- Jamie

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]