This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Unicode (or any multibyte char support)


> Cc: gcc-patches@gcc.gnu.org
> From: Per Bothner <per@bothner.com>
> Date: 16 Apr 2000 17:39:09 -0700

> Geoff Keating <geoffk@cygnus.com> writes:
> 
> > You probably don't want to support Unicode, though.  Instead you want
> > to support the ISO version of the same thing, which has Unicode as a
> > subset and uses 32-bit characters.
> 
> (I just commented on something similar in the Guile mailing list ...)
> 
> As far as I know (please correct me if I'm wrong), the "ISO version of
> the same thing" is still just the 16-bit Unicode standard.  I.e.
> there are no standard characters whose values are >= 2**16.  However,
> that will change.  At that point, Unicode will be ready for those rare
> characters, because it defines an extenion mechanism called "surrogate
> characters", where two 16-bit characters can encode character values
> upto 2**20.  This is much more than anyone has been contemplating.
> Thus there seems to be no good reason to use more than 16 bits for whar_t.

It's not clear to me that this can work.  For instance,
does 'iswalpha()' on one of the surrogate characters return true or
false?

I think the standard also requires that L"\U12345678" is one character
long.

> But you may well argue that if you use surrogate characters, you no
> longer have O(1) random-access from character indexes to memory
> locations.  ...

Then you don't need wchar_t.  C has perfectly good multibyte
facilities for 'char'; you can use UTF-8 (as you mention in the part I
cut).

You also lose the semantic that you can store a single character in a
wchar_t.  Many applications like to process a string one character at
a time.

There are also no standard facilities for saying whether a wchar_t is
part of a multicharacter sequence.  There are such facilities for char.

Finally, ISO10646 is explicitly referenced from the new C standard,
and there are facilities for saying that it is supported.

-- 
- Geoffrey Keating <geoffk@cygnus.com>

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]