This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Unicode (or any multibyte char support)
- To: per at bothner dot com
- Subject: Re: Unicode (or any multibyte char support)
- From: Geoff Keating <geoffk at cygnus dot com>
- Date: Mon, 17 Apr 2000 12:26:30 -0700
- CC: gcc-patches at gcc dot gnu dot org
- References: <200004161917.MAA18809@mail.wrs.com> <jmhfd1r6ei.fsf@envy.cygnus.com> <m2snwl1ryq.fsf@kelso.bothner.com>
> Cc: gcc-patches@gcc.gnu.org
> From: Per Bothner <per@bothner.com>
> Date: 16 Apr 2000 17:39:09 -0700
> Geoff Keating <geoffk@cygnus.com> writes:
>
> > You probably don't want to support Unicode, though. Instead you want
> > to support the ISO version of the same thing, which has Unicode as a
> > subset and uses 32-bit characters.
>
> (I just commented on something similar in the Guile mailing list ...)
>
> As far as I know (please correct me if I'm wrong), the "ISO version of
> the same thing" is still just the 16-bit Unicode standard. I.e.
> there are no standard characters whose values are >= 2**16. However,
> that will change. At that point, Unicode will be ready for those rare
> characters, because it defines an extenion mechanism called "surrogate
> characters", where two 16-bit characters can encode character values
> upto 2**20. This is much more than anyone has been contemplating.
> Thus there seems to be no good reason to use more than 16 bits for whar_t.
It's not clear to me that this can work. For instance,
does 'iswalpha()' on one of the surrogate characters return true or
false?
I think the standard also requires that L"\U12345678" is one character
long.
> But you may well argue that if you use surrogate characters, you no
> longer have O(1) random-access from character indexes to memory
> locations. ...
Then you don't need wchar_t. C has perfectly good multibyte
facilities for 'char'; you can use UTF-8 (as you mention in the part I
cut).
You also lose the semantic that you can store a single character in a
wchar_t. Many applications like to process a string one character at
a time.
There are also no standard facilities for saying whether a wchar_t is
part of a multicharacter sequence. There are such facilities for char.
Finally, ISO10646 is explicitly referenced from the new C standard,
and there are facilities for saying that it is supported.
--
- Geoffrey Keating <geoffk@cygnus.com>