Unicode (or any multibyte char support)
Per Bothner
per@bothner.com
Sun Apr 16 17:42:00 GMT 2000
Geoff Keating <geoffk@cygnus.com> writes:
> You probably don't want to support Unicode, though. Instead you want
> to support the ISO version of the same thing, which has Unicode as a
> subset and uses 32-bit characters.
(I just commented on something similar in the Guile mailing list ...)
As far as I know (please correct me if I'm wrong), the "ISO version of
the same thing" is still just the 16-bit Unicode standard. I.e.
there are no standard characters whose values are >= 2**16. However,
that will change. At that point, Unicode will be ready for those rare
characters, because it defines an extenion mechanism called "surrogate
characters", where two 16-bit characters can encode character values
upto 2**20. This is much more than anyone has been contemplating.
Thus there seems to be no good reason to use more than 16 bits for whar_t.
But you may well argue that if you use surrogate characters, you no
longer have O(1) random-access from character indexes to memory
locations. True, but does it matter? I would argue that there is no
useful operation that needs O(1) random-access from character indexes.
What you may need is being able to remember a position in a string,
but you can use "magic cookies" implemented as byte offsets for that.
The only real reason you might need O(1) random-access to characters
is for legacy code, but such legacy code is not going to handle
complex character set issues well anyway. What about combining forms?
Normalization? Bi-directional text? Mixing "double-width" Kanji with
"single-width" characters? Handling any of these is likely to break
code that thinks of a string as an array of characters.
Of course once we realize that string processing must move away from
the array-of-characters model, then the virtue of Unicode's "all
characters are encoded in a fixed 16 bits" becomes questionable as well,
and you might as well go to the UTF-8 variable-width multi-byte
encoding ... Still, there is nothing fundamentally wrong with Unicode's
16-bit almost-fixed-width encoding; there is just no real advantage
to using it over UTF-8, except compatibility with other languages
(such as Java) or system libraries.
--
--Per Bothner
per@bothner.com http://www.bothner.com/~per/
More information about the Gcc-patches
mailing list