Unicode (or any multibyte char support)

Per Bothner per@bothner.com
Sun Apr 16 17:42:00 GMT 2000


Geoff Keating <geoffk@cygnus.com> writes:

> You probably don't want to support Unicode, though.  Instead you want
> to support the ISO version of the same thing, which has Unicode as a
> subset and uses 32-bit characters.

(I just commented on something similar in the Guile mailing list ...)

As far as I know (please correct me if I'm wrong), the "ISO version of
the same thing" is still just the 16-bit Unicode standard.  I.e.
there are no standard characters whose values are >= 2**16.  However,
that will change.  At that point, Unicode will be ready for those rare
characters, because it defines an extenion mechanism called "surrogate
characters", where two 16-bit characters can encode character values
upto 2**20.  This is much more than anyone has been contemplating.
Thus there seems to be no good reason to use more than 16 bits for whar_t.

But you may well argue that if you use surrogate characters, you no
longer have O(1) random-access from character indexes to memory
locations.  True, but does it matter?  I would argue that there is no
useful operation that needs O(1) random-access from character indexes.
What you may need is being able to remember a position in a string,
but you can use "magic cookies" implemented as byte offsets for that.
The only real reason you might need O(1) random-access to characters
is for legacy code, but such legacy code is not going to handle
complex character set issues well anyway.  What about combining forms?
Normalization?  Bi-directional text?  Mixing "double-width" Kanji with
"single-width" characters?  Handling any of these is likely to break
code that thinks of a string as an array of characters.

Of course once we realize that string processing must move away from
the array-of-characters model, then the virtue of Unicode's "all
characters are encoded in a fixed 16 bits" becomes questionable as well,
and you might as well go to the UTF-8 variable-width multi-byte
encoding ...  Still, there is nothing fundamentally wrong with Unicode's
16-bit almost-fixed-width encoding;  there is just no real advantage
to using it over UTF-8, except compatibility with other languages
(such as Java) or system libraries.
-- 
	--Per Bothner
per@bothner.com   http://www.bothner.com/~per/


More information about the Gcc-patches mailing list