This is the mail archive of the
mailing list for the libstdc++ project.
RE: FW: Unicode and C++
- To: Shiv at pspl dot co dot in, Shiv Shankar Ramakrishnan <Shiv at pspl dot co dot in>
- Subject: RE: FW: Unicode and C++
- From: Stephen Webb <stephen at bregmasoft dot com>
- Date: Fri, 7 Jul 2000 09:12:32 -0400
- Cc: libstdc++ at sourceware dot cygnus dot com
- Organization: CyberSafe
- References: <firstname.lastname@example.org>
On Fri, 07 Jul 2000, Shiv Shankar Ramakrishnan wrote:
> |Not in the C++ standard, which leaves it implementation-defined.
> Oh! I wasn't aware of that. I guess that creates another royal mess ...
Not at all. I've done a lot of heavy-duty i18n work in C on Unixes (not just simple GUI stuff, but complex
collations and parsing). All of this software had to deal with not just the US and Europe (very simple stuff) but
China and Korea as well. Both of these countries have been using wide-character data sets for decades now, and have a
mountain of legacy code and data that do not, and likely will not, use Unicode. The C and C++ standards were
developed to address these very real stuations rather than the still-nonstandard Unicode.
So the standard does not create a mess, it tries to support chaos in a structured and predicatable fasion.
> |> It seems that for most of the living languages 16bit UTF-16 or the
> |> BMP plane of ISO-10646 is more than enough.
> |It is by far not enough. Assignments to plane 1 and plane 2 are in
> |progress; plane 14 is reserved for language tagging. See the Unicode
> |Consortium pages for details.
> I had a dekko at it. They are adding things like Klingon! and scholarly
> and ancient languages. Thats why I said 'living languages' and its a
> stated goal to fit all of the living languages in the BMP only. So it
> doesn't matter if you have to use surrogate pairs for the unusual
> langs. After all you do make things easy and fast for 95% of the case.
> But yes I do see the point in having one simple 32bit character. But it
> seems so extravagant for most data. Is it a political decision to have
> wchar_t as 32 bit due to old EUC stuff for *NIXes or is there a purely
> technical reason? After all if you can have UTF-8 favouring ASCII then
> why can't you have UCS-2 (UTF-16) favouring the BMP Unicode? Seems rea-
> sonable to me. I'm sure Europeans don't particularly relish UTF-8 for
> penalising Latin-1 to 2 bytes each.
Traditional Chinese required more than 32767 characters (you can invent your own, and people do). It, and some
character sets used in Japan, require at least 3 bytes, so smarter implementations provide a 4-byte wchar_t. As
well, the basic addressible unit in most modern processors is 32 bits (an int), so in terms of speed and cacheing
there no advantage to a 16-bit whar_t over a 32-bit one. Memory is cheap. There is no penalty to Europeans in using
32-bit wide characters, but there is a penalty to other locales in using 16-bit wide characters.
As an aside, what we ended up doing for our i18n stuff was to use a multibyte character set (like UTF-8) for most
things such as file I/O and simple manipulations, but convert to a wide-character set for anything that required
manipulating character data specifically such as collation. Multibyte codes are fine for string manipulation but
useless and very error-prone for character manipulation.
Stephen M. Webb