This is the mail archive of the libstdc++@sourceware.cygnus.com mailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

RE: How about basic_string<UTF-8> ?


On May 18, 1999 8:39 PM, Nathan Myers [SMTP:ncm@best.com] wrote:
> > Shiv writes:
> > How about having a char_traits<> specialisation for UTF-8 chars?! ...
> > ... BTW does the char_traits
> > spec have a req for the char to be an integral no of bytes? I think not.
> > In that case I don't think it would be a lot of work to implement a
> > UTF-8 specialisation.
> 
> As one of the primary architects of the Standard C++ Library's
> handling of large character sets, I can say definitively that
> UTF-8 extensions for basic_string<char> would be incompatible
> with the design of the library.
Great job !

> 
> This is not to say that using string to transport UTF-8 sequences
> would not work; string itself doesn't care what you put in it.
> However, any code that looked into the string would not see 
> characters, but only raw bytes.
> 
> The Standard-conforming approach to handling multibyte character 
> encodings is to convert between a fixed-with encoding for manipulation 
> and a variable-width encoding for I/O.  This means that the natural 
> way to handle UTF characters is in basic_string<wchar_t>, or wstring: 
> a "character" in a wstring _really_is_ a character in UTF.  

It will work if wchar_t is really understood as UCS-4 (i.e.: 32 bits Unicode
Characters ). It is the case I think on some recent commercial OSes like
HP-UX.
I don't know what Linux assumes for wchar_t : somebody knows on this list ?

It will work partly if wchar_t is UCS-2 (16 bits Unicode characters),
because there will be no support for the 'surrogates' special case ( two 16
bits integer for one UTF-16 character).  The surrogates extension mechanism
was added because even 64k characters won't fit all needs for characters ...

But, IMHO, one issue is portability ...
wchar_t is at the moment defined differently and seems to have different
meanings on different platforms.
wchar_t on Windows is 16 bits unsigned integer.
wchar_t on most Unixes is 32 bits, but not always UCS-4 (i.e.:not always
Unicode ...).
Moreover, C locale related functions that work on wchar_t don't always
assume wchar_t is UCS-4 ...
They sometimes assume a multibyte character (like EUC-JP japanese locales)
expanded on 32 bits.

So if we want Unicode in wchar_t, we'll have to implement from scratch each
wchar_t locale related function or not use them, am I wrong ?

> 
> The Standard place for code that understands UTF-8 is the codecvt<> 
> facet which performs the conversion to and from wide characters.
> Help on implementing this component for UTF encodings would be most 
> welcome.

There already exists plenty of sources for UTF-8 <=> UCS-4 (Unicode wchar_t)
conversions.
You can find some freely usable code in the expat XML parser (
http://www.jclark.com/xml/expat.html )
and on the Unicode FTP site ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF/
I recommend the later , it is optimized and very clear.

Christophe Pierret



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]