This is the mail archive of the libstdc++@sourceware.cygnus.com mailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: How about basic_string<UTF-8> ?


> Shiv writes:
> How about having a char_traits<> specialisation for UTF-8 chars?! ...
> ... BTW does the char_traits
> spec have a req for the char to be an integral no of bytes? I think not.
> In that case I don't think it would be a lot of work to implement a
> UTF-8 specialisation.

As one of the primary architects of the Standard C++ Library's
handling of large character sets, I can say definitively that
UTF-8 extensions for basic_string<char> would be incompatible
with the design of the library.

This is not to say that using string to transport UTF-8 sequences
would not work; string itself doesn't care what you put in it.
However, any code that looked into the string would not see 
characters, but only raw bytes.

The Standard-conforming approach to handling multibyte character 
encodings is to convert between a fixed-with encoding for manipulation 
and a variable-width encoding for I/O.  This means that the natural 
way to handle UTF characters is in basic_string<wchar_t>, or wstring: 
a "character" in a wstring _really_is_ a character in UTF.  

The Standard place for code that understands UTF-8 is the codecvt<> 
facet which performs the conversion to and from wide characters.
Help on implementing this component for UTF encodings would be most 
welcome.

Nathan Myers
ncm@cantrip.org


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]