This is the mail archive of the
libstdc++@sourceware.cygnus.com
mailing list for the libstdc++ project.
Re: How about basic_string<UTF-8> ?
- To: libstdc++@sourceware.cygnus.com
- Subject: Re: How about basic_string<UTF-8> ?
- From: Nathan Myers <ncm@best.com>
- Date: Tue, 18 May 1999 11:39:15 -0700 (PDT)
> Shiv writes:
> How about having a char_traits<> specialisation for UTF-8 chars?! ...
> ... BTW does the char_traits
> spec have a req for the char to be an integral no of bytes? I think not.
> In that case I don't think it would be a lot of work to implement a
> UTF-8 specialisation.
As one of the primary architects of the Standard C++ Library's
handling of large character sets, I can say definitively that
UTF-8 extensions for basic_string<char> would be incompatible
with the design of the library.
This is not to say that using string to transport UTF-8 sequences
would not work; string itself doesn't care what you put in it.
However, any code that looked into the string would not see
characters, but only raw bytes.
The Standard-conforming approach to handling multibyte character
encodings is to convert between a fixed-with encoding for manipulation
and a variable-width encoding for I/O. This means that the natural
way to handle UTF characters is in basic_string<wchar_t>, or wstring:
a "character" in a wstring _really_is_ a character in UTF.
The Standard place for code that understands UTF-8 is the codecvt<>
facet which performs the conversion to and from wide characters.
Help on implementing this component for UTF encodings would be most
welcome.
Nathan Myers
ncm@cantrip.org