This is the mail archive of the libstdc++@sourceware.cygnus.com mailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

RE: How about basic_string<UTF-8> ?


Hi,
First of thank you everybody for all the insights provided.

|UTF-8 extensions for basic_string<char> would be incompatible
|with the design of the library.

Okay that clears up things wrt the Std.

|way to handle UTF characters is in basic_string<wchar_t>, or wstring: 
|a "character" in a wstring _really_is_ a character in UTF.  

This sounds sensible enough as long as one can convert them back to the
raw bytes(unsigned char) form to send to code that expects it in that form.

|The Standard place for code that understands UTF-8 is the codecvt<> 
|facet which performs the conversion to and from wide characters.

Information about locale<> and codevct<> etc is sketchy at best. Where does
one find more info about all this? Stroustrup 3e skips over this since its
beyond the scope of the book. Other than the Std is there any other source
of info for all this?

Some of my own thoughts after all the inputs follow -

1. Okay so basic_string<UTF8> is a bad idea. So how about some other class
   like Christophe was suggesting, which would have a lot of basic_string<>
   like behaviour.

2. Somebody correct me if I am wrong here. The primary purpose of strings
   in C++ is to ease writing code which manipulates string like data. Now
   aren't we missing the point if we don't have a class to handle UTF8
   strings in C++. Any class will do as long as it is highly usable.

3. A lot of problems in UTF8 arises due to the fact that a UTF8 char has
   a variable length. So how do the other multibyte charsets (like MBCS
   in Windows) handle this? Well they have functions to do all operations
   including something like *p++. So can't we encapsulate all that in a
   class alongwith extra functions to give length and char_length etc.

4. Isn't handling wide chars getting too messy with all this? There are so
   many formats like UTF-8/16,UCS-2/4. Seems to be something very wrong if
   we can't have classes in C++ that can handle all these and interconvert.
   Of course that seems to be like asking for the moon ;-) But I'm sure
   something can be done.

5. Lastly we *ought* to have a good string class that allows one to write
   international strings easily and convert them to other formats (at least
   UTF8). If this just happens to be wstring with codecvt<>'s then that is
   good enough.

I guess asking for a new basic_string<> like class for UTF8 may be too
much so we should at least have what Nathan is suggesting i.e. codecvt<>
for UTF8. What do you all think?
Thanks,
Shiv


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]