This is the mail archive of the libstdc++@sourceware.cygnus.com mailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Unicode and C++



Nathan Myers <ncm@cantrip.org> writes: 
> Manipulating UTF-8 in memory is pathetic.  UTF-8 is compact and 
> convenient as a network and file format representation, but it sucks 
> rocks for string manipulations, or in general for in-memory operations.  
> Things that are naturally O(1) become O(n) for no reason better than 
> sheer obstinacy and stubbornness.
> 

Or in the GTK+ case, massive quantities of legacy code that has to
keep working. UTF8 is pretty easy to port to; UCS4 requires
duplicating the whole API, then porting all apps to it. Without the
nice C++ trick you've outlined here, it's also quite inefficient to
use UCS4 internally but UTF8 in the interfaces.
 
> Ideally, we would plan to add wide-character interfaces to the 
> GTK/GNOME components.  A new-generation component system does nobody 
> any favors by forcing them to stick with using 8-bit chars to hold 
> things that are intrinsically bigger.

Sadly (well, partially sadly), GTK+ isn't new generation, it already
supports millions of lines of code.

My Inti C++ wrapper is new generation however, so I can use your suggestion.
 
> For cases where you want an efficient addressable container object 
> (e.g. for operator[]()), you can make an object that keeps both 
> representations.  Flags indicate that the char[] or wchar_t[] form 
> has been invalidated, and must be (lazily) regenerated after mutative 
> operations on the other form.  Then conversions happen invisibly and 
> only as necessary.  
> 

Excellent, this is the perfect solution.

> The following is just a sketch.
> 
>   class Unicode_string
>   {
>     // constructors
>     explicit Unicode_string(char const* p)
>       : narrow(p), wide(), flags(narrow_ok) {}
>     explicit Unicode_string(std::string const& s)
>       : narrow(s), wide(), flags(narrow_ok) {}
>     explicit Unicode_string(std::wstring const& s)
>       : narrow(), wide(s), flags(wide_ok) {}
> 

If this string goes in libstdc++ as an extension, could it share the
refcounted guts of std::string and std::wstring to avoid copies for
these constructors (and for the conversion operators)?
(I don't even know if you are using refcounting in the latest lib, but
thought I'd ask.)

Havoc



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]