This is the mail archive of the libstdc++@sourceware.cygnus.com mailing list for the libstdc++ project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

RE: How about basic_string<UTF-8> ?

To: "Lib3 (E-mail)" <libstdc++@sourceware.cygnus.com>
Subject: RE: How about basic_string<UTF-8> ?
From: Christophe PIERRET <cpierret@businessobjects.com>
Date: Tue, 18 May 1999 18:23:22 +0200
Cc: 'Shiv Shankar Ramakrishnan' <Shiv@pspl.co.in>

On May 18, 1999 5:57 PM, Edwards, Phil [SMTP:pedwards@ball.com] wrote:
> 
> + basic_string<T>::operator +=(T c);
> + This operation will only work for ascii chars which are one 
> + byte long in UTF-8.
> 
> Why?  The very first paragraph of the strings clause states:
> 
> #   This clause describes components for manipulating sequences of
> #   "characters," where characters may be of any POD (3.9) type. In this
> #   clause such types are called charlike types, and objects of char
> #   like types are called charlike objects or simply "characters." 
> 
> As long as whatever you pick for T (when instantiating basic_string) is a
> POD type, then op+= is defined to work.

There is simply no correct POD type for UTF-8, since an UTF-8 character is
of varying length (in bytes).
If you use something like a POD of max UTF-8 bytes, c_str() won't mean an
UTF-8 string ...

> 
> 
> + basic_string<T>::reference basic_string<T>::operator[](size_type pos);
> + This operation is meaningless for anything but a pure ascii string ...
> 
> Why?
I spoke about the UTF-8 case and supposingly having the only basic_string<>
instantiation that have sense with UTF-8 : any POD type that is a byte.
Therefore, suppose you have a two byte long (or more) character , you'll
never get the entire character .
Ascii chars works because they have the amazing property of being one byte
long in the UTF-8 encoding .

I don't mean it is impossible to use basic_string<> to operate on UTF-8, I
just mean it is dangerous and bug-prone .

> 
> + You can get any byte in a character and it would be an unit 
> + of storage not necessarily a character.
> 
> basic_string<T>::reference is of type T&, whatever that may mean.  It does
> not have to be a single byte in size.

No, you're right, but, since no POD can represent an UTF-8 character (with
its varying length), you have to use a byte-based storage like a
basic_string of bytes.

> 
> 
> + Finally, adding support for an UTF-8 string is far beyond the 
> + scope of a
> + standart C++ library.
> 
> We are in agreement there.  But there's no reason why it couldn't be done
as
> an extension, or even (shameless plug) a HOWTO.

Because of intellectual property laws, all I can do is try to share some
experiences and thoughts on the matter.  ( With no time and budget , of
course ! )

> 
> 
> Luck++;
> Phil

++Glück (auf Deutsch)
Chris

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]