This is the mail archive of the libstdc++@sourceware.cygnus.com mailing list for the libstdc++ project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

RE: How about basic_string<UTF-8> ?

To: 'Shiv Shankar Ramakrishnan' <Shiv@pspl.co.in>,libstdc++ <libstdc++@sourceware.cygnus.com>
Subject: RE: How about basic_string<UTF-8> ?
From: Christophe PIERRET <cpierret@businessobjects.com>
Date: Tue, 18 May 1999 16:35:25 +0200

On May 18, 1999 3:48 PM, Shiv Shankar Ramakrishnan [SMTP:Shiv@pspl.co.in]
wrote:
> Hi,
> How about having a char_traits<> specialisation for UTF-8 chars?! I am
> not really sure but I guess Linux uses the UTF-8 version of Unicode for
> its internationalisation. This would be a great help to do development
> with strings for UTF-8 data. As it is the whole LDAP (Light weight
> directory access protocol RFC 2251) world uses UTF-8 for its data.
> All of us would love to have a UTF-8 version of string. If somebody
> gives me some pointers I am willing to do this. BTW does the char_traits
> spec have a req for the char to be an integral no of bytes? I think not.

If you understand 'char' as unit of storage, it is true.
But it is very dangerous to manipulate UTF-8 strings as a raw byte string.

> In that case I don't think it would be a lot of work to implement a
> UTF-8 specialisation.
> Thanks,
> Shiv
Let's take a basic_example<> :)

basic_string<T>::operator +=(T c);
This operation will only work for ascii chars which are one byte long in
UTF-8.

basic_string<T>::reference basic_string<T>::operator[](size_type pos);
This operation is meaningless for anything but a pure ascii string ...
You can get any byte in a character and it would be an unit of storage not
necessarily a character.

To treat UTF-8 correctly , you also need to distinguish length expressed in
number of characters and expressed in number of bytes , char_traits<> just
doesn't allow you to do this ...

Stricto sensu, you can't have a 'REAL' UTF-8 string which is a
basic_string<> , because of the distinction
made between characters (at the abstract level) and your unit of storage
which is some kind of byte (char, unsigned char or signed char).
If you implement a basic_string<UTF8> (UTF8 being a typedef for 'unsigned
char'), what you get is operations about unit of storage and not characters,
that is : you can easily break characters into unsignificant pieces.

It is a lot more work, especially if you want versatility between an UTF-16
string and an UTF-8 string.
( iterators on an UTF-8 string won't be especially efficient ... )

The only encoding of Unicode you can use safely with a basic_string<>
interface is UCS-4 (32 bits)...

I use my own basic_string<>-like and char_traits<>-like classes for UTF-16
and UTF-8 strings.
If you want to do it quickly, you can use basic_string<> as an underlying
implementation class, but you'll have to ensure consistency in operations on
real world characters (surrogates, character boundaries handling, ... )
You'll also have to extend your char_traits<>-like classes to support
distinction between unit of storage and character.
(for example: length() and char_length() members and the like)

Finally, adding support for an UTF-8 string is far beyond the scope of a
standart C++ library.
( basic_string<unsigned char> is the closest approximation the std lib can
offer)
The only area of interest for UTF-8 is probably locale facets ( codecvt<> ?
)...

Cordially,
Christophe Pierret

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]