This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Proposal for 2 Byte Unicode implementation in gcc and glibc


Wilhelm Nuesser writes:

> One simple example: for a typical database used in medium sized companies of
> about 100 GB, we find a ratio of about 70 percent strings to 30 percent
> data. The transition to 2 byte Unicode would increase the disk space to
> (2*70 + 30) % = 170 %. If we change to 4 byte Unicode the same database
> would increase by 310 %.

Application writers distinguish between external representation of
string (how it is stored on disk) and internal representation (how it
is stored in memory most of the time).

About the external representation:

* Noone uses UCS-4/UTF-32. It's just too wasteful.

* Many Windows applications use UCS-2 or UTF-16.

* Many Unix applications use UTF-8.

* The particular choice for your applications is up to you. Support
  for all of them is available in glibc-2.1.92, through iconv
  (explicit conversion) or fopen/fgetwc/fputwc (implicit conversion).

About the internal representation:

* Many applications use UTF-8 as internal representation, because it
  does not waste a lot of memory for American and European languages.

* For some complicated tasks, like string pattern matching, temporary
  conversion to UCS-4 is performed, using mbsnrtowcs or equivalent.

* For some simpler tasks, like determining the width of a string,
  often the conversion to UCS-4 is performed on the fly, using
  mbrtowc, without need for memory allocation.

* The ISO C 99 standard and its glibc-2.2 implementation offer its
  entire printf/scanf/IO facilities in both the multibyte (possibly
  UTF-8) and wide (UCS-4 on glibc) flavours.

* Again, the choice is up to you. If you absolutely want the third
  flavour (UTF-16 as in-memory representation), libraries like ICU
  give it to you.

> These are reasons to use UTF-16: 
>  
>     1.Performance
>  
>       The UTF-16 representation of textual data needs only half the
>       amount of memory that a 32-bit representation would need, provided
>       that surrogate pairs occur only seldom, which will be the
>       case.

Given that most of the world's textual data is ISO-8859-*/KOI8-R,
encoding it with UTF-8 saves even more memory.

>     2.Portability 
>  
>       Software that uses wchar_t has restricted portability since
>       wchar_t sometimes has 32 bits, but sometimes only 16 bits. A
>       dedicated type for Unicode with platform-independent length allows
>       to write portable software.

Writing portable programs means to realize what is implementation
dependent and what is not. Yes, sizeof(wchar_t) is implementation
dependent.

If you don't like that, you are free to use a middleware library (like
ICU, again) which shields you from the operating system's types.

>     6.Operations and representation of character strings 
>       
>       Although UTF-32 makes some operations on characters easier
>       (e.g. indexing into strings) this implementation leads to a great
>       overhead in other areas (see searching, collating, displaying etc.
>       where the whole string is involved).

In any of these areas (searching, collating, displaying) you can
afford to temporarily convert from UTF-8 or UTF-16 to UCS-4, because
the actual work involved (canonical [de]composition, treatment of
combining characters, reordering of vowels, etc) is far superior to
the conversion cost.

> For a number of languages, the UTF-8 representation saves some
> storage when compared with UTF-16, but for Asian characters UTF-8
> requires 50% more storage than UTF-16.

Yes, it does. And for English and German UTF-16 requires 100% more
storage than UTF-8.

> We do not consider UTF-8 as advantageous for text representation in
> the memory. It may be well suited for files where access is
> sequential but in general it is no uni-versal solution.

Whether the access is sequential or random is irrelevant here. When
doing random access into an UTF-16 encoded string, a program must not
process the second half of a surrogate pair before the first half, and
likewise it normally must not process a combining character before its
preceding base character. Therefore - whether in a UTF-32, UTF-16 or
UTF-8 world - random access into strings is done via substrings
(ranges of indices, not singular indices), and then it doesn't matter
any more whether the substrings are delimited by two "uint32_t *" or
two "uint16_t *" or two "uint8_t *".

>    2.String and character literals 
>  
>       For utf16_t literals, we suggest the prefix u (similar to the
>       prefix L for the type wchar_t):
>  
>          utf16_t s[] = u"someText"; 
>          utf16_t c = u's'; 
>  
>       For utf32_t, we suggest the prefix U. This is similar to the
>       notation for universal character names in the C++ Standard: \u is
>       followed by four hexadecimal digits and \U is followed by eight
>       hexadecimal digits.

The need for this language extension that you propose here - namely,
being able to view and edit source code on non-Unicode text editors -
is already fulfilled by the ISO C 99 / ISO C++ "\uxxxx" and L"\uxxxx"
feature. The problem is that wchar_t is not guaranteed to represent
Unicode is irrelevant, because such programs will work in a given
locale only, anyway. For writing international software, I don't
recommend to put foreign strings in the code. Put them into message
catalogs and use gettext().

                          Bruno

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]