This is the mail archive of the gcc-help@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: UTF-8, UTF-16 and UTF-32

From: Andrew Haley <aph at redhat dot com>
To: Dallas Clarke <DClarke at unwired dot com dot au>
Cc: gcc-help at gcc dot gnu dot org
Date: Thu, 21 Aug 2008 10:28:50 +0100
Subject: Re: UTF-8, UTF-16 and UTF-32
References: <001e01c90348$71a299a0$0100a8c0@testserver>

Dallas Clarke wrote:

> Now I have had the time to pull myself off the ceiling, I realise the
> problem is that Unix/GCC is supporting both UTF-8 and UTF-32, while
> Windows is supporting UTF-8 and UTF-16. And the solution is for both
> Unix and Windows to support all three Unicode formats.
> 
> I have had to spend the last several days totally writing from scratch
> the UTF-16 string functions, and realise that with a bit of common sense
> every thing can work out okay. Hopefully quick action to move wchar_t to
> 2 bytes and create another type for 4 byte strings, we can see a lot of
> problems solved. Maybe have UTF-16 strings with L"My String" and UTF-32
> with LL"My String" notations.

Changing wchar_t would break the ABI.  It isn't going to happen.

> I hope your steering committee can see that there will be lots of UTF-16
> text files out there, with a lot of code required to be written to
> process those files and while UTF-8 will not support many none Latin
> based languages, UTF-32 will not support many none Human base languages
> - i.e. no signal system is fault free.

I don't think that such a change can be decreed by the GCC SC.

I don't understand your claim that "UTF-8 will not support many none Latin
based languages".  UTF-8 <http://tools.ietf.org/html/rfc3629> supports
everything from U+0000 to U+10FFFF.  While programs use a variety of
internal representations of characters, successful transmission of data
between machines requires a common interchange format, and UTF-8 is that
format.

Andrew.

Follow-Ups:
- Re: UTF-8, UTF-16 and UTF-32
  - From: Dallas Clarke

References:
- UTF-8, UTF-16 and UTF-32
  - From: Dallas Clarke

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]