This is the mail archive of the gcc-help@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: UTF-8, UTF-16 and UTF-32

From: Eljay Love-Jensen <eljay at adobe dot com>
To: Dallas Clarke <DClarke at unwired dot com dot au>, GCC-help <gcc-help at gcc dot gnu dot org>
Date: Sat, 23 Aug 2008 07:08:05 -0500
Subject: Re: UTF-8, UTF-16 and UTF-32

Hi Dallas,

The changes you propose, in my opinion, are noble and worthy.

Unfortunately, neither C (ISO 9899) nor C++ (ISO 14882) can incorporate
those changes.  Making those changes as extensions to C or C++ would be a
variant language that is almost C and almost C++... which would indubitably
cause more issues in the long run.

Also, GCC cannot mandate those ABI changes, since GCC complies with the
platform required ABI, not vice versa.

I have a for-instance... GCC had several C++ extensions that I thought were
great, I used them a lot, and I tended to call my not-quite-C++ code "G++"
(informally).  Even though the extensions were cool, and sensical, and
useful, they were an enormous impediment to portability.  I no longer do
that.  [Those in the GCC community who have also done this are either
laughing or shuddering, or both.]

I had an opportunity to speak with Bjarne Stroustrup about all sorts of
issues with C++, as I saw them.  He stopped me short and said (paraphrased),
"If you don't like C++, you are free to write your own compiler.  I did."

Unicode and/or ISO 10646 were not on my radar at that time.  Had they been,
I probably would have brought that up too, since I am a Unicode fanboy.

What you are proposing is not C, and is not C++.  FSF does not control ISO
9899 nor ISO 14882.  GCC does not drive platform ABI.

HOWEVER, you are at liberty to write your own language.  I tried, and I
discovered that writing a good, fleshed-out, general purpose programming
language is very, very hard.  (I was using the GCC back-end, so all I needed
to do was write the front-end for my ultimate programming language.)

FORTUNATELY, there is a programming language that is much like C++, which
has the Unicode support you are looking for, and has a GCC front-end.  The
language is the D Programming Language <http://www.digitalmars.com/d/>.  It
is available now.  D 1.0 is supported by the gdc project
<http://dgcc.sourceforge.net/>, and has been used in commercial software.
Digital Mars, the progenitor of D Programming Language supplies their own
dmd compiler for Windows and Linux.

There's also Java, which has excellent Unicode support, and supports Unicode
source code as well as Unicode strings at runtime.

Alternatively, embrace ICU <http://www.icu-project.org/> for C (which works
in C++ too) to work with Unicode strings.  But that is not a solution that
works to support Unicode source code.

> ...the Chinese can write their source code in visible Mandarin in UTF-16 or
UTF-32...

I think you misunderstand what UTF-16 and UTF-32 are.

The visible Mandarin source code would be in Unicode.

UTF-8, UTF-16, and UTF-32 are encoding representations of Unicode.  You
don't "write in UTF-16" or "write in UTF-32".  The Mandarin can be encoded
using UTF-8 just fine, there is no prohibition against it.  And for a source
code file, any one of the three UTF-8/16/32 formats is as good as another.

To better understand this, get the D Programming Language.  You usually work
with Unicode characters, not UTF-8, UTF-16, UTF-32 encoding units.  You are
thinking at the wrong meta-level.

Sincerely,
--Eljay

Follow-Ups:
- Re: UTF-8, UTF-16 and UTF-32
  - From: Dallas Clarke

References:
- Re: UTF-8, UTF-16 and UTF-32
  - From: Dallas Clarke

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]