This is the mail archive of the gcc-help@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: UTF-8, UTF-16 and UTF-32


On Sat, Aug 23, 2008 at 5:40 PM, Dallas Clarke <DClarke@unwired.com.au> wrote:
> I wont bother repeating myself, it not my responsibility to cure your dogma,
> it just the end of me using GCC. I am sure that many other developers will
> run into the same problem and choose the same solution.
>

I think you're failing to convince most people due to the fact that
many of your arguments definitely require repeating and further
discussion.

You're obviously dealing with portability issues - both at the
compiler level and ABI level (and there are others I guess depending
on your needs).

If I understand your two key issues correctly, they are:

1.  You want source code that has unicode support.
2.  You want to be able to process unicode in c++ and runtime libraries.

They are different issues.  I think you understand that, but some of
your replies have both solutions lumped together.

I would like to respond to #2.

Your initial email was confusing and not the authoritative one you
might have thought.  You made arguments against UTF-8 and UTF-32 which
others here don't understand, and it seems the response was simply
restating wanting UTF-16 support.

Can you really not represent everything you want in UTF-8?  Seems
unlikely considering its meant to represent them.  Your comment on
UTF-32 was odd to say the least.  Sure, if we ever have a textual
representation that contains a quadrillion characters then we will
have to redesign how we encode it (exaggerated).  UTF-16 requires a
multi-word sequence to represent everything as well so it's nothing
special except for the fact that it is used as you said.

Now, as far as your problems above, you should look at encoding as a
design issue and not a compiler issue as far as c++ goes.  The
compiler implements wchar_t in a way that represents all of the
characters as required - as msvc and gcc are not developed in tandem,
they obviously came up with different requirements at different times.
 If you think of it at just the compiler level, you're setting your
design up for failure such that it won't be portable (not only to
different systems but to upgrades of existing compilers and language
specs).

What I mean by that is encoding should definitely be handled in a
layer above the system you are working on.  Even if both compilers
implemented the same wchart_t, it doesn't mean that every API you use
will use that wchar_t.  So, what you need to do is find a way to
represent your data and then map it to the system, api, etc that
you're using.  You never know what display or render you'll need to
use or what system you need to interface with.

I have a couple comments on your gcc modification solution.

1.  Modifying wchar_t to be 2 bytes and then making L create 2-byte
UTF-16 constants means that gcc users could no longer rely on constant
lengths like before.  And if it is just as easy as you indicated, it's
also an indication that it's probably something that should only be
touched carefully.  An code relying on this gcc implementation would
be broken.

2.  Creating a new type long wchar_t as a solution to compatibility?
You're just asking for the same issue.

You mentioned needing to read store data and presumably write it back.
 I saw a mention of a text file and a mention of a database.

UTF-8 seems exceptionally up to the task for encoding your data!  How
can you know for certain that all of your input will be in the same
format?  How will having 2-byte wchar_t in GCC solve all of your
problems?  GCC only controls types not storage or implementation of
any library or OS.

I think have to expect to write an encoder for data and to provide a
layer around your unique systems (data files, databases, constants,
OS).  Linux and Windows themselves certainly aren't going to be
completely compatible.  You could make implementations portable but
never fully compatible!

Just my thoughts after reading through his.  I think several people
here would be interested in discussing solutions with you that make
sense at all levels.

(A quick note about your issue #1, I think it would be very confusing
for source file encoding to be based on what a user typed.  It should
be constant or configured in a more visible way).

corey


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]