This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Query on UTF-32 encodings for letters

From: Robert Dewar <dewar at adacore dot com>
To: Robert Dewar <dewar at adacore dot com>
Cc: Paul Koning <pkoning at equallogic dot com>, joseph at codesourcery dot com,gcc at gcc dot gnu dot org
Date: Mon, 17 Jan 2005 14:13:48 -0500
Subject: Re: Query on UTF-32 encodings for letters
References: <41E3E28D.6050506@adacore.com> <Pine.LNX.4.61.0501161942070.29730@digraph.polyomino.org.uk> <41EACFCA.7070506@adacore.com> <16875.56569.286000.776285@gargle.gargle.HOWL> <41EC0798.5020303@adacore.com> <16876.2932.32855.8813@gargle.gargle.HOWL> <41EC0D78.50201@adacore.com>

Robert Dewar wrote:

(try again, message got sent prematurely before)

Paul Koning wrote:

> But that is nowhere near sufficient.  The issue is that case folding
> rules are different for different languages/locales that use the SAME
> character set.  For example, there are a whole bunch of different
> folding rules for Latin-1.


Well in practice the folding rules for Latin-1 have been part of the
standard for ten years, so they are not about to change.

It would be interesting to know an example of what you state above.
Certainly people have been using Latin-1 to write Ada in countries
all over the world, and no one has ever found the folding rules
for identifiers to be in any way inconvenient.

There was a point in the discussion early on when JDI wanted upper
case E and lower case E-acute to match in identifiers (many French
folks have the illusion that upper case letters do not have accents,
this comes from typewriter days). However, this kind of matching is
very definitely language dependent (an interesting test is can you
cross the letters in a cross-word puzzle, in French xword puzzles,
E-acute and E can cross, but of course A and A-with-circle in Swedish
do not cross, since they are quite different letters.

The decision in Ada is that you do not want the meaning of a program
or its legality to change in a locale dependent way. This is really
a fundamental starting point and I don't think there is anyone from
any country that would think otherwise.

Note that this is a radically different
issue from folding at run-time in a manner that makes sense to an
application program.

> If 10646 defines a single set of rules, then it's part of the problem,
> not part of the solution.

Well the 10646 definition provides a framework from which an acceptable
locale-independent set of folding rules can be obtained. Note that acceptable
here means acceptable to at least the ISO P-members. Indeed when it comes
to such issues in the Ada standard, this is an area where the non-english
speaking member countries take the lead.

Mind you, my own feeling would have been to abandon case insensitive
matching for non-Latin1 letters, but that *was* considered to be
an unacceptably anglo-centric point of view, and the Japanese in
paricular were insistent on this point.

References:
- Query on UTF-32 encodings for letters
  - From: Robert Dewar
- Re: Query on UTF-32 encodings for letters
  - From: Joseph S. Myers
- Re: Query on UTF-32 encodings for letters
  - From: Robert Dewar
- Re: Query on UTF-32 encodings for letters
  - From: Paul Koning
- Re: Query on UTF-32 encodings for letters
  - From: Robert Dewar
- Re: Query on UTF-32 encodings for letters
  - From: Paul Koning
- Re: Query on UTF-32 encodings for letters
  - From: Robert Dewar

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]