This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Query on UTF-32 encodings for letters

From: Robert Dewar <dewar at adacore dot com>
To: "Joseph S. Myers" <joseph at codesourcery dot com>
Cc: gcc at gcc dot gnu dot org
Date: Sun, 16 Jan 2005 15:34:18 -0500
Subject: Re: Query on UTF-32 encodings for letters
References: <41E3E28D.6050506@adacore.com> <Pine.LNX.4.61.0501161942070.29730@digraph.polyomino.org.uk>

Joseph S. Myers wrote:

On Tue, 11 Jan 2005, Robert Dewar wrote:

Ada 2005 requires full support for all planes of UTF-32
encoding, including the use of letters in identifiers,
including also proper upper lower case equivalence.

All this information is obtainable from the 10646 standard,
but it is non-trivial to generate the predicates Is_Letter,
and the function To_Lower.

Proper case folding and caseless matching are locale-dependent.


That's not true for the Ada 2005 rules, which are locale independent
and driven only by the 10646 database.

Case

conversion can also depend on context in a word as well as on locale. In Unicode there is titlecase as well as uppercase and lowercase.

title case is allowed in Ada 2005 identifiers.

The full documentation for what the Ada 2005 AI requires can be found in

www.ada-auth.org/cgi-bin/cvsweb.cgi/AIs/AI-00285.TXT?rev=1.22

I presume there is in fact a more precise specification, with appropriate normative references, of what exactly is required and whether there is to be locale-dependence, at compile time or at runtime.

Indeed, the quoted AI is the precise specification

Although the Unicode Character Database includes various tables for case mapping, including context and locale dependent mapping, I'm not sure whether these are normative or informative; section 4.2 of the Unicode Standard version 4.0 refers to them as normative, while section 5.18 says that case itself is normative but the mappings are informative: but the whole of chapter 5 is not normative.


Well the Ada rules as stated are indeed normative and are based on the
unicode categorization. But Ada does not follow all the Unicode
recommendations. In particular, it does not mandate Normalization
Form KC, and instead follows the C# style of only rigourously
defining the effect of programs which are already in this
normalization form. Furthermore, Ada decided not to use
ISO/IEC TR 10176 which would be the assumed approach. The
reasons for this are discussed in the AI.

Anyway, it seems not too hard to write specific Is_Letter and
Fold_To_Upper_Case following the rules in this AI.

At this stage, I have pretty much concluded that I should spin my own
version of these routines to exactly match the Ada spec.

Thanks Joseph for your comments!

(this character stuff is a bottomless pit :-)

Follow-Ups:
- Re: Query on UTF-32 encodings for letters
  - From: Paul Koning

References:
- Query on UTF-32 encodings for letters
  - From: Robert Dewar
- Re: Query on UTF-32 encodings for letters
  - From: Joseph S. Myers

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]