This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Query on UTF-32 encodings for letters


Paul Koning wrote:

Then take i, which upcases to I with dot.  Turkish has i with and
without dot, and the dot is preserved when you change case (in either
direction).

Yes, and that's fine, both lower case i with dot and lower case i without dot fold upper case to capital I (without dot), and so all three are equivalent in identifiers.

There is no upper case I with dot, so I have no idea what you mean by
saying the dot is preserved. The three characters in question are:

0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;

Would you map eszet (in German) to ss?  Or to sz?  Or neither?  Modern
usage does the former; 1930-ish usage the latter.

The specific decision for Ada (all documented in the AI), is not to do anything special for eszet, so the answer is neither. Quoting from the discussion in the AI:

We notice that there are cases not covered by this simple correspondence.
For example, German "SS" corresponds to two lowercase sequences.  One
is the string "ss", and the other is the es-zett character.  We feel that
such complicated cases should be untouched in this time frame, waiting for
the future standardization of appropriate ISO/IEC standards or technical
reports.





Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]