This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: Query on UTF-32 encodings for letters
Paul Koning wrote:
Then take i, which upcases to I with dot. Turkish has i with and
without dot, and the dot is preserved when you change case (in either
direction).
Yes, and that's fine, both lower case i with dot and lower case i
without dot fold upper case to capital I (without dot), and so all three
are equivalent in identifiers.
There is no upper case I with dot, so I have no idea what you mean by
saying the dot is preserved. The three characters in question are:
0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;
Would you map eszet (in German) to ss? Or to sz? Or neither? Modern
usage does the former; 1930-ish usage the latter.
The specific decision for Ada (all documented in the AI), is not to do
anything special for eszet, so the answer is neither. Quoting from the
discussion in the AI:
We notice that there are cases not covered by this simple correspondence.
For example, German "SS" corresponds to two lowercase sequences. One
is the string "ss", and the other is the es-zett character. We feel that
such complicated cases should be untouched in this time frame, waiting for
the future standardization of appropriate ISO/IEC standards or technical
reports.