Universal Character Names, v2

Martin v. Löwis martin@v.loewis.de
Sun Dec 1 16:25:00 GMT 2002


Zack Weinberg <zack@codesourcery.com> writes:

> For rationale, I should say that I operate under the assumption
> that text editors cannot be trusted to do anything right.  

I can agree with the general feeling, but not with the specific
expressions thereof.

> I assume that we will see source files with unnormalized identifiers

Can you give examples for editors? This would *only* be possible if
the file encoding can /express/ unnormalized identifiers. For that,
the file encoding needs to be UTF-8 (for all other encodings, the
iconv codec would be responsible, not the editor, and I'm not aware of
an iconv codec that produces non-NFC output).

For UTF-8, it is hard to imagine that an editor would save a file in
unnormalized form, since characters entered at the keyboard are
typically normalized. So the user would need to perform specific
keystrokes to meet your expectations.

> ; with inconsistent encoding for the same identifier

Again, which editor, and how could this possibly happen?


> ; with \u escapes used in one place and raw UTF8 in another place;

Very likely, but I can't see a problem here.

> 1. Extended identifiers written using \u escapes should be treated
> identically to extended identifiers written using a binary encoding of
> Unicode (UTF8, UTF16, etc), modulo the fact that we may not support
> binary encodings yet.

That is reasonable. It also means that there is no problem writing the
same identifier in different encodings at different times (treating
UCNs just as another way of encoding characters).

> 2. Visually identical identifiers should be treated as the same
> identifier.

That is a long-going Unicode debate, and there is no way to meet the
requirement.

For example, LATIN CAPITAL LETTER A, CYRILLIC CAPITAL LETTER A, and
GREEK CAPITAL LETTER ALPHA all might look the same if you happen to
use the right (or wrong) font. 

There are many other examples of characters that look the same, but
won't be folded into another under NFC. In particular, many of the
compatibility characters look *identical* to their compatibility
decomposition, but that won't be used in NFC. For example, ROMAN
NUMERAL ONE often looks identical to LATIN CAPITAL LETTER I.

> To continue the above example, U+00C4 (LATIN CAPITAL LETTER A WITH
> DIARESIS) is expected to be displayed using exactly the same visual
> representation as U+0041 U+0308.  

In many GUI systems, they will look slightly different if put next to
each other, since the machinery drawing the combining letter often
gives slightly different results.

> A user may have no way -- not even an inconvenient way -- to
> distinguish them.  Treating them as distinct will only cause GCC to
> reject programs that are apparently entirely correct; or worse, to
> silently miscompile them.
>
> This is why I insist on normalization.

In general, normalization does not help here - the problem remains
even if normalization is applied.

> 3. ISO 10646 (Unicode) is updated more frequently than ISO 9899 (C)
> and ISO 14882 (C++).  It is reasonable to expect that future revisions
> of the latter two standards will augment the lists of acceptable code
> points as further identifier characters are added to Unicode.  

It is possible. I question whether it is reasonable: Extension to
Unicode more and more focus on rarely-used, scholarly, and historic
scripts, as well as various (non-letter) symbols. I don't think it is
reasonable to expect that somebody wants to use these anytime soon to
denote thinks in a computer program.

It's not that there is heavy demand for the feature in the first
place, and I'm certain that anybody who ever asked for non-ASCII
identifiers in GCC in the past would be happy with the set offered by
C99.

> As a convenience to our users, we should accept all the plausible
> identifier characters in the latest revision of Unicode, not just
> whatever revision was current the last time C or C++ was revised.

I disagree. Users will not find that convenient, as it costs them
portability.

> I'm willing to be flexible on this one.  The ideal situation in my
> view would be to ship the current version of UnicodeData.txt with each
> GCC release 

In Debian, people just noticed a legal problem with that: You cannot
redistribute UnicodeData.txt under the terms of the GPL, since you are
not permitted to make modifications to that file.

> Hardwiring the codepoints and the normalization map from the current
> version of Unicode into each release of GCC would also be
> acceptable.

As a practical issue: What is the difference between the characters
hardwired into cpplib with my patch, and the characters allowed in
UAX#15? I believe the differences are *really* small, if there are any
differences at all.

> > 1. It is underspecified, as UAX#15 leaves a number of alternatives for
> >    language designers:
> >    a) which Unicode version?
> >    b) which normalization form?
> 
> The most current as of any given release of GCC, and NFC.  This should
> naturally be documented.

There are more options:

# Normally the formatting codes should be filtered out before storing
# or comparing identifiers.

Should we filter them out or not?

What are the UCNs allowed in a pp-number?

> > 3. It restricts the languages, by disallowing identifiers that are
> >    allowed in the language definition.
> 
> It shouldn't.  Examples?

The following characters are allowed in C++98, but not in UAX#15:

U+0384 GREEK TONOS
U+05F3 HEBREW PUNCTUATION GERESH
U+05f4 HEBREW PUNCTUATION GERSHAYIM
U+0EAF LAO ELLIPSIS
[this is an incomplete list]

The following characters are allowed in C99, but not in UAX#15:

U+06D4 Po ARABIC FULL STOP
U+0E4F Po THAI CHARACTER FONGMAN
U+0E5A Po THAI CHARACTER ANGKHANKHU
U+309B Sk KATAKANA-HIRAGANA VOICED SOUND MARK
U+0F2A No TIBETAN DIGIT HALF ONE
U+0F2B No TIBETAN DIGIT HALF TWO
U+0F2C No TIBETAN DIGIT HALF THREE
U+0F2D No TIBETAN DIGIT HALF FOUR
U+0F2E No TIBETAN DIGIT HALF FIVE
U+0F2F No TIBETAN DIGIT HALF SIX
U+0F30 No TIBETAN DIGIT HALF SEVEN
U+0F31 No TIBETAN DIGIT HALF EIGHT
U+0F32 No TIBETAN DIGIT HALF NINE
U+00B7 Po MIDDLE DOT
U+2118 So SCRIPT CAPITAL P
U+212E So ESTIMATED SYMBOL
[This list should be complete; the second column is the category] 

> > 4. It modifies the languages, by treating identifiers as equal which
> >    are not to be treated equal in the language definition.
> 
> This is deliberate; see above.  Or are you aware of examples where NFC
> merges identifiers that are not visually identical?

To my knowledge, this does not happen. However, code that relies on
this normalization won't be portable across compilers. Is that of no
concern to you at all?

Regards,
Martin



More information about the Java mailing list