Universal Character Names, v2

Joseph S. Myers jsm28@cam.ac.uk
Sun Dec 1 16:32:00 GMT 2002


On Sun, 1 Dec 2002, Zack Weinberg wrote:

> So, first off, a meta-issue:  I do not think any support for extended
> identifiers should appear in GNU C/C++/ObjC until this issue is
> resolved.  Once we have something out there in the wild, users will
> come to rely on its behavior, and we don't want to change it from
> under them.

Users may rely on particular files being *accepted* - they're less likely
to rely on them being rejected.  Implementing something strict (needed
anyway for -pedantic), with an option to relax later, should be safe.

> For rationale, I should say that I operate under the assumption
> that text editors cannot be trusted to do anything right.  I assume
> that we will see source files with unnormalized identifiers; with
> inconsistent encoding for the same identifier; with \u escapes used in
> one place and raw UTF8 in another place; and so on.  I do not think it
> is appropriate for us to punish users for bugs in software they may
> not control, especially since the presence of a bug may be debatable.
> For instance, I can see someone deliberately designing a text editor
> to leave existing \u escapes alone, so as not to introduce unnecessary
> deltas in a version-control system, but write out modified lines in
> UTF-8.

I'll repeat that any editor generating UCNs is C- or C++-aware and is
manifestly broken if it doesn't produce NFC sequences, of UCNs permitted
in identifiers when the Unicode is in identifiers.  Supporting such a
brokenness would make no more sense than supporting a text editor creating
random syntax errors in some way (e.g. quietly inserting hard line breaks
every 80 characters).  I've seen no evidence for such brokenness in text 
editors or any reason to suppose there will such.

> To continue the above example, U+00C4 (LATIN CAPITAL LETTER A WITH
> DIARESIS) is expected to be displayed using exactly the same visual
> representation as U+0041 U+0308.  A user may have no way -- not even
> an inconvenient way -- to distinguish them.  Treating them as distinct
> will only cause GCC to reject programs that are apparently entirely
> correct; or worse, to silently miscompile them.
> 
> This is why I insist on normalization.

(Again,) users will inevitably have to deal with normalized and
unnormalized sequences looking the same, in filesystems which will just
use byte-sequences.  GCC can at least give an error, if a combining
character (not allowed in identifiers) is encountered, suggesting that the
input might not be NFC-normalized (with an appropriate index entry for the
term in the manual).

If there are any instances where multiple UCN sequences permitted by C99
or C++-98 do normalize to the same NFC sequence, then there will also be
silent miscompilation of valid programs - programs that should only occur
in testsuites, but that still means control under -std rather than
-pedantic, and the fewer arbitrary variations there are controlled by
flag_iso, the better.  (C99 is very clear about distinct UCN sequences
being counted as distinct, complete with how many characters they may be
counted as for the purposes of the minimum length of identifiers that must
be distinguished.)

> 3. ISO 10646 (Unicode) is updated more frequently than ISO 9899 (C)
> and ISO 14882 (C++).  It is reasonable to expect that future revisions
> of the latter two standards will augment the lists of acceptable code
> points as further identifier characters are added to Unicode.  As a
> convenience to our users, we should accept all the plausible
> identifier characters in the latest revision of Unicode, not just
> whatever revision was current the last time C or C++ was revised.

ISO 10646 is updated frequently.  Unicode is updated frequently.  They are
not the same standard.  If a list of code points is taken from a third
document, that should rather be ISO/IEC TR 10176, the document used in
C99, which is also updated from time to time (the current draft for the
4th edition apparently being
<http://std.dkuug.dk/JTC1/SC22/WG20/docs/n970-tr10176-2002.pdf>), and the
main table should be used (in the spirit of the choices made by C99 and
C++98), not the supplementary table requiring normalization.  This draft
points out that there are cases of different (normalized) characters that
look the same, e.g. LATIN CAPITAL LETTER A, GREEK CAPITAL LETTER ALPHA and
CYRILLIC CAPITAL LETTER A.  Users may be confused by such cases, but no
reasonable normalization can solve that problem.

-- 
Joseph S. Myers
jsm28@cam.ac.uk



More information about the Java mailing list