9449 – UCNs not recognized in identifiers (c++/c99)

Bug 9449 - UCNs not recognized in identifiers (c++/c99)

Summary: UCNs not recognized in identifiers (c++/c99)

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	preprocessor (show other bugs)
Version:	unknown

Importance:	P3 enhancement
Target Milestone:	5.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	rejects-valid

Depends on:
Blocks:	16989
	Show dependency tree / graph

Reported:	2003-01-27 14:56 UTC by Richard Earnshaw
Modified:	2014-11-06 21:09 UTC (History)
CC List:	8 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2005-12-15 04:48:35

Attachments
smime.p7s (1.63 KB, application/pkcs7-signature) 2005-02-21 19:32 UTC, Geoff Keating	Details
smime.p7s (1.63 KB, application/pkcs7-signature) 2005-02-21 20:15 UTC, Geoff Keating	Details
smime.p7s (1.63 KB, application/pkcs7-signature) 2005-02-21 20:26 UTC, Geoff Keating	Details
smime.p7s (1.63 KB, application/pkcs7-signature) 2005-02-22 09:23 UTC, Geoff Keating	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Zack Weinberg 2003-01-27 11:59:14 UTC

From: Zack Weinberg <zack@codesourcery.com>
To: Richard.Earnshaw@arm.com
Cc: gcc-bugs@gcc.gnu.org, rearnsha@arm.com,  sdouglas@arm.com, 
 gcc-gnats@gcc.gnu.org
Subject: Re: preprocessor/9449: UCNs recognized in identifiers (c++/c99)
Date: Mon, 27 Jan 2003 11:59:14 -0800

 Richard Earnshaw <rearnsha@arm.com> writes:
 
 > http://www.codesourcery.com/lists?2:mss:1481:danfdfbkjoaahbcmmeam
 
 Thanks, that's helpful.
 
 > Why have you changed the class to Change Request?  Rejects legal is
 > a far more accurate description of this.
 
 Because it's a case of "sorry, this feature is not implemented" and I
 don't have time to do it anytime soon, nor do I plan to start an
 implementation until there's agreement on the semantics.
 
 zw

Comment 1 Richard Earnshaw 2003-01-27 14:56:00 UTC

(I've marked this as "preprocessor" since that's used for lexing both C99 and C++, but this is probably more complicated that that...)

The following is legal in both c99 and C++, but is rejected by both:

int x\u0394;

Note that this isn't just an issue of parsing the input.  The correct translation to the target symbol is also required (this my be governed by machine conventions and/or object file  or linker restrictions).

Release:
unknown

Environment:
Any

How-To-Repeat:
compile the example above with -std=c99 (for C) or the C++ compiler.

Comment 2 Zack Weinberg 2003-01-27 16:51:04 UTC

Responsible-Changed-From-To: unassigned->zack
Responsible-Changed-Why: I'll take responsibility for this, but since the situation is that this feature has not yet been implemented, and there are still open questions about how to do it, I am deprioritizing it.

Comment 3 Richard Earnshaw 2003-01-27 17:09:12 UTC

From: Richard Earnshaw <rearnsha@arm.com>
To: zack@gcc.gnu.org, gcc-bugs@gcc.gnu.org, gcc-prs@gcc.gnu.org,
        nobody@gcc.gnu.org, rearnsha@arm.com, sdouglas@arm.com,
        gcc-gnats@gcc.gnu.org
Cc: Richard.Earnshaw@arm.com
Subject: Re: preprocessor/9449: UCNs recognized in identifiers (c++/c99) 
Date: Mon, 27 Jan 2003 17:09:12 +0000

 The following link may give some useful ideas.
 
 http://www.codesourcery.com/lists?2:mss:1481:danfdfbkjoaahbcmmeam
 
 R.
 
 PS.  Why have you changed the class to Change Request?  Rejects legal is a 
 far more accurate description of this.

Comment 4 Zack Weinberg 2004-12-16 02:16:17 UTC

Subject: Re: gcc and UCN in identifiers: bug PR 9449

Al Simons <al.simons@hp.com> writes:

> Hi, Zack.
>
> I'm looking into adding UCN support for identifiers into the HP
> C/C++ compiler, and wondered if there is any new status on your
> implementation / design?  We'd like to do things the same way if at
> all possible.

I don't intend to implement this feature until the C committee, the
C++ committee, and the Unicode committee all agree on which Unicode
character sequences are legitimate in identifiers and what sort of
canonicalization is to be performed.  As long as there is no
agreement, implementation of this feature risks indeterminacy in
shared library ABIs.

Suppose that the identifier "get_length_in_Ångstroms" is part of a
shared library's public interface.  The Å might be U+212B, U+00C5, or
U+0041 U+030A.  Suppose further that the person who implemented the
shared library used a text editor that generates NFD, so the library
header reads U+0041 U+030A.  But their compiler normalizes to NFC on
input, so the name in the shared library's symbol table reads U+00C5.
Now someone comes along with a compiler that does no normalization
whatsoever and tries to use the library.  They're going to get a link
error and they're not going to know why.  Worse, if someone recompiles
the library with a compiler that chose to normalize to NFD, its ABI
silently changes.

Joseph Myers insists that this situation cannot arise, because
C99/C++'s lists of valid Unicode code points in identifiers exclude
all combining forms.  But if I enforce those rules users will hate the
compiler, because their text editors will generate what looks like
perfectly fine text and then the compiler will barf on it.  And I am
not prepared to trust that every editor on the planet will adhere to
C99/C++'s rules.  And even if I were, we'd still have the problem of
the C99 and C++ lists not being identical.

> There is a link in the bug report that appears to be broken; any
> chance you can hook it back up?
>
> <<http://www.codesourcery.com/lists?2:mss:1481:danfdfbkjoaahbcmmeam>http://www.codesourcery.com/lists?2:mss:1481:danfdfbkjoaahbcmmeam>

My best guess is that this is now
<http://www.codesourcery.com/archives/cxx-abi-dev/msg00676.html>.
This is mostly about how to mangle non-ASCII characters in identifiers
to get them past limited linkers, and doesn't offer any help with the
problems I described above.

zw

Comment 5 Gabriel Dos Reis 2004-12-16 02:36:59 UTC

Subject: Re:  UCNs not recognized in identifiers (c++/c99)

"zack at codesourcery dot com" <gcc-bugzilla@gcc.gnu.org> writes:

| Joseph Myers insists that this situation cannot arise, because
| C99/C++'s lists of valid Unicode code points in identifiers exclude
| all combining forms.  But if I enforce those rules users will hate the
| compiler, because their text editors will generate what looks like
| perfectly fine text and then the compiler will barf on it.  And I am

I don't see how that would be different from current situation with

   bool est_il_ingénieur(const Employé&);

The compiler will barf and eventually users will learn feeding the
compiler with proper character sets.

| not prepared to trust that every editor on the planet will adhere to
| C99/C++'s rules.

Maybe, but I think that is irrelevant and beside the point. 

-- Gaby

Comment 6 jsm-csl@polyomino.org.uk 2004-12-16 02:54:40 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

On Thu, 16 Dec 2004, zack at codesourcery dot com wrote:

> Joseph Myers insists that this situation cannot arise, because
> C99/C++'s lists of valid Unicode code points in identifiers exclude
> all combining forms.  But if I enforce those rules users will hate the

(That is, that they exclude combining forms for languages where the 
precomposed forms are made available, so reducing the uniqueness issues.  
Given that, for example, the definition of NFC has itself since been found 
to be defective <http://www.unicode.org/review/pr-29.html>, albeit for 
examples that cannot occur in real languages, this is not a theorem about 
what might be done with general combinations of the characters listed as 
valid.)

And also that:

* The combining rules are not part of what C99 or C++ normatively 
reference.

* Characters looking identical can occur without the combining characters.  
For this reason - distinguishing U+0041 LATIN CAPITAL LETTER A, U+0391 
GREEK CAPITAL LETTER ALPHA, U+0410 CYRILLIC CAPITAL LETTER A, for example 
- I think compiler diagnostics (and probably linker diagnostics too) 
should either default to showing \u or \U sequences rather than raw 
identifiers, or at least have an option so to do.

(Previous threads on gcc-patches and gcc, Oct-Nov 2002.)

> compiler, because their text editors will generate what looks like
> perfectly fine text and then the compiler will barf on it.  And I am
> not prepared to trust that every editor on the planet will adhere to
> C99/C++'s rules.  And even if I were, we'd still have the problem of
> the C99 and C++ lists not being identical.

I do not expect such user complaints simply because I don't expect users 
to be widely trying to use extended characters (with or without UCNs) in 
identifiers within the next several years.  (Extended characters in 
strings and comments are another matter, but don't cause such problems.)  
I'd say implement the rules if someone wishes to do so - complete with the 
previous and following oddities - then try to get things cleaned up for 
the next major revisions of C and C++.

Oddities:

1. Lexing UCNs in identifiers can require up to nine characters 
backtracking:

a\U000000Cz

is three preprocessing tokens {a}{\}{U000000Cz}.

2. (A separate general UCN issue, nothing to do with their use in 
identifiers so in no way required for implementing them in identifiers.)

C++, but not C, converts all extended characters in the source file to 
UCNs in phase 1, so stringising "$" generates different results in C and 
C++ <http://gcc.gnu.org/ml/gcc-patches/2003-04/msg01523.html>.  (Doing 
this efficiently does mean only making this UTF-8 -> UCN conversion if the 
file contains extended characters, ideally only if it contains them 
outside comments.)

Comment 7 Zack Weinberg 2004-12-16 03:03:49 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)


Because of the ABI implications, I consider it completely unacceptable
to implement this feature according to the letter of C99 or C++98. Ever.  
Once the "oddities" are resolved at the standards level, I will
consider supporting the feature as resolved, but *only* as resolved.

zw

Comment 8 jsm-csl@polyomino.org.uk 2004-12-16 12:33:18 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

On Thu, 16 Dec 2004, zack at codesourcery dot com wrote:

> Because of the ABI implications, I consider it completely unacceptable

Which ABI implications?

(a) It isn't explicitly stated that different UCNs designating the same 
character are equivalent to each other (and to that character) in 
identifiers, but I don't think there's any real doubt that they are meant 
to be equivalent.

(b) There is no normalisation, but I'm confident that the answer from WG14 
if this is queried would be that the standard is correct and by design it 
normatively references ISO 10646 (not Unicode) which doesn't include the 
normalisation definitions of UAX 15 and implementation of the standard is 
not meant to involve large external tables.  If there are cases of 
ambiguity a -Wnfc option (default on) to warn for identifiers not in NFC 
(or indeed -Wnfkc, default on, for identifiers not in NFKC) would draw 
users' attention to doubtful identifiers.  (TR 10176 expressly notes the 
problems of ambiguity of appearance of entirely different characters even 
without combining characters, says that language standards need not 
provide for normalisation if they allow combining characters, and excludes 
most combining characters where precombined characters are available for 
the specific purpose of avoiding alternate representations of 
identifiers.)

(c) Though we could do what we want with extended characters (as opposed 
to UCNs) in source files in phase 1, it seems safest to err on the side of 
rejecting all extended characters that wouldn't be accepted as UCNs, 
rather than e.g. applying NFC, to avoid giving identifiers with such 
characters a meaning which might then need to be preserved in future.

(d) There are genuine ABI issues with how extended characters are 
represented in object files, but I think those need to be resolved by 
selecting between UTF-8 and mangling (default UTF-8) based on target 
configurations rather than on the capabilities of the assembler and linker 
in use, and by getting an explicit statement about encoding put in the ELF 
specification.

Comment 9 jsm-csl@polyomino.org.uk 2004-12-16 14:23:19 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

On examination of the different lists in C and C++, I'd add the use of 
UCNs accepted in one language only to the cases that should receive a 
default warning (identifiers not in NFKC receiving such a warning as 
well).  Most extreme, but assuring any case which might have a 
compatibility problem is warned for, would be also to warn for any 
character with a canonical or compatibility decomposition and for any 
character with nonzero combining class.

This area is just one of many where C and C++ give programmers more than 
enough rope to hang themselves.  In general we give due warning in such 
cases then let the programmers go ahead if they really want to.

Comment 10 jsm-csl@polyomino.org.uk 2004-12-16 23:04:54 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

The following example illustrates the problems with lack of normalisation.  
(I still expect WG14 and WG21 to consider the lack of normalisation to be 
both the current meaning of the standards and their correct meaning in 
context, though future revisions might change the exact lists of 
characters, but this is an appropriate example to present to them and 
shows why diagnostics would be needed for various cases.)

\u05e9\u05bc\u05c1
\u05e9\u05c1\u05bc
are valid identifiers in C99 but not C++ while
\ufb2c
is a valid identifier in C++ but not in C99.

In Unicode, the three are canonically equivalent, the first being both NFC 
and NFD.

05BC HEBREW POINT DAGESH OR MAPIQ (combining class 21)
05C1 HEBREW POINT SHIN DOT (combining class 24)
05E9 HEBREW LETTER SHIN (combining class 0)
FB2C HEBREW LETTER SHIN WITH DAGESH AND SHIN DOT (combining class 0)

(U+FB2C is excluded from the compositions allowed in NFC, hence the 
decomposed form being NFC.)

So with current C and C++ standards users cannot portably link some 
pointed Hebrew identifiers between the two languages; it would be 
advisable for them to avoid such identifiers.  Warning for any use of the 
characters permitted by C++ but not C seems appropriate in the expectation 
that such characters will cease to be permitted in future, regardless of 
any other changes there may be.  Making the C++ extern "C" \ufb2c into 
something else would seem to me to be the road to madness, though we could 
see how other implementations of the C++ ABI interpret it as regards 
identifiers with UCNs.

Comment 11 Zack Weinberg 2005-01-07 07:10:16 UTC

Joseph - I never properly answered your question in comment #7, although
arguably the answer is already in comment #4.

I should mention I take as a basic premise that without exception, a sequence of
UCNs and a sequence of extended-source-character-set characters (which both
encode the same sequence of ISO10646 code points) should be treated identically.
 Therefore, I'm going to talk exclusively about code points below.

The scenario that causes ABI breakage is as follows:

1) A shared library author gives an exported interface function a name
containing, for instance, U+212B ANGSTROM SIGN.

2) This is compiled with a compiler that, hewing to the letter of the standard, 
does not perform any normalization.  The shared library's symbol table therefore
also contains U+212B.  That code point is now part of the library ABI.

3) A program that uses this library is compiled with the same compiler; it
expects a symbol containing U+212B.

4) Later, someone recompiles the library with a compiler that applies NFC to all
identifiers.  The library now exports a symbol containing U+00C5 LATIN CAPITAL
LETTER A WITH RING ABOVE.  The program compiled in step 3 breaks.

An obvious rebuttal to this is that the compiler used in step 4 is broken.  As
you say, the C standard references ISO10646 not Unicode and the concept of
normalization does not exist in ISO10646, and this could be taken to imply that
no normalization shall occur.  However, there is no unambiguous statement to
that effect in the standard, and there is strong quality-of-implementation
pressure in the opposite direction.  Put aside the standard for a moment: are
users going to like a compiler that insists that "Å" (U+00C5) and "&#8491;" (U+212B)
are not the same character?  [It happens that on my screen those are ever so
slightly different, but that's just luck - and X11 will only let me type U+00C5;
I resorted to hex-editing to get the other.]

Furthermore, I can easily imagine someone writing a Unicode-aware text editor
and thinking it's a good idea to convert every file to NFC when saved.  Making
some unrelated change to the file defining the symbol with U+212B in it, with
this editor, would trigger the exact same ABI break that the hypothetical
normalizing compiler would.  This possibility means that a WG14/21
no-normalization mandate would NOT prevent silent ABI breakage.  And the
existence of this possibility increases the QoI pressure for a compiler to do
normalization, as a defensive measure against such external changes.  You could
argue that this is just another way for C programmers to shoot themselves in the
foot, but I don't think the myriad ways that already exist are a reason to add more.

For these reasons I see no safe way to implement extended identifiers except to
persuade both WG14 and WG21 to mandate use of UAX#15 annex 7, instead of the
existing lists of allowed characters.  I'm willing to consider other
normalization schemas and sets of allowed characters (as long as C and C++ are
consistent with each other) but not plans which don't include normalization.  To
address the concern about requiring huge tables, perhaps the standards could say
that it is implementation-defined whether extended characters are allowed at all.

Comment 12 jsm-csl@polyomino.org.uk 2005-01-07 10:27:52 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

On Fri, 7 Jan 2005, zack at gcc dot gnu dot org wrote:

> An obvious rebuttal to this is that the compiler used in step 4 is broken.  As
> you say, the C standard references ISO10646 not Unicode and the concept of
> normalization does not exist in ISO10646, and this could be taken to imply that
> no normalization shall occur.  However, there is no unambiguous statement to
> that effect in the standard, and there is strong quality-of-implementation

I think the relevant text is that treating identifiers as sequences of 
characters and UCNs denoting single characters.

I've had no on-list response yet to the query about this I sent to the 
WG14 reflector on Tuesday (reflector message 10698), with the HEBREW 
LETTER SHIN WITH DAGESH AND SHIN DOT examples.

> pressure in the opposite direction.  Put aside the standard for a moment: are
> users going to like a compiler that insists that "Å" (U+00C5) and "&#8491;" (U+212B)
> are not the same character?  [It happens that on my screen those are ever so
> slightly different, but that's just luck - and X11 will only let me type U+00C5;
> I resorted to hex-editing to get the other.]

The question of appearance is the same as that for U+0041 LATIN CAPITAL 
LETTER A, U+0391 GREEK CAPITAL LETTER ALPHA, U+0410 CYRILLIC CAPITAL 
LETTER A.  Will users like such a compiler less than one which doesn't 
allow them to use their native language in identifiers at all?

> normalization, as a defensive measure against such external changes.  
> You could argue that this is just another way for C programmers to shoot 
> themselves in the foot, but I don't think the myriad ways that already 
> exist are a reason to add more.

(It's WG14 and WG21 that added the new way, not us.  And it may be that if 
they are to become convinced there is any mistake then they must see real 
world problems arising with real implementations of the existing 
standards, rather than hypothetical problems.  Mistakes were made in C99 
of adding features in general without adequate implementation experience; 
changing them without experience showing what is a genuine problem could 
be seen as another such mistake to avoid.)

I could believe there could be a case for -fextended-identifiers required 
to enable UCNs in identifiers until there is more experience, with 
documentation along the lines of that formerly associated with -pedantic 
"This option is not intended to be useful; ...".

Comment 13 Gabriel Dos Reis 2005-01-07 14:27:37 UTC

Subject: Re:  UCNs not recognized in identifiers (c++/c99)

"joseph at codesourcery dot com" <gcc-bugzilla@gcc.gnu.org> writes:

| I've had no on-list response yet to the query about this I sent to the 
| WG14 reflector on Tuesday (reflector message 10698), with the HEBREW 
| LETTER SHIN WITH DAGESH AND SHIN DOT examples.

Since this issue contains a compatibility fragment and affects both C
and C++, it occurs to me that you should resend a copy of your message
to the C/C++ compatibility reflector (reaching both WG14 and WG21).  I
highly encourage you to do that.  The address is c++std-compat at accu
dot org.  It would be wrong to let the issue debated by WG14 only
without WG21 knowing. 

-- Gaby

Comment 14 jsm-csl@polyomino.org.uk 2005-01-07 15:01:56 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

On Fri, 7 Jan 2005, gdr at integrable-solutions dot net wrote:

> 
> ------- Additional Comments From gdr at integrable-solutions dot net  2005-01-07 14:27 -------
> Subject: Re:  UCNs not recognized in identifiers (c++/c99)
> 
> "joseph at codesourcery dot com" <gcc-bugzilla@gcc.gnu.org> writes:
> 
> | I've had no on-list response yet to the query about this I sent to the 
> | WG14 reflector on Tuesday (reflector message 10698), with the HEBREW 
> | LETTER SHIN WITH DAGESH AND SHIN DOT examples.
> 
> Since this issue contains a compatibility fragment and affects both C
> and C++, it occurs to me that you should resend a copy of your message
> to the C/C++ compatibility reflector (reaching both WG14 and WG21).  I
> highly encourage you to do that.  The address is c++std-compat at accu
> dot org.  It would be wrong to let the issue debated by WG14 only
> without WG21 knowing. 

I've now sent it to c++std-compat (having checked that the C++ list of 
characters also includes combining characters in more than one combining 
class so the same issues can arise there at least in principle, whether or 
not they can arise with realistic natural language identifiers).

Comment 15 Gabriel Dos Reis 2005-01-07 15:39:41 UTC

Subject: Re:  UCNs not recognized in identifiers (c++/c99)

"joseph at codesourcery dot com" <gcc-bugzilla@gcc.gnu.org> writes:

| I've now sent it to c++std-compat (having checked that the C++ list of 
| characters also includes combining characters in more than one combining 
| class so the same issues can arise there at least in principle, whether or 
| not they can arise with realistic natural language identifiers).

Thanks a lot!

-- Gaby

Comment 16 Geoff Keating 2005-01-08 02:20:29 UTC

So, just to be clear on this, the translation unit:

const char * \u00c5 = "a-ring";
float \u212b = 1e-10;

1. Is a valid translation unit in C99?
2. Invokes undefined behaviour?
3. Requires a diagnostic?

Logically it can only be one of the three.  I think the standard is pretty clear that it's (1); 6.4.2.1 
paragraph 6, "Any identifiers that differ in a significant character are different identifiers."  The standard 
therefore prohibits a compiler converting unicode sequences specified with \u to NFC (or any other 
normal form).

Comment 17 jsm-csl@polyomino.org.uk 2005-01-08 04:11:40 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

Doug Gwyn has now said

  It was certainly the original intent of C99 that identifiers would match
  only if encoded identically.  It would probably be wise for any importing
  process to apply "canonicalization" to source code before it reaches the
  compiler.

and Henry Spencer has said

  The approach I ended up using in a non-C project was to say that (a) all
  occurrences of an identifier must be encoded identically, and (b) it is
  forbidden for two different identifiers to have the same normalized form
  (for a suitable definition of normalization).

Comment 18 Gabriel Dos Reis 2005-01-08 04:45:19 UTC

Subject: Re:  UCNs not recognized in identifiers (c++/c99)

"joseph at codesourcery dot com" <gcc-bugzilla@gcc.gnu.org> writes:

| Subject: Re:  UCNs not recognized in identifiers
|  (c++/c99)
| 
| Doug Gwyn has now said
| 
|   It was certainly the original intent of C99 that identifiers would match
|   only if encoded identically.  It would probably be wise for any importing
|   process to apply "canonicalization" to source code before it reaches the
|   compiler.
| 
| and Henry Spencer has said
| 
|   The approach I ended up using in a non-C project was to say that (a) all
|   occurrences of an identifier must be encoded identically, and (b) it is
|   forbidden for two different identifiers to have the same normalized form
|   (for a suitable definition of normalization).

Joseph --

  You said you resent your message to c++std-compat.  I don't believe
it ever appeared on that list.  Please, could you double-check?

-- Gaby

Comment 19 jsm-csl@polyomino.org.uk 2005-01-08 05:32:43 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

On Sat, 8 Jan 2005, gdr at integrable-solutions dot net wrote:

> Joseph --
> 
>   You said you resent your message to c++std-compat.  I don't believe
> it ever appeared on that list.  Please, could you double-check?

I sent it to c++std-compat (and have had no bounce).  If it hasn't 
appeared within the next week then I'll investigate further.

Comment 20 Gabriel Dos Reis 2005-01-09 03:20:45 UTC

Subject: Re:  UCNs not recognized in identifiers (c++/c99)

"joseph at codesourcery dot com" <gcc-bugzilla@gcc.gnu.org> writes:

| ------- Additional Comments From joseph at codesourcery dot com  2005-01-08 05:32 -------
| Subject: Re:  UCNs not recognized in identifiers
|  (c++/c99)
| 
| On Sat, 8 Jan 2005, gdr at integrable-solutions dot net wrote:
| 
| > Joseph --
| > 
| >   You said you resent your message to c++std-compat.  I don't believe
| > it ever appeared on that list.  Please, could you double-check?
| 
| I sent it to c++std-compat (and have had no bounce).  If it hasn't 
| appeared within the next week then I'll investigate further.

Your message is now avaliable on c++std-compat.  Tom Plum kindly
forwarded Doug Gwyn's reply.

Thanks!

-- Gaby

Comment 21 Joseph S. Myers 2005-02-21 14:15:07 UTC

The following checklist for implementation of extended identifiers has
been discussed with and prioritised by Zack.  No doubt Neil will point
out if there are any missing technical points.

External specifications
=======================

Reasonable efforts should be made to get specifications of handling of
extended identifiers (that UCNs and other non-ASCII characters in
identifiers are encoded in UTF-8, at least on platforms using ASCII in
the symbol names in the first place) into the following
specifications.  Actually succeeding in doing so is not a blocker for
getting an implementation into GCC.

* ELF:
<http://www.thescogroup.com/developers/gabi/latest/ch4.symtab.html>,
where it says "External C symbols have the same names in C and object
files' symbol tables.".  I have attempted to get such wording in, the
last version proposed being:

  Unless the operating system ABI specifies otherwise, it is
  recommended that characters in external C symbols, including
  characters outside the basic source character set whether or not
  designated in source files by universal character names, are encoded
  in UTF-8 in object files' symbol tables.

and discussions being with ia64-abi@unix-os.sc.intel.com.

* C++ ABI: <http://www.codesourcery.com/cxx-abi/abi.html>.  The
appropriate form would be to add a statement that once the ABI has
constructed a C symbol name which may contain UCNs, such name should
be encoded according to the underlying C ABI, following
<http://www.codesourcery.com/cxx-abi/cxx-closed.html#F8>.

The following specification already includes all the required text,
and GCC should implement it before a release is made supporting
extended identifiers:

* DWARF3: the DW_AT_use_UTF8 attribute should be set on the
compilation unit entry for each compilation unit with any UTF-8
identifiers (including ones such as structure element names which
appear in debug information but not otherwise in external
identifiers).  It may in fact be harmless to set it unconditionally.

GCC implementation issues
=========================

The following specific issues should be dealt with in the GCC
implementation.  Everything implemented needs appropriate tests in the
testsuite to cover it, for both C and C++.

(a) Probably implemented already; if not, should be done before
feature is turned on by default in mainline:

* The precise sets of characters permitted in identifiers in each
standard (C99 and C++03) should be followed.

* A UCN is equivalent to the character it denotes.  This should be
implemented initially for the case of $, but if we start accepting
other extended characters then it should be implemented for them as
well.

* The \U and \u UCNs for the same character, and UCNs differing in
upper or lower case for hex digits, are equivalent.

* The greedy algorithm applies for lexing UCNs: for example,
a\U0000000z is three preprocessing tokens {a}{\}{U0000000z} (and
shouldn't get a diagnostic on lexing, presuming macros are defined
such that the eventual token sequence is valid).

* The spelling of UCNs is preserved for the # and ## operators.

* UCNs must not be accepted in identifiers or preprocessing numbers in
strict C90 mode: what in C99 would be an identifier with a UCN in C90
is multiple preprocessing tokens and if the identifier fragments are
defined appropriately as macros this could occur in a valid C90
program.

* I think the only reasonable interpretation of the lexing rules in
the context of forbidden characters is that first identifiers are
lexed (allowing any UCNs) then bad characters yield an error (rather
than stopping the identifier before the bad character and treating it
as not a UCN).

* These rules apply to identifiers as preprocessing tokens at any
time, including before concatenation.  So it is not the case in C99
that splitting an identifier anywhere yields two valid preprocessing
tokens: the second half could begin with a UCN for a digit and not be
a valid identifier.  (Invalid identifiers in C99 don't require
diagnostics, but I don't think we want to use this laxity.)

(b) Not done and needs to happen before the feature is turned on by
default in mainline:

* The GCC testsuite should include a test that the same UCN links
between C and an extern "C" C++ identifier.

* There should be a warning by default for all identifiers (as
preprocessing tokens at any stage, e.g. including both before and
after concatenation) not in NFKC, which may be disabled by -Wno-nfkc.

* Preprocessing numbers can contain UCNs (and extended characters such
as $ considered equivalent to them).

(c) Should happen before a release is made containing this feature:

* All uses of identifiers and DECL_ASSEMBLER_NAME in the compiler
should be audited to determine what sort of identifier is appropriate
in each case.  All places where an identifier may appear in a
diagnostic must handle extended identifiers appropriately; if the
locale cannot handle all characters in the identifier, UCNs need to be
used in diagnostic output.  The %E diagnostic format could be made to
do this, but there are many places using %s / %qs for diagnostics
which need fixing.

* Testcases in the GCC testsuite should include all contexts of
identifiers such as macro names, external linkage, internal linkage
and no linkage.  There should be tests for debug information
generation for such cases.  It would be desirable, though not required
if the necessary support isn't already in GDB, to add corresponding
tests to the GDB testsuite and make sure extended identifiers can be
used with GDB, with both DWARF3 and stabs.

* C99 does not permit UCNs for digits at the start of identifiers, but
does permit them elsewhere in identifiers, while C++ does not have
such a restriction.  The restriction in C99 and its absence in C++
should be tested.

* If platforms with limited assemblers or linkers or debug formats
come up, it would be desirable to be able to use names with internal
or no linkage containing external characters on those plarforms, with
appropriate mangling, even if defining an ABI with mangling for
external names is felt inappropriate.

* The C++ requirement that extended source characters (including '$')
are translated to UCNs in translation phase 1 needs implementing.

Comment 22 Geoff Keating 2005-02-21 19:32:23 UTC

Subject: Re:  UCNs not recognized in identifiers (c++/c99)

On 21/02/2005, at 6:15 AM, jsm28 at gcc dot gnu dot org wrote:

>
> ------- Additional Comments From jsm28 at gcc dot gnu dot org  
> 2005-02-21 14:15 -------
> The following checklist for implementation of extended identifiers has
> been discussed with and prioritised by Zack.  No doubt Neil will point
> out if there are any missing technical points.

Although I agree that these are all (except the below) nice things to 
have, I don't think I agree that they are all preconditions to having 
any part of an implementation.  For instance, an implementation that 
said sorry() when using # on an identifier from a UCN would still be 
more useful than the complete lack of implementation we have now.

> * These rules apply to identifiers as preprocessing tokens at any
> time, including before concatenation.  So it is not the case in C99
> that splitting an identifier anywhere yields two valid preprocessing
> tokens: the second half could begin with a UCN for a digit and not be
> a valid identifier.  (Invalid identifiers in C99 don't require
> diagnostics, but I don't think we want to use this laxity.)

The second half would a pp-number, instead.  It is always true that 
splitting an identifier between characters yields two valid 
preprocessing tokens.

> * All uses of identifiers and DECL_ASSEMBLER_NAME in the compiler
> should be audited to determine what sort of identifier is appropriate
> in each case.

I don't understand this sentence.  What different sorts of identifiers 
are there, and how could they be appropriate or not appropriate?

Comment 23 Geoff Keating 2005-02-21 19:32:24 UTC

Created attachment 8243 [details]
smime.p7s

Comment 24 jsm-csl@polyomino.org.uk 2005-02-21 19:47:24 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

On Mon, 21 Feb 2005, geoffk at geoffk dot org wrote:

> > * These rules apply to identifiers as preprocessing tokens at any
> > time, including before concatenation.  So it is not the case in C99
> > that splitting an identifier anywhere yields two valid preprocessing
> > tokens: the second half could begin with a UCN for a digit and not be
> > a valid identifier.  (Invalid identifiers in C99 don't require
> > diagnostics, but I don't think we want to use this laxity.)
> 
> The second half would a pp-number, instead.  It is always true that 
> splitting an identifier between characters yields two valid 
> preprocessing tokens.

It would not be a pp-number, as a UCN for a digit is still an 
identifier-nondigit rather than a digit in terms of the syntax and 
pp-numbers can't start with identifiers-nondigits.

> > * All uses of identifiers and DECL_ASSEMBLER_NAME in the compiler
> > should be audited to determine what sort of identifier is appropriate
> > in each case.
> 
> I don't understand this sentence.  What different sorts of identifiers 
> are there, and how could they be appropriate or not appropriate?

Identifiers found in input, with input spelling.  (Input includes -D and 
-U options on the command line - in principle the command line should be 
interpreted in the user's locale by default just like source files.)

UTF-8 (or, I suppose, UTF-EBCDIC) internally encoded identifiers.

Identifiers in mangled form in any case where they are mangled for output.

Identifiers in diagnostics (possibly including cases where bits of a 
diagnostic get built up with sprintf), which need converting to the user's 
locale for display or to be displayed using UCNs.

I don't know if collect2 might also need to know something about extended 
identifiers.

The aim is that every datastructure with an identifier should have the 
encoding (input, internal, output, diagnostic) well-defined and 
conversions between these should be handled properly.

Comment 25 Zack Weinberg 2005-02-21 20:14:55 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

"geoffk at geoffk dot org" <gcc-bugzilla@gcc.gnu.org> writes:

> Although I agree that these are all (except the below) nice things to 
> have, I don't think I agree that they are all preconditions to having 
> any part of an implementation.  For instance, an implementation that 
> said sorry() when using # on an identifier from a UCN would still be 
> more useful than the complete lack of implementation we have now.

In my book, a complete lack of implementation of this particular
feature is better than an incomplete one.  This is because I see the
vast majority of the work required to do a complete implementation as
being due-diligence tasks needed to ensure that the feature cannot
crash the compiler, cause wrong code generation, or introduce
compatibility problems, and as long as someone is going to do all that
work, why shouldn't they do the rest of the job as long as they're in
there?

> The second half would a pp-number, instead.  It is always true that
> splitting an identifier between characters yields two valid
> preprocessing tokens.

Joseph has mostly explained this, but I should add that what you get
if you split, say, "a\u0660b", between the "a" and the backslash is
two identifiers, the second of which's "initial character is a
universal character name designating a digit", which violates a
shall-clause in a semantics paragraph, and therefore provokes
undefined behavior. (C99 6.4.2.1p3.)

Standing policy is that all cases which provoke undefined behavior
inside the preprocessor, except already-documented GNU extensions,
shall produce hard errors.  I am tempted to make a partial exception
in this case in the interest of better compatibility with C++.  Almost
all of the UCNs in the "digits" block of C99 annex D are completely
excluded from C++98 annex E - so "a\u0660b" for instance is an invalid
identifier, and we never get as far as wondering what happens if we
split it before the backslash.  However, the range 0e50-0e59 is in the
"Thai" range of C++98/E, but *both* the "Thai" and the "Digits" ranges
of C99/D.  It would be sensible, IMO, to resolve the error in C99/D by
removing 0e50-0e59 from the "Digits" range, thus permitting those
characters to begin identifiers in both C and C++.  [Note that
currently ucnid.tab takes the opposite position.]

zw

Comment 26 Geoff Keating 2005-02-21 20:15:48 UTC

Subject: Re:  UCNs not recognized in identifiers (c++/c99)


On 21/02/2005, at 11:47 AM, joseph at codesourcery dot com wrote:

>
> ------- Additional Comments From joseph at codesourcery dot com  
> 2005-02-21 19:47 -------
> Subject: Re:  UCNs not recognized in identifiers
>  (c++/c99)
>
> On Mon, 21 Feb 2005, geoffk at geoffk dot org wrote:
>
>>> * These rules apply to identifiers as preprocessing tokens at any
>>> time, including before concatenation.  So it is not the case in C99
>>> that splitting an identifier anywhere yields two valid preprocessing
>>> tokens: the second half could begin with a UCN for a digit and not be
>>> a valid identifier.  (Invalid identifiers in C99 don't require
>>> diagnostics, but I don't think we want to use this laxity.)
>>
>> The second half would a pp-number, instead.  It is always true that
>> splitting an identifier between characters yields two valid
>> preprocessing tokens.
>
> It would not be a pp-number, as a UCN for a digit is still an
> identifier-nondigit rather than a digit in terms of the syntax and
> pp-numbers can't start with identifiers-nondigits.

That's a defect in the standard, the tail of an identifier is supposed 
to be either an identifier or a pp-number, that's why pp-number exists.

>>> * All uses of identifiers and DECL_ASSEMBLER_NAME in the compiler
>>> should be audited to determine what sort of identifier is appropriate
>>> in each case.
>>
>> I don't understand this sentence.  What different sorts of identifiers
>> are there, and how could they be appropriate or not appropriate?
>
> Identifiers found in input, with input spelling.  (Input includes -D 
> and
> -U options on the command line - in principle the command line should 
> be
> interpreted in the user's locale by default just like source files.)
>
> UTF-8 (or, I suppose, UTF-EBCDIC) internally encoded identifiers.
>
> Identifiers in mangled form in any case where they are mangled for 
> output.
>
> Identifiers in diagnostics (possibly including cases where bits of a
> diagnostic get built up with sprintf), which need converting to the 
> user's
> locale for display or to be displayed using UCNs.
>
> I don't know if collect2 might also need to know something about 
> extended
> identifiers.
>
> The aim is that every datastructure with an identifier should have the
> encoding (input, internal, output, diagnostic) well-defined and
> conversions between these should be handled properly.

My suggestion is that this can be simplified as follows:

- a CPP token is in the input form.  An identifier outside cpp is in 
'internal' form.
- DECL_ASSEMBLER_NAME is in 'output' form.
- The 'diagnostic' form is created from the 'internal' form based 
solely on the locale, at the time that a diagnostic is printed.

Comment 27 Geoff Keating 2005-02-21 20:15:51 UTC

Created attachment 8244 [details]
smime.p7s

Comment 28 Zack Weinberg 2005-02-21 20:23:18 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

"geoffk at geoffk dot org" <gcc-bugzilla@gcc.gnu.org> writes:

>>> The second half would a pp-number, instead.  It is always true that
>>> splitting an identifier between characters yields two valid
>>> preprocessing tokens.
>>
>> It would not be a pp-number, as a UCN for a digit is still an
>> identifier-nondigit rather than a digit in terms of the syntax and
>> pp-numbers can't start with identifiers-nondigits.
>
> That's a defect in the standard, the tail of an identifier is supposed 
> to be either an identifier or a pp-number, that's why pp-number exists.

Arguably yes.  *shrug* You perhaps begin to see why I did not want
this feature implemented?  Or at least why I want it done with great
caution and consideration of all these corner cases?

Does your opinion of this particular corner case change in view of C++
not permitting most of the "digit" UCNs in identifiers at all?

zw

Comment 29 Geoff Keating 2005-02-21 20:26:37 UTC

Subject: Re:  UCNs not recognized in identifiers (c++/c99)


On 21/02/2005, at 12:15 PM, zack at codesourcery dot com wrote:

>
> ------- Additional Comments From zack at codesourcery dot com  
> 2005-02-21 20:14 -------
> Subject: Re:  UCNs not recognized in identifiers
>  (c++/c99)
>
> "geoffk at geoffk dot org" <gcc-bugzilla@gcc.gnu.org> writes:
>
>> Although I agree that these are all (except the below) nice things to
>> have, I don't think I agree that they are all preconditions to having
>> any part of an implementation.  For instance, an implementation that
>> said sorry() when using # on an identifier from a UCN would still be
>> more useful than the complete lack of implementation we have now.
>
> In my book, a complete lack of implementation of this particular
> feature is better than an incomplete one.  This is because I see the
> vast majority of the work required to do a complete implementation as
> being due-diligence tasks needed to ensure that the feature cannot
> crash the compiler, cause wrong code generation, or introduce
> compatibility problems, and as long as someone is going to do all that
> work, why shouldn't they do the rest of the job as long as they're in
> there?

I think we are just going to have to agree to disagree on this.  I 
don't think your approach will lead to the best possible GCC.

>> The second half would a pp-number, instead.  It is always true that
>> splitting an identifier between characters yields two valid
>> preprocessing tokens.
>
> Joseph has mostly explained this, but I should add that what you get
> if you split, say, "a\u0660b", between the "a" and the backslash is
> two identifiers, the second of which's "initial character is a
> universal character name designating a digit", which violates a
> shall-clause in a semantics paragraph, and therefore provokes
> undefined behavior. (C99 6.4.2.1p3.)

A shall-clause in a semantics paragraph requires a diagnostic, C99 
5.1.1.3.

> Standing policy is that all cases which provoke undefined behavior
> inside the preprocessor, except already-documented GNU extensions,
> shall produce hard errors.  I am tempted to make a partial exception
> in this case in the interest of better compatibility with C++.  Almost
> all of the UCNs in the "digits" block of C99 annex D are completely
> excluded from C++98 annex E - so "a\u0660b" for instance is an invalid
> identifier, and we never get as far as wondering what happens if we
> split it before the backslash.  However, the range 0e50-0e59 is in the
> "Thai" range of C++98/E, but *both* the "Thai" and the "Digits" ranges
> of C99/D.  It would be sensible, IMO, to resolve the error in C99/D by
> removing 0e50-0e59 from the "Digits" range, thus permitting those
> characters to begin identifiers in both C and C++.  [Note that
> currently ucnid.tab takes the opposite position.]

This would make the compiler non-conforming.

Comment 30 Geoff Keating 2005-02-21 20:26:40 UTC

Created attachment 8245 [details]
smime.p7s

Comment 31 Zack Weinberg 2005-02-21 20:54:02 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

"geoffk at geoffk dot org" <gcc-bugzilla@gcc.gnu.org> writes:

>>> The second half would a pp-number, instead.  It is always true that
>>> splitting an identifier between characters yields two valid
>>> preprocessing tokens.
>>
>> Joseph has mostly explained this, but I should add that what you get
>> if you split, say, "a\u0660b", between the "a" and the backslash is
>> two identifiers, the second of which's "initial character is a
>> universal character name designating a digit", which violates a
>> shall-clause in a semantics paragraph, and therefore provokes
>> undefined behavior. (C99 6.4.2.1p3.)
>
> A shall-clause in a semantics paragraph requires a diagnostic, C99 
> 5.1.1.3.

Um, no, 5.1.1.3 does not say that.  It says a diagnostic is required
for a violation of any "syntax rule or constraint"; shall-clauses in
semantics paragraphs are neither.  Constraints only appear in
constraints paragraphs.  See 4p2 for the meaning of shall-clauses
outside constraints paragraphs.

zw

Comment 32 Neil Booth 2005-02-21 23:00:52 UTC

Subject: Re:  UCNs not recognized in identifiers (c++/c99)

jsm28 at gcc dot gnu dot org wrote:-

> * The greedy algorithm applies for lexing UCNs: for example,
> a\U0000000z is three preprocessing tokens {a}{\}{U0000000z} (and
> shouldn't get a diagnostic on lexing, presuming macros are defined
> such that the eventual token sequence is valid).

I'm not sure I agree with this: it would seem to be unnecessary
extra work; further I suspect the user would benefit from it being
pointed out he entered an ill-formed UCN rather than something random
from the front end complaining about an unexpected backslash.

The only case where you wouldn't get a syntax error from the
front end, or an invalid escape in a literal, is with -E.  I'm
not sure lexing to the letter of the standard is worthwhile in
this case, as the standard doesn't discuss -E.

If you have an example where a compiled program is acceptable
with multiple lexing tokens then I would agree with you.

> * The spelling of UCNs is preserved for the # and ## operators.

This is very hard with CPP's current implementation - it assumes
it can deduce the spelling of an identifier from its hash table
entry.  IMO the proper way to fix this to use a different approach
entirely, rather than kludge it in the existing implementation
(which would bloat some common datastructures) but that's some work.

> * I think the only reasonable interpretation of the lexing rules in
> the context of forbidden characters is that first identifiers are
> lexed (allowing any UCNs) then bad characters yield an error (rather
> than stopping the identifier before the bad character and treating it
> as not a UCN).

Agreed - as I say above I don't see why this shouldn't apply for
partial UCNs too, even with -E.

The rest seems reasonable.

Neil.

Comment 33 jsm-csl@polyomino.org.uk 2005-02-22 02:13:57 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

On Mon, 21 Feb 2005, zack at codesourcery dot com wrote:

> Standing policy is that all cases which provoke undefined behavior
> inside the preprocessor, except already-documented GNU extensions,
> shall produce hard errors.  I am tempted to make a partial exception

Which policy (cf. bug 14634) I agree with.  However, I don't think there 
should be any exception made.  The standards (C99 and C++03) are 
implementable as-is.  They have oddities; some of these may be suitable 
for submission as DRs, and if the committees fix them in a TC rather than 
a major new standard revision then we no longer need implement those 
oddities, but for now the standard says what it says.  The headings in C99 
Annex D are except for "Digits" irrelevant to the normative requirements; 
anything in "Digits" is a UCN for a digit, whether or not it appears 
elsewhere.  (C++03 corrected the typo in C++98 which was noted in C++ DR 
131.)  The C++ standard's heading "CJK Unified Ideographs" lists ranges 
which also include various presentations forms such as one of the Hebrew 
characters previously discussed, but these are genuine ranges of letters 
clearly deliberately included; just the heading is wrong.

Comment 34 jsm-csl@polyomino.org.uk 2005-02-22 02:22:48 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

On Mon, 21 Feb 2005, neil at daikokuya dot co dot uk wrote:

> jsm28 at gcc dot gnu dot org wrote:-
> 
> > * The greedy algorithm applies for lexing UCNs: for example,
> > a\U0000000z is three preprocessing tokens {a}{\}{U0000000z} (and
> > shouldn't get a diagnostic on lexing, presuming macros are defined
> > such that the eventual token sequence is valid).
> 
> I'm not sure I agree with this: it would seem to be unnecessary
> extra work; further I suspect the user would benefit from it being
> pointed out he entered an ill-formed UCN rather than something random
> from the front end complaining about an unexpected backslash.
> 
> The only case where you wouldn't get a syntax error from the
> front end, or an invalid escape in a literal, is with -E.  I'm
> not sure lexing to the letter of the standard is worthwhile in
> this case, as the standard doesn't discuss -E.
> 
> If you have an example where a compiled program is acceptable
> with multiple lexing tokens then I would agree with you.

#define a b(
#define b(x) q
int a\U0000000z );

Greedy lexing is the standard as applied for other token types.  I don't 
think a difference here makes sense.  _cpp_valid_ucn would need changing 
so it doesn't give an error for incomplete UCNs in identifiers but instead 
returns quietly.

Comment 35 jsm-csl@polyomino.org.uk 2005-02-22 02:28:54 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

On Mon, 21 Feb 2005, geoffk at geoffk dot org wrote:

> My suggestion is that this can be simplified as follows:
> 
> - a CPP token is in the input form.  An identifier outside cpp is in 
> 'internal' form.
> - DECL_ASSEMBLER_NAME is in 'output' form.
> - The 'diagnostic' form is created from the 'internal' form based 
> solely on the locale, at the time that a diagnostic is printed.

Fine.  Now, at present the conversions between these forms are trivial.  
So the audit required is of everywhere there is an assignment / copy / 
input / output between different forms to ensure that the appropriate 
conversions are applied instead of a straight copy as at present.  For 
example, all the places printing IDENTIFIER_POINTER (id) with %qs become 
no longer valid, as IDENTIFIER_POINTER is in the internal form and %qs 
simply prints a string; %E may print an identifier as such, converting to 
the output form, but everywhere using %qs or some other output notation 
other than %E on an identifier needs checking and fixing.

Comment 36 Geoff Keating 2005-02-22 09:23:16 UTC

Subject: Re:  UCNs not recognized in identifiers (c++/c99)

On 21/02/2005, at 6:13 PM, joseph at codesourcery dot com wrote:

> The standards (C99 and C++03) are implementable as-is.  They have 
> oddities; some of these may be suitable for submission as DRs, and if 
> the committees fix them in a TC rather than a major new standard 
> revision then we no longer need implement those oddities, but for now 
> the standard says what it says.

I agree with this point.  We should implement the standard first, and 
then see if any parts of it are particularly troublesome for actual 
use.

Comment 37 Geoff Keating 2005-02-22 09:23:16 UTC

Created attachment 8253 [details]
smime.p7s

Comment 38 jsm-csl@polyomino.org.uk 2005-02-22 11:51:43 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

On Tue, 22 Feb 2005, geoffk at geoffk dot org wrote:

> > The standards (C99 and C++03) are implementable as-is.  They have 
> > oddities; some of these may be suitable for submission as DRs, and if 
> > the committees fix them in a TC rather than a major new standard 
> > revision then we no longer need implement those oddities, but for now 
> > the standard says what it says.
> 
> I agree with this point.  We should implement the standard first, and 
> then see if any parts of it are particularly troublesome for actual 
> use.

Which is a key reason why a long list of every technical point we could 
think of is on this checklist: if there are as Zack suggests going to be 
serious ABI problems with this feature in the long run, evidence of 
problems can only be provided to WG14 and WG21 on the basis of real 
experience with implementations that attempt to do a good job of 
implementing the current standard requirements, not on the basis of bad or 
partial implementations or implementations not implementing some 
particular requirement because of an advance decision that you don't like 
that bit of the standard or don't think it important.

> ------- Additional Comments From geoffk at geoffk dot org  2005-02-22 09:23 -------
> Created an attachment (id=8253)
>  --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=8253&action=view)

All your messages to this bug appear to be creating an attachment for some 
reason, and none of them seem to be appearing on gcc-bugs.

Comment 39 Joseph S. Myers 2005-03-12 11:14:28 UTC

Another reason why spelling needs preserving (in addition to implementing #
correctly) is for the constraints on duplicate macro definitions.

#define foo \u00c1
#define foo \u00C1

is invalid (different spelling in replacement), as is

#define bar(\u00c1)
#define bar(\u00C1)

(different spelling of parameter names).  However,

#define \u00c1 foo
#define \u00C1 foo

is valid, since the spelling of the macro *name* doesn't need to be the same.

It is true that we don't get the constraints on duplicate macro definitions
right in all cases at present (bug 20078), but since spelling of identifiers
needs preserving anyway for the # operator this seems no reason not to get
this case right (with testcases, of course).

Comment 40 Andrew Pinski 2005-07-05 02:14:25 UTC

Unassigning from Zack since he is now gone from GCC development.

Comment 41 Geoff Keating 2005-09-15 22:34:44 UTC

(In reply to comment #39)
> Another reason why spelling needs preserving (in addition to implementing #
> correctly) is for the constraints on duplicate macro definitions.
> 
> #define foo \u00c1
> #define foo \u00C1
> 
> is invalid (different spelling in replacement), as is

We discussed this on the list and decided that this was probably a defect in the C standard, since the 
Rationale says that the kind of implementation we have now is supposed to be permitted, and jsm said 
he'd file a DR.  How's that going?

Comment 42 Neil Booth 2005-09-15 22:53:20 UTC

Subject: Re:  UCNs not recognized in identifiers (c++/c99)

geoffk at gcc dot gnu dot org wrote:-

> 
> ------- Additional Comments From geoffk at gcc dot gnu dot org  2005-09-15 22:34 -------
> (In reply to comment #39)
> > Another reason why spelling needs preserving (in addition to implementing #
> > correctly) is for the constraints on duplicate macro definitions.
> > 
> > #define foo \u00c1
> > #define foo \u00C1
> > 
> > is invalid (different spelling in replacement), as is
> 
> We discussed this on the list and decided that this was probably a defect in the C standard, since the 
> Rationale says that the kind of implementation we have now is supposed to be permitted, and jsm said 
> he'd file a DR.  How's that going?

I very much doubt this is a defect.  Just because it doesn't fit your
implementation...

Neil.

Comment 43 jsm-csl@polyomino.org.uk 2005-09-15 22:53:36 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

On Thu, 15 Sep 2005, geoffk at gcc dot gnu dot org wrote:

> ------- Additional Comments From geoffk at gcc dot gnu dot org  2005-09-15 22:34 -------
> (In reply to comment #39)
> > Another reason why spelling needs preserving (in addition to implementing #
> > correctly) is for the constraints on duplicate macro definitions.
> > 
> > #define foo \u00c1
> > #define foo \u00C1
> > 
> > is invalid (different spelling in replacement), as is
> 
> We discussed this on the list and decided that this was probably a defect in the C standard, since the 
> Rationale says that the kind of implementation we have now is supposed to be permitted, and jsm said 
> he'd file a DR.  How's that going?

I don't believe I said I'd file a DR unless I saw a defect.  There is no 
defect because models A or C need to be implemented by an 
implementation-defined mapping (documented as such; we don't even document 
the removal of trailing whitespace from lines; of course anything 
replacing UCNs with the characters they designate only in certain places 
is a pain to document because it doesn't fit in with the C model of phases 
of translation).  Doug Gwyn's reading in reflector message 10751,

  Yes, "spelling" is meant in terms of the source code characters.
  The idea is to permit simple strcmp-like checking by the preprocessor.

seems fine to me - implementations permitting the above in the input 
source must end up with the source looking different from the above after 
phase 1.

Comment 44 Neil Booth 2005-09-15 22:58:30 UTC

Subject: Re:  UCNs not recognized in identifiers (c++/c99)

joseph at codesourcery dot com wrote:-

> I don't believe I said I'd file a DR unless I saw a defect.  There is no 
> defect because models A or C need to be implemented by an 
> implementation-defined mapping (documented as such; we don't even document 
> the removal of trailing whitespace from lines; of course anything 
> replacing UCNs with the characters they designate only in certain places 
> is a pain to document because it doesn't fit in with the C model of phases 
> of translation).  Doug Gwyn's reading in reflector message 10751,
> 
>   Yes, "spelling" is meant in terms of the source code characters.
>   The idea is to permit simple strcmp-like checking by the preprocessor.

I think this is what we will need to do to fix the # and ## and spacing
bugs in macro replacements too - base the decision upon a memcmp or
strcmp.

Neil.

Comment 45 jsm-csl@polyomino.org.uk 2005-09-15 23:37:11 UTC

Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

On Thu, 15 Sep 2005, neil at daikokuya dot co dot uk wrote:

> >   Yes, "spelling" is meant in terms of the source code characters.
> >   The idea is to permit simple strcmp-like checking by the preprocessor.
> 
> I think this is what we will need to do to fix the # and ## and spacing
> bugs in macro replacements too - base the decision upon a memcmp or
> strcmp.

Note that comparing macros replacements by strings means you can no longer 
fake a version of UCN model C (don't really rewrite UCNs in phase 1 but 
convert identifier spellings to UTF-8 on lexing identifiers) as now, 
because then the conversion to canonical form would be visible in the 
results of stringising them but differently spelt macro definitions would 
still show up as different.  You'd need either to convert identifiers 
before producing the string form of macro replacements, or (my preference) 
work out how to preserve different spellings of preprocessing tokens 
representing the same identifier (so as to get the results of stringising 
right).  (Comparing with strings may still be useful in order to fix the 
other bugs you mention.)

Comment 46 Geoff Keating 2005-09-16 00:01:54 UTC

Subject: Re:  UCNs not recognized in identifiers (c++/c99)

On 15/09/2005, at 3:53 PM, joseph at codesourcery dot com wrote:

>   Yes, "spelling" is meant in terms of the source code characters.
>   The idea is to permit simple strcmp-like checking by the  
> preprocessor.

Good, so that answers that question.

You raise a good point about GCC not having documentation for phase  
1.  I don't have time to write all of it, but I think I can write the  
last part, about UCNs, so maybe together we can get it all done.  My  
proposed wording is:

@cite{The mapping between physical source file multibyte characters
and the source character set in translation phase 1 (C90 and C99  
5.1.1.2).}

[CR/NL/CR-NL are turned into EOL markers, spaces are deleted between  
backslash and the end of a line, it's converted to UTF-8 using iconv  
based on -finput-charset---and what else?]

Then, any character sequence which would form a UCN in an identifier  
in phase 3 of translation is converted into the corresponding UTF-8  
sequence.  Any backslash-newline combinations in the UCN are  
preserved and placed after the UTF-8 sequence.

[note that there's no way for a user to tell whether a backslash- 
newline combination is placed before, in the middle of, or after, the  
UTF-8 sequence.]

...

@cite{Which additional multibyte characters may appear in identifiers
and their correspondence to universal character names (C99 6.4.2).}

UTF-8 character sequences may appear in identifiers, and they  
correspond to the UCN that specifies that character.  A UTF-8  
sequence may appear only if the UCN that it corresponds to would be  
permitted in the identifier at that point.  At present, only those  
UTF-8 sequences which were produced by the mapping from UCNs to UTF-8  
sequences in translation phase 1 are permitted, but this is likely to  
change in the future.

Comment 47 Joseph S. Myers 2014-11-05 16:19:42 UTC

Author: jsm28
Date: Wed Nov  5 16:19:10 2014
New Revision: 217144

URL: https://gcc.gnu.org/viewcvs?rev=217144&root=gcc&view=rev
Log:
Enable -fextended-identifiers by default.

As proposed at <https://gcc.gnu.org/ml/gcc/2014-11/msg00014.html>,
this patch enables -fextended-identifiers by default for all standard
versions including this feature (all C++ versions, C99 and above for
C, but not C90 / C94 / gnu89 / preprocessing assembler).  It adds a
couple of tests for areas where I previously noted testsuite coverage
for extended identifiers was lacking, removes -fextended-identifiers
from existing tests, adds -g to various such tests to verify that
extended identifiers don't break debug info generation and removes the
test that was only there to verify that the feature was off by
default.

The current state of the feature may not correspond exactly to any
particular checklist from 2004/5 (see bug 9449) of what was wanted
before enabling the feature by default, but I don't think it's any
worse than plenty of other features supported by default before every
corner case is fully functional, and think problems can readily be
fixed incrementally.

The following aspects of extended identifiers could still do with more
work (and should be straightforward):

* C -aux-info (output should use UCNs).

* ObjC -gen-decls (output should use UCNs; associated diagnostics from
  the ObjC front end should use extended characters or UCNs as
  appropriate to the locale, via using %qE or identifier_to_locale).

* Use DW_AT_use_UTF8 in DWARF-3 debug info for compilation units built
  with extended identifiers enabled (or unconditionally).

* cpplib diagnostics (outputting characters or UCNs as appropriate
  depending on the locale, as done for identifiers in non-cpplib
  diagnostics).

* C++ test for UCN linking with C and extern "C".

* Check GDB support / file issues for support if needed.

* Actual UTF-8 in identifiers (?).  (Be careful about not affecting
  performance for the normal fast path of lexing identifiers, if
  possible.)

The following may be trickier:

* cpplib spelling preservation (required to diagnose macro
  redefinition with different spellings of the same identifier in the
  definition or argument names; different spellings of the name of the
  macro itself are OK, however; also required for correct handling of
  multiple stringizing in C++); correct output for -d (UCNs), DWARF
  debug info for macros (UCNs), PCH and PCH tests.  (Spelling
  preservation is the issue that needs fixing to remove references to
  corner cases in the documentation of -std=c99 and -std=c11 and in
  c99status.html.)  The idea would be to add a second pointer to
  cpp_identifier that stores the original spelling (whether for
  extended identifiers only, or for all identifiers); this does not
  enlarge cpp_token because the resulting larger cpp_identifier
  structure is no bigger than cpp_string.

* C++ translation of extended characters (including $@` and various
  control characters) to UCNs in phase 1 (note diagnostics thus
  needed, but not for C++11, for control characters in strings /
  character constants as those UCNs invalid); a likely implementation
  approach is to do translation when identifiers / strings / character
  constants are lexed, together with errors for stray $@` / control
  characters in program as not being valid UCNs in identifiers ($ only
  if not accepted in identifiers); note that this translation should
  not take place inside raw string literals.

Bootstrapped with no regressions on x86_64-unknown-linux-gnu.

libcpp:
	PR preprocessor/9449
	* init.c (lang_defaults): Enable extended identifiers for C++ and
	C99-based standards.

gcc:
	PR preprocessor/9449
	* doc/cpp.texi (Character sets, Tokenization)
	(Implementation-defined behavior): Don't refer to UCNs in
	identifiers requiring -fextended-identifiers.
	* doc/cppopts.texi (-fextended-identifiers): Document as enabled
	by default for C99 and later and C++.
	* doc/invoke.texi (-std=c99, -std=c11): Don't refer to extended
	identifiers needing -fextended-identifiers.

gcc/testsuite:
	PR preprocessor/9449
	* lib/target-supports.exp (check_effective_target_ucn_nocache):
	Don't use -fextended-identifiers.
	* c-c++-common/cpp/normalize-3.c, c-c++-common/cpp/ucnid-2011-1.c,
	g++.dg/cpp/ucn-1.C, g++.dg/cpp/ucnid-1.C, g++.dg/other/ucnid-1.C,
	gcc.dg/cpp/normalize-1.c, gcc.dg/cpp/normalize-2.c,
	gcc.dg/cpp/normalize-4.c: Don't use -fextended-identifiers.
	* gcc.dg/cpp/ucnid-1.c: Don't use -fextended-identifiers.  Use
	-g3.
	* gcc.dg/cpp/ucnid-10.c, gcc.dg/cpp/ucnid-2.c,
	gcc.dg/cpp/ucnid-3.c, gcc.dg/cpp/ucnid-4.c, gcc.dg/cpp/ucnid-5.c,
	gcc.dg/cpp/ucnid-7.c, gcc.dg/cpp/ucnid-9.c,
	gcc.dg/cpp/warn-normalized-1.c, gcc.dg/cpp/warn-normalized-2.c,
	gcc.dg/cpp/warn-normalized-3.c: Don't use -fextended-identifiers.
	* gcc.dg/ucnid-1.c, gcc.dg/ucnid-2.c, gcc.dg/ucnid-3.c,
	gcc.dg/ucnid-4.c, gcc.dg/ucnid-5.c, gcc.dg/ucnid-6.c: Don't use
	-fextended-identifiers.  Use -g.
	* gcc.dg/ucnid-7.c, gcc.dg/ucnid-8.c: Don't use
	-fextended-identifiers.
	* gcc.dg/ucnid-9.c: Don't use -fextended-identifiers.  Use -g.
	* gcc.dg/ucnid-10.c: Don't use -fextended-identifiers.
	* gcc.dg/ucnid-11.c, gcc.dg/ucnid-12.c: Don't use
	-fextended-identifiers.  Use -g.
	* gcc.dg/ucnid-13.c: Don't use -fextended-identifiers.
	* gcc.dg/cpp/ucnid-8.c: Remove test.
	* gcc.dg/cpp/ucnid-10.c, gcc.dg/ucnid-14.c: New tests.

Added:
    trunk/gcc/testsuite/gcc.dg/cpp/ucnid-10.c
    trunk/gcc/testsuite/gcc.dg/ucnid-14.c
Removed:
    trunk/gcc/testsuite/gcc.dg/cpp/ucnid-8.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/doc/cpp.texi
    trunk/gcc/doc/cppopts.texi
    trunk/gcc/doc/invoke.texi
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/c-c++-common/cpp/normalize-3.c
    trunk/gcc/testsuite/c-c++-common/cpp/ucnid-2011-1.c
    trunk/gcc/testsuite/g++.dg/cpp/ucn-1.C
    trunk/gcc/testsuite/g++.dg/cpp/ucnid-1.C
    trunk/gcc/testsuite/g++.dg/other/ucnid-1.C
    trunk/gcc/testsuite/gcc.dg/cpp/normalize-1.c
    trunk/gcc/testsuite/gcc.dg/cpp/normalize-2.c
    trunk/gcc/testsuite/gcc.dg/cpp/normalize-4.c
    trunk/gcc/testsuite/gcc.dg/cpp/ucnid-1.c
    trunk/gcc/testsuite/gcc.dg/cpp/ucnid-2.c
    trunk/gcc/testsuite/gcc.dg/cpp/ucnid-3.c
    trunk/gcc/testsuite/gcc.dg/cpp/ucnid-4.c
    trunk/gcc/testsuite/gcc.dg/cpp/ucnid-5.c
    trunk/gcc/testsuite/gcc.dg/cpp/ucnid-7.c
    trunk/gcc/testsuite/gcc.dg/cpp/ucnid-9.c
    trunk/gcc/testsuite/gcc.dg/cpp/warn-normalized-1.c
    trunk/gcc/testsuite/gcc.dg/cpp/warn-normalized-2.c
    trunk/gcc/testsuite/gcc.dg/cpp/warn-normalized-3.c
    trunk/gcc/testsuite/gcc.dg/ucnid-1.c
    trunk/gcc/testsuite/gcc.dg/ucnid-10.c
    trunk/gcc/testsuite/gcc.dg/ucnid-11.c
    trunk/gcc/testsuite/gcc.dg/ucnid-12.c
    trunk/gcc/testsuite/gcc.dg/ucnid-13.c
    trunk/gcc/testsuite/gcc.dg/ucnid-2.c
    trunk/gcc/testsuite/gcc.dg/ucnid-3.c
    trunk/gcc/testsuite/gcc.dg/ucnid-4.c
    trunk/gcc/testsuite/gcc.dg/ucnid-5.c
    trunk/gcc/testsuite/gcc.dg/ucnid-6.c
    trunk/gcc/testsuite/gcc.dg/ucnid-7.c
    trunk/gcc/testsuite/gcc.dg/ucnid-8.c
    trunk/gcc/testsuite/gcc.dg/ucnid-9.c
    trunk/gcc/testsuite/lib/target-supports.exp
    trunk/libcpp/ChangeLog
    trunk/libcpp/init.c

Comment 48 Joseph S. Myers 2014-11-05 16:23:28 UTC

Enabled by default for relevant standards for GCC 5.

Comment 49 Joseph S. Myers 2014-11-06 21:09:25 UTC

Author: jsm28
Date: Thu Nov  6 21:08:52 2014
New Revision: 217202

URL: https://gcc.gnu.org/viewcvs?rev=217202&root=gcc&view=rev
Log:
Preserve original spellings of extended identifiers.

This patch makes cpplib track the original spellings of extended
identifiers, as well as the canonical UTF-8 version, in order to
follow standard semantics properly without needing a convoluted and
undocumented canonicalization in translation phase 1 (see bug 9449
comments 39-46 regarding such a canonicalization).

The spelling is tracked in cpp_identifier and cpp_macro_arg without
making cpp_token any larger.  The original spelling is used for checks
of duplicate macro definitions, stringizing (see the C++ tests added;
this case is only an issue for C++ not C because C makes it
implementation-defined whether a \ is inserted before the \ of a UCN
in a string or character constant when stringizing, while C++ does
not), pasting (relevant when the result is then stringized for C++)
and when macro definitions are output as text (e.g. for -d options).

Once a macro has been defined, only the original spelling of the
argument names needs keeping in the argument list.  While it is being
defined, however, both spellings are needed: the original one for
subsequent saving for checks of duplicate macro definitions, and the
canonical one which is the node marked specially to generate macro
argument tokens rather than normal identifier tokens.  The buffer that
is used to save the original values of the identifier tokens is
changed so that it stores both those original values and a pointer to
the canonical hash nodes, so that those canonical nodes can be found
when their values need restoring after the macro definition has been
parsed.

I believe this covers the known standards issues in extended
identifiers support (the remaining unimplemented C99 areas in GCC all
being floating-point-related), except for C++ translation of extended
characters to UCNs in phase 1 (which I have no plans to work on).
There are however probably issues left with handling of extended
identifiers in other places, as listed in
<https://gcc.gnu.org/ml/gcc-patches/2014-11/msg00337.html> (those
issues are generally the sort of thing that could be addressed as bugs
outside development stage 1).  (The bulk of the potential issues Zack
was concerned about in 2003-5, that resulted in extended identifiers
being disabled in the absence of -fextended-identifiers, were
effectively eliminated by the audit and fixes I did in 2009, however;
that todo list reflects what was left over after that audit.)

Bootstrapped with no regressions on x86_64-unknown-linux-gnu.

libcpp:
	* include/cpp-id-data.h (struct cpp_macro): Update comment
	regarding parameters.
	* include/cpplib.h (struct cpp_macro_arg, struct cpp_identifier):
	Add spelling fields.
	(struct cpp_token): Update comment on macro_arg.
	* internal.h (_cpp_save_parameter): Add extra argument.
	(_cpp_spell_ident_ucns): New declaration.
	* lex.c (lex_identifier): Add SPELLING argument.  Set *SPELLING to
	original spelling of identifier.
	(_cpp_lex_direct): Update calls to lex_identifier.
	(_cpp_spell_ident_ucns): New function, factored out of
	cpp_spell_token.
	(cpp_spell_token): Adjust FORSTRING argument semantics to return
	original spelling of identifiers.  Use _cpp_spell_ident_ucns in
	!FORSTRING case.
	(_cpp_equiv_tokens): Check spellings of identifiers and macro
	arguments are identical.
	* macro.c (macro_arg_saved_data): New structure.
	(paste_tokens): Use original spellings of identifiers from
	cpp_spell_token.
	(_cpp_save_parameter): Add argument SPELLING.  Save both canonical
	node and its value.
	(parse_params): Update calls to _cpp_save_parameter.
	(lex_expansion_token): Save spelling of macro argument tokens.
	(_cpp_create_definition): Extract canonical node from saved data.
	(cpp_macro_definition): Use UCNs in spelling of macro name.  Use
	original spellings of macro argument tokens and identifiers.
	* traditional.c (scan_parameters): Update call to
	_cpp_save_parameter.

gcc:
	* doc/invoke.texi (-std=c99, -std=c11): Don't refer to corner
	cases of extended identifiers.

gcc/testsuite:
	* g++.dg/cpp/ucnid-2.C, g++.dg/cpp/ucnid-3.C,
	gcc.dg/cpp/ucnid-11.c, gcc.dg/cpp/ucnid-12.c,
	gcc.dg/cpp/ucnid-13.c, gcc.dg/cpp/ucnid-14.c,
	gcc.dg/cpp/ucnid-15.c: New tests.

Added:
    trunk/gcc/testsuite/g++.dg/cpp/ucnid-2.C
    trunk/gcc/testsuite/g++.dg/cpp/ucnid-3.C
    trunk/gcc/testsuite/gcc.dg/cpp/ucnid-11.c
    trunk/gcc/testsuite/gcc.dg/cpp/ucnid-12.c
    trunk/gcc/testsuite/gcc.dg/cpp/ucnid-13.c
    trunk/gcc/testsuite/gcc.dg/cpp/ucnid-14.c
    trunk/gcc/testsuite/gcc.dg/cpp/ucnid-15.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/doc/invoke.texi
    trunk/gcc/testsuite/ChangeLog
    trunk/libcpp/ChangeLog
    trunk/libcpp/include/cpp-id-data.h
    trunk/libcpp/include/cpplib.h
    trunk/libcpp/internal.h
    trunk/libcpp/lex.c
    trunk/libcpp/macro.c
    trunk/libcpp/traditional.c