From: Zack Weinberg <zack@codesourcery.com> To: Richard.Earnshaw@arm.com Cc: gcc-bugs@gcc.gnu.org, rearnsha@arm.com, sdouglas@arm.com, gcc-gnats@gcc.gnu.org Subject: Re: preprocessor/9449: UCNs recognized in identifiers (c++/c99) Date: Mon, 27 Jan 2003 11:59:14 -0800 Richard Earnshaw <rearnsha@arm.com> writes: > http://www.codesourcery.com/lists?2:mss:1481:danfdfbkjoaahbcmmeam Thanks, that's helpful. > Why have you changed the class to Change Request? Rejects legal is > a far more accurate description of this. Because it's a case of "sorry, this feature is not implemented" and I don't have time to do it anytime soon, nor do I plan to start an implementation until there's agreement on the semantics. zw
(I've marked this as "preprocessor" since that's used for lexing both C99 and C++, but this is probably more complicated that that...) The following is legal in both c99 and C++, but is rejected by both: int x\u0394; Note that this isn't just an issue of parsing the input. The correct translation to the target symbol is also required (this my be governed by machine conventions and/or object file or linker restrictions). Release: unknown Environment: Any How-To-Repeat: compile the example above with -std=c99 (for C) or the C++ compiler.
Responsible-Changed-From-To: unassigned->zack Responsible-Changed-Why: I'll take responsibility for this, but since the situation is that this feature has not yet been implemented, and there are still open questions about how to do it, I am deprioritizing it.
From: Richard Earnshaw <rearnsha@arm.com> To: zack@gcc.gnu.org, gcc-bugs@gcc.gnu.org, gcc-prs@gcc.gnu.org, nobody@gcc.gnu.org, rearnsha@arm.com, sdouglas@arm.com, gcc-gnats@gcc.gnu.org Cc: Richard.Earnshaw@arm.com Subject: Re: preprocessor/9449: UCNs recognized in identifiers (c++/c99) Date: Mon, 27 Jan 2003 17:09:12 +0000 The following link may give some useful ideas. http://www.codesourcery.com/lists?2:mss:1481:danfdfbkjoaahbcmmeam R. PS. Why have you changed the class to Change Request? Rejects legal is a far more accurate description of this.
Subject: Re: gcc and UCN in identifiers: bug PR 9449 Al Simons <al.simons@hp.com> writes: > Hi, Zack. > > I'm looking into adding UCN support for identifiers into the HP > C/C++ compiler, and wondered if there is any new status on your > implementation / design? We'd like to do things the same way if at > all possible. I don't intend to implement this feature until the C committee, the C++ committee, and the Unicode committee all agree on which Unicode character sequences are legitimate in identifiers and what sort of canonicalization is to be performed. As long as there is no agreement, implementation of this feature risks indeterminacy in shared library ABIs. Suppose that the identifier "get_length_in_Ångstroms" is part of a shared library's public interface. The Å might be U+212B, U+00C5, or U+0041 U+030A. Suppose further that the person who implemented the shared library used a text editor that generates NFD, so the library header reads U+0041 U+030A. But their compiler normalizes to NFC on input, so the name in the shared library's symbol table reads U+00C5. Now someone comes along with a compiler that does no normalization whatsoever and tries to use the library. They're going to get a link error and they're not going to know why. Worse, if someone recompiles the library with a compiler that chose to normalize to NFD, its ABI silently changes. Joseph Myers insists that this situation cannot arise, because C99/C++'s lists of valid Unicode code points in identifiers exclude all combining forms. But if I enforce those rules users will hate the compiler, because their text editors will generate what looks like perfectly fine text and then the compiler will barf on it. And I am not prepared to trust that every editor on the planet will adhere to C99/C++'s rules. And even if I were, we'd still have the problem of the C99 and C++ lists not being identical. > There is a link in the bug report that appears to be broken; any > chance you can hook it back up? > > <<http://www.codesourcery.com/lists?2:mss:1481:danfdfbkjoaahbcmmeam>http://www.codesourcery.com/lists?2:mss:1481:danfdfbkjoaahbcmmeam> My best guess is that this is now <http://www.codesourcery.com/archives/cxx-abi-dev/msg00676.html>. This is mostly about how to mangle non-ASCII characters in identifiers to get them past limited linkers, and doesn't offer any help with the problems I described above. zw
Subject: Re: UCNs not recognized in identifiers (c++/c99) "zack at codesourcery dot com" <gcc-bugzilla@gcc.gnu.org> writes: | Joseph Myers insists that this situation cannot arise, because | C99/C++'s lists of valid Unicode code points in identifiers exclude | all combining forms. But if I enforce those rules users will hate the | compiler, because their text editors will generate what looks like | perfectly fine text and then the compiler will barf on it. And I am I don't see how that would be different from current situation with bool est_il_ingénieur(const Employé&); The compiler will barf and eventually users will learn feeding the compiler with proper character sets. | not prepared to trust that every editor on the planet will adhere to | C99/C++'s rules. Maybe, but I think that is irrelevant and beside the point. -- Gaby
Subject: Re: UCNs not recognized in identifiers (c++/c99) On Thu, 16 Dec 2004, zack at codesourcery dot com wrote: > Joseph Myers insists that this situation cannot arise, because > C99/C++'s lists of valid Unicode code points in identifiers exclude > all combining forms. But if I enforce those rules users will hate the (That is, that they exclude combining forms for languages where the precomposed forms are made available, so reducing the uniqueness issues. Given that, for example, the definition of NFC has itself since been found to be defective <http://www.unicode.org/review/pr-29.html>, albeit for examples that cannot occur in real languages, this is not a theorem about what might be done with general combinations of the characters listed as valid.) And also that: * The combining rules are not part of what C99 or C++ normatively reference. * Characters looking identical can occur without the combining characters. For this reason - distinguishing U+0041 LATIN CAPITAL LETTER A, U+0391 GREEK CAPITAL LETTER ALPHA, U+0410 CYRILLIC CAPITAL LETTER A, for example - I think compiler diagnostics (and probably linker diagnostics too) should either default to showing \u or \U sequences rather than raw identifiers, or at least have an option so to do. (Previous threads on gcc-patches and gcc, Oct-Nov 2002.) > compiler, because their text editors will generate what looks like > perfectly fine text and then the compiler will barf on it. And I am > not prepared to trust that every editor on the planet will adhere to > C99/C++'s rules. And even if I were, we'd still have the problem of > the C99 and C++ lists not being identical. I do not expect such user complaints simply because I don't expect users to be widely trying to use extended characters (with or without UCNs) in identifiers within the next several years. (Extended characters in strings and comments are another matter, but don't cause such problems.) I'd say implement the rules if someone wishes to do so - complete with the previous and following oddities - then try to get things cleaned up for the next major revisions of C and C++. Oddities: 1. Lexing UCNs in identifiers can require up to nine characters backtracking: a\U000000Cz is three preprocessing tokens {a}{\}{U000000Cz}. 2. (A separate general UCN issue, nothing to do with their use in identifiers so in no way required for implementing them in identifiers.) C++, but not C, converts all extended characters in the source file to UCNs in phase 1, so stringising "$" generates different results in C and C++ <http://gcc.gnu.org/ml/gcc-patches/2003-04/msg01523.html>. (Doing this efficiently does mean only making this UTF-8 -> UCN conversion if the file contains extended characters, ideally only if it contains them outside comments.)
Subject: Re: UCNs not recognized in identifiers (c++/c99) Because of the ABI implications, I consider it completely unacceptable to implement this feature according to the letter of C99 or C++98. Ever. Once the "oddities" are resolved at the standards level, I will consider supporting the feature as resolved, but *only* as resolved. zw
Subject: Re: UCNs not recognized in identifiers (c++/c99) On Thu, 16 Dec 2004, zack at codesourcery dot com wrote: > Because of the ABI implications, I consider it completely unacceptable Which ABI implications? (a) It isn't explicitly stated that different UCNs designating the same character are equivalent to each other (and to that character) in identifiers, but I don't think there's any real doubt that they are meant to be equivalent. (b) There is no normalisation, but I'm confident that the answer from WG14 if this is queried would be that the standard is correct and by design it normatively references ISO 10646 (not Unicode) which doesn't include the normalisation definitions of UAX 15 and implementation of the standard is not meant to involve large external tables. If there are cases of ambiguity a -Wnfc option (default on) to warn for identifiers not in NFC (or indeed -Wnfkc, default on, for identifiers not in NFKC) would draw users' attention to doubtful identifiers. (TR 10176 expressly notes the problems of ambiguity of appearance of entirely different characters even without combining characters, says that language standards need not provide for normalisation if they allow combining characters, and excludes most combining characters where precombined characters are available for the specific purpose of avoiding alternate representations of identifiers.) (c) Though we could do what we want with extended characters (as opposed to UCNs) in source files in phase 1, it seems safest to err on the side of rejecting all extended characters that wouldn't be accepted as UCNs, rather than e.g. applying NFC, to avoid giving identifiers with such characters a meaning which might then need to be preserved in future. (d) There are genuine ABI issues with how extended characters are represented in object files, but I think those need to be resolved by selecting between UTF-8 and mangling (default UTF-8) based on target configurations rather than on the capabilities of the assembler and linker in use, and by getting an explicit statement about encoding put in the ELF specification.
Subject: Re: UCNs not recognized in identifiers (c++/c99) On examination of the different lists in C and C++, I'd add the use of UCNs accepted in one language only to the cases that should receive a default warning (identifiers not in NFKC receiving such a warning as well). Most extreme, but assuring any case which might have a compatibility problem is warned for, would be also to warn for any character with a canonical or compatibility decomposition and for any character with nonzero combining class. This area is just one of many where C and C++ give programmers more than enough rope to hang themselves. In general we give due warning in such cases then let the programmers go ahead if they really want to.
Subject: Re: UCNs not recognized in identifiers (c++/c99) The following example illustrates the problems with lack of normalisation. (I still expect WG14 and WG21 to consider the lack of normalisation to be both the current meaning of the standards and their correct meaning in context, though future revisions might change the exact lists of characters, but this is an appropriate example to present to them and shows why diagnostics would be needed for various cases.) \u05e9\u05bc\u05c1 \u05e9\u05c1\u05bc are valid identifiers in C99 but not C++ while \ufb2c is a valid identifier in C++ but not in C99. In Unicode, the three are canonically equivalent, the first being both NFC and NFD. 05BC HEBREW POINT DAGESH OR MAPIQ (combining class 21) 05C1 HEBREW POINT SHIN DOT (combining class 24) 05E9 HEBREW LETTER SHIN (combining class 0) FB2C HEBREW LETTER SHIN WITH DAGESH AND SHIN DOT (combining class 0) (U+FB2C is excluded from the compositions allowed in NFC, hence the decomposed form being NFC.) So with current C and C++ standards users cannot portably link some pointed Hebrew identifiers between the two languages; it would be advisable for them to avoid such identifiers. Warning for any use of the characters permitted by C++ but not C seems appropriate in the expectation that such characters will cease to be permitted in future, regardless of any other changes there may be. Making the C++ extern "C" \ufb2c into something else would seem to me to be the road to madness, though we could see how other implementations of the C++ ABI interpret it as regards identifiers with UCNs.
Joseph - I never properly answered your question in comment #7, although arguably the answer is already in comment #4. I should mention I take as a basic premise that without exception, a sequence of UCNs and a sequence of extended-source-character-set characters (which both encode the same sequence of ISO10646 code points) should be treated identically. Therefore, I'm going to talk exclusively about code points below. The scenario that causes ABI breakage is as follows: 1) A shared library author gives an exported interface function a name containing, for instance, U+212B ANGSTROM SIGN. 2) This is compiled with a compiler that, hewing to the letter of the standard, does not perform any normalization. The shared library's symbol table therefore also contains U+212B. That code point is now part of the library ABI. 3) A program that uses this library is compiled with the same compiler; it expects a symbol containing U+212B. 4) Later, someone recompiles the library with a compiler that applies NFC to all identifiers. The library now exports a symbol containing U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE. The program compiled in step 3 breaks. An obvious rebuttal to this is that the compiler used in step 4 is broken. As you say, the C standard references ISO10646 not Unicode and the concept of normalization does not exist in ISO10646, and this could be taken to imply that no normalization shall occur. However, there is no unambiguous statement to that effect in the standard, and there is strong quality-of-implementation pressure in the opposite direction. Put aside the standard for a moment: are users going to like a compiler that insists that "Å" (U+00C5) and "Å" (U+212B) are not the same character? [It happens that on my screen those are ever so slightly different, but that's just luck - and X11 will only let me type U+00C5; I resorted to hex-editing to get the other.] Furthermore, I can easily imagine someone writing a Unicode-aware text editor and thinking it's a good idea to convert every file to NFC when saved. Making some unrelated change to the file defining the symbol with U+212B in it, with this editor, would trigger the exact same ABI break that the hypothetical normalizing compiler would. This possibility means that a WG14/21 no-normalization mandate would NOT prevent silent ABI breakage. And the existence of this possibility increases the QoI pressure for a compiler to do normalization, as a defensive measure against such external changes. You could argue that this is just another way for C programmers to shoot themselves in the foot, but I don't think the myriad ways that already exist are a reason to add more. For these reasons I see no safe way to implement extended identifiers except to persuade both WG14 and WG21 to mandate use of UAX#15 annex 7, instead of the existing lists of allowed characters. I'm willing to consider other normalization schemas and sets of allowed characters (as long as C and C++ are consistent with each other) but not plans which don't include normalization. To address the concern about requiring huge tables, perhaps the standards could say that it is implementation-defined whether extended characters are allowed at all.
Subject: Re: UCNs not recognized in identifiers (c++/c99) On Fri, 7 Jan 2005, zack at gcc dot gnu dot org wrote: > An obvious rebuttal to this is that the compiler used in step 4 is broken. As > you say, the C standard references ISO10646 not Unicode and the concept of > normalization does not exist in ISO10646, and this could be taken to imply that > no normalization shall occur. However, there is no unambiguous statement to > that effect in the standard, and there is strong quality-of-implementation I think the relevant text is that treating identifiers as sequences of characters and UCNs denoting single characters. I've had no on-list response yet to the query about this I sent to the WG14 reflector on Tuesday (reflector message 10698), with the HEBREW LETTER SHIN WITH DAGESH AND SHIN DOT examples. > pressure in the opposite direction. Put aside the standard for a moment: are > users going to like a compiler that insists that "Å" (U+00C5) and "Å" (U+212B) > are not the same character? [It happens that on my screen those are ever so > slightly different, but that's just luck - and X11 will only let me type U+00C5; > I resorted to hex-editing to get the other.] The question of appearance is the same as that for U+0041 LATIN CAPITAL LETTER A, U+0391 GREEK CAPITAL LETTER ALPHA, U+0410 CYRILLIC CAPITAL LETTER A. Will users like such a compiler less than one which doesn't allow them to use their native language in identifiers at all? > normalization, as a defensive measure against such external changes. > You could argue that this is just another way for C programmers to shoot > themselves in the foot, but I don't think the myriad ways that already > exist are a reason to add more. (It's WG14 and WG21 that added the new way, not us. And it may be that if they are to become convinced there is any mistake then they must see real world problems arising with real implementations of the existing standards, rather than hypothetical problems. Mistakes were made in C99 of adding features in general without adequate implementation experience; changing them without experience showing what is a genuine problem could be seen as another such mistake to avoid.) I could believe there could be a case for -fextended-identifiers required to enable UCNs in identifiers until there is more experience, with documentation along the lines of that formerly associated with -pedantic "This option is not intended to be useful; ...".
Subject: Re: UCNs not recognized in identifiers (c++/c99) "joseph at codesourcery dot com" <gcc-bugzilla@gcc.gnu.org> writes: | I've had no on-list response yet to the query about this I sent to the | WG14 reflector on Tuesday (reflector message 10698), with the HEBREW | LETTER SHIN WITH DAGESH AND SHIN DOT examples. Since this issue contains a compatibility fragment and affects both C and C++, it occurs to me that you should resend a copy of your message to the C/C++ compatibility reflector (reaching both WG14 and WG21). I highly encourage you to do that. The address is c++std-compat at accu dot org. It would be wrong to let the issue debated by WG14 only without WG21 knowing. -- Gaby
Subject: Re: UCNs not recognized in identifiers (c++/c99) On Fri, 7 Jan 2005, gdr at integrable-solutions dot net wrote: > > ------- Additional Comments From gdr at integrable-solutions dot net 2005-01-07 14:27 ------- > Subject: Re: UCNs not recognized in identifiers (c++/c99) > > "joseph at codesourcery dot com" <gcc-bugzilla@gcc.gnu.org> writes: > > | I've had no on-list response yet to the query about this I sent to the > | WG14 reflector on Tuesday (reflector message 10698), with the HEBREW > | LETTER SHIN WITH DAGESH AND SHIN DOT examples. > > Since this issue contains a compatibility fragment and affects both C > and C++, it occurs to me that you should resend a copy of your message > to the C/C++ compatibility reflector (reaching both WG14 and WG21). I > highly encourage you to do that. The address is c++std-compat at accu > dot org. It would be wrong to let the issue debated by WG14 only > without WG21 knowing. I've now sent it to c++std-compat (having checked that the C++ list of characters also includes combining characters in more than one combining class so the same issues can arise there at least in principle, whether or not they can arise with realistic natural language identifiers).
Subject: Re: UCNs not recognized in identifiers (c++/c99) "joseph at codesourcery dot com" <gcc-bugzilla@gcc.gnu.org> writes: | I've now sent it to c++std-compat (having checked that the C++ list of | characters also includes combining characters in more than one combining | class so the same issues can arise there at least in principle, whether or | not they can arise with realistic natural language identifiers). Thanks a lot! -- Gaby
So, just to be clear on this, the translation unit: const char * \u00c5 = "a-ring"; float \u212b = 1e-10; 1. Is a valid translation unit in C99? 2. Invokes undefined behaviour? 3. Requires a diagnostic? Logically it can only be one of the three. I think the standard is pretty clear that it's (1); 6.4.2.1 paragraph 6, "Any identifiers that differ in a significant character are different identifiers." The standard therefore prohibits a compiler converting unicode sequences specified with \u to NFC (or any other normal form).
Subject: Re: UCNs not recognized in identifiers (c++/c99) Doug Gwyn has now said It was certainly the original intent of C99 that identifiers would match only if encoded identically. It would probably be wise for any importing process to apply "canonicalization" to source code before it reaches the compiler. and Henry Spencer has said The approach I ended up using in a non-C project was to say that (a) all occurrences of an identifier must be encoded identically, and (b) it is forbidden for two different identifiers to have the same normalized form (for a suitable definition of normalization).
Subject: Re: UCNs not recognized in identifiers (c++/c99) "joseph at codesourcery dot com" <gcc-bugzilla@gcc.gnu.org> writes: | Subject: Re: UCNs not recognized in identifiers | (c++/c99) | | Doug Gwyn has now said | | It was certainly the original intent of C99 that identifiers would match | only if encoded identically. It would probably be wise for any importing | process to apply "canonicalization" to source code before it reaches the | compiler. | | and Henry Spencer has said | | The approach I ended up using in a non-C project was to say that (a) all | occurrences of an identifier must be encoded identically, and (b) it is | forbidden for two different identifiers to have the same normalized form | (for a suitable definition of normalization). Joseph -- You said you resent your message to c++std-compat. I don't believe it ever appeared on that list. Please, could you double-check? -- Gaby
Subject: Re: UCNs not recognized in identifiers (c++/c99) On Sat, 8 Jan 2005, gdr at integrable-solutions dot net wrote: > Joseph -- > > You said you resent your message to c++std-compat. I don't believe > it ever appeared on that list. Please, could you double-check? I sent it to c++std-compat (and have had no bounce). If it hasn't appeared within the next week then I'll investigate further.
Subject: Re: UCNs not recognized in identifiers (c++/c99) "joseph at codesourcery dot com" <gcc-bugzilla@gcc.gnu.org> writes: | ------- Additional Comments From joseph at codesourcery dot com 2005-01-08 05:32 ------- | Subject: Re: UCNs not recognized in identifiers | (c++/c99) | | On Sat, 8 Jan 2005, gdr at integrable-solutions dot net wrote: | | > Joseph -- | > | > You said you resent your message to c++std-compat. I don't believe | > it ever appeared on that list. Please, could you double-check? | | I sent it to c++std-compat (and have had no bounce). If it hasn't | appeared within the next week then I'll investigate further. Your message is now avaliable on c++std-compat. Tom Plum kindly forwarded Doug Gwyn's reply. Thanks! -- Gaby
The following checklist for implementation of extended identifiers has been discussed with and prioritised by Zack. No doubt Neil will point out if there are any missing technical points. External specifications ======================= Reasonable efforts should be made to get specifications of handling of extended identifiers (that UCNs and other non-ASCII characters in identifiers are encoded in UTF-8, at least on platforms using ASCII in the symbol names in the first place) into the following specifications. Actually succeeding in doing so is not a blocker for getting an implementation into GCC. * ELF: <http://www.thescogroup.com/developers/gabi/latest/ch4.symtab.html>, where it says "External C symbols have the same names in C and object files' symbol tables.". I have attempted to get such wording in, the last version proposed being: Unless the operating system ABI specifies otherwise, it is recommended that characters in external C symbols, including characters outside the basic source character set whether or not designated in source files by universal character names, are encoded in UTF-8 in object files' symbol tables. and discussions being with ia64-abi@unix-os.sc.intel.com. * C++ ABI: <http://www.codesourcery.com/cxx-abi/abi.html>. The appropriate form would be to add a statement that once the ABI has constructed a C symbol name which may contain UCNs, such name should be encoded according to the underlying C ABI, following <http://www.codesourcery.com/cxx-abi/cxx-closed.html#F8>. The following specification already includes all the required text, and GCC should implement it before a release is made supporting extended identifiers: * DWARF3: the DW_AT_use_UTF8 attribute should be set on the compilation unit entry for each compilation unit with any UTF-8 identifiers (including ones such as structure element names which appear in debug information but not otherwise in external identifiers). It may in fact be harmless to set it unconditionally. GCC implementation issues ========================= The following specific issues should be dealt with in the GCC implementation. Everything implemented needs appropriate tests in the testsuite to cover it, for both C and C++. (a) Probably implemented already; if not, should be done before feature is turned on by default in mainline: * The precise sets of characters permitted in identifiers in each standard (C99 and C++03) should be followed. * A UCN is equivalent to the character it denotes. This should be implemented initially for the case of $, but if we start accepting other extended characters then it should be implemented for them as well. * The \U and \u UCNs for the same character, and UCNs differing in upper or lower case for hex digits, are equivalent. * The greedy algorithm applies for lexing UCNs: for example, a\U0000000z is three preprocessing tokens {a}{\}{U0000000z} (and shouldn't get a diagnostic on lexing, presuming macros are defined such that the eventual token sequence is valid). * The spelling of UCNs is preserved for the # and ## operators. * UCNs must not be accepted in identifiers or preprocessing numbers in strict C90 mode: what in C99 would be an identifier with a UCN in C90 is multiple preprocessing tokens and if the identifier fragments are defined appropriately as macros this could occur in a valid C90 program. * I think the only reasonable interpretation of the lexing rules in the context of forbidden characters is that first identifiers are lexed (allowing any UCNs) then bad characters yield an error (rather than stopping the identifier before the bad character and treating it as not a UCN). * These rules apply to identifiers as preprocessing tokens at any time, including before concatenation. So it is not the case in C99 that splitting an identifier anywhere yields two valid preprocessing tokens: the second half could begin with a UCN for a digit and not be a valid identifier. (Invalid identifiers in C99 don't require diagnostics, but I don't think we want to use this laxity.) (b) Not done and needs to happen before the feature is turned on by default in mainline: * The GCC testsuite should include a test that the same UCN links between C and an extern "C" C++ identifier. * There should be a warning by default for all identifiers (as preprocessing tokens at any stage, e.g. including both before and after concatenation) not in NFKC, which may be disabled by -Wno-nfkc. * Preprocessing numbers can contain UCNs (and extended characters such as $ considered equivalent to them). (c) Should happen before a release is made containing this feature: * All uses of identifiers and DECL_ASSEMBLER_NAME in the compiler should be audited to determine what sort of identifier is appropriate in each case. All places where an identifier may appear in a diagnostic must handle extended identifiers appropriately; if the locale cannot handle all characters in the identifier, UCNs need to be used in diagnostic output. The %E diagnostic format could be made to do this, but there are many places using %s / %qs for diagnostics which need fixing. * Testcases in the GCC testsuite should include all contexts of identifiers such as macro names, external linkage, internal linkage and no linkage. There should be tests for debug information generation for such cases. It would be desirable, though not required if the necessary support isn't already in GDB, to add corresponding tests to the GDB testsuite and make sure extended identifiers can be used with GDB, with both DWARF3 and stabs. * C99 does not permit UCNs for digits at the start of identifiers, but does permit them elsewhere in identifiers, while C++ does not have such a restriction. The restriction in C99 and its absence in C++ should be tested. * If platforms with limited assemblers or linkers or debug formats come up, it would be desirable to be able to use names with internal or no linkage containing external characters on those plarforms, with appropriate mangling, even if defining an ABI with mangling for external names is felt inappropriate. * The C++ requirement that extended source characters (including '$') are translated to UCNs in translation phase 1 needs implementing.
Subject: Re: UCNs not recognized in identifiers (c++/c99) On 21/02/2005, at 6:15 AM, jsm28 at gcc dot gnu dot org wrote: > > ------- Additional Comments From jsm28 at gcc dot gnu dot org > 2005-02-21 14:15 ------- > The following checklist for implementation of extended identifiers has > been discussed with and prioritised by Zack. No doubt Neil will point > out if there are any missing technical points. Although I agree that these are all (except the below) nice things to have, I don't think I agree that they are all preconditions to having any part of an implementation. For instance, an implementation that said sorry() when using # on an identifier from a UCN would still be more useful than the complete lack of implementation we have now. > * These rules apply to identifiers as preprocessing tokens at any > time, including before concatenation. So it is not the case in C99 > that splitting an identifier anywhere yields two valid preprocessing > tokens: the second half could begin with a UCN for a digit and not be > a valid identifier. (Invalid identifiers in C99 don't require > diagnostics, but I don't think we want to use this laxity.) The second half would a pp-number, instead. It is always true that splitting an identifier between characters yields two valid preprocessing tokens. > * All uses of identifiers and DECL_ASSEMBLER_NAME in the compiler > should be audited to determine what sort of identifier is appropriate > in each case. I don't understand this sentence. What different sorts of identifiers are there, and how could they be appropriate or not appropriate?
Created attachment 8243 [details] smime.p7s
Subject: Re: UCNs not recognized in identifiers (c++/c99) On Mon, 21 Feb 2005, geoffk at geoffk dot org wrote: > > * These rules apply to identifiers as preprocessing tokens at any > > time, including before concatenation. So it is not the case in C99 > > that splitting an identifier anywhere yields two valid preprocessing > > tokens: the second half could begin with a UCN for a digit and not be > > a valid identifier. (Invalid identifiers in C99 don't require > > diagnostics, but I don't think we want to use this laxity.) > > The second half would a pp-number, instead. It is always true that > splitting an identifier between characters yields two valid > preprocessing tokens. It would not be a pp-number, as a UCN for a digit is still an identifier-nondigit rather than a digit in terms of the syntax and pp-numbers can't start with identifiers-nondigits. > > * All uses of identifiers and DECL_ASSEMBLER_NAME in the compiler > > should be audited to determine what sort of identifier is appropriate > > in each case. > > I don't understand this sentence. What different sorts of identifiers > are there, and how could they be appropriate or not appropriate? Identifiers found in input, with input spelling. (Input includes -D and -U options on the command line - in principle the command line should be interpreted in the user's locale by default just like source files.) UTF-8 (or, I suppose, UTF-EBCDIC) internally encoded identifiers. Identifiers in mangled form in any case where they are mangled for output. Identifiers in diagnostics (possibly including cases where bits of a diagnostic get built up with sprintf), which need converting to the user's locale for display or to be displayed using UCNs. I don't know if collect2 might also need to know something about extended identifiers. The aim is that every datastructure with an identifier should have the encoding (input, internal, output, diagnostic) well-defined and conversions between these should be handled properly.
Subject: Re: UCNs not recognized in identifiers (c++/c99) "geoffk at geoffk dot org" <gcc-bugzilla@gcc.gnu.org> writes: > Although I agree that these are all (except the below) nice things to > have, I don't think I agree that they are all preconditions to having > any part of an implementation. For instance, an implementation that > said sorry() when using # on an identifier from a UCN would still be > more useful than the complete lack of implementation we have now. In my book, a complete lack of implementation of this particular feature is better than an incomplete one. This is because I see the vast majority of the work required to do a complete implementation as being due-diligence tasks needed to ensure that the feature cannot crash the compiler, cause wrong code generation, or introduce compatibility problems, and as long as someone is going to do all that work, why shouldn't they do the rest of the job as long as they're in there? > The second half would a pp-number, instead. It is always true that > splitting an identifier between characters yields two valid > preprocessing tokens. Joseph has mostly explained this, but I should add that what you get if you split, say, "a\u0660b", between the "a" and the backslash is two identifiers, the second of which's "initial character is a universal character name designating a digit", which violates a shall-clause in a semantics paragraph, and therefore provokes undefined behavior. (C99 6.4.2.1p3.) Standing policy is that all cases which provoke undefined behavior inside the preprocessor, except already-documented GNU extensions, shall produce hard errors. I am tempted to make a partial exception in this case in the interest of better compatibility with C++. Almost all of the UCNs in the "digits" block of C99 annex D are completely excluded from C++98 annex E - so "a\u0660b" for instance is an invalid identifier, and we never get as far as wondering what happens if we split it before the backslash. However, the range 0e50-0e59 is in the "Thai" range of C++98/E, but *both* the "Thai" and the "Digits" ranges of C99/D. It would be sensible, IMO, to resolve the error in C99/D by removing 0e50-0e59 from the "Digits" range, thus permitting those characters to begin identifiers in both C and C++. [Note that currently ucnid.tab takes the opposite position.] zw
Subject: Re: UCNs not recognized in identifiers (c++/c99) On 21/02/2005, at 11:47 AM, joseph at codesourcery dot com wrote: > > ------- Additional Comments From joseph at codesourcery dot com > 2005-02-21 19:47 ------- > Subject: Re: UCNs not recognized in identifiers > (c++/c99) > > On Mon, 21 Feb 2005, geoffk at geoffk dot org wrote: > >>> * These rules apply to identifiers as preprocessing tokens at any >>> time, including before concatenation. So it is not the case in C99 >>> that splitting an identifier anywhere yields two valid preprocessing >>> tokens: the second half could begin with a UCN for a digit and not be >>> a valid identifier. (Invalid identifiers in C99 don't require >>> diagnostics, but I don't think we want to use this laxity.) >> >> The second half would a pp-number, instead. It is always true that >> splitting an identifier between characters yields two valid >> preprocessing tokens. > > It would not be a pp-number, as a UCN for a digit is still an > identifier-nondigit rather than a digit in terms of the syntax and > pp-numbers can't start with identifiers-nondigits. That's a defect in the standard, the tail of an identifier is supposed to be either an identifier or a pp-number, that's why pp-number exists. >>> * All uses of identifiers and DECL_ASSEMBLER_NAME in the compiler >>> should be audited to determine what sort of identifier is appropriate >>> in each case. >> >> I don't understand this sentence. What different sorts of identifiers >> are there, and how could they be appropriate or not appropriate? > > Identifiers found in input, with input spelling. (Input includes -D > and > -U options on the command line - in principle the command line should > be > interpreted in the user's locale by default just like source files.) > > UTF-8 (or, I suppose, UTF-EBCDIC) internally encoded identifiers. > > Identifiers in mangled form in any case where they are mangled for > output. > > Identifiers in diagnostics (possibly including cases where bits of a > diagnostic get built up with sprintf), which need converting to the > user's > locale for display or to be displayed using UCNs. > > I don't know if collect2 might also need to know something about > extended > identifiers. > > The aim is that every datastructure with an identifier should have the > encoding (input, internal, output, diagnostic) well-defined and > conversions between these should be handled properly. My suggestion is that this can be simplified as follows: - a CPP token is in the input form. An identifier outside cpp is in 'internal' form. - DECL_ASSEMBLER_NAME is in 'output' form. - The 'diagnostic' form is created from the 'internal' form based solely on the locale, at the time that a diagnostic is printed.
Created attachment 8244 [details] smime.p7s
Subject: Re: UCNs not recognized in identifiers (c++/c99) "geoffk at geoffk dot org" <gcc-bugzilla@gcc.gnu.org> writes: >>> The second half would a pp-number, instead. It is always true that >>> splitting an identifier between characters yields two valid >>> preprocessing tokens. >> >> It would not be a pp-number, as a UCN for a digit is still an >> identifier-nondigit rather than a digit in terms of the syntax and >> pp-numbers can't start with identifiers-nondigits. > > That's a defect in the standard, the tail of an identifier is supposed > to be either an identifier or a pp-number, that's why pp-number exists. Arguably yes. *shrug* You perhaps begin to see why I did not want this feature implemented? Or at least why I want it done with great caution and consideration of all these corner cases? Does your opinion of this particular corner case change in view of C++ not permitting most of the "digit" UCNs in identifiers at all? zw
Subject: Re: UCNs not recognized in identifiers (c++/c99) On 21/02/2005, at 12:15 PM, zack at codesourcery dot com wrote: > > ------- Additional Comments From zack at codesourcery dot com > 2005-02-21 20:14 ------- > Subject: Re: UCNs not recognized in identifiers > (c++/c99) > > "geoffk at geoffk dot org" <gcc-bugzilla@gcc.gnu.org> writes: > >> Although I agree that these are all (except the below) nice things to >> have, I don't think I agree that they are all preconditions to having >> any part of an implementation. For instance, an implementation that >> said sorry() when using # on an identifier from a UCN would still be >> more useful than the complete lack of implementation we have now. > > In my book, a complete lack of implementation of this particular > feature is better than an incomplete one. This is because I see the > vast majority of the work required to do a complete implementation as > being due-diligence tasks needed to ensure that the feature cannot > crash the compiler, cause wrong code generation, or introduce > compatibility problems, and as long as someone is going to do all that > work, why shouldn't they do the rest of the job as long as they're in > there? I think we are just going to have to agree to disagree on this. I don't think your approach will lead to the best possible GCC. >> The second half would a pp-number, instead. It is always true that >> splitting an identifier between characters yields two valid >> preprocessing tokens. > > Joseph has mostly explained this, but I should add that what you get > if you split, say, "a\u0660b", between the "a" and the backslash is > two identifiers, the second of which's "initial character is a > universal character name designating a digit", which violates a > shall-clause in a semantics paragraph, and therefore provokes > undefined behavior. (C99 6.4.2.1p3.) A shall-clause in a semantics paragraph requires a diagnostic, C99 5.1.1.3. > Standing policy is that all cases which provoke undefined behavior > inside the preprocessor, except already-documented GNU extensions, > shall produce hard errors. I am tempted to make a partial exception > in this case in the interest of better compatibility with C++. Almost > all of the UCNs in the "digits" block of C99 annex D are completely > excluded from C++98 annex E - so "a\u0660b" for instance is an invalid > identifier, and we never get as far as wondering what happens if we > split it before the backslash. However, the range 0e50-0e59 is in the > "Thai" range of C++98/E, but *both* the "Thai" and the "Digits" ranges > of C99/D. It would be sensible, IMO, to resolve the error in C99/D by > removing 0e50-0e59 from the "Digits" range, thus permitting those > characters to begin identifiers in both C and C++. [Note that > currently ucnid.tab takes the opposite position.] This would make the compiler non-conforming.
Created attachment 8245 [details] smime.p7s
Subject: Re: UCNs not recognized in identifiers (c++/c99) "geoffk at geoffk dot org" <gcc-bugzilla@gcc.gnu.org> writes: >>> The second half would a pp-number, instead. It is always true that >>> splitting an identifier between characters yields two valid >>> preprocessing tokens. >> >> Joseph has mostly explained this, but I should add that what you get >> if you split, say, "a\u0660b", between the "a" and the backslash is >> two identifiers, the second of which's "initial character is a >> universal character name designating a digit", which violates a >> shall-clause in a semantics paragraph, and therefore provokes >> undefined behavior. (C99 6.4.2.1p3.) > > A shall-clause in a semantics paragraph requires a diagnostic, C99 > 5.1.1.3. Um, no, 5.1.1.3 does not say that. It says a diagnostic is required for a violation of any "syntax rule or constraint"; shall-clauses in semantics paragraphs are neither. Constraints only appear in constraints paragraphs. See 4p2 for the meaning of shall-clauses outside constraints paragraphs. zw
Subject: Re: UCNs not recognized in identifiers (c++/c99) jsm28 at gcc dot gnu dot org wrote:- > * The greedy algorithm applies for lexing UCNs: for example, > a\U0000000z is three preprocessing tokens {a}{\}{U0000000z} (and > shouldn't get a diagnostic on lexing, presuming macros are defined > such that the eventual token sequence is valid). I'm not sure I agree with this: it would seem to be unnecessary extra work; further I suspect the user would benefit from it being pointed out he entered an ill-formed UCN rather than something random from the front end complaining about an unexpected backslash. The only case where you wouldn't get a syntax error from the front end, or an invalid escape in a literal, is with -E. I'm not sure lexing to the letter of the standard is worthwhile in this case, as the standard doesn't discuss -E. If you have an example where a compiled program is acceptable with multiple lexing tokens then I would agree with you. > * The spelling of UCNs is preserved for the # and ## operators. This is very hard with CPP's current implementation - it assumes it can deduce the spelling of an identifier from its hash table entry. IMO the proper way to fix this to use a different approach entirely, rather than kludge it in the existing implementation (which would bloat some common datastructures) but that's some work. > * I think the only reasonable interpretation of the lexing rules in > the context of forbidden characters is that first identifiers are > lexed (allowing any UCNs) then bad characters yield an error (rather > than stopping the identifier before the bad character and treating it > as not a UCN). Agreed - as I say above I don't see why this shouldn't apply for partial UCNs too, even with -E. The rest seems reasonable. Neil.
Subject: Re: UCNs not recognized in identifiers (c++/c99) On Mon, 21 Feb 2005, zack at codesourcery dot com wrote: > Standing policy is that all cases which provoke undefined behavior > inside the preprocessor, except already-documented GNU extensions, > shall produce hard errors. I am tempted to make a partial exception Which policy (cf. bug 14634) I agree with. However, I don't think there should be any exception made. The standards (C99 and C++03) are implementable as-is. They have oddities; some of these may be suitable for submission as DRs, and if the committees fix them in a TC rather than a major new standard revision then we no longer need implement those oddities, but for now the standard says what it says. The headings in C99 Annex D are except for "Digits" irrelevant to the normative requirements; anything in "Digits" is a UCN for a digit, whether or not it appears elsewhere. (C++03 corrected the typo in C++98 which was noted in C++ DR 131.) The C++ standard's heading "CJK Unified Ideographs" lists ranges which also include various presentations forms such as one of the Hebrew characters previously discussed, but these are genuine ranges of letters clearly deliberately included; just the heading is wrong.
Subject: Re: UCNs not recognized in identifiers (c++/c99) On Mon, 21 Feb 2005, neil at daikokuya dot co dot uk wrote: > jsm28 at gcc dot gnu dot org wrote:- > > > * The greedy algorithm applies for lexing UCNs: for example, > > a\U0000000z is three preprocessing tokens {a}{\}{U0000000z} (and > > shouldn't get a diagnostic on lexing, presuming macros are defined > > such that the eventual token sequence is valid). > > I'm not sure I agree with this: it would seem to be unnecessary > extra work; further I suspect the user would benefit from it being > pointed out he entered an ill-formed UCN rather than something random > from the front end complaining about an unexpected backslash. > > The only case where you wouldn't get a syntax error from the > front end, or an invalid escape in a literal, is with -E. I'm > not sure lexing to the letter of the standard is worthwhile in > this case, as the standard doesn't discuss -E. > > If you have an example where a compiled program is acceptable > with multiple lexing tokens then I would agree with you. #define a b( #define b(x) q int a\U0000000z ); Greedy lexing is the standard as applied for other token types. I don't think a difference here makes sense. _cpp_valid_ucn would need changing so it doesn't give an error for incomplete UCNs in identifiers but instead returns quietly.
Subject: Re: UCNs not recognized in identifiers (c++/c99) On Mon, 21 Feb 2005, geoffk at geoffk dot org wrote: > My suggestion is that this can be simplified as follows: > > - a CPP token is in the input form. An identifier outside cpp is in > 'internal' form. > - DECL_ASSEMBLER_NAME is in 'output' form. > - The 'diagnostic' form is created from the 'internal' form based > solely on the locale, at the time that a diagnostic is printed. Fine. Now, at present the conversions between these forms are trivial. So the audit required is of everywhere there is an assignment / copy / input / output between different forms to ensure that the appropriate conversions are applied instead of a straight copy as at present. For example, all the places printing IDENTIFIER_POINTER (id) with %qs become no longer valid, as IDENTIFIER_POINTER is in the internal form and %qs simply prints a string; %E may print an identifier as such, converting to the output form, but everywhere using %qs or some other output notation other than %E on an identifier needs checking and fixing.
Subject: Re: UCNs not recognized in identifiers (c++/c99) On 21/02/2005, at 6:13 PM, joseph at codesourcery dot com wrote: > The standards (C99 and C++03) are implementable as-is. They have > oddities; some of these may be suitable for submission as DRs, and if > the committees fix them in a TC rather than a major new standard > revision then we no longer need implement those oddities, but for now > the standard says what it says. I agree with this point. We should implement the standard first, and then see if any parts of it are particularly troublesome for actual use.
Created attachment 8253 [details] smime.p7s
Subject: Re: UCNs not recognized in identifiers (c++/c99) On Tue, 22 Feb 2005, geoffk at geoffk dot org wrote: > > The standards (C99 and C++03) are implementable as-is. They have > > oddities; some of these may be suitable for submission as DRs, and if > > the committees fix them in a TC rather than a major new standard > > revision then we no longer need implement those oddities, but for now > > the standard says what it says. > > I agree with this point. We should implement the standard first, and > then see if any parts of it are particularly troublesome for actual > use. Which is a key reason why a long list of every technical point we could think of is on this checklist: if there are as Zack suggests going to be serious ABI problems with this feature in the long run, evidence of problems can only be provided to WG14 and WG21 on the basis of real experience with implementations that attempt to do a good job of implementing the current standard requirements, not on the basis of bad or partial implementations or implementations not implementing some particular requirement because of an advance decision that you don't like that bit of the standard or don't think it important. > ------- Additional Comments From geoffk at geoffk dot org 2005-02-22 09:23 ------- > Created an attachment (id=8253) > --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=8253&action=view) All your messages to this bug appear to be creating an attachment for some reason, and none of them seem to be appearing on gcc-bugs.
Another reason why spelling needs preserving (in addition to implementing # correctly) is for the constraints on duplicate macro definitions. #define foo \u00c1 #define foo \u00C1 is invalid (different spelling in replacement), as is #define bar(\u00c1) #define bar(\u00C1) (different spelling of parameter names). However, #define \u00c1 foo #define \u00C1 foo is valid, since the spelling of the macro *name* doesn't need to be the same. It is true that we don't get the constraints on duplicate macro definitions right in all cases at present (bug 20078), but since spelling of identifiers needs preserving anyway for the # operator this seems no reason not to get this case right (with testcases, of course).
Unassigning from Zack since he is now gone from GCC development.
(In reply to comment #39) > Another reason why spelling needs preserving (in addition to implementing # > correctly) is for the constraints on duplicate macro definitions. > > #define foo \u00c1 > #define foo \u00C1 > > is invalid (different spelling in replacement), as is We discussed this on the list and decided that this was probably a defect in the C standard, since the Rationale says that the kind of implementation we have now is supposed to be permitted, and jsm said he'd file a DR. How's that going?
Subject: Re: UCNs not recognized in identifiers (c++/c99) geoffk at gcc dot gnu dot org wrote:- > > ------- Additional Comments From geoffk at gcc dot gnu dot org 2005-09-15 22:34 ------- > (In reply to comment #39) > > Another reason why spelling needs preserving (in addition to implementing # > > correctly) is for the constraints on duplicate macro definitions. > > > > #define foo \u00c1 > > #define foo \u00C1 > > > > is invalid (different spelling in replacement), as is > > We discussed this on the list and decided that this was probably a defect in the C standard, since the > Rationale says that the kind of implementation we have now is supposed to be permitted, and jsm said > he'd file a DR. How's that going? I very much doubt this is a defect. Just because it doesn't fit your implementation... Neil.
Subject: Re: UCNs not recognized in identifiers (c++/c99) On Thu, 15 Sep 2005, geoffk at gcc dot gnu dot org wrote: > ------- Additional Comments From geoffk at gcc dot gnu dot org 2005-09-15 22:34 ------- > (In reply to comment #39) > > Another reason why spelling needs preserving (in addition to implementing # > > correctly) is for the constraints on duplicate macro definitions. > > > > #define foo \u00c1 > > #define foo \u00C1 > > > > is invalid (different spelling in replacement), as is > > We discussed this on the list and decided that this was probably a defect in the C standard, since the > Rationale says that the kind of implementation we have now is supposed to be permitted, and jsm said > he'd file a DR. How's that going? I don't believe I said I'd file a DR unless I saw a defect. There is no defect because models A or C need to be implemented by an implementation-defined mapping (documented as such; we don't even document the removal of trailing whitespace from lines; of course anything replacing UCNs with the characters they designate only in certain places is a pain to document because it doesn't fit in with the C model of phases of translation). Doug Gwyn's reading in reflector message 10751, Yes, "spelling" is meant in terms of the source code characters. The idea is to permit simple strcmp-like checking by the preprocessor. seems fine to me - implementations permitting the above in the input source must end up with the source looking different from the above after phase 1.
Subject: Re: UCNs not recognized in identifiers (c++/c99) joseph at codesourcery dot com wrote:- > I don't believe I said I'd file a DR unless I saw a defect. There is no > defect because models A or C need to be implemented by an > implementation-defined mapping (documented as such; we don't even document > the removal of trailing whitespace from lines; of course anything > replacing UCNs with the characters they designate only in certain places > is a pain to document because it doesn't fit in with the C model of phases > of translation). Doug Gwyn's reading in reflector message 10751, > > Yes, "spelling" is meant in terms of the source code characters. > The idea is to permit simple strcmp-like checking by the preprocessor. I think this is what we will need to do to fix the # and ## and spacing bugs in macro replacements too - base the decision upon a memcmp or strcmp. Neil.
Subject: Re: UCNs not recognized in identifiers (c++/c99) On Thu, 15 Sep 2005, neil at daikokuya dot co dot uk wrote: > > Yes, "spelling" is meant in terms of the source code characters. > > The idea is to permit simple strcmp-like checking by the preprocessor. > > I think this is what we will need to do to fix the # and ## and spacing > bugs in macro replacements too - base the decision upon a memcmp or > strcmp. Note that comparing macros replacements by strings means you can no longer fake a version of UCN model C (don't really rewrite UCNs in phase 1 but convert identifier spellings to UTF-8 on lexing identifiers) as now, because then the conversion to canonical form would be visible in the results of stringising them but differently spelt macro definitions would still show up as different. You'd need either to convert identifiers before producing the string form of macro replacements, or (my preference) work out how to preserve different spellings of preprocessing tokens representing the same identifier (so as to get the results of stringising right). (Comparing with strings may still be useful in order to fix the other bugs you mention.)
Subject: Re: UCNs not recognized in identifiers (c++/c99) On 15/09/2005, at 3:53 PM, joseph at codesourcery dot com wrote: > Yes, "spelling" is meant in terms of the source code characters. > The idea is to permit simple strcmp-like checking by the > preprocessor. Good, so that answers that question. You raise a good point about GCC not having documentation for phase 1. I don't have time to write all of it, but I think I can write the last part, about UCNs, so maybe together we can get it all done. My proposed wording is: @cite{The mapping between physical source file multibyte characters and the source character set in translation phase 1 (C90 and C99 5.1.1.2).} [CR/NL/CR-NL are turned into EOL markers, spaces are deleted between backslash and the end of a line, it's converted to UTF-8 using iconv based on -finput-charset---and what else?] Then, any character sequence which would form a UCN in an identifier in phase 3 of translation is converted into the corresponding UTF-8 sequence. Any backslash-newline combinations in the UCN are preserved and placed after the UTF-8 sequence. [note that there's no way for a user to tell whether a backslash- newline combination is placed before, in the middle of, or after, the UTF-8 sequence.] ... @cite{Which additional multibyte characters may appear in identifiers and their correspondence to universal character names (C99 6.4.2).} UTF-8 character sequences may appear in identifiers, and they correspond to the UCN that specifies that character. A UTF-8 sequence may appear only if the UCN that it corresponds to would be permitted in the identifier at that point. At present, only those UTF-8 sequences which were produced by the mapping from UCNs to UTF-8 sequences in translation phase 1 are permitted, but this is likely to change in the future.
Author: jsm28 Date: Wed Nov 5 16:19:10 2014 New Revision: 217144 URL: https://gcc.gnu.org/viewcvs?rev=217144&root=gcc&view=rev Log: Enable -fextended-identifiers by default. As proposed at <https://gcc.gnu.org/ml/gcc/2014-11/msg00014.html>, this patch enables -fextended-identifiers by default for all standard versions including this feature (all C++ versions, C99 and above for C, but not C90 / C94 / gnu89 / preprocessing assembler). It adds a couple of tests for areas where I previously noted testsuite coverage for extended identifiers was lacking, removes -fextended-identifiers from existing tests, adds -g to various such tests to verify that extended identifiers don't break debug info generation and removes the test that was only there to verify that the feature was off by default. The current state of the feature may not correspond exactly to any particular checklist from 2004/5 (see bug 9449) of what was wanted before enabling the feature by default, but I don't think it's any worse than plenty of other features supported by default before every corner case is fully functional, and think problems can readily be fixed incrementally. The following aspects of extended identifiers could still do with more work (and should be straightforward): * C -aux-info (output should use UCNs). * ObjC -gen-decls (output should use UCNs; associated diagnostics from the ObjC front end should use extended characters or UCNs as appropriate to the locale, via using %qE or identifier_to_locale). * Use DW_AT_use_UTF8 in DWARF-3 debug info for compilation units built with extended identifiers enabled (or unconditionally). * cpplib diagnostics (outputting characters or UCNs as appropriate depending on the locale, as done for identifiers in non-cpplib diagnostics). * C++ test for UCN linking with C and extern "C". * Check GDB support / file issues for support if needed. * Actual UTF-8 in identifiers (?). (Be careful about not affecting performance for the normal fast path of lexing identifiers, if possible.) The following may be trickier: * cpplib spelling preservation (required to diagnose macro redefinition with different spellings of the same identifier in the definition or argument names; different spellings of the name of the macro itself are OK, however; also required for correct handling of multiple stringizing in C++); correct output for -d (UCNs), DWARF debug info for macros (UCNs), PCH and PCH tests. (Spelling preservation is the issue that needs fixing to remove references to corner cases in the documentation of -std=c99 and -std=c11 and in c99status.html.) The idea would be to add a second pointer to cpp_identifier that stores the original spelling (whether for extended identifiers only, or for all identifiers); this does not enlarge cpp_token because the resulting larger cpp_identifier structure is no bigger than cpp_string. * C++ translation of extended characters (including $@` and various control characters) to UCNs in phase 1 (note diagnostics thus needed, but not for C++11, for control characters in strings / character constants as those UCNs invalid); a likely implementation approach is to do translation when identifiers / strings / character constants are lexed, together with errors for stray $@` / control characters in program as not being valid UCNs in identifiers ($ only if not accepted in identifiers); note that this translation should not take place inside raw string literals. Bootstrapped with no regressions on x86_64-unknown-linux-gnu. libcpp: PR preprocessor/9449 * init.c (lang_defaults): Enable extended identifiers for C++ and C99-based standards. gcc: PR preprocessor/9449 * doc/cpp.texi (Character sets, Tokenization) (Implementation-defined behavior): Don't refer to UCNs in identifiers requiring -fextended-identifiers. * doc/cppopts.texi (-fextended-identifiers): Document as enabled by default for C99 and later and C++. * doc/invoke.texi (-std=c99, -std=c11): Don't refer to extended identifiers needing -fextended-identifiers. gcc/testsuite: PR preprocessor/9449 * lib/target-supports.exp (check_effective_target_ucn_nocache): Don't use -fextended-identifiers. * c-c++-common/cpp/normalize-3.c, c-c++-common/cpp/ucnid-2011-1.c, g++.dg/cpp/ucn-1.C, g++.dg/cpp/ucnid-1.C, g++.dg/other/ucnid-1.C, gcc.dg/cpp/normalize-1.c, gcc.dg/cpp/normalize-2.c, gcc.dg/cpp/normalize-4.c: Don't use -fextended-identifiers. * gcc.dg/cpp/ucnid-1.c: Don't use -fextended-identifiers. Use -g3. * gcc.dg/cpp/ucnid-10.c, gcc.dg/cpp/ucnid-2.c, gcc.dg/cpp/ucnid-3.c, gcc.dg/cpp/ucnid-4.c, gcc.dg/cpp/ucnid-5.c, gcc.dg/cpp/ucnid-7.c, gcc.dg/cpp/ucnid-9.c, gcc.dg/cpp/warn-normalized-1.c, gcc.dg/cpp/warn-normalized-2.c, gcc.dg/cpp/warn-normalized-3.c: Don't use -fextended-identifiers. * gcc.dg/ucnid-1.c, gcc.dg/ucnid-2.c, gcc.dg/ucnid-3.c, gcc.dg/ucnid-4.c, gcc.dg/ucnid-5.c, gcc.dg/ucnid-6.c: Don't use -fextended-identifiers. Use -g. * gcc.dg/ucnid-7.c, gcc.dg/ucnid-8.c: Don't use -fextended-identifiers. * gcc.dg/ucnid-9.c: Don't use -fextended-identifiers. Use -g. * gcc.dg/ucnid-10.c: Don't use -fextended-identifiers. * gcc.dg/ucnid-11.c, gcc.dg/ucnid-12.c: Don't use -fextended-identifiers. Use -g. * gcc.dg/ucnid-13.c: Don't use -fextended-identifiers. * gcc.dg/cpp/ucnid-8.c: Remove test. * gcc.dg/cpp/ucnid-10.c, gcc.dg/ucnid-14.c: New tests. Added: trunk/gcc/testsuite/gcc.dg/cpp/ucnid-10.c trunk/gcc/testsuite/gcc.dg/ucnid-14.c Removed: trunk/gcc/testsuite/gcc.dg/cpp/ucnid-8.c Modified: trunk/gcc/ChangeLog trunk/gcc/doc/cpp.texi trunk/gcc/doc/cppopts.texi trunk/gcc/doc/invoke.texi trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/c-c++-common/cpp/normalize-3.c trunk/gcc/testsuite/c-c++-common/cpp/ucnid-2011-1.c trunk/gcc/testsuite/g++.dg/cpp/ucn-1.C trunk/gcc/testsuite/g++.dg/cpp/ucnid-1.C trunk/gcc/testsuite/g++.dg/other/ucnid-1.C trunk/gcc/testsuite/gcc.dg/cpp/normalize-1.c trunk/gcc/testsuite/gcc.dg/cpp/normalize-2.c trunk/gcc/testsuite/gcc.dg/cpp/normalize-4.c trunk/gcc/testsuite/gcc.dg/cpp/ucnid-1.c trunk/gcc/testsuite/gcc.dg/cpp/ucnid-2.c trunk/gcc/testsuite/gcc.dg/cpp/ucnid-3.c trunk/gcc/testsuite/gcc.dg/cpp/ucnid-4.c trunk/gcc/testsuite/gcc.dg/cpp/ucnid-5.c trunk/gcc/testsuite/gcc.dg/cpp/ucnid-7.c trunk/gcc/testsuite/gcc.dg/cpp/ucnid-9.c trunk/gcc/testsuite/gcc.dg/cpp/warn-normalized-1.c trunk/gcc/testsuite/gcc.dg/cpp/warn-normalized-2.c trunk/gcc/testsuite/gcc.dg/cpp/warn-normalized-3.c trunk/gcc/testsuite/gcc.dg/ucnid-1.c trunk/gcc/testsuite/gcc.dg/ucnid-10.c trunk/gcc/testsuite/gcc.dg/ucnid-11.c trunk/gcc/testsuite/gcc.dg/ucnid-12.c trunk/gcc/testsuite/gcc.dg/ucnid-13.c trunk/gcc/testsuite/gcc.dg/ucnid-2.c trunk/gcc/testsuite/gcc.dg/ucnid-3.c trunk/gcc/testsuite/gcc.dg/ucnid-4.c trunk/gcc/testsuite/gcc.dg/ucnid-5.c trunk/gcc/testsuite/gcc.dg/ucnid-6.c trunk/gcc/testsuite/gcc.dg/ucnid-7.c trunk/gcc/testsuite/gcc.dg/ucnid-8.c trunk/gcc/testsuite/gcc.dg/ucnid-9.c trunk/gcc/testsuite/lib/target-supports.exp trunk/libcpp/ChangeLog trunk/libcpp/init.c
Enabled by default for relevant standards for GCC 5.
Author: jsm28 Date: Thu Nov 6 21:08:52 2014 New Revision: 217202 URL: https://gcc.gnu.org/viewcvs?rev=217202&root=gcc&view=rev Log: Preserve original spellings of extended identifiers. This patch makes cpplib track the original spellings of extended identifiers, as well as the canonical UTF-8 version, in order to follow standard semantics properly without needing a convoluted and undocumented canonicalization in translation phase 1 (see bug 9449 comments 39-46 regarding such a canonicalization). The spelling is tracked in cpp_identifier and cpp_macro_arg without making cpp_token any larger. The original spelling is used for checks of duplicate macro definitions, stringizing (see the C++ tests added; this case is only an issue for C++ not C because C makes it implementation-defined whether a \ is inserted before the \ of a UCN in a string or character constant when stringizing, while C++ does not), pasting (relevant when the result is then stringized for C++) and when macro definitions are output as text (e.g. for -d options). Once a macro has been defined, only the original spelling of the argument names needs keeping in the argument list. While it is being defined, however, both spellings are needed: the original one for subsequent saving for checks of duplicate macro definitions, and the canonical one which is the node marked specially to generate macro argument tokens rather than normal identifier tokens. The buffer that is used to save the original values of the identifier tokens is changed so that it stores both those original values and a pointer to the canonical hash nodes, so that those canonical nodes can be found when their values need restoring after the macro definition has been parsed. I believe this covers the known standards issues in extended identifiers support (the remaining unimplemented C99 areas in GCC all being floating-point-related), except for C++ translation of extended characters to UCNs in phase 1 (which I have no plans to work on). There are however probably issues left with handling of extended identifiers in other places, as listed in <https://gcc.gnu.org/ml/gcc-patches/2014-11/msg00337.html> (those issues are generally the sort of thing that could be addressed as bugs outside development stage 1). (The bulk of the potential issues Zack was concerned about in 2003-5, that resulted in extended identifiers being disabled in the absence of -fextended-identifiers, were effectively eliminated by the audit and fixes I did in 2009, however; that todo list reflects what was left over after that audit.) Bootstrapped with no regressions on x86_64-unknown-linux-gnu. libcpp: * include/cpp-id-data.h (struct cpp_macro): Update comment regarding parameters. * include/cpplib.h (struct cpp_macro_arg, struct cpp_identifier): Add spelling fields. (struct cpp_token): Update comment on macro_arg. * internal.h (_cpp_save_parameter): Add extra argument. (_cpp_spell_ident_ucns): New declaration. * lex.c (lex_identifier): Add SPELLING argument. Set *SPELLING to original spelling of identifier. (_cpp_lex_direct): Update calls to lex_identifier. (_cpp_spell_ident_ucns): New function, factored out of cpp_spell_token. (cpp_spell_token): Adjust FORSTRING argument semantics to return original spelling of identifiers. Use _cpp_spell_ident_ucns in !FORSTRING case. (_cpp_equiv_tokens): Check spellings of identifiers and macro arguments are identical. * macro.c (macro_arg_saved_data): New structure. (paste_tokens): Use original spellings of identifiers from cpp_spell_token. (_cpp_save_parameter): Add argument SPELLING. Save both canonical node and its value. (parse_params): Update calls to _cpp_save_parameter. (lex_expansion_token): Save spelling of macro argument tokens. (_cpp_create_definition): Extract canonical node from saved data. (cpp_macro_definition): Use UCNs in spelling of macro name. Use original spellings of macro argument tokens and identifiers. * traditional.c (scan_parameters): Update call to _cpp_save_parameter. gcc: * doc/invoke.texi (-std=c99, -std=c11): Don't refer to corner cases of extended identifiers. gcc/testsuite: * g++.dg/cpp/ucnid-2.C, g++.dg/cpp/ucnid-3.C, gcc.dg/cpp/ucnid-11.c, gcc.dg/cpp/ucnid-12.c, gcc.dg/cpp/ucnid-13.c, gcc.dg/cpp/ucnid-14.c, gcc.dg/cpp/ucnid-15.c: New tests. Added: trunk/gcc/testsuite/g++.dg/cpp/ucnid-2.C trunk/gcc/testsuite/g++.dg/cpp/ucnid-3.C trunk/gcc/testsuite/gcc.dg/cpp/ucnid-11.c trunk/gcc/testsuite/gcc.dg/cpp/ucnid-12.c trunk/gcc/testsuite/gcc.dg/cpp/ucnid-13.c trunk/gcc/testsuite/gcc.dg/cpp/ucnid-14.c trunk/gcc/testsuite/gcc.dg/cpp/ucnid-15.c Modified: trunk/gcc/ChangeLog trunk/gcc/doc/invoke.texi trunk/gcc/testsuite/ChangeLog trunk/libcpp/ChangeLog trunk/libcpp/include/cpp-id-data.h trunk/libcpp/include/cpplib.h trunk/libcpp/internal.h trunk/libcpp/lex.c trunk/libcpp/macro.c trunk/libcpp/traditional.c