This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: UCNs-in-IDs patch

From: Geoff Keating <geoffk at geoffk dot org>
To: "Joseph S. Myers" <joseph at codesourcery dot com>
Cc: Neil Booth <neil at daikokuya dot co dot uk>, Dave Brolley <brolley at redhat dot com>, Per Bothner <per at bothner dot com>, GCC Patches <gcc-patches at gcc dot gnu dot org>, Zack Weinberg <zack at codesourcery dot com>
Date: Wed, 16 Mar 2005 18:14:45 -0800
Subject: Re: UCNs-in-IDs patch
References: <20050312104132.AB39E20B06E@geoffk5.apple.com> <Pine.LNX.4.61.0503121129150.3509@digraph.polyomino.org.uk> <20050312130239.GR897@duron.akihabara.co.uk> <42349112.60001@codesourcery.com> <Pine.LNX.4.61.0503132028080.17215@digraph.polyomino.org.uk> <Pine.LNX.4.61.0503151244290.26693@digraph.polyomino.org.uk> <m2y8cozmcq.fsf@greed.local> <Pine.LNX.4.61.0503160320390.8209@digraph.polyomino.org.uk>

On 15/03/2005, at 8:10 PM, Joseph S. Myers wrote:

On Tue, 15 Mar 2005, Geoffrey Keating wrote:

Could you quote a part of the standard which says that \u00c1 and
\u00C1 count as a "different expansion" (or, in standardese, that they
have "different spelling")?  I couldn't find any definition of the
word 'spelling' at all, but maybe I missed it.

Google's dictionary says that "spelling" means "the forming of words
with letters in an accepted order".  I would not consider \ to be a
letter, but "\u00c1" is a (string containing a) letter.

I consider it obvious that spelling in the standard refers to the sequence of source characters.

Consider

#define foo ba\
r
#define foo bar

Do these have 'different spellings'? cpplib doesn't think so. I don't think so. Yet they have a different sequence of source characters.

Likewise,

#define foo ??'
#define foo ^

So I don't think it's obvious at all, in fact I think it's obvious that it doesn't. The characters in the tokens are the same, even though the source code is not equivalent.

The fact that %: and # are different spellings (explicitly stated in 6.4.6) but otherwise equivalent is exactly the same as the fact that there are different spellings of what is otherwise the same identifier: multiple sequences of source characters that are differently stringized and different in macro expansions but otherwise are the same semantically (once converted from preprocessing tokens to tokens). If there is a definition in ISO/IEC 2382-1:1993, it would only be relevant if consistent with the references to spelling of non-alphanumeric tokens.

You assert that it is 'exactly the same', but you provide no evidence. I claim that it is not the same, not least because there is no explicit statement for UCNs comparable to the one in 6.4.6.

Such questions are matters to raise with the WG14 reflector when found in the course of implementation before committing changes, *not* after doing the work, if you think there is doubt. If there is not a consensus on the reflector as to the clear meaning of the standard, they are matters for DRs.

Well, now we've found it in the course of implementation, and I don't intend to commit any changes to the behaviour for this case until it's been raised with WG14. How would I raise these questions with the WG14 reflector?

In any case, translation phase 1 begins with an implementation-defined mapping; and such mapping can choose to implement model A or C (but the implementation must specify it).
Since users can tell the difference between the three models only in
obscure corner cases, which the standard tried to make undefined
anyway, I think it's fine to say that we're doing model C.
I don't consider any model which doesn't allow all valid sequences of preprocessing token spellings to be a sensible model to choose. Models which don't permit all C programs could be done, but they aren't what we document and I don't believe they make sense.

Why? What user benefit does it provide?

I believe that to the user, all UCNs for the same character are indistinguishable (and are indistinguishable from the same character in the input charset, although that isn't implemented yet). I believe that users won't write UCNs by hand, they'll use some kind of editor, and for those users it'll be literally impossible to tell which UCN spelling is used. (That, after all, is the whole justification for the -Wnormalized flag.)

I know, from reading the Rationale, that the standards committee didn't think that any behaviour you could get from distinguishing different UCNs for the same character was useful.

Thus, I do think that C is a good model to choose.

Changes to the documented implementation-defined behavior need especially careful discussion, agreement and design in advance of implementation. Especially, models are not something to choose in the middle of implementation.

It was not my intent to choose this particular model, or any model; my intention was to get basic UCN support in, and do any work needed involving stringification later. It's clearly possible to implement any of these, and as you say above, there should be clarity on what the desired behaviour is before doing work on it.

What the standard "tries" to make undefined behavior seems irrelevant. If something is undefined, the established and previously discussed cpplib practice is that it is a hard error. If it is not, it must be handled as required by the standards. Imperfect implementations defeat the object of showing up the problems with this feature for future standard versions.

I agree, but if the committee specifically intended for particular implementations to be possible by interpreting the standard in a particular way, and that is what the Rationale says, and the wording of the standard does not contradict that, one cannot say that the standard should be fixed to permit those implementations; it already does.

However, if the standard is intended to permit certain implementations, and when those implementations are created it turns out that aspects of their behaviour (like being unable to write all possible C programs) are undesirable (even if just confusing), then you have achieved your goal of showing a problem with the feature for future standard revisions.

So, in order to achieve the object of showing the problems with this feature, we should pick a method C implementation, since that would show this problem.

[lex.phases] paragraph 1 says:

An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.
I believe this is specifically intended to allow implementations to
use UTF-8 (or other encoding) as an internal encoding for identifiers,
and so when [cpp.stringize] says "the original spelling" it means in
the internal encoding, not as the user wrote it.
We discussed this before - it seems to be a restatement of the as-if rule, nothing more <http://gcc.gnu.org/ml/gcc-patches/2003-04/msg01528.html>.

I believe if you read the rest of the standard correctly, it is a restatement of the as-if rule.

So, how do you justify that your reading of the rest of the standard is consistent with this sentence?

I thought you knew <http://gcc.gnu.org/ml/gcc-patches/2003-04/msg01509.html> that spellings should be preserved for a good implementation and must be preserved if you don't use dodgy phase 1 models.

Yes, at the time I thought that, and it may still be correct. Or not. You see, I am not sure that preserving spellings is a good implementation either. It seems like it should be unnecessary. So it might be possible that a good implementation requires a phase 1 model, which might or might not be dodgy; you'd have to try it and see.

Alternatively, phase 1 starts with the same mapping as for C, and so the comment from the C rationale applies for C++ too.
The comment from the C rationale does not apply for C++.

Why not?

I should point out that there is a mood on IRC that there was a serious
mistake in the way that the design of PCH was done in private without
sufficient public discussion of design approaches and agreement of the
right way, and that IMA had similar problems, and that this is another
instance - at least the third - of exactly the same problem when the
previous cases should have been learnt from.

Yes. I've learnt that I should stop trying to get public review of my significant patches before committing them, because it's mostly useless. I've learnt that there's no point arguing with people who do not accept reasoned argument. I've learnt that in the cases where public review would have been helpful, all the unhelpful comments are so annoying that I often miss the helpful ones. I've learnt that some people seem to think that it's their right to complain about design or implementation without being willing in any way to contribute. I've learnt that some people will create a "mood" which has no basis in reality. I've learnt that some people seem to think that someone who is willing to do 50% of the work should be willing to do 100% of the work, even when they themselves are unwilling to do any of the work (people plural, not just Zack). I've learnt that the best is not just the enemy of the good, but that it has sympathisers, and some of those think that even the best is not good enough. I've learned that there are useful features which are intentionally being left out of GCC, even though they would benefit users, simply because of some trivial difficulty which users will probably never encounter.

I am sad that I have learnt these things.

I've also learned that in each of the two cases above, and I expect in this case, that *I was right and those who objected were wrong*; and I have a conclusive piece of evidence for this, which is that no-one has yet come up with better implementations, and so even if a replacement was developed tomorrow, users have still had the ability to use these features for several years more than had I not written them.

What's more, I am starting to learn that I do not like contributing to GCC. I am starting to feel like every time I improve GCC, I get criticised for it. I did not have to try to make the implementation anywhere near as extensive as it is, and I did not have to try to get it into FSF GCC. I am very unhappy that my attempts to "play nice" have been met with complaints and criticism, because it means that I'm going to have to choose between trying to play nice and avoiding complaints. I am unhappy that I have had to work on this at all, it should have been resolved years ago. I am unhappy that when I tried to work on it, instead of being met with help and thanks for taking on the problem, I was met with rejection.

If you don't like how I did PCH, make a patch. If you don't like how I did IMA, make a patch. If you don't like how I did UCNs-in-IDs, make a patch. If you don't want to make a patch, stop complaining.

I am tired of this. Starting tomorrow, I will be taking a break from working on UCNs-in-IDs. I will commit no more than one further patch, depending on the consensus of the three cpplib maintainers, or the Steering Committee:

1. I put a call to 'cpp_error (CPP_DL_ERROR' in forms_identifier_p when _cpp_valid_ucn recognises a valid UCN, saying "sorry, UCNs in identifiers are not implemented". I remove all the testcases, since they'd all fail. 2. I back out all the patches I've committed. When gcc 4.1 comes around and UCNs-in-IDs still hasn't been implemented, I say "I told you so". 3. I leave the tree as-is. 4. I put a call to cpp_error saying "sorry, not implemented" when stringizing an identifier containing a UCN in C++.

If no consensus appears, I will choose option (3), leaving the tree as-is.

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Follow-Ups:
- Re: UCNs-in-IDs patch
  - From: Joseph S. Myers
- Re: UCNs-in-IDs patch
  - From: Michael Matz
- Re: UCNs-in-IDs patch
  - From: David Edelsohn
- Re: UCNs-in-IDs patch
  - From: Zack Weinberg
- Procedural issues and consensus building (was Re: UCNs-in-IDspatch)
  - From: Zack Weinberg

References:
- UCNs-in-IDs patch
  - From: Geoffrey Keating
- Re: UCNs-in-IDs patch
  - From: Joseph S. Myers
- Re: UCNs-in-IDs patch
  - From: Neil Booth
- Re: UCNs-in-IDs patch
  - From: Mark Mitchell
- Re: UCNs-in-IDs patch
  - From: Joseph S. Myers
- Re: UCNs-in-IDs patch
  - From: Joseph S. Myers
- Re: UCNs-in-IDs patch
  - From: Geoffrey Keating
- Re: UCNs-in-IDs patch
  - From: Joseph S. Myers

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]