UCNs-in-IDs patch

Thu Mar 17 14:34:00 GMT 2005

On Wed, 16 Mar 2005, Geoff Keating wrote:

> Consider
> 
> #define foo ba\
> r
> #define foo bar
> 
> Do these have 'different spellings'?  cpplib doesn't think so.  I don't think
> so.  Yet they have a different sequence of source characters.

Not in phase 4, which is the relevant phase here.

> Well, now we've found it in the course of implementation, and I don't intend
> to commit any changes to the behaviour for this case until it's been raised
> with WG14.  How would I raise these questions with the WG14 reflector?

Email it to them (sc22wg14 at open-std.org), which I've now done.  Just as 
I did when there were doubts about relevance of Unicode normalization.  
Just as I did when proposing the relevant changes to ELF and Dave Prosser 
doubted that different UCNs for the same character were equivalent in 
identifiers.

> > I don't consider any model which doesn't allow all valid sequences of
> > preprocessing token spellings to be a sensible model to choose.  Models
> > which don't permit all C programs could be done, but they aren't what we
> > document and I don't believe they make sense.
> 
> Why?  What user benefit does it provide?

Being able to write strings containing \\u not starting a UCN, which a 
phase 1 model looking at UCNs context-independently would break (and a 
context-dependent model would be possible but rather excessively 
complicated to specify); such breakage would be a serious quiet change 
from C90.  Because standard problems with such sequences of spellings can 
only be shown up through experience with implementations allowing such 
spellings.  Because as noted in the Rationale a model B implementation - 
keeping both UCNs and extended characters until as late as possible - is 
more in the spirit of C.

> > > [lex.phases] paragraph 1 says:
> > > 
> > >   An implementation may use any internal encoding, so long as an actual
> > >   extended character encountered in the source file, and the same
> > >   extended character expressed in the source file as a
> > >   universal-character-name (i.e. using the \uXXXX notation), are handled
> > >   equivalently.
> > > 
> > > I believe this is specifically intended to allow implementations to
> > > use UTF-8 (or other encoding) as an internal encoding for identifiers,
> > > and so when [cpp.stringize] says "the original spelling" it means in
> > > the internal encoding, not as the user wrote it.
> > 
> > We discussed this before - it seems to be a restatement of the as-if rule,
> > nothing more <http://gcc.gnu.org/ml/gcc-patches/2003-04/msg01528.html>.
> 
> I believe if you read the rest of the standard correctly, it is a restatement
> of the as-if rule.
> 
> So, how do you justify that your reading of the rest of the standard is
> consistent with this sentence?

Simple: it's an aside to implementors, reiterating that exactly what 
internal encoding they use is beyond the scope of the standard but $ must 
act exactly as written as one of \u0024 or \U00000024 and similarly for 
all other extended characters; it's still possible to distinguish \u0024 
from \U00000024 but in C++ it is also possible to tell which was used for 
$ in a particular case in certain circumstances (and the UCN used may 
differ within the translation unit).

> > > Alternatively, phase 1 starts with the same mapping as for C, and so the
> > > comment from the C rationale applies for C++ too.
> > 
> > The comment from the C rationale does not apply for C++.
> 
> Why not?

Because C and C++ started diverging more than 20 years ago; they adopted 
similar models for UCNs which then started diverging immediately, before 
the standards were released, such divergence taking account of the 
different cultures around the languages.  C++ chose model A and the C 
rationale comments are about what was felt in the context of C.

-- 
Joseph S. Myers               http://www.srcf.ucam.org/~jsm28/gcc/
    jsm@polyomino.org.uk (personal mail)
    joseph@codesourcery.com (CodeSourcery mail)
    jsm28@gcc.gnu.org (Bugzilla assignments and CCs)