UCNs-in-IDs patch
Joseph S. Myers
joseph@codesourcery.com
Thu Mar 17 14:34:00 GMT 2005
On Wed, 16 Mar 2005, Geoff Keating wrote:
> Consider
>
> #define foo ba\
> r
> #define foo bar
>
> Do these have 'different spellings'? cpplib doesn't think so. I don't think
> so. Yet they have a different sequence of source characters.
Not in phase 4, which is the relevant phase here.
> Well, now we've found it in the course of implementation, and I don't intend
> to commit any changes to the behaviour for this case until it's been raised
> with WG14. How would I raise these questions with the WG14 reflector?
Email it to them (sc22wg14 at open-std.org), which I've now done. Just as
I did when there were doubts about relevance of Unicode normalization.
Just as I did when proposing the relevant changes to ELF and Dave Prosser
doubted that different UCNs for the same character were equivalent in
identifiers.
> > I don't consider any model which doesn't allow all valid sequences of
> > preprocessing token spellings to be a sensible model to choose. Models
> > which don't permit all C programs could be done, but they aren't what we
> > document and I don't believe they make sense.
>
> Why? What user benefit does it provide?
Being able to write strings containing \\u not starting a UCN, which a
phase 1 model looking at UCNs context-independently would break (and a
context-dependent model would be possible but rather excessively
complicated to specify); such breakage would be a serious quiet change
from C90. Because standard problems with such sequences of spellings can
only be shown up through experience with implementations allowing such
spellings. Because as noted in the Rationale a model B implementation -
keeping both UCNs and extended characters until as late as possible - is
more in the spirit of C.
> > > [lex.phases] paragraph 1 says:
> > >
> > > An implementation may use any internal encoding, so long as an actual
> > > extended character encountered in the source file, and the same
> > > extended character expressed in the source file as a
> > > universal-character-name (i.e. using the \uXXXX notation), are handled
> > > equivalently.
> > >
> > > I believe this is specifically intended to allow implementations to
> > > use UTF-8 (or other encoding) as an internal encoding for identifiers,
> > > and so when [cpp.stringize] says "the original spelling" it means in
> > > the internal encoding, not as the user wrote it.
> >
> > We discussed this before - it seems to be a restatement of the as-if rule,
> > nothing more <http://gcc.gnu.org/ml/gcc-patches/2003-04/msg01528.html>.
>
> I believe if you read the rest of the standard correctly, it is a restatement
> of the as-if rule.
>
> So, how do you justify that your reading of the rest of the standard is
> consistent with this sentence?
Simple: it's an aside to implementors, reiterating that exactly what
internal encoding they use is beyond the scope of the standard but $ must
act exactly as written as one of \u0024 or \U00000024 and similarly for
all other extended characters; it's still possible to distinguish \u0024
from \U00000024 but in C++ it is also possible to tell which was used for
$ in a particular case in certain circumstances (and the UCN used may
differ within the translation unit).
> > > Alternatively, phase 1 starts with the same mapping as for C, and so the
> > > comment from the C rationale applies for C++ too.
> >
> > The comment from the C rationale does not apply for C++.
>
> Why not?
Because C and C++ started diverging more than 20 years ago; they adopted
similar models for UCNs which then started diverging immediately, before
the standards were released, such divergence taking account of the
different cultures around the languages. C++ chose model A and the C
rationale comments are about what was felt in the context of C.
--
Joseph S. Myers http://www.srcf.ucam.org/~jsm28/gcc/
jsm@polyomino.org.uk (personal mail)
joseph@codesourcery.com (CodeSourcery mail)
jsm28@gcc.gnu.org (Bugzilla assignments and CCs)
More information about the Gcc-patches
mailing list