This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: cpplib: Preliminary implementation of UCNs

From: Zack Weinberg <zack at codesourcery dot com>
To: Neil Booth <neil at daikokuya dot co dot uk>
Cc: gcc-patches at gcc dot gnu dot org
Date: Sun, 20 Apr 2003 10:09:13 -0700
Subject: Re: cpplib: Preliminary implementation of UCNs
References: <20030419235913.GV23814@daikokuya.co.uk><87ademovxh.fsf@egil.codesourcery.com><20030420161940.GC23814@daikokuya.co.uk>

Neil Booth <neil at daikokuya dot co dot uk> writes:

> Zack Weinberg wrote:-
>
>> Neil Booth <neil at daikokuya dot co dot uk> writes:
>> 
>> ...
>> >I don't think it's a good idea to make \u00aa and \u00AA the same
>> > identifier, never mind making it the same as the character itself.
>> 
>> I disagree strongly - please give a rationale for your position.
>
> Hmmm, after reading the C++ phases of translation, it seems pretty
> clear that this is intended.  Sigh.  I'll try to figure out a way of
> handling this.

I have a vague idea how we might handle this.  It leverages what I
want to do to identifier->decl lookup, to get rid of all the special
cases.

Suppose that, instead of the grab-bag of pointers we have now, every
identifier has just one 'value' field, which is a linked list of
(context, decl) pairs.  CONTEXT is some sort of binding-level
structure, and DECL is a tree.  This is going to require a certain
amount of magic to cope with reserved words and macro definitions, but
the point is that nothing stops two identifier nodes from pointing to
the same list.  We may need a list header node, actually that would
solve the magic problem as well.

So then, when we hit an identifier containing an UCN and/or an
extended character, we enter it into the symbol table under its
literal spelling, then we canonicalize it onto some preferred form,
look that up too, and set it to use the same binding list.

> It seems sensible too, that should we decide we are going to allow
> identifiers with chars outside the basic charset, then the permitted
> ones should correspond exactly with the permitted UCNs, and that they
> should be coincident in the hash table, but somehow their original
> spellings are also preseved.  Agreed?

I'm not real happy about the permitted list -- see previous rant on
the subject -- but what I would like to do to it would only enlarge
it, which we can do without forward-compatibility problems, so I'm
okay with this as a first cut.

Actually, a check I'd like to do first: Joseph asserts that an
identifier containing only UCNs/extended characters from the permitted
set must be in canonical form already.  Can we prove that?

> What about CPP output for UCN and non-ASCII identifiers?
>
> 1) As UCNs
> 2) As UTF-8
> 3) A possible mixture
> 4) Don't care 1) or 2), whichever turns out to be easier

I prefer 2, as it means users of CPP output (textual and tokenized)
don't have to know about UCNs.  I'd also be fine with 1.
Inconsistency would be bad, I think.

What are you doing with UCNs in string constants, vs extended
characters?

zw

Follow-Ups:
- Re: cpplib: Preliminary implementation of UCNs
  - From: Neil Booth

References:
- cpplib: Preliminary implementation of UCNs
  - From: Neil Booth
- Re: cpplib: Preliminary implementation of UCNs
  - From: Zack Weinberg
- Re: cpplib: Preliminary implementation of UCNs
  - From: Neil Booth

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]