This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Implementing Universal Character Names in identifiers

From: loewis at informatik dot hu-berlin dot de (Martin v. Löwis)
To: Zack Weinberg <zack at codesourcery dot com>
Cc: gcc-patches at gcc dot gnu dot org
Date: 28 Oct 2002 09:53:35 +0100
Subject: Re: Implementing Universal Character Names in identifiers
References: <200210280715.g9S7FdI2003815@paros.informatik.hu-berlin.de><20021028075111.GB1273@codesourcery.com>

Zack Weinberg <zack@codesourcery.com> writes:

> It would be worthwhile - as a separate patch, mind - to add support
> for extended characters written in bare UTF-8 in identifiers.  

I completely agree.

> My plan for general extended-character-encoding support is to
> convert to UTF-8 and process that representation; that plus iconv
> plus some glue and heuristics will get us most of the way there.

Notice that this might be difficult to incorporate into the
parser. Parsing extended characters will require maintenance of a
shift state (mbstate_t); iconv does not directly expose the
mbstate. So you have to carefully keep the mbstate_t and the iconv_t
synchronized.

Alternatively, you could even use iconv to split the input into
individual characters, and then perform parsing on the iconv result
(conversion to UTF-8 might be appropriate); but that would be a
significant change.

> You want to look closely at what is currently done for UCNs in wide
> character constants and string literals.  I'm pretty sure it's wrong,
> and I would appreciate suggestions.

As for the preprocessor, it looks quite right to me; also, the output
is right, assuming gcc implies ISO 10646 for wchar_t on all platforms
(which is a sensible choice, and correct for GNU systems).

> I thought we had some sort of encoding schema for assemblers that
> don't support UTF-8?  How does this interact with the C++ ABI? 

The C++ ABI left this open; the current recommendation (which is not
normative) is to use UTF-8 unless something else is specified by the
vendor. Encoding schemes don't really work for C, and add complexity
for C++.

> We should normalize identifiers before entering them in the symbol
> table, and for output; otherwise there will be great confusion.
> That needs to happen as part of the initial patch.

I have now the opinion that encoding schemes (other than UTF-8) should
not be used. Compatibility with Java might be an issue; it might be
necessary to special-case extern "Java" identifiers in the C++
front-end. I could add that to the patch - although I would prefer if
the Java API would change.

As for assemblers that don't allow UTF-8 in source code: I'd rather
disable the feature for those assemblers than trying to find a
solution - this allows for compatibility should the vendor decide on
this matter later.

The tricky part is how to determine whether UTF-8 is supported in
assembler output: initially, I'd just assume that GNU as supports it,
and no other assembler does; this can then be extended as support on
other systems becomes possible.

The next question is where to block unacceptable identifiers: in
cpplib, or later? If in cpplib, or later? Later might be better since,
atleast for C++, supporting this in Java identifiers might be
desirable, plus you could use it in macro names even if the assembler
does not support it.

If UTF-8 identifiers must be rejected (or converted) in the language
front-ends, how can I efficiently determine whether an identifier uses
UTF-8? Can I use deprecated_flag on IDENTIFIERs for that?

Assuming  this  is  all  agreeable,  I'll  try  to  revise  the  patch
appropriately.

> (1) This routine belongs in libiberty, as part of the safe-ctype.h
> interface.

Really? The list of characters is quite specific to the language (and
perhaps even the language revision). I haven't even checked whether
the lists of acceptable characters are the same in C++98 and C99.

> (2) Isn't this comment now inaccurate?  You just did implement
>     extended characters in identifiers.

Yes, right :-(

> (3) The ranges need to be updated from the latest Unicode standard,
>     and the standard version noted in commentary.

No. They are mandated by the language specification. For C++, see
Annex E. For C99, see Annex D (unfortunately, I can't, since I don't
have the final copy of C99). C++ claims to have copied the table from
PDTR 10176, C from TR 10176.

*If* my C99 draft is accurate, then there are differences between
 these two tables: e.g. in C99, U+00AA (FEMININE ORDINAL INDICATOR)
is acceptable in an identifier; in C++98, it is not.

> Due to the size of this routine, and the concerns with the rest of
> your change, please submit a patch that does just that, all by itself;
> that will get in easily, and then we can iterate on the rest of it.

I will do that, when the issue of per-language tables has been
settled.

> Don't use abort in cpplib; use cpp_error (pfile, DL_ICE, ...).
> Further, this can happen as a result of ill-formed user input, can't
> it?  Therefore this should be a plain error, not an ICE.

Right, will fix.

> >     /* Check for slow-path cases.  */
> >     if (*cur == '?' || *cur == '\\' || *cur == '$')
> > !     number->text = parse_slow (pfile, cur, 1 + leading_period,
> > !                                &number->len, &ignored);
> 
> I don't think the UTF8 flag should be ignored at this point.  Consider
> what happens if we get
> 
>   asdf ## 12\u03F8
> 
> -- that is valid, and needs to turn into a single CPP_NAME token with
> the UTF8 flag set.  It seems safe to me to carry around the UTF8 bit
> on all CPP_NUMBER tokens.  

Ah, right. I missed that nondigit includes universal-character-names.

> Naturally, cpp_classify_number should categorize such numbers as
> CPP_N_INVALID (allowing digits outside the basic source character
> set strikes me as a bad idea).

Please educate me: is this taking the target language into account? If
not, there is nothing wrong with that token, as a pp-token.

> Please find a more efficient way to accomplish this.  This code is
> already *the* bottleneck for textual preprocessing.  (For instance, if
> you implement support for raw UTF8 as input encoding, we can just
> splat out the identifier as is.)

Is that necessary? Few tokens will ever have the flag set, and the
only part where I added overhead is the test for the flag.

Regards,
Martin

Follow-Ups:
- Re: Implementing Universal Character Names in identifiers
  - From: Fergus Henderson
- Re: Implementing Universal Character Names in identifiers
  - From: Joseph S. Myers
- Re: Implementing Universal Character Names in identifiers
  - From: Zack Weinberg

References:
- Implementing Universal Character Names in identifiers
  - From: Martin v. Löwis
- Re: Implementing Universal Character Names in identifiers
  - From: Zack Weinberg

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]