This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [4.1] UCNs in identifiers


I'd like to point out that I am currently on vacation and do not have
either the time or the inclination to carry out a long debate.
However, next week I'll have an opportunity to talk to Joseph in
person, and hopefully some sort of sensible way forward can be found.

Also, I concur with Joseph that this patch would not be acceptable at
the present stage of the development cycle even if I didn't have
serious technical and philosophical objections.

Accordingly, this is going to be my last word on the subject.

Tangentially, please don't use the word "UCN" unless you really mean
to restrict the discussion to \u/\U escapes.  I don't see any good
reason to treat UCNs differently from (non-basic-)source-character-set
characters representing the same Unicode code points.

Geoffrey Keating wrote:
> 1. GCC's stated goal is to be standards-conformant.  The standard is 
> quite clear.  The fact that you do not like the consequences of what 
> the standard says is unfortunate, and I do encourage you to get the 
> standards fixed in the very obscure case of certain Hebrew characters, 
> but that's insufficient reason to not implement the standard.  We 
> implement the standard in cases even where the feature is completely 
> objectionable, like trigraphs, and this is not nearly that bad.

On my personal scale of objectionability, this particular mis-designed
feature is substantially worse than trigraphs, because of the ABI
issues.  (It should be said that I think trigraphs have gotten a bum
rap; as C design issues go, I'd put the extended-identifier problems
on the same level as the "what's an object?" and "which pointer types
can access this data?" messes.  I think it has about that potential to
break people's expectations and code.)

The problems are not limited to the "very obscure case of certain
Hebrew characters", and ...

> 2. I believe that the standards will never be "fixed" in the way you
> wish.  The standards intentionally excluded combining forms in order
> to prevent the problem you are describing.

... the standard's exclusion of some combining forms fails to prevent
the problem.  See the very long example involving ANGSTROM SIGN and
LATIN CAPITAL LETTER A WITH RING that I added to PR 9449 last night.

> 3. You have not considered what could be done even if the worst-case
> scenario (which I don't think will ever happen) of a shared library,
> widely distributed, with an ABI that contains a UCN, and that UCN
> becomes prohibited in a later version of the standard.

I acknowledge that this scenario is unlikely, but there are much more
probable scenarios that lead to ABI breakage.  Again, see the example
I added to PR 9449 last night.

I'm not considering what can be done after the fact because I'm only
interested in solutions which rule out ABI problems ever arising in
the first place.

"Joseph S. Myers" <joseph@codesourcery.com> writes:
> I don't, however, like "appeal to authority" as a basis for such
> technical decisions.  When we choose not to implement the standard
> by default, the normal way is to have an option to enable
> conformance.

This is not a normal feature.  This needs to get treated the same way
as options that change the platform ABI - switches all its own, and a
default that is as safe as possible.

> I would also repeat, as I said in comment#12 in that PR, that
> changes to the standard are best based on implementation experience,
> and UCNs in identifiers cannot be shown to be a mistake without
> there being implementations of them in use and actual problems
> arising.

This is a good point.  Off the top of my head, I can think of two
options for gaining implementation experience with extended
identifiers that might not risk ABI problems for end users:

 - Only allow extended identifiers in declarations with no linkage.
   (We may have to do this anyway for the sake of limited assemblers
   and/or object file formats.)

 - Restrict the set of extended character sequences allowed in
   identifiers to those that are unchanged by NFC.  That means issuing
   *hard errors* for any identifier whose byte sequence would be
   changed by NFC.  Warnings are not good enough.  This happens
   whether or not the sequence was written with UCNs.

   I'm not certain that this eliminates all the potential ABI-breaking
   cases, but if it doesn't, there's a problem with Unicode itself,
   not just with C/C++.

I'd be happy to talk about these and other options - after I'm back
from vacation! - but only in the context of 4.1 and only if no code
gets written until everyone agrees on the semantics to be implemented.

zw


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]