This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: proposed Opengroup action for c99 command (XCU ERN 76)

From: Geoff Keating <geoffk at geoffk dot org>
To: Joseph S. Myers <joseph at codesourcery dot com>
Cc: Paul Eggert <eggert at CS dot UCLA dot EDU>,gcc at gcc dot gnu dot org
Date: Fri, 16 Sep 2005 11:29:01 -0700
Subject: Re: proposed Opengroup action for c99 command (XCU ERN 76)
References: <877jdjqsqc.fsf@penguin.cs.ucla.edu> <m264t1mj94.fsf@greed.local> <Pine.LNX.4.61.0509161151570.8721@digraph.polyomino.org.uk>

On 16/09/2005, at 5:12 AM, Joseph S. Myers wrote:

On Fri, 16 Sep 2005, Geoffrey Keating wrote:
What this means in practise, I think, is that the structure that
represents a token, 'struct cpp_token' will grow from 16 bytes to 20
bytes, which makes it 2 cache lines rather than 1, and a subsequent
memory use increase and compiler performance decrease.  It might be
that someone will think of some clever way to avoid this, but I
couldn't think of any that would be likely to be a win overall, since
a significant proportion of tokens are identifiers.  (I especially
didn't like the alternative that required a second hash lookup for
every identifier.)
There are plenty of spare bits in cpp_token to flag extended identifiers and handle them specially (as a slow path, marked as such with __builtin_expect). There's one bit in the flags byte, two unused bytes after it and a whole word not used in the case of identifiers (identifiers use a cpp_hashnode * where strings and numbers use a struct cpp_string which is bigger) which could store a canonical form of an identifier (or could store the noncanonical spelling for the use of the specific places which care about the original spelling).

Yes, I think this can be made to work efficiently.

Adding salt to the wound, of course, is that for C the only difference between an (A) or (B) and a (C) implementation is that a (C) implementation is less expressive: there are some programs, all of which are erroneous and require a diagnostic, that can't be written. So you lose compiler performance just so users have another bullet to shoot their feet with.

C++ requires (A)

This is true, but only in the sense that C requires (B). Either language can be supported by any of the three implementations with an appropriate phase 1 rule.

Implementation of (A) could start by a (slow path, if there are extended characters present) conversion of the whole input to UCNs, or a more efficient conversion that avoids the need to convert within comments.

Although UCNs would be the most convenient form for the preprocessor, the backend would like strings to be in UTF-8, to avoid the need for conversion when outputting names to the assembler.

But if any normalisation of UCNs is documented for C++ it does need to be documented in the form of transforming UCNs to other UCNs (not to UTF-8).

Yes; but this is not a difficult problem. For C++, you would just say (following my proposed wording) that after they're converted to UTF-8, they are converted back to some canonical form of UCN ('the version with the most lower-case characters', for instance). Then, when stringifying, you would convert UTF-8 characters in identifiers to that canonical UCN.

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Follow-Ups:
- Re: proposed Opengroup action for c99 command (XCU ERN 76)
  - From: Paul Eggert

References:
- proposed Opengroup action for c99 command (XCU ERN 76)
  - From: Paul Eggert
- Re: proposed Opengroup action for c99 command (XCU ERN 76)
  - From: Geoffrey Keating
- Re: proposed Opengroup action for c99 command (XCU ERN 76)
  - From: Joseph S. Myers

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]