This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: proposed Opengroup action for c99 command (XCU ERN 76)

From: "Joseph S. Myers" <joseph at codesourcery dot com>
To: Geoffrey Keating <geoffk at geoffk dot org>
Cc: Paul Eggert <eggert at CS dot UCLA dot EDU>, gcc at gcc dot gnu dot org
Date: Fri, 16 Sep 2005 12:12:34 +0000 (UTC)
Subject: Re: proposed Opengroup action for c99 command (XCU ERN 76)
References: <877jdjqsqc.fsf@penguin.cs.ucla.edu> <m264t1mj94.fsf@greed.local>

On Fri, 16 Sep 2005, Geoffrey Keating wrote:

> What this means in practise, I think, is that the structure that
> represents a token, 'struct cpp_token' will grow from 16 bytes to 20
> bytes, which makes it 2 cache lines rather than 1, and a subsequent
> memory use increase and compiler performance decrease.  It might be
> that someone will think of some clever way to avoid this, but I
> couldn't think of any that would be likely to be a win overall, since
> a significant proportion of tokens are identifiers.  (I especially
> didn't like the alternative that required a second hash lookup for
> every identifier.)

There are plenty of spare bits in cpp_token to flag extended identifiers 
and handle them specially (as a slow path, marked as such with 
__builtin_expect).  There's one bit in the flags byte, two unused bytes 
after it and a whole word not used in the case of identifiers (identifiers 
use a cpp_hashnode * where strings and numbers use a struct cpp_string 
which is bigger) which could store a canonical form of an identifier (or 
could store the noncanonical spelling for the use of the specific places 
which care about the original spelling).

> Adding salt to the wound, of course, is that for C the only difference
> between an (A) or (B) and a (C) implementation is that a (C)
> implementation is less expressive: there are some programs, all of
> which are erroneous and require a diagnostic, that can't be written.
> So you lose compiler performance just so users have another bullet
> to shoot their feet with.

C++ requires (A) and provides examples of valid programs where it can be 
told whether a normalisation of UCNs is part of the implementation-defined 
phase 1 transformation.  As I gave in a previous discussion,

#include <stdlib.h>
#include <string.h>
#define h(s) #s
#define str(s) h(s)
int
main()
{
  if (strcmp(str(str(\u00c1)), "\"\\u00c1\"")) abort ();
  if (strcmp(str(str(\u00C1)), "\"\\u00C1\"")) abort ();
}

Implementation of (A) could start by a (slow path, if there are extended 
characters present) conversion of the whole input to UCNs, or a more 
efficient conversion that avoids the need to convert within comments.  
But if any normalisation of UCNs is documented for C++ it does need to be 
documented in the form of transforming UCNs to other UCNs (not to UTF-8).

-- 
Joseph S. Myers               http://www.srcf.ucam.org/~jsm28/gcc/
    jsm@polyomino.org.uk (personal mail)
    joseph@codesourcery.com (CodeSourcery mail)
    jsm28@gcc.gnu.org (Bugzilla assignments and CCs)

Follow-Ups:
- Re: proposed Opengroup action for c99 command (XCU ERN 76)
  - From: Joe Buck
- Re: proposed Opengroup action for c99 command (XCU ERN 76)
  - From: Geoff Keating
- Re: proposed Opengroup action for c99 command (XCU ERN 76)
  - From: Kai Henningsen

References:
- proposed Opengroup action for c99 command (XCU ERN 76)
  - From: Paul Eggert
- Re: proposed Opengroup action for c99 command (XCU ERN 76)
  - From: Geoffrey Keating

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]