This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: proposed Opengroup action for c99 command (XCU ERN 76)
- From: "Joseph S. Myers" <joseph at codesourcery dot com>
- To: Geoffrey Keating <geoffk at geoffk dot org>
- Cc: Paul Eggert <eggert at CS dot UCLA dot EDU>, gcc at gcc dot gnu dot org
- Date: Fri, 16 Sep 2005 12:12:34 +0000 (UTC)
- Subject: Re: proposed Opengroup action for c99 command (XCU ERN 76)
- References: <877jdjqsqc.fsf@penguin.cs.ucla.edu> <m264t1mj94.fsf@greed.local>
On Fri, 16 Sep 2005, Geoffrey Keating wrote:
> What this means in practise, I think, is that the structure that
> represents a token, 'struct cpp_token' will grow from 16 bytes to 20
> bytes, which makes it 2 cache lines rather than 1, and a subsequent
> memory use increase and compiler performance decrease. It might be
> that someone will think of some clever way to avoid this, but I
> couldn't think of any that would be likely to be a win overall, since
> a significant proportion of tokens are identifiers. (I especially
> didn't like the alternative that required a second hash lookup for
> every identifier.)
There are plenty of spare bits in cpp_token to flag extended identifiers
and handle them specially (as a slow path, marked as such with
__builtin_expect). There's one bit in the flags byte, two unused bytes
after it and a whole word not used in the case of identifiers (identifiers
use a cpp_hashnode * where strings and numbers use a struct cpp_string
which is bigger) which could store a canonical form of an identifier (or
could store the noncanonical spelling for the use of the specific places
which care about the original spelling).
> Adding salt to the wound, of course, is that for C the only difference
> between an (A) or (B) and a (C) implementation is that a (C)
> implementation is less expressive: there are some programs, all of
> which are erroneous and require a diagnostic, that can't be written.
> So you lose compiler performance just so users have another bullet
> to shoot their feet with.
C++ requires (A) and provides examples of valid programs where it can be
told whether a normalisation of UCNs is part of the implementation-defined
phase 1 transformation. As I gave in a previous discussion,
#include <stdlib.h>
#include <string.h>
#define h(s) #s
#define str(s) h(s)
int
main()
{
if (strcmp(str(str(\u00c1)), "\"\\u00c1\"")) abort ();
if (strcmp(str(str(\u00C1)), "\"\\u00C1\"")) abort ();
}
Implementation of (A) could start by a (slow path, if there are extended
characters present) conversion of the whole input to UCNs, or a more
efficient conversion that avoids the need to convert within comments.
But if any normalisation of UCNs is documented for C++ it does need to be
documented in the form of transforming UCNs to other UCNs (not to UTF-8).
--
Joseph S. Myers http://www.srcf.ucam.org/~jsm28/gcc/
jsm@polyomino.org.uk (personal mail)
joseph@codesourcery.com (CodeSourcery mail)
jsm28@gcc.gnu.org (Bugzilla assignments and CCs)