This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC/RFHelp: c-decl.c rewrite - almost but not quite


Robert Bowdidge <bowdidge@apple.com> writes:

> On Mar 17, 2004, at 2:12 AM, Zack Weinberg wrote:
>> Well, since this is (hopefully) the final data structure, I intended
>> that the comments in c-decl.c should explain it adequately.  Can you
>> look at those and tell me where they're inadequate?
>
> What I'm hoping for is a summary paragraph saying what these 4000
> lines of changes mean at a high level -- what were you changing, and
> why -- just so I have some hope of asking the right questions about
> the patch.

Okay.  I *was* hoping this was clear from the file itself ... at
least, the what of the new, and why the various features are
necessary.  I'll try to go over it here, and hopefully we can converge
on some better commentary.

> As far as I can tell, it appears that you've added the c_binding
> structure to the name lookup and scoping code.  So how do the
> c_binding structure, c_scope structure, and lookup code interact?
> From the bits of removed code, it looks like struct c_scope still
> exists, but that most of the lookup questions get done on the
> c_binding structure rather than c_scope structure, and this changed a
> lot of the commonly-called functions.   It also looks like several
> functions were renamed (pushlevel -> push_scope, poplevel->pop_scope).
> Any particular reason for this?  Are you trying to make the C and C++
> code more analogous?  Is struct c_binding intended to resemble struct
> cxx_binding?

First the metaobservations: I didn't code this with any reference to
the C++ front end, because C++ has different and much more complicated
rules.  Despite that, I may have ended up with a similar data
structure, but it wasn't intentional.  The nomenclature changes are
because I wanted to use the same terminology that the C standard does
(i.e. "scope" not "binding level"); also, it guaranteed that I had
flushed out all callers of the old pushlevel/poplevel.

The new data structure works like this.  In C, an identifier can have
meaning in three different namespaces simultaneously: the namespace of
symbols, the namespace of tags, and the namespace of labels.
(Ignoring reserved words and macros, which are handled elsewhere.)  So
there are three pointers in struct lang_identifier.  Formerly they
pointed to DECLs; now they point to c_binding structures.  This is the
core change.  The reason for it is, a DECL node only has one
TREE_CHAIN pointer.  So if you need to keep track of the visibility of
a DECL in multiple different scopes simultaneously, which we do, you
have a choice of copying the entire DECL, or using a data structure
external to the DECL to do the job.  The old code took the copying
approach; this was bad because it violated the basic assumption made
elsewhere in the compiler that there is exactly one DECL node for each
assembly-level symbol.  Hence all the bugs.

The core data structure can be thought of as a grid.  Consider the
following example:

extern void bar(void);

static void foo(void)
{
  int baz;
  ...
}

While parsing the remainder of foo()'s body, we'll have a combination
of c_scope and c_binding structures that are linked together like
this:

\
 \   foo   bar   baz
  +-----------------
  |   |     |      |
2 |...|.....|......*   # current_function_scope
  |   |     |
1 |---*-----*          # file_scope
  |         |
0 |~~~~~~~~~*          # external_scope

The rows are all the identifiers bound in a given scope - findable
by walking the ->bindings chain of a given c_scope structure.  (The
->prev field of struct c_binding is the one that carries this chain.)

The columns are all the meanings attributed to a given identifier; you
start down a column by applying one of the I_*_BINDING macros to an
IDENTIFIER_NODE, and then proceed down by following the ->shadowed
field of struct c_binding.  (In this example we're only looking at
I_SYMBOL_BINDING.)

The asterisks are c_binding structures.  baz and foo are easy to
understand.  baz is an automatic variable in the outermost block
scope of foo, so it's got just one c_binding structure, which is
linked to current_function_scope.  foo is a static function, so again
it has just one c_binding structure, which is linked to file_scope.

bar is a little trickier.  It is an external reference.  That gets
*two* bindings: one in file_scope and one in external_scope.
external_scope doesn't correspond directly to any scope defined by the
C standard.  Its purpose is to hold on to all the external references
to the same object, so we can rely on them all getting the same DECL
node.  Bindings in external_scope are invisible except for
duplicate_decls checks.  This corresponds to the old C_DECL_INVISIBLE
bit, only it actually works.  There are several ways to get a binding
only in the external_scope:  the most obvious being something like

static void foo(void)
{
  extern int global;
}

Once parsing moves beyond foo, the only binding to global will be in
external_scope.  Also, many built-in functions (the ones with
user-namespace names, not the __builtin_* ones) start out this way.

The ->contour pointer in struct c_binding exists because sometimes we
have a symbol and we need to look up its scope.  For instance, when
bindings are inserted in a scope other than the topmost, we have to
make sure that they are inserted in the proper place in the ->shadowed
chain; this is done by scanning for an appropriate value of
->contour->depth.  It might be possible to replace this with a flags
word, which might make the data structure more efficient.  It is an
invariant that, given a c_scope structure S, B->contour == S for all
c_bindings structures B on the S->bindings chain.

The ->id pointer is probably unnecessary.  It points back to the
IDENTIFIER_NODE.  I think it's only used in contexts where we have the
IDENTIFIER_NODE anyway.

Now, parts of the language-independent compiler are expecting to find
out about the lexical scope structure of the program by looking at
BLOCK nodes which point to DECLs linked together by TREE_CHAIN.  It is
pop_scope's responsibility to produce these BLOCK nodes.  I decided it
would be simplest not to maintain the chains of DECLs during parsing.
This is why C now disables the pushlevel, poplevel, getdecls, set_block,
and clear_binding_stack hooks (set_block was already disabled, I just
killed the definition in c-decl.c in favor of the stub in langhooks.c).

This change is probably why the Objective C front end is now broken,
but I have no idea why Java is broken - the changes ought not to have
affected Java at all.

It should furthermore be noted that calling lang_hooks.decls.pushdecl
from language-independent code is nothing but trouble.  I took one
instance of that out of coverage.c because it was being called after
file_scope was torn down, so it would crash.  It's my belief that
language-independent code has no business calling this hook, but I
could be persuaded otherwise.  (I didn't disable it because it is also
used in c-objc-common.c for legitimate purposes.)

> You also mention that "the new data structure is somewhat more memory
> intensive than strictly necessary."  Ok, which data structure do you
> mean (c_binding?), and why do you think it's more memory intensive?
> Why might this be bad?  Why was it necessary?

The combination of struct c_binding and IDENTIFIER_NODE is what I was
thinking of.  We have to allocate something like 40 bytes to represent
an identifier bound to a single DECL, which is by far the most common
case.  I have some ideas for how to improve this - e.g. using just a
single chain of bindings, and you have to look for the one you want -
but I am not sure they're actually performance wins.

How's that?

zw


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]