This is the mail archive of the
mailing list for the GCC project.
Better memory management in cpplib
- To: Mark Mitchell <mark at codesourcery dot com>
- Subject: Better memory management in cpplib
- From: Neil Booth <neil at daikokuya dot demon dot co dot uk>
- Date: Wed, 29 Aug 2001 00:04:07 +0100
- Cc: gcc at gcc dot gnu dot org
[cc-ed to the list as others might find it vaguely interesting]
I'm working on a more intelligent way of handling memory in cpplib;
and am presently focussing on the memory used to store tokens. By
that I mean the cpp_token structure, and not the textual payload of
identifiers, numbers or strings, which is good enough for the moment.
I think this may have knock-on benefits in your C++ parser stuff.
My original motivation was wanting to get away from copying cpp_token
structures around. We currently expand function-like macros by making
a copy of the expansion, and inserting into it the pre-expanded
arguments. This is quite an expensive operation, particularly if
macros get heavily nested. I had two reasons for the copying approach
when I originally wrote it: one, it avoids the question of token
lifetimes, and two, the tokens are modifiable. This latter property
is (only) used by the stand-alone preprocessor to insert spaces with
the PREV_WHITE flag to get correct paste-avoidance and natural token
spacing on output.
Anyway, an obvious improvement is to expand macros by treating
expansions as a list of pointers to tokens rather than a list of
tokens. Then all the copying and inserting involved involves only
pointers, and not whole structures. It would also cut down memory
usage. However, this is not without its costs: I need to put in place
an intelligent concept of token lifetime, so that tokens don't
disappear from underneath us, and to solve the token paste avoidance
issue some other way for stand-alone CPP. Curiously, I'm finding the
latter to be the somewhat harder problem :-(
I've since realised that a better and more flexible approach to token
lifetimes could be far more valuable to the front ends, particularly
the C++ one, than to CPP itself.
On your parser branch I glanced at what you're doing for your
tentative parsing. It seemed to be a bit more complicated that I
thought was necessary; but more on that some other time. I think you
too are having to copy tokens around and keep track of their original
location, which is tedious and boring. Your job would be much easier
if you could just maintain a pointer to a token, and be assured that
the token it points to won't go away at some inconvenient moment.
With the line-map stuff I did, each token now contains is original
file, line and column number, so that's not a problem. You need more
information than just a token, like tentative diagnostics and pointers
to a "tree", but basically I think I'm right in saying that all bases
are covered by a pointer to a cpp_token and a bit of extra fluff.
So, I thought you might be interested in what I'm planning to do; it
involves a fairly simple rule for token lifetime, and allows front
ends to temporarily extend token lifetimes if they see fit. I've
appended below some random things I jotted into a file when I was
collecting my thoughts on this; I'd like to know what you think.
There may be other uses of it too. I think the C++ front end wants to
do syntax checking of member functions only after parsing the whole
class (because it might not have seen all the member variables). It
would be possible to get cpplib to keep the tokens of the whole class
in memory, until you've parsed the whole thing, at which point you
could then go back to the member functions and parse those properly.
You wouldn't need to translate them into some kind of tree
representation, until this point. I'm not sure whether keeping a list
of token pointers is more useful though.
I'm working on the lexer part of these changes; it's looking promising
so far. It adds some code to cpplib, but simplifies a bunch of issues
too (cpplib no longer needs its current kludge to handle its own token
lookahead, for example). Dunno when I'll have the lexer changes
deliverable; but I'm hoping within a week or two at most.
/* cpplib token memory management:
The preprocessor returns pointers to tokens. This then raises the
question of how memory is managed for those tokens; i.e. under what
circumstances can what the pointer points to change? This briefly
explains memory management within CPP.
o Tokens are lexed a logical line at a time. Lexing only stops at
"real" new lines or at EOF. It does not stop at escaped newlines
nor at newlines within C-style comments or multi-line strings.
o The lexer re-uses its token buffer when it lexes a new line, so
previous tokens are overwritten. Therefore, pointers to lexed
tokens remain good only until a subsequent line is lexed. This
fact is used by the preprocessor itself to jump around the tokens
within directives without worry, since directives can only ever
occupy a single line.
o However, as long as the flag keep_tokens is set, the lexer does
not overwrite previous token storage but lexes new lines into fresh
storage, thus preserving the tokens of the previous line(s). This
is necessary if any kind of look ahead is taking place. For
example, keep_tokens is set by the preprocessor when it sees the
name of a function-like macro. It sets this flag whilst it looks
ahead for a '(', and if it finds it, keeps it set through to the
matching ')'. That way the token representing the macro's name,
and any earlier tokens on the same line, are not lost if the lexer
needs to read in tokens from subsequent lines.
o Clients of cpplib can set this flag too. Since cpplib does not
recycle token memory whilst it is set, to keep the memory footprint
of the compiler down, it should be used only when necessary, and
each call to set the flag must be matched by an eventual call to
o One example of use is string literal concatenation: when the C
front end sees a string literal token, it needs to be sure it
remains valid whilst looking ahead for a possible subsequent string
literal token to concatenate it with. Another example is the C++
parser: it often wants to parse tentatively, and for efficiency
reasons doesn't want to have to worry about making its own copies
of tokens whilst doing so.
o Macros do not interfere with this technique. The tokens forming
the replacement list of a macro are kept in memory permanently (at
present #undef does not free them, though if in future it did free
them, it would be fine as long as they were freed only when
keep_tokens is cleared). Macro argument tokens must have come
either from other macro expansions, or from the currently lexed
line. This leaves token that are generated by macro expansion:
stringized tokens and pasted tokens. These are not a problem
either; the preprocessor arranges to allocate them from the same
storage as is used by the lexer when lexing a line, so they are
invalidated at the same time as the lexer's tokens.
o Note that the textual payloads of identifiers, numbers, character
constants and string literals are allocated from buffers whose
contents are never freed.
o Lexed lines, function-like macros, and funlike macros with zero
arguments, are stored as a list of tokens, possibly terminated with
with a jump to a continuation buffer in case the original buffer was
not large enough. The expansions of function-like macros with more
than one token are stored as lists of pointers to tokens, after
argument pre-expansion and replacement. */