]> gcc.gnu.org Git - gcc.git/blame - gcc/doc/cppinternals.texi
cpp.texi, [...]: Use @: where necessary when a full stop does not end a sentence.
[gcc.git] / gcc / doc / cppinternals.texi
CommitLineData
6951bc4a
NB
1\input texinfo
2@setfilename cppinternals.info
3@settitle The GNU C Preprocessor Internals
4
5@ifinfo
6@dircategory Programming
7@direntry
23de1fbf 8* Cpplib: (cppinternals). Cpplib internals.
6951bc4a
NB
9@end direntry
10@end ifinfo
11
12@c @smallbook
13@c @cropmarks
14@c @finalout
15@setchapternewpage odd
16@ifinfo
17This file documents the internals of the GNU C Preprocessor.
18
23de1fbf 19Copyright 2000, 2001 Free Software Foundation, Inc.
6951bc4a
NB
20
21Permission is granted to make and distribute verbatim copies of
22this manual provided the copyright notice and this permission notice
23are preserved on all copies.
24
25@ignore
26Permission is granted to process this file through Tex and print the
27results, provided the printed document carries copying permission
28notice identical to this one except for the removal of this paragraph
29(this paragraph not being relevant to the printed manual).
30
31@end ignore
32Permission is granted to copy and distribute modified versions of this
33manual under the conditions for verbatim copying, provided also that
34the entire resulting derived work is distributed under the terms of a
35permission notice identical to this one.
36
37Permission is granted to copy and distribute translations of this manual
38into another language, under the above conditions for modified versions.
39@end ifinfo
40
41@titlepage
42@c @finalout
43@title Cpplib Internals
23de1fbf 44@subtitle Last revised Jan 2001
6951bc4a
NB
45@subtitle for GCC version 3.0
46@author Neil Booth
47@page
48@vskip 0pt plus 1filll
49@c man begin COPYRIGHT
23de1fbf 50Copyright @copyright{} 2000, 2001
6951bc4a
NB
51Free Software Foundation, Inc.
52
53Permission is granted to make and distribute verbatim copies of
54this manual provided the copyright notice and this permission notice
55are preserved on all copies.
56
57Permission is granted to copy and distribute modified versions of this
58manual under the conditions for verbatim copying, provided also that
59the entire resulting derived work is distributed under the terms of a
60permission notice identical to this one.
61
62Permission is granted to copy and distribute translations of this manual
63into another language, under the above conditions for modified versions.
64@c man end
65@end titlepage
1347cc4f 66@contents
6951bc4a
NB
67@page
68
69@node Top, Conventions,, (DIR)
70@chapter Cpplib - the core of the GNU C Preprocessor
71
72The GNU C preprocessor in GCC 3.0 has been completely rewritten. It is
73now implemented as a library, cpplib, so it can be easily shared between
74a stand-alone preprocessor, and a preprocessor integrated with the C,
2147b154 75C++ and Objective-C front ends. It is also available for use by other
6951bc4a
NB
76programs, though this is not recommended as its exposed interface has
77not yet reached a point of reasonable stability.
78
79This library has been written to be re-entrant, so that it can be used
80to preprocess many files simultaneously if necessary. It has also been
81written with the preprocessing token as the fundamental unit; the
82preprocessor in previous versions of GCC would operate on text strings
83as the fundamental unit.
84
85This brief manual documents some of the internals of cpplib, and a few
86tricky issues encountered. It also describes certain behaviour we would
87like to preserve, such as the format and spacing of its output.
88
89Identifiers, macro expansion, hash nodes, lexing.
90
91@menu
92* Conventions:: Conventions used in the code.
2147b154 93* Lexer:: The combined C, C++ and Objective-C Lexer.
6951bc4a 94* Whitespace:: Input and output newlines and whitespace.
111e0469
NB
95* Hash Nodes:: All identifiers are hashed.
96* Macro Expansion:: Macro expansion algorithm.
97* Files:: File handling.
6951bc4a
NB
98* Index:: Index.
99@end menu
100
101@node Conventions, Lexer, Top, Top
111e0469 102@unnumbered Conventions
a867b80c
NB
103@cindex interface
104@cindex header files
6951bc4a
NB
105
106cpplib has two interfaces - one is exposed internally only, and the
107other is for both internal and external use.
108
109The convention is that functions and types that are exposed to multiple
110files internally are prefixed with @samp{_cpp_}, and are to be found in
111the file @samp{cpphash.h}. Functions and types exposed to external
a867b80c
NB
112clients are in @samp{cpplib.h}, and prefixed with @samp{cpp_}. For
113historical reasons this is no longer quite true, but we should strive to
114stick to it.
6951bc4a
NB
115
116We are striving to reduce the information exposed in cpplib.h to the
117bare minimum necessary, and then to keep it there. This makes clear
118exactly what external clients are entitled to assume, and allows us to
119change internals in the future without worrying whether library clients
120are perhaps relying on some kind of undocumented implementation-specific
121behaviour.
122
123@node Lexer, Whitespace, Conventions, Top
111e0469 124@unnumbered The Lexer
a867b80c
NB
125@cindex lexer
126@cindex tokens
6951bc4a
NB
127
128The lexer is contained in the file @samp{cpplex.c}. We want to have a
129lexer that is single-pass, for efficiency reasons. We would also like
130the lexer to only step forwards through the input files, and not step
131back. This will make future changes to support different character
132sets, in particular state or shift-dependent ones, much easier.
133
e979f9e8 134This file also contains all information needed to spell a token, i.e.@: to
6951bc4a
NB
135output it either in a diagnostic or to a preprocessed output file. This
136information is not exported, but made available to clients through such
137functions as @samp{cpp_spell_token} and @samp{cpp_token_len}.
138
139The most painful aspect of lexing ISO-standard C and C++ is handling
140trigraphs and backlash-escaped newlines. Trigraphs are processed before
141any interpretation of the meaning of a character is made, and unfortunately
142there is a trigraph representation for a backslash, so it is possible for
143the trigraph @samp{??/} to introduce an escaped newline.
144
145Escaped newlines are tedious because theoretically they can occur
146anywhere - between the @samp{+} and @samp{=} of the @samp{+=} token,
147within the characters of an identifier, and even between the @samp{*}
148and @samp{/} that terminates a comment. Moreover, you cannot be sure
149there is just one - there might be an arbitrarily long sequence of them.
150
151So the routine @samp{parse_identifier}, that lexes an identifier, cannot
152assume that it can scan forwards until the first non-identifier
153character and be done with it, because this could be the @samp{\}
154introducing an escaped newline, or the @samp{?} introducing the trigraph
155sequence that represents the @samp{\} of an escaped newline. Similarly
156for the routine that handles numbers, @samp{parse_number}. If these
157routines stumble upon a @samp{?} or @samp{\}, they call
158@samp{skip_escaped_newlines} to skip over any potential escaped newlines
159before checking whether they can finish.
160
161Similarly code in the main body of @samp{_cpp_lex_token} cannot simply
162check for a @samp{=} after a @samp{+} character to determine whether it
163has a @samp{+=} token; it needs to be prepared for an escaped newline of
164some sort. These cases use the function @samp{get_effective_char},
165which returns the first character after any intervening newlines.
166
167The lexer needs to keep track of the correct column position,
168including counting tabs as specified by the @samp{-ftabstop=} option.
169This should be done even within comments; C-style comments can appear in
170the middle of a line, and we want to report diagnostics in the correct
171position for text appearing after the end of the comment.
172
173Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers,
174may be invalid and require a diagnostic. However, if they appear in a
175macro expansion we don't want to complain with each use of the macro.
176It is therefore best to catch them during the lexing stage, in
177@samp{parse_identifier}. In both cases, whether a diagnostic is needed
178or not is dependent upon lexer state. For example, we don't want to
179issue a diagnostic for re-poisoning a poisoned identifier, or for using
180@samp{__VA_ARGS__} in the expansion of a variable-argument macro.
181Therefore @samp{parse_identifier} makes use of flags to determine
182whether a diagnostic is appropriate. Since we change state on a
183per-token basis, and don't lex whole lines at a time, this is not a
184problem.
185
186Another place where state flags are used to change behaviour is whilst
187parsing header names. Normally, a @samp{<} would be lexed as a single
1198142b 188token. After a @code{#include} directive, though, it should be lexed
6951bc4a
NB
189as a single token as far as the nearest @samp{>} character. Note that
190we don't allow the terminators of header names to be escaped; the first
191@samp{"} or @samp{>} terminates the header name.
192
193Interpretation of some character sequences depends upon whether we are
2147b154 194lexing C, C++ or Objective-C, and on the revision of the standard in
a867b80c
NB
195force. For example, @samp{::} is a single token in C++, but two
196separate @samp{:} tokens, and almost certainly a syntax error, in C.
197Such cases are handled in the main function @samp{_cpp_lex_token}, based
198upon the flags set in the @samp{cpp_options} structure.
6951bc4a
NB
199
200Note we have almost, but not quite, achieved the goal of not stepping
201backwards in the input stream. Currently @samp{skip_escaped_newlines}
202does step back, though with care it should be possible to adjust it so
203that this does not happen. For example, one tricky issue is if we meet
204a trigraph, but the command line option @samp{-trigraphs} is not in
205force but @samp{-Wtrigraphs} is, we need to warn about it but then
206buffer it and continue to treat it as 3 separate characters.
207
111e0469
NB
208@node Whitespace, Hash Nodes, Lexer, Top
209@unnumbered Whitespace
a867b80c
NB
210@cindex whitespace
211@cindex newlines
212@cindex escaped newlines
213@cindex paste avoidance
214@cindex line numbers
6951bc4a
NB
215
216The lexer has been written to treat each of @samp{\r}, @samp{\n},
217@samp{\r\n} and @samp{\n\r} as a single new line indicator. This allows
218it to transparently preprocess MS-DOS, Macintosh and Unix files without
219their needing to pass through a special filter beforehand.
220
221We also decided to treat a backslash, either @samp{\} or the trigraph
111e0469
NB
222@samp{??/}, separated from one of the above newline indicators by
223non-comment whitespace only, as intending to escape the newline. It
224tends to be a typing mistake, and cannot reasonably be mistaken for
225anything else in any of the C-family grammars. Since handling it this
226way is not strictly conforming to the ISO standard, the library issues a
227warning wherever it encounters it.
228
229Handling newlines like this is made simpler by doing it in one place
6951bc4a 230only. The function @samp{handle_newline} takes care of all newline
111e0469
NB
231characters, and @samp{skip_escaped_newlines} takes care of arbitrarily
232long sequences of escaped newlines, deferring to @samp{handle_newline}
233to handle the newlines themselves.
234
a867b80c
NB
235Another whitespace issue only concerns the stand-alone preprocessor: we
236want to guarantee that re-reading the preprocessed output results in an
237identical token stream. Without taking special measures, this might not
238be the case because of macro substitution. We could simply insert a
239space between adjacent tokens, but ideally we would like to keep this to
240a minimum, both for aesthetic reasons and because it causes problems for
241people who still try to abuse the preprocessor for things like Fortran
242source and Makefiles.
243
244The token structure contains a flags byte, and two flags are of interest
245here: @samp{PREV_WHITE} and @samp{AVOID_LPASTE}. @samp{PREV_WHITE}
246indicates that the token was preceded by whitespace; if this is the case
247we need not worry about it incorrectly pasting with its predecessor.
248The @samp{AVOID_LPASTE} flag is set by the macro expansion routines, and
249indicates that paste avoidance by insertion of a space to the left of
250the token may be necessary. Recursively, the first token of a macro
251substitution, the first token after a macro substitution, the first
252token of a substituted argument, and the first token after a substituted
253argument are all flagged @samp{AVOID_LPASTE} by the macro expander.
254
255If a token flagged in this way does not have a @samp{PREV_WHITE} flag,
256and the routine @var{cpp_avoid_paste} determines that it might be
257misinterpreted by the lexer if a space is not inserted between it and
258the immediately preceding token, then stand-alone CPP's output routines
259will insert a space between them. To avoid excessive spacing,
260@var{cpp_avoid_paste} tries hard to only request a space if one is
261likely to be necessary, but for reasons of efficiency it is slightly
262conservative and might recommend a space where one is not strictly
263needed.
264
265Finally, the preprocessor takes great care to ensure it keeps track of
266both the position of a token in the source file, for diagnostic
267purposes, and where it should appear in the output file, because using
268CPP for other languages like assembler requires this. The two positions
269may differ for the following reasons:
270
271@itemize @bullet
272@item
273Escaped newlines are deleted, so lines spliced in this way are joined to
274form a single logical line.
275
276@item
277A macro expansion replaces the tokens that form its invocation, but any
278newlines appearing in the macro's arguments are interpreted as a single
279space, with the result that the macro's replacement appears in full on
280the same line that the macro name appeared in the source file. This is
281particularly important for stringification of arguments - newlines
282embedded in the arguments must appear in the string as spaces.
283@end itemize
284
285The source file location is maintained in the @var{lineno} member of the
286@var{cpp_buffer} structure, and the column number inferred from the
287current position in the buffer relative to the @var{line_base} buffer
288variable, which is updated with every newline whether escaped or not.
289
290TODO: Finish this.
291
111e0469
NB
292@node Hash Nodes, Macro Expansion, Whitespace, Top
293@unnumbered Hash Nodes
a867b80c
NB
294@cindex hash table
295@cindex identifiers
296@cindex macros
297@cindex assertions
298@cindex named operators
111e0469
NB
299
300When cpplib encounters an "identifier", it generates a hash code for it
301and stores it in the hash table. By "identifier" we mean tokens with
302type @samp{CPP_NAME}; this includes identifiers in the usual C sense, as
303well as keywords, directive names, macro names and so on. For example,
304all of "pragma", "int", "foo" and "__GNUC__" are identifiers and hashed
305when lexed.
306
307Each node in the hash table contain various information about the
308identifier it represents. For example, its length and type. At any one
309time, each identifier falls into exactly one of three categories:
310
311@itemize @bullet
312@item Macros
313
314These have been declared to be macros, either on the command line or
1198142b 315with @code{#define}. A few, such as @samp{__TIME__} are builtins
111e0469
NB
316entered in the hash table during initialisation. The hash node for a
317normal macro points to a structure with more information about the
318macro, such as whether it is function-like, how many arguments it takes,
319and its expansion. Builtin macros are flagged as special, and instead
320contain an enum indicating which of the various builtin macros it is.
321
322@item Assertions
323
324Assertions are in a separate namespace to macros. To enforce this, cpp
1198142b 325actually prepends a @code{#} character before hashing and entering it in
111e0469
NB
326the hash table. An assertion's node points to a chain of answers to
327that assertion.
328
329@item Void
330
331Everything else falls into this category - an identifier that is not
332currently a macro, or a macro that has since been undefined with
1198142b 333@code{#undef}.
111e0469
NB
334
335When preprocessing C++, this category also includes the named operators,
336such as @samp{xor}. In expressions these behave like the operators they
337represent, but in contexts where the spelling of a token matters they
338are spelt differently. This spelling distinction is relevant when they
1198142b
NB
339are operands of the stringizing and pasting macro operators @code{#} and
340@code{##}. Named operator hash nodes are flagged, both to catch the
111e0469
NB
341spelling distinction and to prevent them from being defined as macros.
342@end itemize
343
344The same identifiers share the same hash node. Since each identifier
345token, after lexing, contains a pointer to its hash node, this is used
346to provide rapid lookup of various information. For example, when
1198142b 347parsing a @code{#define} statement, CPP flags each argument's identifier
111e0469
NB
348hash node with the index of that argument. This makes duplicated
349argument checking an O(1) operation for each argument. Similarly, for
350each identifier in the macro's expansion, lookup to see if it is an
351argument, and which argument it is, is also an O(1) operation. Further,
352each directive name, such as @samp{endif}, has an associated directive
353enum stored in its hash node, so that directive lookup is also O(1).
354
111e0469
NB
355@node Macro Expansion, Files, Hash Nodes, Top
356@unnumbered Macro Expansion Algorithm
111e0469 357
a867b80c 358@node Files, Index, Macro Expansion, Top
111e0469 359@unnumbered File Handling
1198142b
NB
360@cindex files
361
362Fairly obviously, the file handling code of cpplib resides in the file
363@samp{cppfiles.c}. It takes care of the details of file searching,
364opening, reading and caching, for both the main source file and all the
365headers it recursively includes.
366
367The basic strategy is to minimize the number of system calls. On many
368systems, the basic @code{open ()} and @code{fstat ()} system calls can
369be quite expensive. For every @code{#include}-d file, we need to try
370all the directories in the search path until we find a match. Some
371projects, such as glibc, pass twenty or thirty include paths on the
372command line, so this can rapidly become time consuming.
373
374For a header file we have not encountered before we have little choice
375but to do this. However, it is often the case that the same headers are
376repeatedly included, and in these cases we try to avoid repeating the
377filesystem queries whilst searching for the correct file.
378
379For each file we try to open, we store the constructed path in a splay
380tree. This path first undergoes simplification by the function
381@code{_cpp_simplify_pathname}. For example,
382@samp{/usr/include/bits/../foo.h} is simplified to
383@samp{/usr/include/foo.h} before we enter it in the splay tree and try
384to @code{open ()} the file. CPP will then find subsequent uses of
385@samp{foo.h}, even as @samp{/usr/include/foo.h}, in the splay tree and
386save system calls.
387
388Further, it is likely the file contents have also been cached, saving a
389@code{read ()} system call. We don't bother caching the contents of
390header files that are re-inclusion protected, and whose re-inclusion
391macro is defined when we leave the header file for the first time. If
392the host supports it, we try to map suitably large files into memory,
393rather than reading them in directly.
394
c8a96070 395The include paths are internally stored on a null-terminated
1198142b
NB
396singly-linked list, starting with the @code{"header.h"} directory search
397chain, which then links into the @code{<header.h>} directory chain.
398
399Files included with the @code{<foo.h>} syntax start the lookup directly
400in the second half of this chain. However, files included with the
401@code{"foo.h"} syntax start at the beginning of the chain, but with one
402extra directory prepended. This is the directory of the current file;
403the one containing the @code{#include} directive. Prepending this
404directory on a per-file basis is handled by the function
405@code{search_from}.
406
407Note that a header included with a directory component, such as
408@code{#include "mydir/foo.h"} and opened as
409@samp{/usr/local/include/mydir/foo.h}, will have the complete path minus
410the basename @samp{foo.h} as the current directory.
411
412Enough information is stored in the splay tree that CPP can immediately
413tell whether it can skip the header file because of the multiple include
414optimisation, whether the file didn't exist or couldn't be opened for
415some reason, or whether the header was flagged not to be re-used, as it
416is with the obsolete @code{#import} directive.
417
418For the benefit of MS-DOS filesystems with an 8.3 filename limitation,
419CPP offers the ability to treat various include file names as aliases
420for the real header files with shorter names. The map from one to the
421other is found in a special file called @samp{header.gcc}, stored in the
422command line (or system) include directories to which the mapping
423applies. This may be higher up the directory tree than the full path to
424the file minus the base name.
6951bc4a 425
a867b80c
NB
426@node Index,, Files, Top
427@unnumbered Index
6951bc4a
NB
428@printindex cp
429
6951bc4a 430@bye
This page took 0.19165 seconds and 5 git commands to generate.