]>
Commit | Line | Data |
---|---|---|
6951bc4a NB |
1 | \input texinfo |
2 | @setfilename cppinternals.info | |
3 | @settitle The GNU C Preprocessor Internals | |
4 | ||
5 | @ifinfo | |
6 | @dircategory Programming | |
7 | @direntry | |
23de1fbf | 8 | * Cpplib: (cppinternals). Cpplib internals. |
6951bc4a NB |
9 | @end direntry |
10 | @end ifinfo | |
11 | ||
12 | @c @smallbook | |
13 | @c @cropmarks | |
14 | @c @finalout | |
15 | @setchapternewpage odd | |
16 | @ifinfo | |
17 | This file documents the internals of the GNU C Preprocessor. | |
18 | ||
23de1fbf | 19 | Copyright 2000, 2001 Free Software Foundation, Inc. |
6951bc4a NB |
20 | |
21 | Permission is granted to make and distribute verbatim copies of | |
22 | this manual provided the copyright notice and this permission notice | |
23 | are preserved on all copies. | |
24 | ||
25 | @ignore | |
26 | Permission is granted to process this file through Tex and print the | |
27 | results, provided the printed document carries copying permission | |
28 | notice identical to this one except for the removal of this paragraph | |
29 | (this paragraph not being relevant to the printed manual). | |
30 | ||
31 | @end ignore | |
32 | Permission is granted to copy and distribute modified versions of this | |
33 | manual under the conditions for verbatim copying, provided also that | |
34 | the entire resulting derived work is distributed under the terms of a | |
35 | permission notice identical to this one. | |
36 | ||
37 | Permission is granted to copy and distribute translations of this manual | |
38 | into another language, under the above conditions for modified versions. | |
39 | @end ifinfo | |
40 | ||
41 | @titlepage | |
42 | @c @finalout | |
43 | @title Cpplib Internals | |
23de1fbf | 44 | @subtitle Last revised Jan 2001 |
6951bc4a NB |
45 | @subtitle for GCC version 3.0 |
46 | @author Neil Booth | |
47 | @page | |
48 | @vskip 0pt plus 1filll | |
49 | @c man begin COPYRIGHT | |
23de1fbf | 50 | Copyright @copyright{} 2000, 2001 |
6951bc4a NB |
51 | Free Software Foundation, Inc. |
52 | ||
53 | Permission is granted to make and distribute verbatim copies of | |
54 | this manual provided the copyright notice and this permission notice | |
55 | are preserved on all copies. | |
56 | ||
57 | Permission is granted to copy and distribute modified versions of this | |
58 | manual under the conditions for verbatim copying, provided also that | |
59 | the entire resulting derived work is distributed under the terms of a | |
60 | permission notice identical to this one. | |
61 | ||
62 | Permission is granted to copy and distribute translations of this manual | |
63 | into another language, under the above conditions for modified versions. | |
64 | @c man end | |
65 | @end titlepage | |
1347cc4f | 66 | @contents |
6951bc4a NB |
67 | @page |
68 | ||
69 | @node Top, Conventions,, (DIR) | |
70 | @chapter Cpplib - the core of the GNU C Preprocessor | |
71 | ||
72 | The GNU C preprocessor in GCC 3.0 has been completely rewritten. It is | |
73 | now implemented as a library, cpplib, so it can be easily shared between | |
74 | a stand-alone preprocessor, and a preprocessor integrated with the C, | |
2147b154 | 75 | C++ and Objective-C front ends. It is also available for use by other |
6951bc4a NB |
76 | programs, though this is not recommended as its exposed interface has |
77 | not yet reached a point of reasonable stability. | |
78 | ||
79 | This library has been written to be re-entrant, so that it can be used | |
80 | to preprocess many files simultaneously if necessary. It has also been | |
81 | written with the preprocessing token as the fundamental unit; the | |
82 | preprocessor in previous versions of GCC would operate on text strings | |
83 | as the fundamental unit. | |
84 | ||
85 | This brief manual documents some of the internals of cpplib, and a few | |
86 | tricky issues encountered. It also describes certain behaviour we would | |
87 | like to preserve, such as the format and spacing of its output. | |
88 | ||
89 | Identifiers, macro expansion, hash nodes, lexing. | |
90 | ||
91 | @menu | |
92 | * Conventions:: Conventions used in the code. | |
2147b154 | 93 | * Lexer:: The combined C, C++ and Objective-C Lexer. |
6951bc4a | 94 | * Whitespace:: Input and output newlines and whitespace. |
111e0469 NB |
95 | * Hash Nodes:: All identifiers are hashed. |
96 | * Macro Expansion:: Macro expansion algorithm. | |
97 | * Files:: File handling. | |
6951bc4a NB |
98 | * Index:: Index. |
99 | @end menu | |
100 | ||
101 | @node Conventions, Lexer, Top, Top | |
111e0469 | 102 | @unnumbered Conventions |
a867b80c NB |
103 | @cindex interface |
104 | @cindex header files | |
6951bc4a NB |
105 | |
106 | cpplib has two interfaces - one is exposed internally only, and the | |
107 | other is for both internal and external use. | |
108 | ||
109 | The convention is that functions and types that are exposed to multiple | |
110 | files internally are prefixed with @samp{_cpp_}, and are to be found in | |
111 | the file @samp{cpphash.h}. Functions and types exposed to external | |
a867b80c NB |
112 | clients are in @samp{cpplib.h}, and prefixed with @samp{cpp_}. For |
113 | historical reasons this is no longer quite true, but we should strive to | |
114 | stick to it. | |
6951bc4a NB |
115 | |
116 | We are striving to reduce the information exposed in cpplib.h to the | |
117 | bare minimum necessary, and then to keep it there. This makes clear | |
118 | exactly what external clients are entitled to assume, and allows us to | |
119 | change internals in the future without worrying whether library clients | |
120 | are perhaps relying on some kind of undocumented implementation-specific | |
121 | behaviour. | |
122 | ||
123 | @node Lexer, Whitespace, Conventions, Top | |
111e0469 | 124 | @unnumbered The Lexer |
a867b80c NB |
125 | @cindex lexer |
126 | @cindex tokens | |
6951bc4a NB |
127 | |
128 | The lexer is contained in the file @samp{cpplex.c}. We want to have a | |
129 | lexer that is single-pass, for efficiency reasons. We would also like | |
130 | the lexer to only step forwards through the input files, and not step | |
131 | back. This will make future changes to support different character | |
132 | sets, in particular state or shift-dependent ones, much easier. | |
133 | ||
e979f9e8 | 134 | This file also contains all information needed to spell a token, i.e.@: to |
6951bc4a NB |
135 | output it either in a diagnostic or to a preprocessed output file. This |
136 | information is not exported, but made available to clients through such | |
137 | functions as @samp{cpp_spell_token} and @samp{cpp_token_len}. | |
138 | ||
139 | The most painful aspect of lexing ISO-standard C and C++ is handling | |
140 | trigraphs and backlash-escaped newlines. Trigraphs are processed before | |
141 | any interpretation of the meaning of a character is made, and unfortunately | |
142 | there is a trigraph representation for a backslash, so it is possible for | |
143 | the trigraph @samp{??/} to introduce an escaped newline. | |
144 | ||
145 | Escaped newlines are tedious because theoretically they can occur | |
146 | anywhere - between the @samp{+} and @samp{=} of the @samp{+=} token, | |
147 | within the characters of an identifier, and even between the @samp{*} | |
148 | and @samp{/} that terminates a comment. Moreover, you cannot be sure | |
149 | there is just one - there might be an arbitrarily long sequence of them. | |
150 | ||
151 | So the routine @samp{parse_identifier}, that lexes an identifier, cannot | |
152 | assume that it can scan forwards until the first non-identifier | |
153 | character and be done with it, because this could be the @samp{\} | |
154 | introducing an escaped newline, or the @samp{?} introducing the trigraph | |
155 | sequence that represents the @samp{\} of an escaped newline. Similarly | |
156 | for the routine that handles numbers, @samp{parse_number}. If these | |
157 | routines stumble upon a @samp{?} or @samp{\}, they call | |
158 | @samp{skip_escaped_newlines} to skip over any potential escaped newlines | |
159 | before checking whether they can finish. | |
160 | ||
161 | Similarly code in the main body of @samp{_cpp_lex_token} cannot simply | |
162 | check for a @samp{=} after a @samp{+} character to determine whether it | |
163 | has a @samp{+=} token; it needs to be prepared for an escaped newline of | |
164 | some sort. These cases use the function @samp{get_effective_char}, | |
165 | which returns the first character after any intervening newlines. | |
166 | ||
167 | The lexer needs to keep track of the correct column position, | |
168 | including counting tabs as specified by the @samp{-ftabstop=} option. | |
169 | This should be done even within comments; C-style comments can appear in | |
170 | the middle of a line, and we want to report diagnostics in the correct | |
171 | position for text appearing after the end of the comment. | |
172 | ||
173 | Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers, | |
174 | may be invalid and require a diagnostic. However, if they appear in a | |
175 | macro expansion we don't want to complain with each use of the macro. | |
176 | It is therefore best to catch them during the lexing stage, in | |
177 | @samp{parse_identifier}. In both cases, whether a diagnostic is needed | |
178 | or not is dependent upon lexer state. For example, we don't want to | |
179 | issue a diagnostic for re-poisoning a poisoned identifier, or for using | |
180 | @samp{__VA_ARGS__} in the expansion of a variable-argument macro. | |
181 | Therefore @samp{parse_identifier} makes use of flags to determine | |
182 | whether a diagnostic is appropriate. Since we change state on a | |
183 | per-token basis, and don't lex whole lines at a time, this is not a | |
184 | problem. | |
185 | ||
186 | Another place where state flags are used to change behaviour is whilst | |
187 | parsing header names. Normally, a @samp{<} would be lexed as a single | |
1198142b | 188 | token. After a @code{#include} directive, though, it should be lexed |
6951bc4a NB |
189 | as a single token as far as the nearest @samp{>} character. Note that |
190 | we don't allow the terminators of header names to be escaped; the first | |
191 | @samp{"} or @samp{>} terminates the header name. | |
192 | ||
193 | Interpretation of some character sequences depends upon whether we are | |
2147b154 | 194 | lexing C, C++ or Objective-C, and on the revision of the standard in |
a867b80c NB |
195 | force. For example, @samp{::} is a single token in C++, but two |
196 | separate @samp{:} tokens, and almost certainly a syntax error, in C. | |
197 | Such cases are handled in the main function @samp{_cpp_lex_token}, based | |
198 | upon the flags set in the @samp{cpp_options} structure. | |
6951bc4a NB |
199 | |
200 | Note we have almost, but not quite, achieved the goal of not stepping | |
201 | backwards in the input stream. Currently @samp{skip_escaped_newlines} | |
202 | does step back, though with care it should be possible to adjust it so | |
203 | that this does not happen. For example, one tricky issue is if we meet | |
204 | a trigraph, but the command line option @samp{-trigraphs} is not in | |
205 | force but @samp{-Wtrigraphs} is, we need to warn about it but then | |
206 | buffer it and continue to treat it as 3 separate characters. | |
207 | ||
111e0469 NB |
208 | @node Whitespace, Hash Nodes, Lexer, Top |
209 | @unnumbered Whitespace | |
a867b80c NB |
210 | @cindex whitespace |
211 | @cindex newlines | |
212 | @cindex escaped newlines | |
213 | @cindex paste avoidance | |
214 | @cindex line numbers | |
6951bc4a NB |
215 | |
216 | The lexer has been written to treat each of @samp{\r}, @samp{\n}, | |
217 | @samp{\r\n} and @samp{\n\r} as a single new line indicator. This allows | |
218 | it to transparently preprocess MS-DOS, Macintosh and Unix files without | |
219 | their needing to pass through a special filter beforehand. | |
220 | ||
221 | We also decided to treat a backslash, either @samp{\} or the trigraph | |
111e0469 NB |
222 | @samp{??/}, separated from one of the above newline indicators by |
223 | non-comment whitespace only, as intending to escape the newline. It | |
224 | tends to be a typing mistake, and cannot reasonably be mistaken for | |
225 | anything else in any of the C-family grammars. Since handling it this | |
226 | way is not strictly conforming to the ISO standard, the library issues a | |
227 | warning wherever it encounters it. | |
228 | ||
229 | Handling newlines like this is made simpler by doing it in one place | |
6951bc4a | 230 | only. The function @samp{handle_newline} takes care of all newline |
111e0469 NB |
231 | characters, and @samp{skip_escaped_newlines} takes care of arbitrarily |
232 | long sequences of escaped newlines, deferring to @samp{handle_newline} | |
233 | to handle the newlines themselves. | |
234 | ||
a867b80c NB |
235 | Another whitespace issue only concerns the stand-alone preprocessor: we |
236 | want to guarantee that re-reading the preprocessed output results in an | |
237 | identical token stream. Without taking special measures, this might not | |
238 | be the case because of macro substitution. We could simply insert a | |
239 | space between adjacent tokens, but ideally we would like to keep this to | |
240 | a minimum, both for aesthetic reasons and because it causes problems for | |
241 | people who still try to abuse the preprocessor for things like Fortran | |
242 | source and Makefiles. | |
243 | ||
244 | The token structure contains a flags byte, and two flags are of interest | |
245 | here: @samp{PREV_WHITE} and @samp{AVOID_LPASTE}. @samp{PREV_WHITE} | |
246 | indicates that the token was preceded by whitespace; if this is the case | |
247 | we need not worry about it incorrectly pasting with its predecessor. | |
248 | The @samp{AVOID_LPASTE} flag is set by the macro expansion routines, and | |
249 | indicates that paste avoidance by insertion of a space to the left of | |
250 | the token may be necessary. Recursively, the first token of a macro | |
251 | substitution, the first token after a macro substitution, the first | |
252 | token of a substituted argument, and the first token after a substituted | |
253 | argument are all flagged @samp{AVOID_LPASTE} by the macro expander. | |
254 | ||
255 | If a token flagged in this way does not have a @samp{PREV_WHITE} flag, | |
256 | and the routine @var{cpp_avoid_paste} determines that it might be | |
257 | misinterpreted by the lexer if a space is not inserted between it and | |
258 | the immediately preceding token, then stand-alone CPP's output routines | |
259 | will insert a space between them. To avoid excessive spacing, | |
260 | @var{cpp_avoid_paste} tries hard to only request a space if one is | |
261 | likely to be necessary, but for reasons of efficiency it is slightly | |
262 | conservative and might recommend a space where one is not strictly | |
263 | needed. | |
264 | ||
265 | Finally, the preprocessor takes great care to ensure it keeps track of | |
266 | both the position of a token in the source file, for diagnostic | |
267 | purposes, and where it should appear in the output file, because using | |
268 | CPP for other languages like assembler requires this. The two positions | |
269 | may differ for the following reasons: | |
270 | ||
271 | @itemize @bullet | |
272 | @item | |
273 | Escaped newlines are deleted, so lines spliced in this way are joined to | |
274 | form a single logical line. | |
275 | ||
276 | @item | |
277 | A macro expansion replaces the tokens that form its invocation, but any | |
278 | newlines appearing in the macro's arguments are interpreted as a single | |
279 | space, with the result that the macro's replacement appears in full on | |
280 | the same line that the macro name appeared in the source file. This is | |
281 | particularly important for stringification of arguments - newlines | |
282 | embedded in the arguments must appear in the string as spaces. | |
283 | @end itemize | |
284 | ||
285 | The source file location is maintained in the @var{lineno} member of the | |
286 | @var{cpp_buffer} structure, and the column number inferred from the | |
287 | current position in the buffer relative to the @var{line_base} buffer | |
288 | variable, which is updated with every newline whether escaped or not. | |
289 | ||
290 | TODO: Finish this. | |
291 | ||
111e0469 NB |
292 | @node Hash Nodes, Macro Expansion, Whitespace, Top |
293 | @unnumbered Hash Nodes | |
a867b80c NB |
294 | @cindex hash table |
295 | @cindex identifiers | |
296 | @cindex macros | |
297 | @cindex assertions | |
298 | @cindex named operators | |
111e0469 NB |
299 | |
300 | When cpplib encounters an "identifier", it generates a hash code for it | |
301 | and stores it in the hash table. By "identifier" we mean tokens with | |
302 | type @samp{CPP_NAME}; this includes identifiers in the usual C sense, as | |
303 | well as keywords, directive names, macro names and so on. For example, | |
304 | all of "pragma", "int", "foo" and "__GNUC__" are identifiers and hashed | |
305 | when lexed. | |
306 | ||
307 | Each node in the hash table contain various information about the | |
308 | identifier it represents. For example, its length and type. At any one | |
309 | time, each identifier falls into exactly one of three categories: | |
310 | ||
311 | @itemize @bullet | |
312 | @item Macros | |
313 | ||
314 | These have been declared to be macros, either on the command line or | |
1198142b | 315 | with @code{#define}. A few, such as @samp{__TIME__} are builtins |
111e0469 NB |
316 | entered in the hash table during initialisation. The hash node for a |
317 | normal macro points to a structure with more information about the | |
318 | macro, such as whether it is function-like, how many arguments it takes, | |
319 | and its expansion. Builtin macros are flagged as special, and instead | |
320 | contain an enum indicating which of the various builtin macros it is. | |
321 | ||
322 | @item Assertions | |
323 | ||
324 | Assertions are in a separate namespace to macros. To enforce this, cpp | |
1198142b | 325 | actually prepends a @code{#} character before hashing and entering it in |
111e0469 NB |
326 | the hash table. An assertion's node points to a chain of answers to |
327 | that assertion. | |
328 | ||
329 | @item Void | |
330 | ||
331 | Everything else falls into this category - an identifier that is not | |
332 | currently a macro, or a macro that has since been undefined with | |
1198142b | 333 | @code{#undef}. |
111e0469 NB |
334 | |
335 | When preprocessing C++, this category also includes the named operators, | |
336 | such as @samp{xor}. In expressions these behave like the operators they | |
337 | represent, but in contexts where the spelling of a token matters they | |
338 | are spelt differently. This spelling distinction is relevant when they | |
1198142b NB |
339 | are operands of the stringizing and pasting macro operators @code{#} and |
340 | @code{##}. Named operator hash nodes are flagged, both to catch the | |
111e0469 NB |
341 | spelling distinction and to prevent them from being defined as macros. |
342 | @end itemize | |
343 | ||
344 | The same identifiers share the same hash node. Since each identifier | |
345 | token, after lexing, contains a pointer to its hash node, this is used | |
346 | to provide rapid lookup of various information. For example, when | |
1198142b | 347 | parsing a @code{#define} statement, CPP flags each argument's identifier |
111e0469 NB |
348 | hash node with the index of that argument. This makes duplicated |
349 | argument checking an O(1) operation for each argument. Similarly, for | |
350 | each identifier in the macro's expansion, lookup to see if it is an | |
351 | argument, and which argument it is, is also an O(1) operation. Further, | |
352 | each directive name, such as @samp{endif}, has an associated directive | |
353 | enum stored in its hash node, so that directive lookup is also O(1). | |
354 | ||
111e0469 NB |
355 | @node Macro Expansion, Files, Hash Nodes, Top |
356 | @unnumbered Macro Expansion Algorithm | |
111e0469 | 357 | |
a867b80c | 358 | @node Files, Index, Macro Expansion, Top |
111e0469 | 359 | @unnumbered File Handling |
1198142b NB |
360 | @cindex files |
361 | ||
362 | Fairly obviously, the file handling code of cpplib resides in the file | |
363 | @samp{cppfiles.c}. It takes care of the details of file searching, | |
364 | opening, reading and caching, for both the main source file and all the | |
365 | headers it recursively includes. | |
366 | ||
367 | The basic strategy is to minimize the number of system calls. On many | |
368 | systems, the basic @code{open ()} and @code{fstat ()} system calls can | |
369 | be quite expensive. For every @code{#include}-d file, we need to try | |
370 | all the directories in the search path until we find a match. Some | |
371 | projects, such as glibc, pass twenty or thirty include paths on the | |
372 | command line, so this can rapidly become time consuming. | |
373 | ||
374 | For a header file we have not encountered before we have little choice | |
375 | but to do this. However, it is often the case that the same headers are | |
376 | repeatedly included, and in these cases we try to avoid repeating the | |
377 | filesystem queries whilst searching for the correct file. | |
378 | ||
379 | For each file we try to open, we store the constructed path in a splay | |
380 | tree. This path first undergoes simplification by the function | |
381 | @code{_cpp_simplify_pathname}. For example, | |
382 | @samp{/usr/include/bits/../foo.h} is simplified to | |
383 | @samp{/usr/include/foo.h} before we enter it in the splay tree and try | |
384 | to @code{open ()} the file. CPP will then find subsequent uses of | |
385 | @samp{foo.h}, even as @samp{/usr/include/foo.h}, in the splay tree and | |
386 | save system calls. | |
387 | ||
388 | Further, it is likely the file contents have also been cached, saving a | |
389 | @code{read ()} system call. We don't bother caching the contents of | |
390 | header files that are re-inclusion protected, and whose re-inclusion | |
391 | macro is defined when we leave the header file for the first time. If | |
392 | the host supports it, we try to map suitably large files into memory, | |
393 | rather than reading them in directly. | |
394 | ||
c8a96070 | 395 | The include paths are internally stored on a null-terminated |
1198142b NB |
396 | singly-linked list, starting with the @code{"header.h"} directory search |
397 | chain, which then links into the @code{<header.h>} directory chain. | |
398 | ||
399 | Files included with the @code{<foo.h>} syntax start the lookup directly | |
400 | in the second half of this chain. However, files included with the | |
401 | @code{"foo.h"} syntax start at the beginning of the chain, but with one | |
402 | extra directory prepended. This is the directory of the current file; | |
403 | the one containing the @code{#include} directive. Prepending this | |
404 | directory on a per-file basis is handled by the function | |
405 | @code{search_from}. | |
406 | ||
407 | Note that a header included with a directory component, such as | |
408 | @code{#include "mydir/foo.h"} and opened as | |
409 | @samp{/usr/local/include/mydir/foo.h}, will have the complete path minus | |
410 | the basename @samp{foo.h} as the current directory. | |
411 | ||
412 | Enough information is stored in the splay tree that CPP can immediately | |
413 | tell whether it can skip the header file because of the multiple include | |
414 | optimisation, whether the file didn't exist or couldn't be opened for | |
415 | some reason, or whether the header was flagged not to be re-used, as it | |
416 | is with the obsolete @code{#import} directive. | |
417 | ||
418 | For the benefit of MS-DOS filesystems with an 8.3 filename limitation, | |
419 | CPP offers the ability to treat various include file names as aliases | |
420 | for the real header files with shorter names. The map from one to the | |
421 | other is found in a special file called @samp{header.gcc}, stored in the | |
422 | command line (or system) include directories to which the mapping | |
423 | applies. This may be higher up the directory tree than the full path to | |
424 | the file minus the base name. | |
6951bc4a | 425 | |
a867b80c NB |
426 | @node Index,, Files, Top |
427 | @unnumbered Index | |
6951bc4a NB |
428 | @printindex cp |
429 | ||
6951bc4a | 430 | @bye |