From 6951bc4a54318800dfc949c3c0167e8a0857dc35 Mon Sep 17 00:00:00 2001 From: Neil Booth Date: Mon, 4 Dec 2000 07:34:21 +0000 Subject: [PATCH] * cppinternals.texi: New file. From-SVN: r37990 --- gcc/ChangeLog | 4 + gcc/cppinternals.texi | 225 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 229 insertions(+) create mode 100644 gcc/cppinternals.texi diff --git a/gcc/ChangeLog b/gcc/ChangeLog index f2ff46108eeb..aaa31987ef91 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,7 @@ +2000-12-04 Neil Booth + + * cppinternals.texi: New file. + 2000-12-04 Neil Booth * cppfiles.c (cpp_make_system_header): Take 2 booleans, diff --git a/gcc/cppinternals.texi b/gcc/cppinternals.texi new file mode 100644 index 000000000000..c1604e907f2b --- /dev/null +++ b/gcc/cppinternals.texi @@ -0,0 +1,225 @@ +\input texinfo +@setfilename cppinternals.info +@settitle The GNU C Preprocessor Internals + +@ifinfo +@dircategory Programming +@direntry +* Cpplib: Cpplib internals. +@end direntry +@end ifinfo + +@c @smallbook +@c @cropmarks +@c @finalout +@setchapternewpage odd +@ifinfo +This file documents the internals of the GNU C Preprocessor. + +Copyright 2000 Free Software Foundation, Inc. + +Permission is granted to make and distribute verbatim copies of +this manual provided the copyright notice and this permission notice +are preserved on all copies. + +@ignore +Permission is granted to process this file through Tex and print the +results, provided the printed document carries copying permission +notice identical to this one except for the removal of this paragraph +(this paragraph not being relevant to the printed manual). + +@end ignore +Permission is granted to copy and distribute modified versions of this +manual under the conditions for verbatim copying, provided also that +the entire resulting derived work is distributed under the terms of a +permission notice identical to this one. + +Permission is granted to copy and distribute translations of this manual +into another language, under the above conditions for modified versions. +@end ifinfo + +@titlepage +@c @finalout +@title Cpplib Internals +@subtitle Last revised Dec 2000 +@subtitle for GCC version 3.0 +@author Neil Booth +@page +@vskip 0pt plus 1filll +@c man begin COPYRIGHT +Copyright @copyright{} 2000 +Free Software Foundation, Inc. + +Permission is granted to make and distribute verbatim copies of +this manual provided the copyright notice and this permission notice +are preserved on all copies. + +Permission is granted to copy and distribute modified versions of this +manual under the conditions for verbatim copying, provided also that +the entire resulting derived work is distributed under the terms of a +permission notice identical to this one. + +Permission is granted to copy and distribute translations of this manual +into another language, under the above conditions for modified versions. +@c man end +@end titlepage +@page + +@node Top, Conventions,, (DIR) +@chapter Cpplib - the core of the GNU C Preprocessor + +The GNU C preprocessor in GCC 3.0 has been completely rewritten. It is +now implemented as a library, cpplib, so it can be easily shared between +a stand-alone preprocessor, and a preprocessor integrated with the C, +C++ and Objective C front ends. It is also available for use by other +programs, though this is not recommended as its exposed interface has +not yet reached a point of reasonable stability. + +This library has been written to be re-entrant, so that it can be used +to preprocess many files simultaneously if necessary. It has also been +written with the preprocessing token as the fundamental unit; the +preprocessor in previous versions of GCC would operate on text strings +as the fundamental unit. + +This brief manual documents some of the internals of cpplib, and a few +tricky issues encountered. It also describes certain behaviour we would +like to preserve, such as the format and spacing of its output. + +Identifiers, macro expansion, hash nodes, lexing. + +@menu +* Conventions:: Conventions used in the code. +* Lexer:: The combined C, C++ and Objective C Lexer. +* Whitespace:: Input and output newlines and whitespace. +* Concept Index:: Index of concepts and terms. +* Index:: Index. +@end menu + +@node Conventions, Lexer, Top, Top + +cpplib has two interfaces - one is exposed internally only, and the +other is for both internal and external use. + +The convention is that functions and types that are exposed to multiple +files internally are prefixed with @samp{_cpp_}, and are to be found in +the file @samp{cpphash.h}. Functions and types exposed to external +clients are in @samp{cpplib.h}, and prefixed with @samp{cpp_}. + +We are striving to reduce the information exposed in cpplib.h to the +bare minimum necessary, and then to keep it there. This makes clear +exactly what external clients are entitled to assume, and allows us to +change internals in the future without worrying whether library clients +are perhaps relying on some kind of undocumented implementation-specific +behaviour. + +@node Lexer, Whitespace, Conventions, Top + +The lexer is contained in the file @samp{cpplex.c}. We want to have a +lexer that is single-pass, for efficiency reasons. We would also like +the lexer to only step forwards through the input files, and not step +back. This will make future changes to support different character +sets, in particular state or shift-dependent ones, much easier. + +This file also contains all information needed to spell a token, i.e. to +output it either in a diagnostic or to a preprocessed output file. This +information is not exported, but made available to clients through such +functions as @samp{cpp_spell_token} and @samp{cpp_token_len}. + +The most painful aspect of lexing ISO-standard C and C++ is handling +trigraphs and backlash-escaped newlines. Trigraphs are processed before +any interpretation of the meaning of a character is made, and unfortunately +there is a trigraph representation for a backslash, so it is possible for +the trigraph @samp{??/} to introduce an escaped newline. + +Escaped newlines are tedious because theoretically they can occur +anywhere - between the @samp{+} and @samp{=} of the @samp{+=} token, +within the characters of an identifier, and even between the @samp{*} +and @samp{/} that terminates a comment. Moreover, you cannot be sure +there is just one - there might be an arbitrarily long sequence of them. + +So the routine @samp{parse_identifier}, that lexes an identifier, cannot +assume that it can scan forwards until the first non-identifier +character and be done with it, because this could be the @samp{\} +introducing an escaped newline, or the @samp{?} introducing the trigraph +sequence that represents the @samp{\} of an escaped newline. Similarly +for the routine that handles numbers, @samp{parse_number}. If these +routines stumble upon a @samp{?} or @samp{\}, they call +@samp{skip_escaped_newlines} to skip over any potential escaped newlines +before checking whether they can finish. + +Similarly code in the main body of @samp{_cpp_lex_token} cannot simply +check for a @samp{=} after a @samp{+} character to determine whether it +has a @samp{+=} token; it needs to be prepared for an escaped newline of +some sort. These cases use the function @samp{get_effective_char}, +which returns the first character after any intervening newlines. + +The lexer needs to keep track of the correct column position, +including counting tabs as specified by the @samp{-ftabstop=} option. +This should be done even within comments; C-style comments can appear in +the middle of a line, and we want to report diagnostics in the correct +position for text appearing after the end of the comment. + +Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers, +may be invalid and require a diagnostic. However, if they appear in a +macro expansion we don't want to complain with each use of the macro. +It is therefore best to catch them during the lexing stage, in +@samp{parse_identifier}. In both cases, whether a diagnostic is needed +or not is dependent upon lexer state. For example, we don't want to +issue a diagnostic for re-poisoning a poisoned identifier, or for using +@samp{__VA_ARGS__} in the expansion of a variable-argument macro. +Therefore @samp{parse_identifier} makes use of flags to determine +whether a diagnostic is appropriate. Since we change state on a +per-token basis, and don't lex whole lines at a time, this is not a +problem. + +Another place where state flags are used to change behaviour is whilst +parsing header names. Normally, a @samp{<} would be lexed as a single +token. After a @samp{#include} directive, though, it should be lexed +as a single token as far as the nearest @samp{>} character. Note that +we don't allow the terminators of header names to be escaped; the first +@samp{"} or @samp{>} terminates the header name. + +Interpretation of some character sequences depends upon whether we are +lexing C, C++ or Objective C, and on the revision of the standard in +force. For example, @samp{@@foo} is a single identifier token in +objective C, but two separate tokens @samp{@@} and @samp{foo} in C or +C++. Such cases are handled in the main function @samp{_cpp_lex_token}, +based upon the flags set in the @samp{cpp_options} structure. + +Note we have almost, but not quite, achieved the goal of not stepping +backwards in the input stream. Currently @samp{skip_escaped_newlines} +does step back, though with care it should be possible to adjust it so +that this does not happen. For example, one tricky issue is if we meet +a trigraph, but the command line option @samp{-trigraphs} is not in +force but @samp{-Wtrigraphs} is, we need to warn about it but then +buffer it and continue to treat it as 3 separate characters. + +@node Whitespace, Concept Index, Lexer, Top + +The lexer has been written to treat each of @samp{\r}, @samp{\n}, +@samp{\r\n} and @samp{\n\r} as a single new line indicator. This allows +it to transparently preprocess MS-DOS, Macintosh and Unix files without +their needing to pass through a special filter beforehand. + +We also decided to treat a backslash, either @samp{\} or the trigraph +@samp{??/}, separated from one of the above newline forms by whitespace +only (one or more space, tab, form-feed, vertical tab or NUL characters), +as an intended escaped newline. The library issues a diagnostic in this +case. + +Handling newlines in this way is made simpler by doing it in one place +only. The function @samp{handle_newline} takes care of all newline +characters, and @samp{skip_escaped_newlines} takes care of all escaping +of newlines, deferring to @samp{handle_newline} to handle the newlines +themselves. + +@node Concept Index, Index, Whitespace, Top +@unnumbered Concept Index +@printindex cp + +@node Index,, Concept Index, Top +@unnumbered Index of Directives, Macros and Options +@printindex fn + +@contents +@bye -- 2.43.5