Projects relating to cpplib
Note: this writeup represents state as of 2002.
cpplib has largely been completed, and is stable at this point.
For GCC versions 3.0 and later, it is linked into the C, C++ and
Objective C front ends. Most future work will relate to character set
issues, performance enhancements and improving cpplib as a stand-alone
Work recently completed
- Stand-alone CPP is dead. The compiler front end now handles
preprocessed output if necessary.
- As many built-in macros as possible have been moved to the front
ends, and out of SPECS and cpplib itself (some targets still
- CPP arithmetic is now done to the correct target precision, based
upon the selected language standard.
- The traditional preprocessor has been integrated into cpplib.
At present it is an output-only preprocessor, but it should be
fairly simple to modify cpplib so that traditional preprocessing
and then tokenization are performed in one invocation.
Greater Coordination with the Front Ends
The integrated preprocessor would benefit from greater integration
with the front ends. It still feels like it has been tacked on as an
after thought, which is not entirely coincidental.
- Character sets that are strict supersets of ASCII are safe to
use, but extended characters cannot appear in identifiers. This
has to be coordinated with the C and C++ front ends. See character set issues, below.
- C99 universal character escapes (
\Uxxxxxxxx) are not recognized in identifiers.
Proper support has to be coordinated with the front ends.
- Precompiled headers are commonly requested; this entails the
ability for cpp to dump out and reload all its internal state.
You can get some of this with the debug switches, but not all,
and not in a reloadable format. The front end must cooperate
- Integration of diagnostic reporting. The front ends could use
extra information only available to the preprocessor, such as
column numbers and macros under expansion. The existing code
copies cpplib's internal state into the state used by
diagnostic.c, which is better than writing out and
processing linemarker commands, but still suboptimal.
- If YACC did not insist on assigning its own values for token
codes, there would be no need for a translation layer between
the codes returned by cpplib and the codes used by the parser.
Noises have been made about a recursive-descent parser that
could handle all of C, C++, Objective C; if this ever happens,
it should use cpplib's token codes.
- String concatenation should be handled in the function
c-lex.c. Then the front ends
would not have to jump through hoops to remember to concatenate
strings, and we could simplify the parsers a little too.
Potential minor improvements
- The file-handling code allocates lots of items with xmalloc.
The rest of cpplib is now reasonably efficient in its use of
memory; minor improvements are certainly still possible.
- There might be room for further improvement of macro expansion
performance, although it is now pretty good. For example, we
currently pre-expand each argument (if necessary) into its own
buffer, replace the arguments in the replacement list with their
expansions, and then free up each buffer. It might be better to
simply expand the arguments into the final argument-replaced
expansion, saving one copy per argument and the need to free the
argument expansion buffers. It has the disadvantage that we
don't know the size we need to make the token buffer in advance
[equally, though, we don't know the size we need to make each
expanded argument buffer, either]. In view of this, a further
enhancement might then be to permit the list of token pointers
that represents the expansion to be made up of more than one
run. Then we would just need to append a new run, rather than
reallocating the expansion buffer if we overflow its initial
- It might be worth trying to optimize wrapper headers - files
containing only an #include of another file, so that they are
optimized out on reinclusion. This is more tricky than it may
sound - something with heuristics similar to the
multiple-include optimization is needed, that handles multiple
levels of wrapper headers.
Character set issues
Proper non-ASCII character handling is a hard problem. Users want
to be able to write comments and strings in their native language.
They want the strings to come out in their native language and not
gibberish after translation to object code. Some users also want to
use their own alphabet for identifiers in their code. There is no
one-to-one or many-to-one map between languages and character set
encodings. The subset of ASCII that is included in most modern day
character sets does not include all the punctuation C uses; some of
the missing punctuation may be present but at a different place than
where it is in ASCII. The subset described in ISO646 may not be the
smallest subset out there.
At the present time, GCC supports the use of any encoding for
source code, as long as it is a strict superset of 7-bit ASCII. By
this I mean that all printable (including whitespace) ASCII
characters, when they appear as single bytes in a file, stand only for
themselves, no matter what the context is. This is true of ISO8859.x,
KOI8-R, and UTF8. It is not true of Shift JIS and some other popular
Asian character sets. If they are used, GCC may silently mangle the
input file. The only known specific example is that a Shift JIS
multibyte character ending with 0x5C will be mistaken for a line
continuation if it occurs at the end of a line. 0x5C is "\" in ASCII.
Assuming a safe encoding, characters not in the base set listed in
the standard (C99 5.2.1) are syntax errors if they appear outside
strings, character constants, or comments. In strings and character
constants, they are taken literally - converted blindly to numeric
codes, or copied to the assembly output verbatim, depending on the
context. If you use the C99
escapes, you get UTF8, no exceptions. These too are only supported in
string and character constants.
We intend to improve this as follows:
- cpplib will be reworked so that it can handle any character set
in wide use, whether or not it is a strict superset of 7-bit
ASCII. This means that cpplib will never confuse non-ASCII
characters with C punctuators, comment delimiters, or whatever.
- In comments, naturally any character will be permitted to appear.
- All Unicode code points which are permitted by C99 Annex D to
appear in identifiers, will be accepted in identifiers. All
source-file characters which, when translated to Unicode,
correspond to permitted code points, will also be accepted. In
assembly output, identifiers will be encoded in UTF8, and then
reencoded using some mangling scheme if the assembler cannot
handle UTF8 identifiers. (Does the new C++ ABI have anything to
say about this? What does the Java compiler do?)
U+0024 will be permitted in
identifiers if and only if
$ is permitted.
- In strings and character constants, GCC will translate from the
character set of the file (selectable on a per-file basis), to
the current execution character set (chosen once per
compilation). This may or may not be Unicode. UCN escapes will
also be converted from Unicode to the execution character set;
this happens independent of the source character set.
- Each file referenced by the compiler may state its own character
set with a
#pragma, or rely on the default
established by the user with locale or a command-line option.
#pragma, if used, must be the first line in
the file. This will not prevent the multiple include
optimization from working. GCC will also recognize MULE
(Multilingual Emacs) magic comments, byte order marks, and any
other reasonable in-band method of specifying a file's character set.
It's worth noting that the standard C library facilities for
"multibyte character sets" are not adequate to implement the above.
The basic problem is that neither C89 nor C99 gives you any way to
specify the character set of a file directly. You can manipulate the
"locale," which indirectly specifies the character set, but that's a
global change. Further, locale names are not defined by the C
standard nor is there any consistent map between them and character
The Single Unix specification, and possibly also POSIX, provide the
iconv interfaces which
mostly circumvent these limitations. We may require these interfaces
to be present for complete non-ASCII support to be functional.
One final note: EBCDIC is, and will be, supported as a source
character set if and only if GCC is compiled for a host (not a target)
which uses EBCDIC natively.