This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: cpplib project web page update
This is a proper CVS diff against the CVS source, with Gerald's
</p> marks added and a spelling mistake fixed.
Neil.
*proj-cpplib.html: Update to reflect recent progress.
Index: wwwdocs/htdocs/proj-cpplib.html
===================================================================
RCS file: /cvs/gcc/wwwdocs/htdocs/proj-cpplib.html,v
retrieving revision 1.7
diff -u -p -r1.7 proj-cpplib.html
--- proj-cpplib.html 2000/02/01 00:44:53 1.7
+++ proj-cpplib.html 2000/05/06 01:02:07
@@ -13,19 +13,19 @@ gcc. It is not yet linked into the C an
because the interface is likely to change and there are still some
major bugs in that area. There remain a number of bugs which need to
be stomped out, and some missing features. We also badly need more
-testing.
+testing.</p>
<h2>How to help test</h2>
<p>The number one priority for testing is cross-platform work. Simply
bootstrap the compiler and run the test suite on as many different OS
and hardware combinations as you can. I only have access to a very
-few.
+few.</p>
<p>The number two priority is large packages that (ab)use the
preprocessor heavily. The compiler itself is pretty good at that, but
doesn't cover all the bases. If you've got cycles to burn, please
-try one or more of:
+try one or more of:</p>
<ul>
<li>BSD 'make world'
@@ -44,12 +44,12 @@ try one or more of:
<p>Old grotty pre-ANSI code is particularly good for exposing bad
assumptions and missed corner cases; you may have more trouble with
-bugs in the package than bugs in the compiler, though.
+bugs in the package than bugs in the compiler, though.</p>
<p>A bug report saying 'package FOO won't compile on system BAR' is
-useless. At this stage what I need are short testcases with no system
+useless. At this stage what I need are short test cases with no system
dependencies. Aim for less than fifty lines and no #includes at all.
-I recognize this won't always be possible.
+I recognize this won't always be possible.</p>
<p>Also, please file off everything that would cause us legal trouble
if we were to roll your test case into the distributed test suite.
@@ -58,56 +58,80 @@ don't sweat it too much. An example of
includes a 200-line comment detailing inner workings of your program.
(A 200-line comment might be what you need to provoke a bug, but its
contents are unlikely to matter. Try running it through
-<code>"tr A-Za-z x"</code>.)
+<code>"tr A-Za-z x"</code>.)</p>
<p>As usual, report bugs to <a
href="mailto:gcc-bugs@gcc.gnu.org">gcc-bugs@gcc.gnu.org</a>. But
-please read the rest of this document first!
+please read the rest of this document first!</p>
-<h2>Known Bugs</h2>
+<h2>Fixed Bugs and Nits</h2>
+
+<p>These have either been fixed in the latest snapshot, or are
+awaiting uncommenting of the relevant code.</p>
<p><ol>
<li>Under some conditions the line numbers seen by the compiler
- proper are incorrect. It shows up most obviously as bad line
- numbers in warnings when bootstrapping the compiler. I have not
- been able to reproduce this with an input file of less than a
- couple thousand lines. Help would be greatly appreciated.
-
- <li>cpplib will silently mangle input files containing ASCII NUL.
- The cause of the bug is well known, but we weren't able to come
- to consensus on what to do about it. My personal preference is
- to issue a warning and strip the NUL; other people feel it
- should be preserved or considered a hard error.
+ proper were incorrect.
+
+ <li>cpplib used to silently mangle input files containing ASCII NUL.
+ Handling now depends on the context. In comments, they are
+ ignored. In string and character constants, they are warned
+ about but preserved. Anywhere else they are treated as whitespace,
+ and a warning emitted.
+
+ <li>Trigraphs no longer provoke warnings within comments.
+ <li>C89 Amendment 1 "alternate spellings" of punctuators are now
+ recognized. These are
+<pre> <: :> <% %> %: %:%:</pre>
+ which correspond, respectively, to
+<pre> [ ] { } # ##</pre>
+
+ <li>Someone once requested warnings about stray whitespace in the
+ input under various circumstances. With -traditional, cpplib
+ now warns about directives with initial whitespace that were
+ available before c89, and conversely warns about other
+ directives unavailable at that time without initial whitespace.
+ Additionally, cpplib now warns if pure whitespace separates a
+ backslash from a subsequent newline character. This looks like
+ a line continuation sequence, but isn't.
+
+ <li>The handling of <code>#define</code> and <code>#if</code> now
+ uses the same lexical analysis code as the rest of cpplib. This
+ is essential to adding support for the new preprocessor features
+ in C9x and C89 Amendment 1.
+
+ <li>cpplib no longer makes two separate passes over the input file,
+ which hopefully will lead to improved performance.
+
+ <li>The code is now stricter in its use of <code>char *</code> and
+ <code>unsigned char *</code> for improved consistency.
+
+ <li>The lexer now parses tokens a logical line at a time. The
+ resulting token lists should be useable pretty much directly by
+ the C or C++ front ends, so when linked up they won't have to do
+ any rescanning of tokens.
+
+</ol></p>
+
+<h2>Known Bugs</h2>
+
+<p><ol>
<li>Character sets that are <em>not</em> strict supersets of ASCII
may cause cpplib to mangle the input file, even in comments or
strings. Unfortunately, that includes important character sets
such as Shift JIS and UCS2. (Please see the discussion of <a
href="#charset">character set issues</a>, below.)
- <li>Trigraphs provoke warnings everywhere in the input file, even in
- comments. This is obnoxious, but difficult to fix due to the
- brain-dead semantics of trigraphs and backslash-newline.
-
<li>Code that does perverse things with directives inside macro
arguments can cause the preprocessor to dump core. cccp dealt
with this by disallowing all directives there, but it might be
nice to permit conditionals at least.
-
</ol>
<h2>Missing User-visible Features</h2>
<p><ol>
-
- <li>C89 Amendment 1 "alternate spellings" of punctuators are not
- recognized. These are
-<pre> <: :> <% %> %: %:%:</pre>
- which correspond, respectively, to
-<pre> [ ] { } # ##</pre>
- The preprocessor must be aware of all of them, even though it
- uses only <code>%:</code> and <code>%:%:</code> itself.
-
<li>Character sets that are strict supersets of ASCII are safe to
use, but extended characters cannot appear in identifiers. This
has to be coordinated with the front end, and requires library
@@ -134,11 +158,6 @@ please read the rest of this document fi
and not in a reloadable format. The front end must cooperate
also.
- <li>Someone once requested warnings about stray whitespace in the
- input, notably trailing whitespace after a backslash. If that
- happens, you have something that looks like a line-continuation
- backslash, but isn't.
-
<li>Better support for languages other than C would be nice. People
want to preprocess Fortran, Chill, and assembly language. Chill
has been kludged in, Fortran and assembly still have serious
@@ -150,24 +169,11 @@ please read the rest of this document fi
function-like macros; object macros should probably be ANSI-ish
all the time.
-</ol>
+</ol></p>
<h2>Internal work that needs doing</h2>
<ol>
- <li>The handling of <code>#define</code> and <code>#if</code> must
- be fixed so it uses the same lexical analysis code as the rest of
- cpplib (i.e. <code>cpp_get_token</code>). This is essential to
- adding support for the new preprocessor features in C9x and C89
- Amendment 1.
-
- <li>cpplib makes two separate passes over the input file, which
- causes a number of headaches, such as the trigraph warnings
- inside comments. It's also a performance problem. Semantic
- issues make a one-pass lexer impractical, but a two pass scheme
- with the first pass called coroutine fashion from the first
- should work better.
-
<li>The macro expander could use a total rewrite. We currently
re-tokenize macros every time they are expanded. It'd be better
to tokenize when the macro is defined and remember it for later.
@@ -178,10 +184,6 @@ please read the rest of this document fi
and <code>long</code> are used interchangeably; this is worse,
but I think most of the instances have been removed.
- <li>Likewise, the code uses <code>char *</code>, <code>unsigned char
- *</code>, and <code>U_CHAR *</code> interchangeably. This is
- more of a consistency issue and annoyance than a real problem.
-
<li>VMS support has suffered extreme bit rot. There may be problems
with support for DOS, Windows, MVS, and other non-Unixy
platforms. I can fix none of these myself.
@@ -195,22 +197,14 @@ please read the rest of this document fi
<ol>
- <li>The lexer should do more work - enough that when cpplib is
- linked into the C or C++ front end, the front end doesn't have
- to do any rescanning of tokens.
-
<li>The library interface needs to be tidied up. Internal
implementation details are exposed all over the place.
Extracting all the information the library provides is
difficult.
- <li><code>cpp_get_token</code> must be changed to return exactly one
- token per invocation. For performance, there should be a
- <code>cpp_get_tokens</code> call that returns a lineful.
-
<li>Front ends need to use cpplib's line and column numbering
interface directly. cpplib needs to stop inserting #line
- directives into the output. (The standalone preprocessor in
+ directives into the output. (The stand-alone preprocessor in
cppmain.c counts as a front end.)
<li>When cpplib is linked into front ends <code>-save-temps</code>
@@ -241,7 +235,7 @@ please read the rest of this document fi
filesystem overhead as well as the work of lexical analysis.
<li>Wrapper headers - files containing only an #include of another
- file - should be optimized out on reinclusion. (Just tweak the
+ file - should be optimized out on re-inclusion. (Just tweak the
hash table entry of the wrapper to point to the file it reads.)
<li>When a macro is defined to itself, bypass the macro expander
@@ -264,7 +258,7 @@ The subset of ASCII that is included in
does not include all the punctuation C uses; some of the missing
punctuation may be present but at a different place than where it is
in ASCII. The subset described in ISO646 may not be the smallest
-subset out there.
+subset out there.</p>
<p>Furthermore, the C standard's solutions for these problems are all
more or less hideous. None rises above the status of kludge.
@@ -273,7 +267,7 @@ solve. Digraphs are okay, but but nonin
solution. <code>iso646.h</code> merely shifts the problem from one
place to another, and is not a complete solution either. UCN escapes
assume Unicode, which makes them unsuitable for most Japanese and some
-Chinese environments.
+Chinese environments.</p>
<p>Compounding the problem, the standard C library features for
processing non-ASCII character sets are sadly lacking, even in the new
@@ -283,27 +277,27 @@ sets into three classes: unibyte, multib
characters can be further subdivided into shifted and unshifted
encodings. ASCII and most of its strict supersets - ISO 8859-x,
KOI8-R, etc - are unibyte, which means that all characters are exactly
-one byte long. This is obviously the easiest to deal with.
+one byte long. This is obviously the easiest to deal with.</p>
<p>UCS2 and UCS4, and no other sets that I know of, are wide; this
means that all characters are N bytes long, for some N greater than
one. Handling these requires mechanical code changes throughout the
lexer, which is then incapable of handling unibyte encodings; you have
to add a translator. Memory requirements obviously at least double.
-However, no structural changes are needed.
+However, no structural changes are needed.</p>
<p>UTF-8 and a few others are unshifted multibyte encodings. That
means that not all characters are one byte long, but given any one
byte you can tell if it's a one-byte character, the first byte of a
longer character, or one of the trailing bytes of a longer character,
without any additional information. These are almost as easy to deal
-with as unibyte encodings.
+with as unibyte encodings.</p>
<p>Finally, JISx and a few others are shifted multibyte encodings,
meaning that you must remember state as you walk down a string in
order to interpret it. These are the worst to handle. Unfortunately,
this category includes most of the character sets used in Asian
-countries.
+countries.</p>
<p>The C standard library has no way of processing multibyte
encodings, shifted or not, other than translating them into some
@@ -314,35 +308,35 @@ unibyte subset. That's true for UTF8 an
the usual English letters, Arabic numbers, and the underscore in
identifiers. If you want to permit other alphanumeric characters in
identifiers, you've got to find out what they are, and that requires
-converting to wide encoding first.
+converting to wide encoding first.</p>
<p>So what's wrong with converting to wide encoding? First, it's
slow. Obscenely slow, with most C libraries. It may be acceptably
fast to convert an entire file all at once, but that doubles or
quadruples your memory consumption. Typical C source files are on a
par with data cache sizes as is; double it and you're in main memory
-and slowed to a crawl.
+and slowed to a crawl.</p>
<p>Second, the normal wide encoding is Unicode, and conversion from
some sets (JISx, again) to Unicode and back loses information. [This
-is the infamous "Han unification" problem.]
+is the infamous "Han unification" problem.]</p>
<p>Third, there is no portable way to tell the library what multibyte
encoding you want to convert from. You can only specify it indirectly
by way of the locale. Locale strings are not standardized, and
-setting the locale changes other behavior that we want left alone.
+setting the locale changes other behavior that we want left alone.</p>
<p>It is possible to walk down a multibyte string without converting
it, using <code>mbrlen</code> or equivalent. That's the slowest
possible mode you can put the conversion library in, though. Nor does
-it tell you anything about the characters you're hopping over.
+it tell you anything about the characters you're hopping over.</p>
<p>End of rant. So what's cpplib likely to support in the near
future? We will verify that it is safe to use any charset that is a
strict superset of ASCII (unibyte or unshifted multibyte) in strings,
character constants, and comments. We'll also support UCN escapes in
those locations. If you write them in strings, the result will be
-in UTF-8.
+in UTF-8.</p>
<p>Support for shifted multibyte charsets in will come next, and will
involve some sort of library that provides all of the useful
@@ -350,7 +344,7 @@ involve some sort of library that provid
arbitrary character set, <em>without</em> conversion. This will also
require us to have some way to specify what character set an input
file uses; the scheme MULE (Multilingual Emacs) uses is one
-possibility, and a #pragma is another.
+possibility, and a #pragma is another.</p>
<p>Support for additional alphanumeric characters in identifiers will
be added much later, because it presents ABI issues as well as
@@ -358,10 +352,10 @@ compiler-guts issues. Arbitrary bytes u
assembly labels nor in object-file string tables, so there needs to be
a mangling scheme. That scheme might be charset dependent,
independent, or neutral, and you can make a case for all three. All
-these things must be debated before we can implement anything.
+these things must be debated before we can implement anything.</p>
<p>There's one exception - <code>\u0024</code> will be legal in
-identifiers if and only if <code>$</code> is also legal.
+identifiers if and only if <code>$</code> is also legal.</p>
<address>Zack Weinberg,
<a href="mailto:zack@wolery.cumb.org">zack@wolery.cumb.org</a>