This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Re: cpplib project web page update

To: Gerald Pfeifer <pfeifer at dbai dot tuwien dot ac dot at>
Subject: Re: cpplib project web page update
From: Neil Booth <NeilB at earthling dot net>
Date: Sat, 6 May 2000 10:08:32 +0900
Cc: gcc-patches at gcc dot gnu dot org, Zack Weinberg <zack at wolery dot cumb dot org>
References: <E12ndyq-00056N-00@monkey.rosenet.ne.jp> <Pine.BSF.4.21.0005052235440.76493-100000@deneb.dbai.tuwien.ac.at>
This is a proper CVS diff against the CVS source, with Gerald's
</p> marks added and a spelling mistake fixed.

Neil.

	*proj-cpplib.html: Update to reflect recent progress.

Index: wwwdocs/htdocs/proj-cpplib.html
===================================================================
RCS file: /cvs/gcc/wwwdocs/htdocs/proj-cpplib.html,v
retrieving revision 1.7
diff -u -p -r1.7 proj-cpplib.html
--- proj-cpplib.html	2000/02/01 00:44:53	1.7
+++ proj-cpplib.html	2000/05/06 01:02:07
@@ -13,19 +13,19 @@ gcc.  It is not yet linked into the C an
 because the interface is likely to change and there are still some
 major bugs in that area.  There remain a number of bugs which need to
 be stomped out, and some missing features.  We also badly need more
-testing.
+testing.</p>
 
 <h2>How to help test</h2>
 
 <p>The number one priority for testing is cross-platform work.  Simply
 bootstrap the compiler and run the test suite on as many different OS
 and hardware combinations as you can.  I only have access to a very
-few.  
+few.</p>
 
 <p>The number two priority is large packages that (ab)use the
 preprocessor heavily.  The compiler itself is pretty good at that, but
 doesn't cover all the bases.  If you've got cycles to burn, please
-try one or more of:
+try one or more of:</p>
 
 <ul>
   <li>BSD 'make world'
@@ -44,12 +44,12 @@ try one or more of:
 
 <p>Old grotty pre-ANSI code is particularly good for exposing bad
 assumptions and missed corner cases; you may have more trouble with
-bugs in the package than bugs in the compiler, though.
+bugs in the package than bugs in the compiler, though.</p>
 
 <p>A bug report saying 'package FOO won't compile on system BAR' is
-useless.  At this stage what I need are short testcases with no system
+useless.  At this stage what I need are short test cases with no system
 dependencies.  Aim for less than fifty lines and no #includes at all.
-I recognize this won't always be possible.
+I recognize this won't always be possible.</p>
 
 <p>Also, please file off everything that would cause us legal trouble
 if we were to roll your test case into the distributed test suite.
@@ -58,56 +58,80 @@ don't sweat it too much.  An example of 
 includes a 200-line comment detailing inner workings of your program.
 (A 200-line comment might be what you need to provoke a bug, but its
 contents are unlikely to matter.   Try running it through 
-<code>"tr A-Za-z x"</code>.)
+<code>"tr A-Za-z x"</code>.)</p>
 
 <p>As usual, report bugs to <a
 href="mailto:gcc-bugs@gcc.gnu.org">gcc-bugs@gcc.gnu.org</a>.  But
-please read the rest of this document first!
+please read the rest of this document first!</p>
 
-<h2>Known Bugs</h2>
+<h2>Fixed Bugs and Nits</h2>
+
+<p>These have either been fixed in the latest snapshot, or are
+awaiting uncommenting of the relevant code.</p>
 
 <p><ol>
   <li>Under some conditions the line numbers seen by the compiler
-      proper are incorrect.  It shows up most obviously as bad line
-      numbers in warnings when bootstrapping the compiler.  I have not
-      been able to reproduce this with an input file of less than a
-      couple thousand lines.  Help would be greatly appreciated.
-
-  <li>cpplib will silently mangle input files containing ASCII NUL.
-      The cause of the bug is well known, but we weren't able to come
-      to consensus on what to do about it.  My personal preference is
-      to issue a warning and strip the NUL; other people feel it
-      should be preserved or considered a hard error.
+      proper were incorrect.
+
+  <li>cpplib used to silently mangle input files containing ASCII NUL.
+      Handling now depends on the context.  In comments, they are
+      ignored.  In string and character constants, they are warned
+      about but preserved.  Anywhere else they are treated as whitespace,
+      and a warning emitted.
+
+  <li>Trigraphs no longer provoke warnings within comments.
 
+  <li>C89 Amendment 1 "alternate spellings" of punctuators are now
+      recognized. These are
+<pre>		&lt;:  :&gt;  &lt;%  %&gt;  %:  %:%:</pre>
+      which correspond, respectively, to
+<pre>		[   ]   {   }   #   ##</pre>
+
+  <li>Someone once requested warnings about stray whitespace in the
+      input under various circumstances.  With -traditional, cpplib
+      now warns about directives with initial whitespace that were
+      available before c89, and conversely warns about other
+      directives unavailable at that time without initial whitespace.
+      Additionally, cpplib now warns if pure whitespace separates a
+      backslash from a subsequent newline character.  This looks like
+      a line continuation sequence, but isn't.
+
+  <li>The handling of <code>#define</code> and <code>#if</code> now
+      uses the same lexical analysis code as the rest of cpplib.  This
+      is essential to adding support for the new preprocessor features
+      in C9x and C89 Amendment 1.
+
+  <li>cpplib no longer makes two separate passes over the input file,
+      which hopefully will lead to improved performance.
+
+  <li>The code is now stricter in its use of <code>char *</code> and
+      <code>unsigned char *</code> for improved consistency.
+
+  <li>The lexer now parses tokens a logical line at a time.  The
+      resulting token lists should be useable pretty much directly by
+      the C or C++ front ends, so when linked up they won't have to do
+      any rescanning of tokens.
+
+</ol></p>
+
+<h2>Known Bugs</h2>
+
+<p><ol>
   <li>Character sets that are <em>not</em> strict supersets of ASCII
       may cause cpplib to mangle the input file, even in comments or
       strings.  Unfortunately, that includes important character sets
       such as Shift JIS and UCS2.  (Please see the discussion of <a
       href="#charset">character set issues</a>, below.)
 
-  <li>Trigraphs provoke warnings everywhere in the input file, even in
-      comments.  This is obnoxious, but difficult to fix due to the
-      brain-dead semantics of trigraphs and backslash-newline.
-
   <li>Code that does perverse things with directives inside macro
       arguments can cause the preprocessor to dump core.  cccp dealt
       with this by disallowing all directives there, but it might be
       nice to permit conditionals at least.
-
 </ol>
 
 <h2>Missing User-visible Features</h2>
 
 <p><ol>
-
-  <li>C89 Amendment 1 "alternate spellings" of punctuators are not
-      recognized. These are
-<pre>		&lt;:  :&gt;  &lt;%  %&gt;  %:  %:%:</pre>
-      which correspond, respectively, to
-<pre>		[   ]   {   }   #   ##</pre>
-      The preprocessor must be aware of all of them, even though it
-      uses only <code>%:</code> and <code>%:%:</code> itself.
-
   <li>Character sets that are strict supersets of ASCII are safe to
       use, but extended characters cannot appear in identifiers.  This
       has to be coordinated with the front end, and requires library
@@ -134,11 +158,6 @@ please read the rest of this document fi
       and not in a reloadable format.  The front end must cooperate
       also.
 
-  <li>Someone once requested warnings about stray whitespace in the
-      input, notably trailing whitespace after a backslash.  If that
-      happens, you have something that looks like a line-continuation
-      backslash, but isn't.
-
   <li>Better support for languages other than C would be nice.  People
       want to preprocess Fortran, Chill, and assembly language.  Chill
       has been kludged in, Fortran and assembly still have serious
@@ -150,24 +169,11 @@ please read the rest of this document fi
       function-like macros; object macros should probably be ANSI-ish
       all the time.
 
-</ol>
+</ol></p>
 
 <h2>Internal work that needs doing</h2>
 
 <ol>
-  <li>The handling of <code>#define</code> and <code>#if</code> must
-      be fixed so it uses the same lexical analysis code as the rest of
-      cpplib (i.e. <code>cpp_get_token</code>).  This is essential to
-      adding support for the new preprocessor features in C9x and C89
-      Amendment 1.
-
-  <li>cpplib makes two separate passes over the input file, which
-      causes a number of headaches, such as the trigraph warnings
-      inside comments.  It's also a performance problem.  Semantic
-      issues make a one-pass lexer impractical, but a two pass scheme
-      with the first pass called coroutine fashion from the first
-      should work better.
-
   <li>The macro expander could use a total rewrite.  We currently
       re-tokenize macros every time they are expanded.  It'd be better
       to tokenize when the macro is defined and remember it for later.
@@ -178,10 +184,6 @@ please read the rest of this document fi
       and <code>long</code> are used interchangeably; this is worse,
       but I think most of the instances have been removed.
 
-  <li>Likewise, the code uses <code>char *</code>, <code>unsigned char
-      *</code>, and <code>U_CHAR *</code> interchangeably.  This is
-      more of a consistency issue and annoyance than a real problem.
-
   <li>VMS support has suffered extreme bit rot.  There may be problems
       with support for DOS, Windows, MVS, and other non-Unixy
       platforms.  I can fix none of these myself.
@@ -195,22 +197,14 @@ please read the rest of this document fi
 
 <ol>
 
-  <li>The lexer should do more work - enough that when cpplib is
-      linked into the C or C++ front end, the front end doesn't have
-      to do any rescanning of tokens.
-
   <li>The library interface needs to be tidied up.  Internal
       implementation details are exposed all over the place.
       Extracting all the information the library provides is
       difficult.  
 
-  <li><code>cpp_get_token</code> must be changed to return exactly one
-      token per invocation.  For performance, there should be a
-      <code>cpp_get_tokens</code> call that returns a lineful.
-
   <li>Front ends need to use cpplib's line and column numbering
       interface directly.  cpplib needs to stop inserting #line
-      directives into the output.  (The standalone preprocessor in
+      directives into the output.  (The stand-alone preprocessor in
       cppmain.c counts as a front end.)
 
   <li>When cpplib is linked into front ends <code>-save-temps</code>
@@ -241,7 +235,7 @@ please read the rest of this document fi
       filesystem overhead as well as the work of lexical analysis.
 
   <li>Wrapper headers - files containing only an #include of another
-      file - should be optimized out on reinclusion.  (Just tweak the
+      file - should be optimized out on re-inclusion.  (Just tweak the
       hash table entry of the wrapper to point to the file it reads.)
 
   <li>When a macro is defined to itself, bypass the macro expander
@@ -264,7 +258,7 @@ The subset of ASCII that is included in 
 does not include all the punctuation C uses; some of the missing
 punctuation may be present but at a different place than where it is
 in ASCII.  The subset described in ISO646 may not be the smallest
-subset out there.
+subset out there.</p>
 
 <p>Furthermore, the C standard's solutions for these problems are all
 more or less hideous.  None rises above the status of kludge.
@@ -273,7 +267,7 @@ solve.  Digraphs are okay, but but nonin
 solution.  <code>iso646.h</code> merely shifts the problem from one
 place to another, and is not a complete solution either.  UCN escapes
 assume Unicode, which makes them unsuitable for most Japanese and some
-Chinese environments.
+Chinese environments.</p>
 
 <p>Compounding the problem, the standard C library features for
 processing non-ASCII character sets are sadly lacking, even in the new
@@ -283,27 +277,27 @@ sets into three classes: unibyte, multib
 characters can be further subdivided into shifted and unshifted
 encodings.  ASCII and most of its strict supersets - ISO 8859-x,
 KOI8-R, etc - are unibyte, which means that all characters are exactly
-one byte long.  This is obviously the easiest to deal with.
+one byte long.  This is obviously the easiest to deal with.</p>
 
 <p>UCS2 and UCS4, and no other sets that I know of, are wide; this
 means that all characters are N bytes long, for some N greater than
 one.  Handling these requires mechanical code changes throughout the
 lexer, which is then incapable of handling unibyte encodings; you have
 to add a translator.  Memory requirements obviously at least double.
-However, no structural changes are needed.
+However, no structural changes are needed.</p>
 
 <p>UTF-8 and a few others are unshifted multibyte encodings.  That
 means that not all characters are one byte long, but given any one
 byte you can tell if it's a one-byte character, the first byte of a
 longer character, or one of the trailing bytes of a longer character,
 without any additional information.  These are almost as easy to deal
-with as unibyte encodings.
+with as unibyte encodings.</p>
 
 <p>Finally, JISx and a few others are shifted multibyte encodings,
 meaning that you must remember state as you walk down a string in
 order to interpret it.  These are the worst to handle.  Unfortunately,
 this category includes most of the character sets used in Asian
-countries.
+countries.</p>
 
 <p>The C standard library has no way of processing multibyte
 encodings, shifted or not, other than translating them into some
@@ -314,35 +308,35 @@ unibyte subset.  That's true for UTF8 an
 the usual English letters, Arabic numbers, and the underscore in
 identifiers.  If you want to permit other alphanumeric characters in
 identifiers, you've got to find out what they are, and that requires
-converting to wide encoding first.
+converting to wide encoding first.</p>
 
 <p>So what's wrong with converting to wide encoding?  First, it's
 slow.  Obscenely slow, with most C libraries.  It may be acceptably
 fast to convert an entire file all at once, but that doubles or
 quadruples your memory consumption.  Typical C source files are on a
 par with data cache sizes as is; double it and you're in main memory
-and slowed to a crawl.
+and slowed to a crawl.</p>
 
 <p>Second, the normal wide encoding is Unicode, and conversion from
 some sets (JISx, again) to Unicode and back loses information.  [This
-is the infamous "Han unification" problem.]
+is the infamous "Han unification" problem.]</p>
 
 <p>Third, there is no portable way to tell the library what multibyte
 encoding you want to convert from.  You can only specify it indirectly
 by way of the locale.  Locale strings are not standardized, and
-setting the locale changes other behavior that we want left alone.
+setting the locale changes other behavior that we want left alone.</p>
 
 <p>It is possible to walk down a multibyte string without converting
 it, using <code>mbrlen</code> or equivalent.  That's the slowest
 possible mode you can put the conversion library in, though.  Nor does
-it tell you anything about the characters you're hopping over.
+it tell you anything about the characters you're hopping over.</p>
 
 <p>End of rant.  So what's cpplib likely to support in the near
 future?  We will verify that it is safe to use any charset that is a
 strict superset of ASCII (unibyte or unshifted multibyte) in strings,
 character constants, and comments.  We'll also support UCN escapes in
 those locations.  If you write them in strings, the result will be
-in UTF-8.
+in UTF-8.</p>
 
 <p>Support for shifted multibyte charsets in will come next, and will
 involve some sort of library that provides all of the useful
@@ -350,7 +344,7 @@ involve some sort of library that provid
 arbitrary character set, <em>without</em> conversion.  This will also
 require us to have some way to specify what character set an input
 file uses; the scheme MULE (Multilingual Emacs) uses is one
-possibility, and a #pragma is another.
+possibility, and a #pragma is another.</p>
 
 <p>Support for additional alphanumeric characters in identifiers will
 be added much later, because it presents ABI issues as well as
@@ -358,10 +352,10 @@ compiler-guts issues.  Arbitrary bytes u
 assembly labels nor in object-file string tables, so there needs to be
 a mangling scheme.  That scheme might be charset dependent,
 independent, or neutral, and you can make a case for all three.  All
-these things must be debated before we can implement anything.
+these things must be debated before we can implement anything.</p>
 
 <p>There's one exception - <code>\u0024</code> will be legal in
-identifiers if and only if <code>$</code> is also legal.
+identifiers if and only if <code>$</code> is also legal.</p>
 
 <address>Zack Weinberg,
 <a href="mailto:zack@wolery.cumb.org">zack@wolery.cumb.org</a>
References:
- cpplib project web page update
  - From: Neil Booth
- Re: cpplib project web page update
  - From: Gerald Pfeifer
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]