This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

webpage: update cpplib todo list, add gcc-micro.html


This patch has a complete rewrite of the cpplib to-do list and adds my
list of optimizer problems to the official website.

zw

===================================================================
Index: projects.html
--- projects.html	1999/09/20 07:41:33	1.7
+++ projects.html	2000/01/29 07:45:11
@@ -13,6 +13,10 @@ very much anymore, but who knows?
 <p>There is a separate project list for the <a
 href="proj-cpplib.html">C preprocessor</a>.
 
+<p>We also have a page detailing <a href="gcc-micro.html">optimizer
+inadequacies</a>, if you'd prefer to think about it in terms of problems
+instead of features.
+
 <h2>Changes to support C99 draft</h2>
 
 <p>The (not yet published) next revision of the C standard requires a
===================================================================
Index: gcc-micro.html
--- /dev/null	Tue May  5 13:32:27 1998
+++ gcc-micro.html	Fri Jan 28 23:45:11 2000
@@ -0,0 +1,822 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
+                      "http://www.w3.org/TR/html4/loose.dtd">
+<html><head>
+<title>Micro-optimizations</title>
+<link rev="made" href="mailto:zack@wolery.cumb.org">
+</head>
+
+<body bgcolor="white" text="black" link="#0000EE" vlink="#551A8B" alink="red">
+<h1 align="center">Micro-optimizations that GCC should perform</h1>
+
+<p>This page lists places where GCC's code generation is suboptimal and
+the problem can be shown in a few lines of assembly output, hence
+"micro-optimizations."  I'll be updating it as I notice new issues.
+Please send suggestions to <a
+href="mailto:zack@wolery.cumb.org">zack@wolery.cumb.org</a>.
+
+<p>Note: unless otherwise specified, all examples have been compiled
+with the current CVS tree as of the date of the example, on x86, with
+<code>-O2 -fomit-frame-pointer -fschedule-insns</code>.  (The x86 back
+end disables <code>-fschedule-insns</code>, which is something that
+should be revisited, because it always gives better code when I turn
+it back on.)
+
+<p><strong>Contents:</strong>
+<ol>
+<li><a href="#invert">Inverting conditionals</a>
+<li><a href="#csefail">Failure of common subexpression elimination</a>
+<li><a href="#storemerge">Store merging</a>
+<li><a href="#gcsereg">Global CSE and hard registers</a>
+<li><a href="#volatile">Volatile inhibits too many optimizations</a>
+<li><a href="#rndmode">Unnecessary changes of rounding mode</a>
+<li><a href="#regshuf">Register shuffling and <code>long long</code></a>
+<li><a href="#fpmove">Moving floating point through integer registers</a>
+</ol>
+
+<hr>
+<h2><a name="invert">Inverting conditionals</a></h2>
+
+<p>(14 Jan 2000) Frequently GCC produces better code if you write a
+conditional one way than if you write it the opposite way.  Here is a
+simple example.
+
+<p><pre>
+static const unsigned char
+trigraph_map[] = {
+  '|', 0, 0, 0, 0, 0, '^',
+  '[', ']', 0, 0, 0, '~',
+  0, '\\', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+  0, '{', '#', '}'
+};
+
+unsigned char
+map1 (c)
+     unsigned char c;
+{
+  if (c &gt;= '!' &amp;&amp; c &lt;= '&gt;')
+    return trigraph_map[c - '!'];
+  return 0;
+}
+
+unsigned char
+map2 (c)
+     unsigned char c;
+{
+  if (c &lt; '!' || c &gt; '&gt;')
+    return 0;
+  return trigraph_map[c - '!'];
+}
+</pre>
+
+<p>Assembly output for <code>map1</code> and <code>map2</code> is,
+surprisingly, different:
+
+<p><pre>
+map1:
+	movb	4(%esp), %cl
+	xorl	%eax, %eax
+	movb	%cl, %dl
+	subb	$33, %dl
+	cmpb	$29, %dl
+	ja	.L4
+	movzbl	%cl, %eax
+	movzbl	trigraph_map-33(%eax), %eax
+.L4:
+	ret
+
+map2:
+	movb	4(%esp), %cl
+	xorl	%eax, %eax
+	movb	%cl, %dl
+	subb	$33, %dl
+	cmpb	$29, %dl
+	ja	.L7
+	movzbl	%cl, %eax
+	movzbl	trigraph_map-33(%eax), %eax
+	ret
+	.p2align 4,,7
+.L7:
+	ret
+</pre>
+
+<p>Admittedly, the difference is small - a redundant <code>'ret'</code>
+instruction and a padding directive, and six bytes wasted in the
+object file.  The problem is worse for larger blocks of conditional
+code, though.
+
+<hr>
+<h2><a name="csefail">Failure of common subexpression elimination</a></h2>
+
+<p>(14 Jan 2000) The same code also illustrates a failing in CSE.
+Once again, the source is
+
+<p><pre>
+unsigned char
+map1 (c)
+     unsigned char c;
+{
+  if (c &gt;= '!' &amp;&amp; c &lt;= '&gt;')
+    return trigraph_map[c - '!'];
+  return 0;
+}
+</pre>
+
+<p>and the assembly is
+
+<p><pre>
+map1:
+	movb	4(%esp), %cl
+	xorl	%eax, %eax
+	movb	%cl, %dl
+	subb	$33, %dl
+	cmpb	$29, %dl
+	ja	.L4
+	movzbl	%cl, %eax
+	movzbl	trigraph_map-33(%eax), %eax
+.L4:
+	ret
+</pre>
+
+<p>If we were writing this code by hand, we would do it thus:
+
+<p><pre>
+map1:
+	movb	4(%esp), %cl
+	xorl	%eax, %eax
+	subb	$33, %cl
+	cmpb	$29, %cl
+	ja	.L4
+	movzbl	%cl, %eax
+	movzbl	trigraph_map(%eax), %eax
+.L4:
+	ret
+</pre>
+
+<p>This does not save a runtime subtract - <code>trigraph_map-33</code>
+happens at load time.  It does, however, save a register, which would
+be important if this function were to be inlined.  It also puts the
+<code>'ret'</code> instruction at the alignment the processor likes for
+jump targets, which is important because we happen to know that the
+jump will almost always be taken.
+
+<p>Some marginally more detailed analysis: Local CSE can't help because
+the two subtracts are in different basic blocks.  Global CSE does not
+merge the subtracts because they appear to occur in different modes.
+We have RTL like so:
+
+<p><pre>
+(insn 13 7 14 (parallel[ 
+            (set (reg:QI 27)
+                (plus:QI (reg/v:QI 25)
+                    (const_int -33 [0xffffffdf])))
+            (clobber (reg:CC 17 flags))
+        ] ) 183 {*addqi_1} (nil)
+    (nil))
+
+;; ...
+
+(insn 17 44 19 (parallel[ 
+            (set (reg:SI 29)
+                (zero_extend:SI (reg/v:QI 25)))
+            (clobber (reg:CC 17 flags))
+        ] ) 106 {*zero_extendqisi2_movzbw_and} (nil)
+    (nil))
+
+(insn 19 17 21 (parallel[ 
+            (set (reg:SI 30)
+                (plus:SI (reg:SI 29)
+                    (const_int -33 [0xffffffdf])))
+            (clobber (reg:CC 17 flags))
+        ] ) 174 {*addsi_1} (nil)
+    (nil))
+</pre>
+
+<p>I suspect that this is conservatism on the part of the optimizer.  It
+might be that doing the zero_extend and then the subtract would have a
+different result than the other way around.  However, we know that
+this cannot be the case, because control will never reach insn 17
+unless (reg:QI 25) is greater than 33.
+
+<hr>
+<h2><a name="storemerge">Store merging</a></h2>
+
+<p>(14 Jan 2000) GCC frequently generates multiple narrow writes to
+adjacent memory locations.  Memory writes are expensive; it would be
+better if they were combined.  For example:
+
+<p><pre>
+struct rtx_def
+{
+  unsigned short code;
+  int mode : 8;
+  unsigned int jump : 1;
+  unsigned int call : 1;
+  unsigned int unchanging : 1;
+  unsigned int volatil : 1;
+  unsigned int in_struct : 1;
+  unsigned int used : 1;
+  unsigned integrated : 1;
+  unsigned frame_related : 1;
+};
+
+void
+i1(struct rtx_def *d)
+{
+  memset((char *)d, 0, sizeof(struct rtx_def));
+  d-&gt;code = 12;
+  d-&gt;mode = 23;
+}
+
+void
+i2(struct rtx_def *d)
+{
+  d-&gt;code = 12;
+  d-&gt;mode = 23;
+
+  d-&gt;jump = d-&gt;call = d-&gt;unchanging = d-&gt;volatil
+    = d-&gt;in_struct = d-&gt;used = d-&gt;integrated = d-&gt;frame_related = 0;
+}
+</pre>
+
+<p>compiles to (I have converted the constants to hexadecimal to make the
+situation clearer):
+
+<p><pre>
+i1:
+	movl	4(%esp), %eax
+	movl	$0x0, (%eax)
+	movb	$0x17, 2(%eax)
+	movw	$0x0c, (%eax)
+	ret
+
+i2:
+	movl	4(%esp), %eax
+	movb	$0x0, 3(%eax)
+	movw	$0x0c, (%eax)
+	movb	$0x17, 2(%eax)
+	ret
+</pre>
+
+<p>Both versions ought to compile to
+
+<p><pre>
+i3:
+	movl	4(%esp), %eax
+	movl	$0x17000c, (%eax)
+	ret
+</pre>
+
+<p>Other architectures <em>have</em> to do this optimization, so GCC is
+capable of it.  GCC simply needs to be taught that it's a win on this
+architecture too.  It might be nice if it would do the same thing for
+a more general function where the values assigned to
+<code>'code'</code> and <code>'mode'</code> were not constant, but the
+advantage is less obvious here.
+
+<hr>
+<h2><a name="gcsereg">Global CSE and hard registers</a></h2>
+
+<p>(16 Jan 2000) Global CSE is not capable of operating on hard
+registers.  This causes it to miss obvious optimizations.  For
+example, consider this C++ fragment:
+
+<p><pre>
+struct A
+{
+  A (int);
+};
+
+struct B : virtual public A
+{
+  B ();
+};
+
+B::B ()
+  : A (3)
+{
+}
+</pre>
+
+<p>This compiles as follows (exception handling labels edited out for
+clarity):
+
+<p><pre>
+__1Bi:
+	subl	$24, %esp
+	pushl	%ebx
+	movl	36(%esp), %edx
+	movl	32(%esp), %ebx
+	testl	%edx, %edx
+	je	.L3
+	leal	4(%ebx), %eax
+	movl	%eax, (%ebx)
+.L3:
+	testl	%edx, %edx
+	je	.L4
+	subl	$8, %esp
+	leal	4(%ebx), %eax
+	pushl	$3
+	pushl	%eax
+	call	__1Ai
+	addl	$16, %esp
+.L4:
+	movl	%ebx, %eax
+	popl	%ebx
+	addl	$24, %esp
+	ret
+</pre>
+
+<p>Notice how the test of <code>%edx</code> and the load of
+<code>%eax</code> both occur twice.  We would like code more like this
+to be generated:
+
+<p><pre>
+__1Bi:
+	subl	$24, %esp
+	pushl	%ebx
+	movl	36(%esp), %edx
+	movl	32(%esp), %ebx
+	testl	%edx, %edx
+	je	.L4
+	leal	4(%ebx), %eax
+	movl	%eax, (%ebx)
+	subl	$8, %esp
+	pushl	$3
+	pushl	%eax
+	call	__1Ai
+	addl	$16, %esp
+.L4:
+	movl	%ebx, %eax
+	popl	%ebx
+	addl	$24, %esp
+	ret
+</pre>
+
+<p>This is also a decent example of stack space wastage.  The i386
+architecture wants 16-byte stack alignment right before every call
+instruction, and we try to align doubles on the stack as well.
+However, none of the variables in this function need more than 4 byte
+alignment, and there's no reason to keep the stack pointer aligned in
+the middle of the function.  All the same constraints are satisfied by
+this version:
+
+<p><pre>
+__1Bi:
+	pushl	%ebx
+	movl	12(%esp), %edx
+	movl	8(%esp), %ebx
+	testl	%edx, %edx
+	je	.L4
+	leal	4(%ebx), %eax
+	movl	%eax, (%ebx)
+	pushl	$3
+	pushl	%eax
+	call	__1Ai
+	addl	$8, %esp
+.L4:
+	movl	%ebx, %eax
+	popl	%ebx
+	ret
+</pre>
+
+<p>Only part of the problem is with alignment.  The other part is that
+stack slots are frequently allocated for variables that wound up in
+registers.
+
+<hr>
+<h2><a name="volatile">Volatile inhibits too many optimizations</a></h2>
+
+<p>(17 Jan 2000) gcc refuses to perform in-memory operations on
+volatile variables, on architectures that have those operations.
+Compare:
+
+<p><pre>
+extern int a;
+extern volatile int b;
+
+void inca(void) { a++; } 
+
+void incb(void) { b++; }
+</pre>
+
+<p>compiles to:
+
+<p><pre>
+inca:
+        incl    a
+        ret
+
+incb:
+        movl    b, %eax
+        incl    %eax
+        movl    %eax, b
+        ret
+</pre>
+
+<p>Note that this is a policy decision.  Changing the behavior is
+trivial - permit <code>general_operand</code> to accept volatile
+variables.  To date the GCC team has chosen not to do so.
+
+<p>The C standard is maddeningly ambiguous about the semantics of
+volatile variables.  It <em>happens</em> that on x86 the two
+functions above have identical semantics.  On other platforms that
+have in-memory operations, that may not be the case, and the C
+standard may take issue with the difference - we aren't sure.
+
+<hr>
+<h2><a name="rndmode">Unnecessary changes of rounding mode</a></h2>
+
+<p>(17 Jan 2000) gcc does not remember the state of the floating point
+control register, so it changes it more than necessary.  Consider the
+following:
+
+<p><pre>
+void
+d2i2(const double a, const double b, int * const i, int * const j)
+{
+	*i = a;
+	*j = b;
+}
+</pre>
+
+<p>This performs two conversions from <code>'double'</code> to
+<code>'int'</code>.  The example compiles as follows:
+
+<p><pre>
+d2i2:
+	subl	$24, %esp
+	pushl	%ebx
+	movl	48(%esp), %edx
+	movl	52(%esp), %ecx
+	fldl	32(%esp)
+	fldl	40(%esp)
+	fxch	%st(1)
+	fnstcw	12(%esp)
+	movl	12(%esp), %ebx
+	movb	$12, 13(%esp)
+	fldcw	12(%esp)
+	movl	%ebx, 12(%esp)
+	fistpl	8(%esp)
+	fldcw	12(%esp)
+	movl	8(%esp), %eax
+	movl	%eax, (%edx)
+	fnstcw	12(%esp)
+	movl	12(%esp), %edx
+	movb	$12, 13(%esp)
+	fldcw	12(%esp)
+	movl	%edx, 12(%esp)
+	fistpl	8(%esp)
+	fldcw	12(%esp)
+	movl	8(%esp), %eax
+	movl	%eax, (%ecx)
+	popl	%ebx
+	addl	$24, %esp
+	ret
+</pre>
+
+<p>For those who are unfamiliar with the, um, unique design of the x86
+floating point unit, it has an eight-slot stack and each entry holds a
+value in an extended format.  Values can be moved between top-of-stack
+and memory, but cannot be moved between top-of-stack and the integer
+registers.  The control word, which is a separate value, cannot be
+moved to or from the integer registers either.
+
+<p>On x86, converting a <code>'double'</code> to <code>'int'</code>,
+when <code>'double'</code> is in 64-bit IEEE format, requires setting
+the control word to a nonstandard value.  In the code above, you can
+clearly see that the control word is saved, changed, and restored
+around each individual conversion.  It would be perfectly possible to
+do it only once, thus:
+
+<p><pre>
+d2i2:
+	subl	$24, %esp
+	pushl	%ebx
+	movl	48(%esp), %edx
+	movl	52(%esp), %ecx
+	fldl	32(%esp)
+	fldl	40(%esp)
+	fxch	%st(1)
+	fnstcw	12(%esp)
+	movl	12(%esp), %ebx
+	movb	$12, 13(%esp)
+	fldcw	12(%esp)
+	movl	%ebx, 12(%esp)
+	fistpl	8(%esp)
+	movl	8(%esp), %eax
+	movl	%eax, (%edx)
+	fistpl	8(%esp)
+	fldcw	12(%esp)
+	movl	8(%esp), %eax
+	movl	%eax, (%ecx)
+	popl	%ebx
+	addl	$24, %esp
+	ret
+</pre>
+
+<p>Other obvious improvements in this code include storing directly
+from the floating-point stack to the target addresses, and reordering
+the loads to avoid the <code>'fxch'</code> instruction.  You can't
+reorder the stores in C because <code>'i'</code> and <code>'j'</code>
+might point at the same location.
+
+<p><pre>
+d2i2:
+	subl	$24, %esp
+	pushl	%ebx
+	movl	48(%esp), %edx
+	movl	52(%esp), %ecx
+	fldl	40(%esp)
+	fldl	32(%esp)
+	fnstcw	12(%esp)
+	movl	12(%esp), %ebx
+	movb	$12, 13(%esp)
+	fldcw	12(%esp)
+	movl	%ebx, 12(%esp)
+	fistpl	(%edx)
+	fistpl	(%ecx)
+	fldcw	12(%esp)
+	popl	%ebx
+	addl	$24, %esp
+	ret
+</pre>
+
+<p>As usual, we can also reduce the amount of wasted stack space:
+
+<p><pre>
+d2i2:
+	pushl	%ebx
+	movl	24(%esp), %edx
+	movl	28(%esp), %ecx
+	fldl	16(%esp)
+	fldl	8(%esp)
+	fnstcw	24(%esp)
+	movl	24(%esp), %ebx
+	movb	$12, 25(%esp)
+	fldcw	24(%esp)
+	fistpl	(%edx)
+	fistpl	(%ecx)
+	movl	%ebx, 24(%esp)
+	fldcw	24(%esp)
+	popl	%ebx
+	ret
+</pre>
+
+<p>This version recycles the stack slot of one of the parameters as
+temporary storage for the control word.
+
+<p>The four versions of this routine occupy respectively 97, 72, 54,
+and 48 bytes of text.  Version 2 will be dramatically faster than
+version 1; 3 will be somewhat faster than 2, and 4 will be about the
+same as 3, but will waste less memory.
+
+<hr>
+<h2><a name="regshuf">Register shuffling and <code>long long</code></a></h2>
+
+<p>(22 Jan 2000) GCC has a number of problems doing 64-bit arithmetic
+on architectures with 32-bit words.  This is only one of the most
+obvious issues.
+
+<p><pre>
+extern void big(long long u);
+void doit(unsigned int a, unsigned int b, char *id)
+{
+  big(*id);
+  big(a);
+  big(b);
+}
+</pre>
+
+<p>compiles to:
+
+<p><pre>
+doit:
+	subl	$20, %esp
+	pushl	%esi
+	pushl	%ebx
+	movl	40(%esp), %ecx
+	subl	$8, %esp
+	movl	40(%esp), %ebx
+	movl	44(%esp), %esi
+	movsbl	(%ecx), %eax
+	cltd
+*	pushl	%edx
+*	pushl	%eax
+	call	big
+	subl	$8, %esp
+	xorl	%edx, %edx
+	movl	%ebx, %eax
+*	pushl	%edx
+*	pushl	%eax
+	call	big
+	addl	$24, %esp
+	xorl	%edx, %edx
+	movl	%esi, %eax
+*	pushl	%edx
+*	pushl	%eax
+	call	big
+	addl	$16, %esp
+	popl	%ebx
+	popl	%esi
+	addl	$20, %esp
+	ret
+</pre>
+
+<p>Notice how the argument to <code>big</code> is invariably shuffled
+such that its high word is in <code>%edx</code> and its low word is in
+<code>%eax</code>, and then pushed.  This is because gcc is incapable
+of manipulating the two halves separately.  It should be able to
+generate code like this:
+
+<p><pre>
+doit:
+	subl	$20, %esp
+	pushl	%esi
+	pushl	%ebx
+	movl	40(%esp), %ecx
+	subl	$8, %esp
+	movl	40(%esp), %ebx
+	movl	44(%esp), %esi
+	movsbl	(%ecx), %eax
+	cltd
+	pushl	%edx
+	pushl	%eax
+	call	big
+	subl	$8, %esp
+	xorl	%edx, %edx
+	pushl	%edx
+	pushl	%ebx
+	call	big
+	addl	$24, %esp
+	xorl	%edx, %edx
+	pushl	%edx
+	pushl	%esi
+	call	big
+	addl	$16, %esp
+	popl	%ebx
+	popl	%esi
+	addl	$20, %esp
+	ret
+</pre>
+
+<p>Also, the choice to fetch all arguments from the stack at the very
+beginning is questionable.  It might be better to use one callee-save
+register to hold zero and retrieve args from the stack when needed.
+This, with the usual tweaks to stack adjusts, makes the code much
+shorter.
+
+<p><pre>
+doit:
+	pushl	%ebx
+	xorl	%ebx, %ebx
+	movl	8(%esp), %ecx
+	movsbl	(%ecx), %eax
+	cltd
+	pushl	%edx
+	pushl	%eax
+	call	big
+	addl	$8, %esp
+	movl	12(%esp), %eax
+	pushl	%ebx
+	pushl	%eax
+	call	big
+	addl	$8, %esp
+	movl	16(%esp), %eax
+	pushl	%ebx
+	pushl	%eax
+	call	big
+	addl	$8, %esp
+	popl	%ebx
+	ret
+</pre>
+
+<hr>
+<h2><a name="fpmove">Moving floating point through integer registers</a></h2>
+
+<p>(22 Jan 2000) GCC 2.96 on x86 knows how to move <code>float</code>
+quantities using integer instructions.  This is normally a win because
+floating point moves take more cycles.  However, it increases the
+pressure on the minuscule integer register file and therefore can end
+up making things worse.
+
+<p><pre>
+void
+fcpy(float *a, float *b, float *aa, float *bb, int n)
+{
+	int i;
+	for(i = 0; i &lt; n; i++) {
+		aa[i]=a[i];
+		bb[i]=b[i];
+	}
+}
+</pre>
+
+<p>I've compiled this three times and present the results side by
+side.  Only the inner loop is shown.
+
+<p><pre>
+  2.95 @ -O2            2.96 @ -O2                  2.96 @ -O2 -fomit-fp
+  .L6:                  .L6:                        .L6:
+                        movl  8(%ebp), %ebx         
+  flds  (%edi,%eax,4)   movl  (%ebx,%edx,4), %eax   movl  (%ebp,%edx,4), %eax
+  fstps (%ebx,%eax,4)   movl  %eax, (%esi,%edx,4)   movl  %eax, (%esi,%edx,4)
+                        movl  20(%ebp), %ebx        
+  flds  (%esi,%eax,4)   movl  (%edi,%edx,4), %eax   movl  (%edi,%edx,4), %eax
+  fstps (%ecx,%eax,4)   movl  %eax, (%ebx,%edx,4)   movl  %eax, (%ebx,%edx,4)
+  incl  %eax            incl  %edx                  incl  %edx               
+  cmpl  %edx,%eax       cmpl  %ecx, %edx            cmpl  %ecx, %edx         
+  jl    .L6             jl    .L6                   jl    .L6                
+</pre>
+
+<p>The loop requires seven registers: four base pointers, an index, a
+limit, and a scratch.  All but the scratch must be integer.  The x86
+has only six integer registers under normal conditions.  gcc 2.95 uses
+a float register for the scratch, so the loop just fits.  2.96 tries
+to use an integer register, and has to spill two pointers onto the
+stack to make everything fit.  Adding <code>-fomit-frame-pointer</code>
+ makes a seventh integer register available, and the loop fits again.
+
+<p>We see here numerous optimizer idiocies.  First, it ought to
+recognize that a load - even from L1 cache - is more expensive than a
+floating point move, and go back to the FP registers.  Second, instead
+of spilling the pointers, it should spill the limit register.  The
+limit is only used once and the <code>'cmpl'</code> instruction can
+take a memory operand.  Third, the loop optimizer has failed to do
+anything at all.  It should rewrite the code thus:
+
+<p><pre>
+void
+fcpy(float *a, float *b, float *aa, float *bb, int n)
+{
+	int i;
+	for(i = n; i &gt; 0; i--) {
+		*aa++ = *a++;
+		*bb++ = *b++;
+	}
+}
+</pre>
+
+<p>which compiles to this inner loop:
+
+<p><pre>
+.L6:
+        movl    (%edi), %eax
+        addl    $4, %edi
+        movl    %eax, (%ebx)
+        addl    $4, %ebx
+        movl    (%esi), %eax
+        addl    $4, %esi
+        movl    %eax, (%ecx)
+        addl    $4, %ecx
+        addl    $-1, %edx
+        jg      .L6
+</pre>
+
+<p>Yes, more adds are necessary, but this loop is going to be bound by
+I/O bandwidth anyway, and the rewrite gets rid of the limit register.
+Thus the loop fits in the integer registers again.  Note that I have
+no idea why it isn't using the <code>'decl'</code> instruction.
+
+<p>If this were Fortran, we could do even better:
+
+<p><pre>
+void
+fcpy(float *a, float *b, float *aa, float *bb, int n)
+{
+	int i;
+	for(i = n; i &gt; 0; i--) {
+		aa[i] = a[i];
+		bb[i] = b[i];
+	}
+}
+</pre>
+
+<p>which compiles to:
+
+<p><pre>
+.L6:
+        movl    (%ebp,%ecx,4), %eax
+        movl    (%edi,%ecx,4), %edx
+        movl    %eax, (%esi,%ecx,4)
+        movl    %edx, (%ebx,%ecx,4)
+        addl    $-1, %ecx
+        jg      .L6
+</pre>
+
+<p>at least with <code>-fomit-frame-pointer</code>.  You can't make
+that transformation in C because the compiler isn't allowed to assume
+that the vectors pointed to by <code>a</code>, <code>b</code>,
+<code>aa</code>, and <code>bb</code> do not overlap.  In Fortran it
+is.
+
+<p>Then there's the question of loop unrolling, loop splitting, etc.
+but high-level transformations like those are outside the scope of
+this document.
+
+<hr>
+<p>Last modified: 22 Jan 2000
+<p>Zack Weinberg, <a
+href="mailto:zack@wolery.cumb.org">&lt;zack@wolery.cumb.org&gt;</a>
+
+</body>
+</html>
===================================================================
Index: proj-cpplib.html
--- proj-cpplib.html	1999/09/28 20:55:30	1.5
+++ proj-cpplib.html	2000/01/29 07:45:11
@@ -1,217 +1,176 @@
-<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
+		      "http://www.w3.org/TR/html40/loose.dtd">
 <html><head>
 <title>cpplib TODO</title>
+<link rev="made" href="mailto:zack@wolery.cumb.org">
 </head>
 
 <body bgcolor="white" text="black" link="#0000EE" vlink="#551A8B" alink="red">
-<h1 align=center>Projects relating to cpplib</h1>
+<h1 align="center">Projects relating to cpplib</h1>
 
-cpplib is almost ready to replace cccp as the standalone C
-preprocessor used by gcc.  A bit more work is necessary before it can
-be used directly from cc1 and the other front ends.
+<p>As of 28 January 2000, cpplib is the default C preprocessor used by
+gcc.  It is not yet linked into the C and C++ front ends by default,
+because the interface is likely to change and there are still some
+major bugs in that area.  There remain a number of bugs which need to
+be stomped out, and some missing features.  We also badly need more
+testing.
+
+<h2>How to help test</h2>
+
+<p>The number one priority for testing is cross-platform work.  Simply
+bootstrap the compiler and run the test suite on as many different OS
+and hardware combinations as you can.  I only have access to a very
+few.  
+
+<p>The number two priority is large packages that (ab)use the
+preprocessor heavily.  The compiler itself is pretty good at that, but
+doesn't cover all the bases.  If you've got cycles to burn, please
+try one or more of:
+
+<ul>
+  <li>BSD 'make world'
+  <li>Binutils
+  <li>Emacs
+  <li>GNOME
+  <li>GNU libc
+  <li>Guile
+  <li>Linux kernel (esp. non-i386)
+  <li>Mozilla
+  <li>Obfuscated C Contest entries
+  <li>Perl
+  <li>X11
+  <li>... and anything else you can think of.
+</ul>
+
+<p>Old grotty pre-ANSI code is particularly good for exposing bad
+assumptions and missed corner cases; you may have more trouble with
+bugs in the package than bugs in the compiler, though.
+
+<p>A bug report saying 'package FOO won't compile on system BAR' is
+useless.  At this stage what I need are short testcases with no system
+dependencies.  Aim for less than fifty lines and no #includes at all.
+I recognize this won't always be possible.
+
+<p>Also, please file off everything that would cause us legal trouble
+if we were to roll your test case into the distributed test suite.
+Short test cases will almost always fall under fair use guidelines, so
+don't sweat it too much.  An example of a problem is if your test case
+includes a 200-line comment detailing inner workings of your program.
+(A 200-line comment might be what you need to provoke a bug, but its
+contents are unlikely to matter.   Try running it through 
+<code>"tr A-Za-z x"</code>.)
+
+<p>As usual, report bugs to <a
+href="mailto:gcc-bugs@gcc.gnu.org">gcc-bugs@gcc.gnu.org</a>.  But
+please read the rest of this document first!
+
+<h2>Known Bugs</h2>
+
+<p><ol>
+  <li>Under some conditions the line numbers seen by the compiler
+      proper are incorrect.  It shows up most obviously as bad line
+      numbers in warnings when bootstrapping the compiler.  I have not
+      been able to reproduce this with an input file of less than a
+      couple thousand lines.  Help would be greatly appreciated.
+
+  <li>cpplib will silently mangle input files containing ASCII NUL.
+      The cause of the bug is well known, but we weren't able to come
+      to consensus on what to do about it.  My personal preference is
+      to issue a warning and strip the NUL; other people feel it
+      should be preserved or considered a hard error.
+
+  <li>Character sets that are <em>not</em> strict supersets of ASCII
+      may cause cpplib to mangle the input file, even in comments or
+      strings.  Unfortunately, that includes important character sets
+      such as Shift JIS and UCS2.  (Please see the discussion of <a
+      href="#charset">character set issues</a>, below.)
+
+  <li>Trigraphs provoke warnings everywhere in the input file, even in
+      comments.  This is obnoxious, but difficult to fix due to the
+      brain-dead semantics of trigraphs and backslash-newline.
+
+  <li>Code that does perverse things with directives inside macro
+      arguments can cause the preprocessor to dump core.  cccp dealt
+      with this by disallowing all directives there, but it might be
+      nice to permit conditionals at least.
 
-<p>In rough priority order, the things that need to be done before cccp
-is retired:
+</ol>
 
-<ol>
-  <li>Fix the handling of <code>#define</code> and <code>#if</code>
-      so that they use the same lexical analysis code as the rest of
-      cpplib (i.e. <code>cpp_get_token</code>).  This is essential to
-      adding support for the new preprocessor features in C9x and C89
-      Amendment 1.  It also will enable the removal of the last global
-      variable in cpplib - meaning the library will be reentrant as
-      long as different <code>cpp_reader</code> objects are in use.
-      (For the curious: it's the presence or absence of <code>$</code>
-      in <code>is_idchar</code> and <code>is_idstart</code>.)
-  
-  <li>Implement C89 Amendment 1 "alternate spellings" of punctuators:<br>
+<h2>Missing User-visible Features</h2>
+
+<p><ol>
+
+  <li>C89 Amendment 1 "alternate spellings" of punctuators are not
+      recognized. These are
 <pre>		&lt;:  :&gt;  &lt;%  %&gt;  %:  %:%:</pre>
       which correspond, respectively, to
 <pre>		[   ]   {   }   #   ##</pre>
       The preprocessor must be aware of all of them, even though it
       uses only <code>%:</code> and <code>%:%:</code> itself.
-
-  <li>Support multi-byte characters in comments, identifiers, string
-      constants, and character constants.  Consensus on the egcs
-      development list was that this can be limited to systems with
-      support for reentrant multi-byte functions and for the
-      <code>nl_langinfo</code> interface.  cpplib will make no attempt
-      to interpret or translate multibyte characters.
-
-      <p>cpplib contains some optimizations which may not be
-      valid in the presence of multibyte characters.  The code to read
-      files and perform translation phases 1 through 3
-      (<code>read_and_prescan</code> in cppfiles.c) may
-      break if the bytes corresponding to <code>\</code>, <code>?</code>,
-      <code>^M</code>, and <code>^J</code> in ASCII can appear
-      "inside" a multibyte character.  Shift JIS has some characters
-      like this, but it is not clear to me whether the specific case
-      that will trigger problems can occur.
-
-      <p>A question for character set experts:  Are there multibyte
-      encodings for which the length of a multibyte sequence cannot be
-      determined by examining only the first character of that
-      sequence?  If so, which ones are they?
-
-  <li>Ignore ASCII NUL in an input file, with a warning.  Right now it
-      silently mangles the output.
-
-      <p>This is easy to fix; in <code>read_and_prescan</code>, don't
-      assume that NUL ends the current chunk of input.  But you have
-      to not wreck performance while you're at it.
-      <code>read_and_prescan</code> is fragile.
-
-  <li>Fix the memory leak in <code>#undef</code>. Someone thought it
-      was necessary to support
-<p>
-<pre>		#define foo(arg) blah arg blah
-		foo(bar
-		#undef foo
-		baz)
-</pre>
-<p>
-      which is undefined (C9x 6.10.3.11).  The Right Thing is to
-      detect this in <code>macarg</code> and treat it as an error.
-
-  <li>Support the <code>-lint</code> switch at least as well as cccp
-      does.  This is easy once someone tells me what the exact
-      syntax of lint comments are.  I believe that the regexp
-      <code>/^\s*\/\*\s*[A-Z0-9]+\s*\*\/\s*$/</code> correctly
-      describes the syntax, but would like confirmation from someone
-      who has actually used lint.  In particular, is it true that lint
-      comments always appear on lines by themselves?
-
-  <li>Support cccp's <code>-Wwhite-space</code> feature; this warns
-      about <code>/\\\s+$/</code>, which is not a line-continuation
-      backslash, but looks like one.  (Is there anything else it
-      should warn about?)
-      
-  <li>More testing.  I would like, at least, reports that bootstrap
-      completes and the testsuite gets no regressions versus cccp on
-      most major platforms.  Other tests that would be useful:  build
-      X11; test Imake outside the X11 tree; build the complete
-      (free|net|open)BSD tree; compile Emacs (both FSF and X).  I
-      already test glibc compiles on a regular basis.
 
-      <p>Test results for non-Intel and/or non-Linux platforms are
-      particularly desirable.
-</ol>
-
-To make cpplib usable from within language front ends, we need:
+  <li>Character sets that are strict supersets of ASCII are safe to
+      use, but extended characters cannot appear in identifiers.  This
+      has to be coordinated with the front end, and requires library
+      support which is usually not adequate.  See <a
+      href="#charset">character set issues</a>, below.
+
+  <li>C99 universal character escapes (<code>\uxxxx</code>,
+      <code>\Uxxxxxxxx</code>) are not recognized.  They are harmless
+      in comments, and will be passed on to the compiler safely if
+      they appear elsewhere, but cannot be used in macro names or #if
+      directives.  The C front end doesn't handle them either.
+
+  <li>C99's <code>_Pragma</code> intrinsic is not supported.  This
+      needs to be done in conjunction with the front end.
+
+  <li>cccp had some marginal support for translating lint directives
+      into #pragmas which the front end could see.  Of course, the
+      front end never did anything with them.  I don't intend to put
+      this back till the front end can use them.
+
+  <li>Precompiled headers are commonly requested; this entails the
+      ability for cpp to dump out and reload all its internal state.
+      You can get some of this with the debug switches, but not all,
+      and not in a reloadable format.  The front end must cooperate
+      also.
+
+  <li>Someone once requested warnings about stray whitespace in the
+      input, notably trailing whitespace after a backslash.  If that
+      happens, you have something that looks like a line-continuation
+      backslash, but isn't.
+
+  <li>Better support for languages other than C would be nice.  People
+      want to preprocess Fortran, Chill, and assembly language.  Chill
+      has been kludged in, Fortran and assembly still have serious
+      issues (notably, comment and string detection).
 
-<ol start=9>
-  <li>The public interface and private implementation details of
-      cpplib are currently mixed together in cpplib.h.
-      This must be cleaned up.
-  
-  <li>When cc1 is invoked on an already-preprocessed (.i) file,
-      the preprocessor must not be run again.  This should work, I'm
-      not sure.
+  <li><code>#define TOKEN TOKEN</code> should not cause infinite
+      recursion on the buffer stack when <code>-traditional</code> is
+      on.  All the interesting uses of traditional macro recursion use
+      function-like macros; object macros should probably be ANSI-ish
+      all the time.
 
-  <li>When cpplib is linked into front ends <code>-save-temps</code>
-      does not preserve an .i file.  This is the temp
-      file you usually want when tracking compiler bugs; its loss is
-      intolerable.  The simple fix: in the gcc driver, when
-      <code>-save-temps</code> is given, revert to using the external
-      preprocessor.
 </ol>
 
-Once that is done, more optimizations are possible:
+<h2>Internal work that needs doing</h2>
 
-<ol start=12>
-  <li>cc1 and cpp do quite a bit of duplicate bookkeeping of source
-      file, line, etc.  This should be eliminated.
-
-  <li>cc1 should take advantage of the partial lexical analysis done
-      by cpplib.  Maybe cpplib should do more complete lexical
-      analysis of C - at least identify all the different punctuators.
-
-      <p>To do that cleanly, <code>cpp_get_token</code> must return
-      exactly one token per invocation, except at EOF.  We need a
-      mechanism to queue a list of tokens for output.  This should fall
-      out of the macro-expansion rewrite.
-
-      <p>The main problem here is the directives like
-      <code>#pragma</code> and <code>#ident</code> that are passed on
-      to the language front-end for interpretation.  It would be a
-      good idea to add extended syntax to the front end that will fit
-      into the grammar (instead of requiring a special hook, as is
-      currently done) and translate.
-</ol>
-
-Some longer term projects which are largely independent of using
-cpplib directly from language front ends:
+<ol>
+  <li>The handling of <code>#define</code> and <code>#if</code> must
+      be fixed so it uses the same lexical analysis code as the rest of
+      cpplib (i.e. <code>cpp_get_token</code>).  This is essential to
+      adding support for the new preprocessor features in C9x and C89
+      Amendment 1.
 
-<ol start=15>
-  <li>Implement C9x UCN escapes.  These look like <code>\uXXXX</code>
-      or <code>\UXXXXXXXX</code> where each X is a hexadecimal digit.
-      They are legal in identifiers, string constants, and character
-      constants, and must be validated against some constraints when
-      parsed.  They designate characters from ISO/IEC 10646 (aka
-      Unicode) which are not in the "source character set".  cpplib
-      will not interpret them beyond the constraints in C9x, except
-      that it will map <code>\u0024</code> to <code>$</code>,
-      <code>\u0040</code> to <code>@</code>, and <code>\u0060</code>
-      to <code>`</code>.  All other UCN escapes with numbers below
-      <code>00A0</code> are illegal.  (Yes, this does mean that
-      <code>\u0024</code> will be legal in identifiers if and only if
-      the <code>$</code>-in-identifier extension is enabled.)
-
-  <li>It Would Be Nice if cpplib recognized when a multibyte character
-      was equivalent to a UCN escape; e.g. the sequences
-      <code>G&oacute;mez</code> and <code>G\u00F3mez</code> should be
-      treated as the same identifer.  This unfortunately would require
-      converting arbitrary multibyte characters to Unicode, and there
-      is no portable way to do that (<code>mbtowc</code> does not
-      necessarily produce Unicode).  However, cc1 has to do it, so
-      whatever solution we adopt there can be used in cpp also.
+  <li>cpplib makes two separate passes over the input file, which
+      causes a number of headaches, such as the trigraph warnings
+      inside comments.  It's also a performance problem.  Semantic
+      issues make a one-pass lexer impractical, but a two pass scheme
+      with the first pass called coroutine fashion from the first
+      should work better.
 
   <li>The macro expander could use a total rewrite.  We currently
       re-tokenize macros every time they are expanded.  It'd be better
       to tokenize when the macro is defined and remember it for later.
-      Also, the macro expander is recursive and allocates large arrays
-      on the stack, which is asking for trouble.
-
-  <li>It might be worthwhile to cache file buffers after processing by
-      <code>read_and_prescan</code>.  My limited survey of header files
-      indicates that headers which don't contain idempotence
-      <code>#ifdef</code>s are generally included multiple times
-      (examples:  stddef.h, tree.def).
-      Caching would avoid the expense of rereading from the disk (or OS
-      cache) and the expense of redoing translation phases 1-3.  I
-      spent a lot of time bumming cycles out of
-      <code>read_and_prescan</code>, but it's still an expensive
-      operation.  However, the memory cost may be prohibitive.
-
-  <li>Wrapper headers - files containing only an include of another
-      file - should be optimized out on reinclusion.
-
-  <li><code>#define TOKEN TOKEN</code> should not cause infinite
-      recursion on the buffer stack when <code>-traditional</code> is
-      on.  GNU libc uses this construct heavily; it is therefore
-      impossible to use <code>-traditional</code> on systems that use
-      it.  Actually, all the interesting uses of traditional-mode
-      macro recursion involve macros with arguments, so maybe
-      object-like macros should always behave as specified in C89.
-      <p>The specific case where an object-like macro is defined to
-      itself can be optimized: give them their own hashtable code,
-      don't bother allocating a <code>DEFINITION</code> structure, and
-      skip all the processing done by <code>macroexpand</code>.
-
-  <li>Support for C9x's <code>_Pragma("...")</code> built-in macro
-      needs to be added eventually.  Ideally <code>#pragma</code> and
-      <code>_Pragma()</code> would go through the same interface, but
-      this may be difficult.
-      <p>An idea for implementation: invent a destringizing operator
-      symmetric with the existing stringizer.  Then _Pragma could be
-      implemented by the equivalent of
-<pre>		#define _Pragma(arg) #pragma #$arg</pre>
-      where <code>#$</code> is the destringizer.  This has almost the
-      right semantics for _Pragma according to C9x.  (The resultant
-      line is supposed to be processed as a directive, which wouldn't
-      happen if you took the above literally.)  Problem: a strictly
-      conforming program could contain <code>#$</code> in a context
-      where it would be interpreted as the destringizing operator.
 
   <li>The code uses <code>long</code>, <code>unsigned long</code>, and
       <code>size_t</code> interchangeably.  This is wrong, and needs to
@@ -223,28 +182,192 @@ cpplib directly from language front ends
       *</code>, and <code>U_CHAR *</code> interchangeably.  This is
       more of a consistency issue and annoyance than a real problem.
 
+  <li>VMS support has suffered extreme bit rot.  There may be problems
+      with support for DOS, Windows, MVS, and other non-Unixy
+      platforms.  I can fix none of these myself.
+
   <li>We use too much stack.  Large arrays should be moved to static
       storage (if constant) or the heap (if not).
+
+</ol>
+
+<h2>Integrating cpplib with the front ends</h2>
+
+<ol>
+
+  <li>The lexer should do more work - enough that when cpplib is
+      linked into the C or C++ front end, the front end doesn't have
+      to do any rescanning of tokens.
+
+  <li>The library interface needs to be tidied up.  Internal
+      implementation details are exposed all over the place.
+      Extracting all the information the library provides is
+      difficult.  
+
+  <li><code>cpp_get_token</code> must be changed to return exactly one
+      token per invocation.  For performance, there should be a
+      <code>cpp_get_tokens</code> call that returns a lineful.
+
+  <li>Front ends need to use cpplib's line and column numbering
+      interface directly.  cpplib needs to stop inserting #line
+      directives into the output.  (The standalone preprocessor in
+      cppmain.c counts as a front end.)
+
+  <li>When cpplib is linked into front ends <code>-save-temps</code>
+      does not preserve an .i file.  This is the temp
+      file you usually want when tracking compiler bugs; its loss is
+      intolerable.  The simple fix: in the gcc driver, when
+      <code>-save-temps</code> is given, revert to using the external
+      preprocessor.
 
-  <li>VMS support has bit-rotted to the point of total brokenness.
-      Someone who knows VMS needs to look at this.  EBCDIC support
-      (i.e. the MVS port) <i>may</i> be functional, but I wouldn't
-      swear to it.  The MVS port may also need system-specific code.
-
-  <li>More generally, there is quite a bit of Unix-specific code in
-      cppfiles.c.  It might be a good idea to reduce this.  Use of
-      stdio instead of POSIX I/O primitives is an obvious change.
-      (This might also make line-ending and multibyte character
-      support easier.)  Other things, like include search paths, are
-      harder.
 </ol>
+
+<h2>Optimizations</h2>
+
+<ol>
+
+  <li>It might be worthwhile to cache file buffers in memory after
+      lexical analysis, but before directive processing and macro
+      expansion.  My limited survey of header files indicates
+      that headers which don't contain wrapper <code>#ifdef</code>s
+      are generally included multiple times (examples: stddef.h,
+      tree.def).  Caching would avoid a good deal of work.  However,
+      the memory cost may be prohibitive.
+
+  <li>A complement to the usual one-huge-file scheme of precompiled
+      headers would be to cache files on disk after lexical analysis.
+      You could run a cruncher over <code>/usr/include</code> and save
+      the results in a <code>.jar</code> file or similar, bypassing
+      filesystem overhead as well as the work of lexical analysis.
+
+  <li>Wrapper headers - files containing only an #include of another
+      file - should be optimized out on reinclusion.  (Just tweak the
+      hash table entry of the wrapper to point to the file it reads.)
+
+  <li>When a macro is defined to itself, bypass the macro expander
+      entirely.
+
+  <li>Consider reading files with <code>mmap</code> rather than
+      <code>read</code>.  (Portability issues; may not be a real win.)
 
-<p>
+</ol>
+
+<h2><a name="charset">Character set issues</a></h2>
+
+<p>Proper character set handling is a hard problem.  Users want to be
+able to write comments and strings in their native language.  They
+want the strings to come out in their native language and not
+gibberish after translation to object code.  Some users also want to
+use their own alphabet for identifiers in their code.  There is no
+one-to-one or many-to-one map between languages and character sets.
+The subset of ASCII that is included in most modern day character sets
+does not include all the punctuation C uses; some of the missing
+punctuation may be present but at a different place than where it is
+in ASCII.  The subset described in ISO646 may not be the smallest
+subset out there.
+
+<p>Furthermore, the C standard's solutions for these problems are all
+more or less hideous.  None rises above the status of kludge.
+Trigraphs are nonintuitive and cause far more problems than they
+solve.  Digraphs are okay, but but nonintuitive and not a complete
+solution.  <code>iso646.h</code> merely shifts the problem from one
+place to another, and is not a complete solution either.  UCN escapes
+assume Unicode, which makes them unsuitable for most Japanese and some
+Chinese environments.
+
+<p>Compounding the problem, the standard C library features for
+processing non-ASCII character sets are sadly lacking, even in the new
+standard (which no one's finished implementing yet).  To explain why,
+some background is necessary.  You can divide all existing character
+sets into three classes: unibyte, multibyte, and wide.  Multibyte
+characters can be further subdivided into shifted and unshifted
+encodings.  ASCII and most of its strict supersets - ISO 8859-x,
+KOI8-R, etc - are unibyte, which means that all characters are exactly
+one byte long.  This is obviously the easiest to deal with.
+
+<p>UCS2 and UCS4, and no other sets that I know of, are wide; this
+means that all characters are N bytes long, for some N greater than
+one.  Handling these requires mechanical code changes throughout the
+lexer, which is then incapable of handling unibyte encodings; you have
+to add a translator.  Memory requirements obviously at least double.
+However, no structural changes are needed.
+
+<p>UTF-8 and a few others are unshifted multibyte encodings.  That
+means that not all characters are one byte long, but given any one
+byte you can tell if it's a one-byte character, the first byte of a
+longer character, or one of the trailing bytes of a longer character,
+without any additional information.  These are almost as easy to deal
+with as unibyte encodings.
+
+<p>Finally, JISx and a few others are shifted multibyte encodings,
+meaning that you must remember state as you walk down a string in
+order to interpret it.  These are the worst to handle.  Unfortunately,
+this category includes most of the character sets used in Asian
+countries.
+
+<p>The C standard library has no way of processing multibyte
+encodings, shifted or not, other than translating them into some
+unspecified wide encoding.  For unshifted multibyte encodings, you can
+fake it as long as the only characters you're interested in
+manipulating (as opposed to passing through unexamined) are in the
+unibyte subset.  That's true for UTF8 and C as long as you only allow
+the usual English letters, Arabic numbers, and the underscore in
+identifiers.  If you want to permit other alphanumeric characters in
+identifiers, you've got to find out what they are, and that requires
+converting to wide encoding first.
+
+<p>So what's wrong with converting to wide encoding?  First, it's
+slow.  Obscenely slow, with most C libraries.  It may be acceptably
+fast to convert an entire file all at once, but that doubles or
+quadruples your memory consumption.  Typical C source files are on a
+par with data cache sizes as is; double it and you're in main memory
+and slowed to a crawl.
+
+<p>Second, the normal wide encoding is Unicode, and conversion from
+some sets (JISx, again) to Unicode and back loses information.  [This
+is the infamous "Han unification" problem.]
+
+<p>Third, there is no portable way to tell the library what multibyte
+encoding you want to convert from.  You can only specify it indirectly
+by way of the locale.  Locale strings are not standardized, and
+setting the locale changes other behavior that we want left alone.
+
+<p>It is possible to walk down a multibyte string without converting
+it, using <code>mbrlen</code> or equivalent.  That's the slowest
+possible mode you can put the conversion library in, though.  Nor does
+it tell you anything about the characters you're hopping over.
+
+<p>End of rant.  So what's cpplib likely to support in the near
+future?  We will verify that it is safe to use any charset that is a
+strict superset of ASCII (unibyte or unshifted multibyte) in strings,
+character constants, and comments.  We'll also support UCN escapes in
+those locations.  If you write them in strings, the result will be
+in UTF-8.
+
+<p>Support for shifted multibyte charsets in will come next, and will
+involve some sort of library that provides all of the useful
+<code>string.h</code> and <code>ctype.h</code> functions for an
+arbitrary character set, <em>without</em> conversion.  This will also
+require us to have some way to specify what character set an input
+file uses; the scheme MULE (Multilingual Emacs) uses is one
+possibility, and a #pragma is another.
+
+<p>Support for additional alphanumeric characters in identifiers will
+be added much later, because it presents ABI issues as well as
+compiler-guts issues.  Arbitrary bytes usually aren't legal in
+assembly labels nor in object-file string tables, so there needs to be
+a mangling scheme.  That scheme might be charset dependent,
+independent, or neutral, and you can make a case for all three.  All
+these things must be debated before we can implement anything.
+
+<p>There's one exception - <code>\u0024</code> will be legal in
+identifiers if and only if <code>$</code> is also legal.
+
+<p><hr>
 <address>Zack Weinberg,
-<a href="mailto:zack@rabi.columbia.edu">zack@rabi.columbia.edu</a>
+<a href="mailto:zack@wolery.cumb.org">zack@wolery.cumb.org</a>
 </address>
-<br><small><i>Last modified on May 15, 1999.</i></small>
-
+<br>Last modified 28 Jan 2000.
 <hr>
 
 <p><a href="projects.html">Back to the projects page</a>

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]