This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Some experiments with cpplib's output end


Neil recently reported that with his improvements to macro expansion,
the standalone cpp spent more than 50% of its time in cpp_output_token
and its subroutines.  This is a secondary application - normally
c-lex.c calls cpp_get_token directly - but it's still fairly important
to make it go fast.

We can blame GNU libc for being slow and bloated, but it's interesting
to look a bit more deeply.  Here's an excerpt from the call graph
profile of a basically stock cpp0 (I've got some patches in there but
they affect only comment skipping).  This was compiled by a gcc-3.0.2
snapshot.  The test case was combine.c, with RTL checking enabled to
bulk up the file, run 900 times.

total execution time 622.38s
               32.41  130.06 417852000/417852000     scan_translation_unit [4]
[6]     26.1   32.41  130.06 417852000         cpp_output_token [6]
               73.61   49.81 392669100/393003000     fputs_unlocked [7]
                2.61    4.02  25182900/25182900      fwrite [37]
                0.00    0.01      2700/70200         __overflow [126]

                0.06    0.04    333900/393003000     print_line [64]
               73.61   49.81 392669100/393003000     cpp_output_token [6]
[7]     19.8   73.67   49.85 393003000         fputs_unlocked [7]
               46.37    3.48 393003000/423131400     _IO_new_file_xsputn [12]

                0.58    0.04   4945500/423131400     vfprintf [26]
                2.97    0.22  25182900/423131400     fwrite [37]
               46.37    3.48 393003000/423131400     fputs_unlocked [7]
[12]     8.6   49.93    3.75 423131400         _IO_new_file_xsputn [12]
                2.72    0.00  10936800/10971900      mempcpy [60]
                0.08    0.85    326700/396900        _IO_file_overflow [77]
                0.09    0.00    326700/14269500      _IO_default_xsputn [47]

[Note that __overflow and _IO_file_overflow are, fundamentally, the
same routine.  _IO_new_file_xsputn punts to _IO_default_xsputn to
handle the chunk of text left over after an overflow has occurred.]

That's 417 million calls to cpp_output_token, nearly all of which
cause a call to fputs_unlocked.  (The rest cause a call to fwrite, but
since there are only 25 million of those, we're going to ignore them
for now.)  Looking at cpplex.c we see that these happen in one of two
places:

  switch (TOKEN_SPELL (token))
    {
    case SPELL_OPERATOR:
      {
	const unsigned char *spelling;

	if (token->flags & DIGRAPH)
	  spelling
	    = digraph_spellings[(int) token->type - (int) CPP_FIRST_DIGRAPH];
	else if (token->flags & NAMED_OP)
	  goto spell_ident;
	else
	  spelling = TOKEN_NAME (token);

*	ufputs (spelling, fp);
      }
      break;

    spell_ident:
    case SPELL_IDENT:
*     ufputs (NODE_NAME (token->val.node), fp);
    break;

(ufputs is a wrapper around fputs_unlocked that adapts cpplib's
universal use of unsigned char to libc's universal use of plain char.)

So nearly all tokens are categorized as SPELL_OPERATOR or SPELL_IDENT.
This is not surprising - SPELL_IDENT tokens are identifiers or
numbers, SPELL_OPERATOR tokens are punctuators, and your
typical program has far more of those than anything else.  (The other
possibilities are strings, which is where the 25 million calls to
fwrite come from, and "other", which a well-formed C program has none
of.)

All C punctuators are one to four characters long.  It is reasonable
to suppose that fputs has been tuned for longer strings and
therefore has excessive overhead for very short ones.  If we replace
that call to ufputs() with a loop calling putc, we get this profile:

total execution time 553.16s
               35.47   60.02 417852000/417852000     scan_translation_unit [4]
[9]     17.3   35.47   60.02 417852000         cpp_output_token [9]
               29.31   22.79 134260200/134594100     fputs_unlocked [11]
                2.62    5.06  25182900/25182900      fwrite [37]
                0.01    0.23     74700/142200        __overflow [108]

                0.07    0.06    333900/134594100     print_line [64]
               29.31   22.79 134260200/134594100     cpp_output_token [9]
[11]     9.4   29.38   22.84 134594100         fputs_unlocked [11]
               19.72    3.12 134594100/164722500     _IO_new_file_xsputn [16]

                0.72    0.11   4945500/164722500     vfprintf [27]
                3.69    0.58  25182900/164722500     fwrite [37]
               19.72    3.12 134594100/164722500     fputs_unlocked [11]
[16]     5.1   24.14    3.82 164722500         _IO_new_file_xsputn [16]
                2.94    0.00  10936800/10971900      mempcpy [53]
                0.10    0.70    254700/396900        _IO_file_overflow [74]
                0.08    0.00    254700/14197500      _IO_default_xsputn [44]

That's an 11% improvement all by itself.  We've knocked down the
number of calls to fputs_unlocked by a factor of four.  Let's see what
happens if we do the same thing for identifiers, which are mostly
pretty short, too.

total execution time 536.84s
               56.90    7.77 417852000/417852000    scan_translation_unit [4]
[9]     12.1   56.90    7.77 417852000         cpp_output_token [9]
                2.20    4.83  25182900/25182900     fwrite [35]
                0.05    0.69    258300/336600       __overflow [84]

                0.04    0.06    334800/334800       print_line [62]
[138]    0.0    0.04    0.06    334800         fputs_unlocked [138]
                0.05    0.02    334800/30469500     _IO_new_file_xsputn [38]

                0.05    0.02    334800/30469500     fputs_unlocked [138]
                0.67    0.28   4951800/30469500     vfprintf [24]
                3.40    1.43  25182900/30469500     fwrite [35]
[38]     1.1    4.11    1.73  30469500         _IO_new_file_xsputn [38]
                1.55    0.00   5392800/5425200      mempcpy [68]
                0.01    0.15     60300/396900       _IO_file_overflow [78]
                0.02    0.00     60300/14003100     _IO_default_xsputn [42]

That did not help as much; only 3% improvement over the previous.
We've squelched fputs_unlocked, but now cpp_output_token has to do
more work, which means we don't win as much as we might like.
Nonetheless this is definitely better than what we had.

It's possible to squeeze out another four seconds by using
fwrite_unlocked instead of fwrite, but this involves more autoconf
gook.  I am particularly unfond of it in this case, since the default
chosen by glibc is so completely and obviously wrong.

zw


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]