This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: gcc 3.1 is still very slow, compared to 2.95.3

From: Julian Seward <jseward at acm dot org>
To: gcc at gcc dot gnu dot org
Cc: njn25 at cam dot ac dot uk
Date: Sun, 19 May 2002 06:11:35 +0100
Subject: Re: gcc 3.1 is still very slow, compared to 2.95.3
References: <3CE6B61C.4F3A3ED@acm.org>
Reply-to: jseward at acm dot org


Here are some numbers for 2.95.3 vs 3.1 on a 1133 MHz PIII-T with 16k
I1, 16K D1, 512K L2 caches and 133 MHz SDRAM.

These were measured using valgrind-20020518 simulating the above cache
arrangement.  You should regard what follows with some skepticism (I
do!), but that aside, assuming they are true, the results are
interesting.

In both cases I compiled preprocessed bzip2.c in bzip2-0.1pl2 with -O2:
   .../cc1 ./bzip2.i -quiet -O2 -version -o bzip2.s
This is a 3000+ line C file with some large procedures and many loops.
I suspect it is very similar to the bzip2 in SPEC CPU2000.

Compile times:
   2.95.3:   0m1.450s, 0m1.490s, 0m1.460s
   3.1:      0m2.340s, 0m2.360s, 0m2.390s


Counts:

           Instruction refs        Data Read refs         Data Write Refs
         I1refs I1miss L2miss   D1refs D1miss L2miss   D1refs D1miss L2miss

2.95.3   1,184M 22.9M  371k     408M   7.03M  97.9k    224M   2.68M  284k
3.1      1,847M 31.0M  822k     563M   11.8M  378k     377M   13.6M  670k

Note that each instruction is regarded as causing one I1 ref, so the
I1 refs count is the instruction count.  


Condensed counts, with ratios relative to 2.95.3:

           Instructions      L1 misses        L2 misses
2.95.3     1,184M            32.6M            753k
3.1        1,847M (1.55)     56.4M (1.73)     1870k (2.48)


That's an alarming increase in the L2 miss rate.  I'm particularly
surprised (suspicious) at the claimed more-than-doubling of the L2
misses caused by insn refs.  Or then again, perhaps not, given that
the text size is getting on for doubled and the L2 also has to
cope with a hammering from D1:

   text    data     bss     dec     hex filename
2883663    6516  692928 3583107  36ac83 gcc-3.1/..../3.1/cc1
1658059   17804   87076 1762939  1ae67b gcc-2.95.3/..../2.95.3/cc1


I was struck by the 5-fold increase in D1 write misses.  Sorting
functions by D1 write misses gives

     Writes   D1-w miss  L2-w miss
 79,686,641  11,105,948  181,718  ../sysdeps/i386/memset.c:memset
  2,230,720     279,758  140,231  genrtl.c:gen_rtx_fmt_i0
    430,998     121,051        0  gcc/recog.c:preprocess_constraints
    388,072     101,271   20,187  gcc/sbitmap.c:sbitmap_vector_alloc

which doesn't really lead anywhere.  Either it's untrue, or someone is
calling memset like crazy where they weren't in 2.95.3.

The read-miss high-rollers are more evenly spread out:

     Reads  D1-r miss  L2-r miss
 6,001,122    913,636      4,101    gcc/alias.c:init_alias_analysis
   699,650    460,290     77,616    gcc/ggc-page.c:ggc_pop_context
 7,529,883    419,953        806    gcc/rtlanal.c:note_stores
 3,091,375    260,500      3,086    libiberty/hashtab.c:htab_traverse
14,293,623    240,896      1,259    gcc/regclass.c:reg_scan_mark_refs
 6,692,484    224,000        579    gcc/rtlanal.c:find_reg_note
   480,479    221,851        165    gcc/regclass.c:reg_scan
 7,728,398    210,269        710    gcc/rtlanal.c:side_effects_p
 3,692,958    188,313      3,424    gcc/recog.c:extract_insn
 2,059,888    173,745        245    gcc/cse.c:cse_end_of_basic_block
 2,765,888    119,995        392    gcc/jump.c:mark_jump_label
   897,396    116,789         61    gcc/cfgbuild.c:find_basic_blocks_1
 3,656,876    114,462     18,092    /usr/lib/bison.simple:yyparse_1
 1,115,343    111,289        168    gcc/flow.c:propagate_block
   875,912    101,947         82    gcc/flow.c:insn_dead_p
   834,122    101,700        376    gcc/recog.c:preprocess_constraints

Hmm.  Maybe you folks can make sense of this, or tell me that it's
nonsense.  I guess a smart thing to do would be to use Rabbit to see
if the numbers collected by this PIII's hardware counters bear any
relationship to what's showing up here.  

Careful testing on Athlons has shown that valgrind's cache miss
numbers generally agree with what the hardware measures, so there's no
immediate reason to regard the above as untrue.  It just _seems_ a bit
unlikely.

J

---------
Details of the simulated caches:
I1 cache:         16384 B, 16 B, 4-way associative
D1 cache:         16384 B, 16 B, 4-way associative
L2 cache:         524288 B, 32 B, 8-way associative

Follow-Ups:
- Re: gcc 3.1 is still very slow, compared to 2.95.3
  - From: Neil Booth
- Re: gcc 3.1 is still very slow, compared to 2.95.3
  - From: John Levon

References:
- Re: gcc 3.1 is still very slow, compared to 2.95.3
  - From: Julian Seward

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]