This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: gcc 3.1 is still very slow, compared to 2.95.3
- From: Julian Seward <jseward at acm dot org>
- To: gcc at gcc dot gnu dot org
- Cc: njn25 at cam dot ac dot uk
- Date: Sun, 19 May 2002 06:11:35 +0100
- Subject: Re: gcc 3.1 is still very slow, compared to 2.95.3
- References: <3CE6B61C.4F3A3ED@acm.org>
- Reply-to: jseward at acm dot org
Here are some numbers for 2.95.3 vs 3.1 on a 1133 MHz PIII-T with 16k
I1, 16K D1, 512K L2 caches and 133 MHz SDRAM.
These were measured using valgrind-20020518 simulating the above cache
arrangement. You should regard what follows with some skepticism (I
do!), but that aside, assuming they are true, the results are
interesting.
In both cases I compiled preprocessed bzip2.c in bzip2-0.1pl2 with -O2:
.../cc1 ./bzip2.i -quiet -O2 -version -o bzip2.s
This is a 3000+ line C file with some large procedures and many loops.
I suspect it is very similar to the bzip2 in SPEC CPU2000.
Compile times:
2.95.3: 0m1.450s, 0m1.490s, 0m1.460s
3.1: 0m2.340s, 0m2.360s, 0m2.390s
Counts:
Instruction refs Data Read refs Data Write Refs
I1refs I1miss L2miss D1refs D1miss L2miss D1refs D1miss L2miss
2.95.3 1,184M 22.9M 371k 408M 7.03M 97.9k 224M 2.68M 284k
3.1 1,847M 31.0M 822k 563M 11.8M 378k 377M 13.6M 670k
Note that each instruction is regarded as causing one I1 ref, so the
I1 refs count is the instruction count.
Condensed counts, with ratios relative to 2.95.3:
Instructions L1 misses L2 misses
2.95.3 1,184M 32.6M 753k
3.1 1,847M (1.55) 56.4M (1.73) 1870k (2.48)
That's an alarming increase in the L2 miss rate. I'm particularly
surprised (suspicious) at the claimed more-than-doubling of the L2
misses caused by insn refs. Or then again, perhaps not, given that
the text size is getting on for doubled and the L2 also has to
cope with a hammering from D1:
text data bss dec hex filename
2883663 6516 692928 3583107 36ac83 gcc-3.1/..../3.1/cc1
1658059 17804 87076 1762939 1ae67b gcc-2.95.3/..../2.95.3/cc1
I was struck by the 5-fold increase in D1 write misses. Sorting
functions by D1 write misses gives
Writes D1-w miss L2-w miss
79,686,641 11,105,948 181,718 ../sysdeps/i386/memset.c:memset
2,230,720 279,758 140,231 genrtl.c:gen_rtx_fmt_i0
430,998 121,051 0 gcc/recog.c:preprocess_constraints
388,072 101,271 20,187 gcc/sbitmap.c:sbitmap_vector_alloc
which doesn't really lead anywhere. Either it's untrue, or someone is
calling memset like crazy where they weren't in 2.95.3.
The read-miss high-rollers are more evenly spread out:
Reads D1-r miss L2-r miss
6,001,122 913,636 4,101 gcc/alias.c:init_alias_analysis
699,650 460,290 77,616 gcc/ggc-page.c:ggc_pop_context
7,529,883 419,953 806 gcc/rtlanal.c:note_stores
3,091,375 260,500 3,086 libiberty/hashtab.c:htab_traverse
14,293,623 240,896 1,259 gcc/regclass.c:reg_scan_mark_refs
6,692,484 224,000 579 gcc/rtlanal.c:find_reg_note
480,479 221,851 165 gcc/regclass.c:reg_scan
7,728,398 210,269 710 gcc/rtlanal.c:side_effects_p
3,692,958 188,313 3,424 gcc/recog.c:extract_insn
2,059,888 173,745 245 gcc/cse.c:cse_end_of_basic_block
2,765,888 119,995 392 gcc/jump.c:mark_jump_label
897,396 116,789 61 gcc/cfgbuild.c:find_basic_blocks_1
3,656,876 114,462 18,092 /usr/lib/bison.simple:yyparse_1
1,115,343 111,289 168 gcc/flow.c:propagate_block
875,912 101,947 82 gcc/flow.c:insn_dead_p
834,122 101,700 376 gcc/recog.c:preprocess_constraints
Hmm. Maybe you folks can make sense of this, or tell me that it's
nonsense. I guess a smart thing to do would be to use Rabbit to see
if the numbers collected by this PIII's hardware counters bear any
relationship to what's showing up here.
Careful testing on Athlons has shown that valgrind's cache miss
numbers generally agree with what the hardware measures, so there's no
immediate reason to regard the above as untrue. It just _seems_ a bit
unlikely.
J
---------
Details of the simulated caches:
I1 cache: 16384 B, 16 B, 4-way associative
D1 cache: 16384 B, 16 B, 4-way associative
L2 cache: 524288 B, 32 B, 8-way associative