This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
cache misses in gcc 3.3
- From: Andi Kleen <ak at muc dot de>
- To: gcc at gcc dot gnu dot org
- Date: Sun, 9 Feb 2003 23:13:11 +0100
- Subject: cache misses in gcc 3.3
I ran a slightly older gcc 3.3 on linux in cachegrind
(http://developer.kde.org/~sewardj/) on an Athlon with 256K cache.
cachegrind is a cache simulator and gives detailed reports about
the caching behaviour of a program.
The test case was one large file from the linux 2.4 kernel source
(tcp_input.c) compiled with -O2
The interesting columns are I2mr and D2mr (ignoring L1 cache for now)
The top cache missers are not surprising. Top three is comfortably
hold by the garbage collector (ggc_set_mark, gt_ggc_mx_lang_tree_node,
memset). It was run with the too small 4MB heap default.
Then for_each_rtx - lots of optimization passes walking through
the code all the time.
side_effects_p is a pig regarding the dcache. It seems to walk through
a significant part of the RTL quite often. Perhaps some simple
prefetching could speed it up?
constrain_operands seems to have the same problem. Suprisingly it gets
quite a lot of icache misses too. It is not that big a function
so that's a big surprising (perhaps -Os for it would help)
One general observation is that the instruction cache is less stressed
than the data cache. This suggests that it is better to implement
new algorithms with code than with state machines driven with tables.
One particularly interesting case for this I investigated nearer is the
bison parser (bison.simple). In my test it took >10% and much longer
than lexical analysis, which is quite slow (on fast compilers lexical
analysis is slower than the parser). It shows quite a bit dcache misses
too.
I investigated this a bit further using vg_annotate. The hot parts
are data cache misses when the bison parser fetches a new state from
its state tables. The problem seems to be that the other passes in
rest_of_compilation blow away all cache state and when the parser runs
again it has to refetch all the LALR(1) tables from main memory, which
is costly (perhaps it needs a few prefetches as short time help?)
I suspect using an handwritten top-down parser would be somewhat faster
here because it encodes the grammer in code, not data and I would expect it
to have a smaller working set and be less affected by data caches overflowing.
-Andi
--------------------------------------------------------------------------------
I1 cache: 65536 B, 64 B, 2-way associative
D1 cache: 65536 B, 64 B, 2-way associative
L2 cache: 262144 B, 64 B, 8-way associative
Command: /pkg/gcc-3.3-030124/lib/gcc-lib/i686-pc-linux-gnu/3.3/cc1 -O2 tcp_input.i
Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
--------------------------------------------------------------------------------
Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
--------------------------------------------------------------------------------
18,446,744,072,045,728,445 5,923,770 1,150,972 805,699,845 7,724,585 2,865,482 522,750,208 1,233,655 809,434 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function
--------------------------------------------------------------------------------
172,398,900 38 37 53,636,310 250,492 116,947 17,621,244 2,379 839 ggc-page.c:ggc_set_mark
100,861,678 22,609 911 6,094,224 5,070 161 79,872,781 423,396 331,747 ???:memset
93,221,935 371 371 22,777,108 863,697 837,499 19,204,130 1,982 909 gt-c-decl.h:gt_ggc_mx_lang_tree_node
80,421,115 3,047 1,419 22,739,407 21,348 1,210 13,869,790 2,535 136 rtlanal.c:for_each_rtx
67,914,814 12,937 1,589 21,950,185 86,360 21,983 11,618,234 17,148 3,309 ggc-page.c:ggc_alloc
45,466,105 442 442 9,796,518 314,064 267,408 6,490,270 122 52 gtype-desc.c:gt_ggc_mx_rtx_def
42,527,220 172,435 18,354 16,389,146 15,361 1,838 9,103,320 2,923 61 cse.c:cse_insn
42,296,870 37,486 1,253 14,394,579 107,831 13,903 7,455,819 1,527 37 rtlanal.c:side_effects_p
40,474,855 104,575 4,790 8,641,610 10,605 1,630 4,106,599 4,087 652 recog.c:constrain_operands
39,103,794 3,800 3,675 11,887,615 6,397 3,136 3,202,125 999 599 regclass.c:record_reg_classes
31,810,357 23,838 2,146 12,820,501 175,648 27,730 3,690,166 3,189 651 bison.simple:yyparse
31,293,300 6,213 3,920 10,455,781 8,520 2,636 3,898,817 917 102 cse.c:rtx_cost
30,689,036 15,007 2,731 8,847,622 9,206 3,580 7,307,109 605 56 tree-inline.c:walk_tree
30,221,891 4,735 2,848 9,736,810 15,819 2,650 5,148,742 128 14 flow.c:mark_set_1