This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Faster compilation speed


How about reordering the rows and columns in the table used by yyparse to improve locality? Have a instrumented version of the yyparse to record the number of times each transition is taken and use the data to interchange rows and columns to attempt to get frequent transitions in the same cache line (or at least not conflicting memory locations). It would be a kind of feedback-directed optimization (-fprofile-arcs/-fbranch-probabilities) for bison.

-Will

Jeff Sturm wrote:
On Tue, 13 Aug 2002, David Edelsohn wrote:

	Here's an interesting (aka depressing) data point.  My previous
cache miss statistics were for GCC -O2.  At -O0, GCC's cache miss
statistics stay the same or get up to 20% *worse*.  In comparison, the
cache statistics for IBM's compiler without optimization enabled *improve*
up to 50 for the same reload.c and insn-recog.c input files compared to
optimized.

Here's a data point on alpha-linux:

cc1 -quiet -O2 reload.i
issues/cycles = 0.51  issues/dcache_miss = 26.93

Without optimization:

cc1 -quiet  reload.i
issues/cycles = 0.52  issues/dcache_miss = 31.29

This is on a ev56 with a direct-mapped cache.  To get some idea where the
misses are taking place, I experimented with iprobe's sampling mode.
Omitting results below the 1% sample threshold, I get:

function                    | issues | access | misses | i/m |  a/m
----------------------------+--------+--------+--------+-----+-----
yyparse                     |   2924 |    848 |    148 |  20 |  5.7
gt_ggc_mx_lang_tree_node    |   1336 |    612 |     74 |  18 |  8.2
verify_flow_info            |   1388 |    408 |    129 |  11 |  3.1
copy_rtx_if_shared          |   2120 |   1012 |     53 |  40 | 19.0
propagate_one_insn          |   3636 |    504 |     52 |  70 |  9.6
find_temp_slot_from_address |    728 |    232 |    126 |   6 |  1.8
ggc_mark_rtx_children_1     |   1580 |    316 |     40 |  40 |  7.9
extract_insn                |   1576 |    476 |     52 |  30 |  9.1
record_reg_classes          |   3848 |    944 |     65 |  59 | 14.5
reg_scan_mark_refs          |   1472 |    632 |     66 |  22 |  9.5
find_reloads                |   7680 |   3104 |    148 |  52 | 20.9
subst_reloads               |   4772 |   2736 |    169 |  28 | 16.1
side_effects_p              |   1344 |    564 |     43 |  31 | 13.1
for_each_rtx                |   4924 |   1464 |     75 |  66 | 19.5
ggc_alloc                   |   2424 |    728 |    111 |  22 |  6.5
ggc_set_mark                |   3392 |    976 |    107 |  32 |  9.1

(Each sample reported is 2^14 events.)

yyparse performs badly (as would any table-driven parser), but how about
verify_flow_info and find_temp_slot_from_address?  Both are reporting
awful cache behavior.

Jeff



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]