This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Tremendous performance regression in 1.1.2 -> mainline
- To: gcc at gcc dot gnu dot org
- Subject: Tremendous performance regression in 1.1.2 -> mainline
- From: Brad Lucier <lucier at math dot purdue dot edu>
- Date: Thu, 6 Apr 2000 11:56:22 -0500 (EST)
- Cc: lucier at math dot purdue dot edu, feeley at iro dot umontreal dot ca, hosking at cs dot purdue dot edu
To follow up a bit with my e-mail of
http://gcc.gnu.org/ml/gcc/2000-03/msg00860.html
I decided to compare the performance of egcs-1.1.2 with the mainline
compiler on that code.
With egcs-1.1.2 on alpha-redhat-linux:
popov-75% time /usr/lib/gcc-lib/alpha-redhat-linux/egcs-2.91.66/cc1 -O1 _std.i
__copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___std ___init_proc ____20___std
time in parse: 1.827072
time in jump: 1.662128
time in cse: 1.168272
time in loop: 0.006832
time in flow: 13.635696
time in combine: 0.992592
time in local-alloc: 0.440176
time in global-alloc: 1.127280
time in shorten-branch: 0.068320
time in final: 0.349408
20.894u 1.003s 0:21.96 99.6% 0+0k 0+0io 326pf+0w
with gcc version 2.96 20000331:
popov-76% time /export/u10/egcs-test/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -O1 _std.i
__copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___std {GC 27190k -> 8575k in 0.085} {GC 12863k -> 8778k in 0.091} {GC 12534k -> 9230k in 0.096} ___init_proc {GC 18065k -> 1705k in 0.019} ____20___std
time in parse: 2.065216 (0%)
time in jump: 1.479616 (0%)
time in cse: 0.730048 (0%)
time in loop: 0.006832 (0%)
time in flow: 1466.980704 (99%)
time in combine: 0.996496 (0%)
time in local-alloc: 0.452864 (0%)
time in global-alloc: 1.095072 (0%)
time in flow2: 5.055680 (0%)
time in shorten-branch: 0.056608 (0%)
time in final: 0.897920 (0%)
time in varconst: 0.007808 (0%)
time in gc: 0.289872 (0%)
1482.946u 0.711s 24:48.62 99.6% 0+0k 0+0io 621pf+0w
So the current version is 70 times slower than egcs-1.1.2; almost all
the time is spent in compute_flow_dominators.
Lately I've been compiling code like this with 2.95.1 with -O2 since
Marc Feeley was kind enough to change the Gambit-C code generator for
floating-point arithmetic to get around the problem in gcc's register
allocator for IEEE floating-point on the 21264. (BTW, I've been getting
tremendous code on my 21264; the bottlenck now is memory access time, not
the actual FP operations.) So I was somewhat worried whether there are
several places in the current gcc that had this kind of performance hit.
But with -O2, things are not significantly worse than -O1 (which is
bad enough):
popov-78% time /export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -O2 _std.i
__copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___std {GC 27822k -> 8578k in 0.188} {GC 11787k -> 8943k in 0.208} {GC 14947k -> 9413k in 0.227} {GC 15181k -> 9818k in 0.239} ___init_proc {GC 21431k -> 1711k in 0.035} {GC 5696k -> 1810k in 0.052} ____20___std
time in parse: 4.316848 (0%)
time in integration: 0.000976 (0%)
time in jump: 14.361840 (1%)
time in cse: 3.089040 (0%)
time in gcse: 2.694736 (0%)
time in loop: 0.223504 (0%)
time in cse2: 2.815760 (0%)
time in flow: 1501.634560 (95%)
time in combine: 2.682048 (0%)
time in regmove: 0.750544 (0%)
time in sched: 8.382864 (1%)
time in local-alloc: 2.263344 (0%)
time in global-alloc: 4.171424 (0%)
time in flow2: 7.216544 (0%)
time in peephole2: 0.036112 (0%)
time in sched2: 6.834928 (0%)
time in shorten-branch: 0.120048 (0%)
time in final: 1.255136 (0%)
time in varconst: 0.009760 (0%)
time in gc: 0.949648 (0%)
1574.513u 2.770s 26:21.30 99.7% 0+0k 0+0io 661pf+0w
The profile information for -O2 tells almost nothing new; here are all
the procedures that take longer than parse:
Flat profile:
Each sample counts as 0.000976562 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
85.21 268.42 268.42 738118 0.36 0.36 sbitmap_intersection_of_preds
1.86 274.28 5.86 12070 0.49 0.55 compute_block_backward_dependences
1.08 277.67 3.39 738121 0.00 0.00 sbitmap_a_and_b
1.03 280.90 3.23 36 89.74 89.74 mark_critical_edges
0.72 283.18 2.28 htab_traverse
0.61 285.11 1.93 3 642.25 91266.58 compute_flow_dominators
0.55 286.85 1.74 7901172 0.00 0.00 bitmap_operation
0.47 288.32 1.47 18 81.43 82.04 delete_unreachable_blocks
0.44 289.72 1.40 3 466.47 91953.93 flow_loops_find
0.43 291.06 1.34 6 223.63 538.83 calculate_global_regs_live
0.41 292.34 1.28 1879299 0.00 0.00 rtx_renumbered_equal_p
0.41 293.62 1.28 24143 0.05 0.05 count_or_remove_death_notes
0.27 294.47 0.85 6712608 0.00 0.00 make_label_edge
0.23 295.20 0.73 947453 0.00 0.00 find_cross_jump
0.19 295.80 0.60 430642 0.00 0.00 constrain_operands
0.18 296.37 0.57 6734272 0.00 0.00 make_edge
0.17 296.91 0.55 5850 0.09 0.09 clear_table
0.14 297.36 0.45 8637288 0.00 0.00 find_reg_note
0.11 297.69 0.33 1425216 0.00 0.00 ggc_alloc_obj
0.10 298.02 0.33 1 328.12 328.13 flow_depth_first_order_compute
0.10 298.33 0.31 18 17.36 103.93 make_edges
0.10 298.64 0.31 724 0.43 0.43 flow_loop_exits_find
0.10 298.94 0.30 1 301.76 311841.42 yyparse
...
-----------------------------------------------
1.93 271.87 3/3 flow_loops_find [7]
[8] 86.9 1.93 271.87 3 compute_flow_dominators [8]
268.42 0.04 738118/738118 sbitmap_intersection_of_preds [9]
3.39 0.00 738121/738121 sbitmap_a_and_b [15]
0.02 0.00 3/14 sbitmap_vector_alloc [206]
0.00 0.00 3/3 sbitmap_vector_ones [732]
0.00 0.00 3/11 sbitmap_vector_zero [761]
0.00 0.00 3/72575 sbitmap_zero [700]
-----------------------------------------------
268.42 0.04 738118/738118 compute_flow_dominators [8]
[9] 85.2 268.42 0.04 738118 sbitmap_intersection_of_preds [9]
0.04 0.00 738118/738118 sbitmap_copy [274]
-----------------------------------------------
Brad Lucier