This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
New performance measurements
- From: Richard Guenther <rguenth at tat dot physik dot uni-tuebingen dot de>
- To: gcc at gcc dot gnu dot org
- Date: Tue, 27 Jan 2004 11:01:49 +0100 (CET)
- Subject: New performance measurements
Hi!
After Mark's improvement to g++ performance I did another profiled run
of my POOMA testcase. This time on ia64, and it looks a lot better.
g++ (GCC) 3.4.0 20040127 (prerelease)
But there is one confusing entry:
Flat profile:
Each sample counts as 0.000976562 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
4.94 13.93 13.93 74858479 0.00 0.00 ggc_alloc
4.17 25.69 11.76 200210137 0.00 0.00 ggc_set_mark
3.76 36.29 10.61 4636253 0.00 0.00 gt_ggc_mx_lang_tree_node
3.24 45.42 9.13 100014 0.00 0.00 dfa_clean_insn_cache
2.37 52.13 6.70 3649669 0.00 0.00 walk_tree
2.32 58.66 6.54 366372947 0.00 0.00 bitmap_set_bit
2.18 64.81 6.15 1339 0.00 0.03 gcse_main
1.98 70.40 5.59 11016326 0.00 0.00 propagate_one_insn
1.89 75.74 5.35 __umoddi3
1.84 80.94 5.20 82384068 0.00 0.00 mark_set_1
1.82 86.08 5.13 19118 0.00 0.00 init_alias_analysis
1.62 90.64 4.56 117400 0.00 0.00 free_deps
1.50 94.86 4.22 57231942 0.00 0.00 htab_find_slot_with_hash
1.23 98.34 3.47 187133886 0.00 0.00 alloc_INSN_LIST
1.07 101.35 3.01 6554201 0.00 0.00 constrain_operands
The dfa_clean_insn_cache is suspiciously high in the profile (didn't
notice that at all for ia32). Looking at the callers
0.24 0.00 2678/100014 sched_init [81]
8.89 0.00 97336/100014 ia64_sched_finish [26]
[47] 3.2 9.13 0.00 100014 dfa_clean_insn_cache [47]
there seems to be a imbalance between init and finish calls!? Maybe there
is something obvious to improve.
Also all the ggc stuff so far top in the profile doesn't make me happy for
a 16GB machine either... (the compilation needs about 1.8GB of ram).
The next offender would be walk_tree - callgraph looks like
93090617 walk_tree <cycle 1> [32]
728 break_out_target_exprs <cycle 1> [2727]
3122 for_each_template_parm <cycle 1> [2478]
6193 for_each_template_parm_r <cycle 1> [3286]
97953 walk_tree_without_duplicates <cycle 1> [1038]
139286 cxx_unsave_expr_now <cycle 1> [1170]
194929 copy_body <cycle 1> [1552]
198322 cgraph_create_edges <cycle 1> [788]
239054 record_call_1 <cycle 1> [319]
490646 remap_decl <cycle 1> [314]
929116 expand_call_inline <cycle 1> [111]
1347782 cp_walk_subtrees <cycle 1> [223]
0.00 0.00 2538/85141820 optimize_inline_calls [1412]
[32] 5.3 6.70 8.38 3649669+93090617 walk_tree <cycle 1> [32]
0.35 5.63 17440519/17440519 cp_unsave_r [66]
0.54 0.90 43519924/52040240 htab_find_slot [132]
0.69 0.00 51452506/53622979 first_rtl_op [227]
0.03 0.10 951918/951918 inline_forbidden_p_1 [593]
0.05 0.00 1622033/1622033 calls_setjmp_r [866]
0.05 0.00 3669705/31134895 cp_is_overload_p [323]
0.02 0.00 593309/593309 c_estimate_num_insns_1 [1228]
0.01 0.00 245861/245861 no_linkage_helper [1417]
0.01 0.00 117465/117465 find_reachable_label_1 [1555]
0.00 0.00 7355/7355 bot_replace [2723]
0.00 0.00 2424/2424 local_variable_p_walkfn [2801]
0.00 0.00 6917/6917 nullify_returns_r [3303]
43519924 htab_find_slot_with_hash <cycle 1> [68]
37225555 cp_walk_subtrees <cycle 1> [223]
31547283 mark_local_for_remap_r <cycle 1> [163]
16823873 copy_body_r <cycle 1> [48]
15491406 expand_call_inline <cycle 1> [111]
8459533 clear_decl_rtl <cycle 1> [176]
7747832 record_call_1 <cycle 1> [319]
1106716 simplify_aggr_init_exprs_r <cycle 1> [1152]
9138 for_each_template_parm_r <cycle 1> [3286]
1036 bot_manip <cycle 1> [2615]
93090617 walk_tree <cycle 1> [32]
for_each_template_parm is way down now - nice. It seems for my testcase
inlining is consuming the most time here (I have leafify enabled here).
For bitmap_set_bit I either suppose it has a very lame implementation for
ia64, or, looking at parts of the callgraph
[...]
0.05 0.00 2633237/366372947 update_life_info <cycle 10> [16]
0.15 0.00 8276426/366372947 sched_analyze [33]
0.15 0.00 8414793/366372947 mark_used_reg [136]
0.64 0.00 35853193/366372947 mark_set_1 [41]
1.11 0.00 62294012/366372947 simple_loop_p [65]
1.21 0.00 67601188/366372947 variable_initial_values [108]
3.11 0.00 174367076/366372947 sched_analyze_insn [37]
[62] 2.3 6.54 0.00 366372947 bitmap_set_bit [62]
0.00 0.00 15071/74858479 ggc_alloc [34]
the DFA automaton for ia64 is really costly.
Maybe someone can try improving performance and memory consumption for
ia64, too.
Thanks,
Richard.
--
Richard Guenther <richard dot guenther at uni-tuebingen dot de>
WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/