This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
3.4 speed improvements
- From: Richard Guenther <rguenth at tat dot physik dot uni-tuebingen dot de>
- To: gcc at gcc dot gnu dot org
- Date: Tue, 2 Mar 2004 15:03:53 +0100 (CET)
- Subject: 3.4 speed improvements
Hi!
I'm at looking for possible compile speed improvements of 3.4 with C++
template metaprograms. Currently for a testcase an instrumented compiler
has the following profile:
% cumulative self self total
time seconds seconds calls s/call s/call name
4.16 5.33 5.33 33367827 0.00 0.00 htab_find_slot_with_hash
3.83 10.24 4.91 2431135 0.00 0.00 walk_tree
3.36 14.55 4.31 2469189 0.00 0.00 gt_ggc_mx_lang_tree_node
2.42 17.65 3.10 50981527 0.00 0.00 ggc_set_mark
1.87 20.05 2.40 26251310 0.00 0.00 find_empty_slot_for_expand
1.81 22.37 2.32 30142145 0.00 0.00 ggc_alloc
1.69 24.54 2.17 20714377 0.00 0.00 comp_template_args
1.61 26.60 2.06 23261989 0.00 0.00 comptypes
1.51 28.53 1.93 11930469 0.00 0.00 splay_tree_splay_helper
1.47 30.42 1.89 1359271 0.00 0.00 cse_insn
1.44 32.26 1.84 7224 0.00 0.00 init_alias_analysis
1.36 34.00 1.74 11660381 0.00 0.00 copy_node
1.22 35.57 1.57 5980276 0.00 0.00 for_each_rtx
1.14 37.03 1.46 26854739 0.00 0.00 template_args_equal
0.96 38.26 1.23 107817 0.00 0.00 retrieve_specialization
0.90 39.42 1.16 118786 0.00 0.00 alloc_page
0.87 40.53 1.11 411863 0.00 0.00 lookup_template_class
where htab_find_slot_with_hash is on top. Trying to do some
micro-optimization in walk_tree by switching the htab_find_slot call for a
pointer hash optimized version gives a 1.5% speed increase in
(instrumented) compilation, but still:
% cumulative self self total
time seconds seconds calls s/call s/call name
3.68 4.64 4.64 27632814 0.00 0.00 htab_find_slot_pointer
3.54 9.11 4.47 2431135 0.00 0.00 walk_tree
I don't know if it's worth micro-optimizing the pointer hash.
So looking further a lot of the top profile stuff comes from the cycle
containing comp_template_args, comptypes, template_args_equal where some
callgraph excerpts show
164236 compparms <cycle 1> [963]
419950 standard_conversion <cycle 1> [375]
596065 lookup_base_r <cycle 1> [1047]
692740 lookup_field_r <cycle 1> [409]
1365604 is_properly_derived_from <cycle 1> [1404]
3527451 find_substitution <cycle 1> [256]
15438289 template_args_equal <cycle 1> [149]
[103] 1.9 2.38 0.00 23261989+89962 comptypes <cycle 1> [103]
17557771 ix86_comp_type_attributes <cycle 1> [368]
10126344 comp_template_args <cycle 1> [116]
94376 resolve_typename_type <cycle 1> [2660]
24337 cp_tree_equal <cycle 1> [630]
2168 comp_array_types <cycle 1> [3895]
463 compparms <cycle 1> [963]
211 comp_template_parms <cycle 1> [1825]
30 lookup_base <cycle 1> [3259]
89962 comptypes <cycle 1> [103]
1119011 template_args_equal <cycle 1> [149]
2702781 register_specialization <cycle 1> [443]
3087042 lookup_template_class <cycle 1> [158]
3670234 retrieve_specialization <cycle 1> [169]
10126344 comptypes <cycle 1> [103]
[116] 1.6 2.06 0.00 20714377 comp_template_args <cycle 1> [116]
26854739 template_args_equal <cycle 1> [149]
26854739 comp_template_args <cycle 1> [116]
[149] 1.1 1.40 0.00 26854739 template_args_equal <cycle 1> [149]
15438289 comptypes <cycle 1> [103]
4214373 cp_tree_equal <cycle 1> [630]
1119011 comp_template_args <cycle 1> [116]
I'm far from understanding what is going on here, so maybe someone else
can see some low-hanging fruits here?
Time-report shows parsing and name-lookup as top, output (instrumented):
Execution times (seconds)
garbage collection : 17.43 ( 7%) usr 0.00 ( 0%) sys 17.55 ( 7%) wall
callgraph construction: 1.19 ( 0%) usr 0.00 ( 0%) sys 1.19 ( 0%) wall
callgraph optimization: 0.28 ( 0%) usr 0.10 ( 2%) sys 0.39 ( 0%) wall
cfg construction : 2.07 ( 1%) usr 0.01 ( 0%) sys 2.30 ( 1%) wall
cfg cleanup : 4.02 ( 2%) usr 0.01 ( 0%) sys 4.09 ( 2%) wall
trivially dead code : 4.53 ( 2%) usr 0.02 ( 0%) sys 4.92 ( 2%) wall
life analysis : 6.27 ( 3%) usr 0.00 ( 0%) sys 6.65 ( 3%) wall
life info update : 2.41 ( 1%) usr 0.01 ( 0%) sys 2.71 ( 1%) wall
alias analysis : 4.33 ( 2%) usr 0.00 ( 0%) sys 4.46 ( 2%) wall
register scan : 2.62 ( 1%) usr 0.01 ( 0%) sys 2.78 ( 1%) wall
rebuild jump labels : 1.57 ( 1%) usr 0.01 ( 0%) sys 1.63 ( 1%) wall
preprocessing : 1.44 ( 1%) usr 0.26 ( 6%) sys 1.74 ( 1%) wall
parser : 32.15 (13%) usr 1.08 (24%) sys 33.61 (13%) wall
name lookup : 14.73 ( 6%) usr 1.10 (24%) sys 16.01 ( 6%) wall
expand : 26.78 (11%) usr 0.42 ( 9%) sys 28.56 (11%) wall
varconst : 0.99 ( 0%) usr 0.02 ( 0%) sys 1.02 ( 0%) wall
integration : 39.83 (16%) usr 0.52 (11%) sys 42.05 (16%) wall
jump : 1.14 ( 0%) usr 0.01 ( 0%) sys 1.15 ( 0%) wall
CSE : 18.99 ( 8%) usr 0.02 ( 0%) sys 20.14 ( 8%) wall
global CSE : 8.48 ( 3%) usr 0.01 ( 0%) sys 8.77 ( 3%) wall
...
TOTAL : 250.51 4.54 264.72
If anyone is interested, preprocessed testcase is at
http://www.tat.physik.uni-tuebingen.de/~rguenth/gcc/TestSymmetrize.ii.gz
Thanks,
Richard.
--
Richard Guenther <richard dot guenther at uni-tuebingen dot de>
WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/