This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

3.4 speed improvements


Hi!

I'm at looking for possible compile speed improvements of 3.4 with C++
template metaprograms.  Currently for a testcase an instrumented compiler
has the following profile:

  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
  4.16      5.33     5.33 33367827     0.00     0.00  htab_find_slot_with_hash
  3.83     10.24     4.91  2431135     0.00     0.00  walk_tree
  3.36     14.55     4.31  2469189     0.00     0.00  gt_ggc_mx_lang_tree_node
  2.42     17.65     3.10 50981527     0.00     0.00  ggc_set_mark
  1.87     20.05     2.40 26251310     0.00     0.00  find_empty_slot_for_expand
  1.81     22.37     2.32 30142145     0.00     0.00  ggc_alloc
  1.69     24.54     2.17 20714377     0.00     0.00  comp_template_args
  1.61     26.60     2.06 23261989     0.00     0.00  comptypes
  1.51     28.53     1.93 11930469     0.00     0.00  splay_tree_splay_helper
  1.47     30.42     1.89  1359271     0.00     0.00  cse_insn
  1.44     32.26     1.84     7224     0.00     0.00  init_alias_analysis
  1.36     34.00     1.74 11660381     0.00     0.00  copy_node
  1.22     35.57     1.57  5980276     0.00     0.00  for_each_rtx
  1.14     37.03     1.46 26854739     0.00     0.00  template_args_equal
  0.96     38.26     1.23   107817     0.00     0.00  retrieve_specialization
  0.90     39.42     1.16   118786     0.00     0.00  alloc_page
  0.87     40.53     1.11   411863     0.00     0.00  lookup_template_class

where htab_find_slot_with_hash is on top.  Trying to do some
micro-optimization in walk_tree by switching the htab_find_slot call for a
pointer hash optimized version gives a 1.5% speed increase in
(instrumented) compilation, but still:

  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
  3.68      4.64     4.64 27632814     0.00     0.00  htab_find_slot_pointer
  3.54      9.11     4.47  2431135     0.00     0.00  walk_tree

I don't know if it's worth micro-optimizing the pointer hash.


So looking further a lot of the top profile stuff comes from the cycle
containing comp_template_args, comptypes, template_args_equal where some
callgraph excerpts show

                              164236             compparms <cycle 1> [963]
                              419950             standard_conversion <cycle 1> [375]
                              596065             lookup_base_r <cycle 1> [1047]
                              692740             lookup_field_r <cycle 1> [409]
                             1365604             is_properly_derived_from <cycle 1> [1404]
                             3527451             find_substitution <cycle 1> [256]
                             15438289             template_args_equal <cycle 1> [149]
[103]    1.9    2.38    0.00 23261989+89962   comptypes <cycle 1> [103]
                             17557771             ix86_comp_type_attributes <cycle 1> [368]
                             10126344             comp_template_args <cycle 1> [116]
                               94376             resolve_typename_type <cycle 1> [2660]
                               24337             cp_tree_equal <cycle 1> [630]
                                2168             comp_array_types <cycle 1> [3895]
                                 463             compparms <cycle 1> [963]
                                 211             comp_template_parms <cycle 1> [1825]
                                  30             lookup_base <cycle 1> [3259]
                               89962             comptypes <cycle 1> [103]


                             1119011             template_args_equal <cycle 1> [149]
                             2702781             register_specialization <cycle 1> [443]
                             3087042             lookup_template_class <cycle 1> [158]
                             3670234             retrieve_specialization <cycle 1> [169]
                             10126344             comptypes <cycle 1> [103]
[116]    1.6    2.06    0.00 20714377         comp_template_args <cycle 1> [116]
                             26854739             template_args_equal <cycle 1> [149]


                             26854739             comp_template_args <cycle 1> [116]
[149]    1.1    1.40    0.00 26854739         template_args_equal <cycle 1> [149]
                             15438289             comptypes <cycle 1> [103]
                             4214373             cp_tree_equal <cycle 1> [630]
                             1119011             comp_template_args <cycle 1> [116]

I'm far from understanding what is going on here, so maybe someone else
can see some low-hanging fruits here?

Time-report shows parsing and name-lookup as top, output (instrumented):

Execution times (seconds)
 garbage collection    :  17.43 ( 7%) usr   0.00 ( 0%) sys  17.55 ( 7%) wall
 callgraph construction:   1.19 ( 0%) usr   0.00 ( 0%) sys   1.19 ( 0%) wall
 callgraph optimization:   0.28 ( 0%) usr   0.10 ( 2%) sys   0.39 ( 0%) wall
 cfg construction      :   2.07 ( 1%) usr   0.01 ( 0%) sys   2.30 ( 1%) wall
 cfg cleanup           :   4.02 ( 2%) usr   0.01 ( 0%) sys   4.09 ( 2%) wall
 trivially dead code   :   4.53 ( 2%) usr   0.02 ( 0%) sys   4.92 ( 2%) wall
 life analysis         :   6.27 ( 3%) usr   0.00 ( 0%) sys   6.65 ( 3%) wall
 life info update      :   2.41 ( 1%) usr   0.01 ( 0%) sys   2.71 ( 1%) wall
 alias analysis        :   4.33 ( 2%) usr   0.00 ( 0%) sys   4.46 ( 2%) wall
 register scan         :   2.62 ( 1%) usr   0.01 ( 0%) sys   2.78 ( 1%) wall
 rebuild jump labels   :   1.57 ( 1%) usr   0.01 ( 0%) sys   1.63 ( 1%) wall
 preprocessing         :   1.44 ( 1%) usr   0.26 ( 6%) sys   1.74 ( 1%) wall
 parser                :  32.15 (13%) usr   1.08 (24%) sys  33.61 (13%) wall
 name lookup           :  14.73 ( 6%) usr   1.10 (24%) sys  16.01 ( 6%) wall
 expand                :  26.78 (11%) usr   0.42 ( 9%) sys  28.56 (11%) wall
 varconst              :   0.99 ( 0%) usr   0.02 ( 0%) sys   1.02 ( 0%) wall
 integration           :  39.83 (16%) usr   0.52 (11%) sys  42.05 (16%) wall
 jump                  :   1.14 ( 0%) usr   0.01 ( 0%) sys   1.15 ( 0%) wall
 CSE                   :  18.99 ( 8%) usr   0.02 ( 0%) sys  20.14 ( 8%) wall
 global CSE            :   8.48 ( 3%) usr   0.01 ( 0%) sys   8.77 ( 3%) wall
...
 TOTAL                 : 250.51             4.54           264.72


If anyone is interested, preprocessed testcase is at
http://www.tat.physik.uni-tuebingen.de/~rguenth/gcc/TestSymmetrize.ii.gz

Thanks,

Richard.

--
Richard Guenther <richard dot guenther at uni-tuebingen dot de>
WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]