This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

New performance measurements


Hi!

After Mark's improvement to g++ performance I did another profiled run
of my POOMA testcase.  This time on ia64, and it looks a lot better.
g++ (GCC) 3.4.0 20040127 (prerelease)
But there is one confusing entry:

Flat profile:

Each sample counts as 0.000976562 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
  4.94     13.93    13.93 74858479     0.00     0.00  ggc_alloc
  4.17     25.69    11.76 200210137     0.00     0.00  ggc_set_mark
  3.76     36.29    10.61  4636253     0.00     0.00  gt_ggc_mx_lang_tree_node
  3.24     45.42     9.13   100014     0.00     0.00  dfa_clean_insn_cache
  2.37     52.13     6.70  3649669     0.00     0.00  walk_tree
  2.32     58.66     6.54 366372947     0.00     0.00  bitmap_set_bit
  2.18     64.81     6.15     1339     0.00     0.03  gcse_main
  1.98     70.40     5.59 11016326     0.00     0.00  propagate_one_insn
  1.89     75.74     5.35                             __umoddi3
  1.84     80.94     5.20 82384068     0.00     0.00  mark_set_1
  1.82     86.08     5.13    19118     0.00     0.00  init_alias_analysis
  1.62     90.64     4.56   117400     0.00     0.00  free_deps
  1.50     94.86     4.22 57231942     0.00     0.00  htab_find_slot_with_hash
  1.23     98.34     3.47 187133886     0.00     0.00  alloc_INSN_LIST
  1.07    101.35     3.01  6554201     0.00     0.00  constrain_operands


The dfa_clean_insn_cache is suspiciously high in the profile (didn't
notice that at all for ia32).  Looking at the callers

                0.24    0.00    2678/100014      sched_init [81]
                8.89    0.00   97336/100014      ia64_sched_finish [26]
[47]     3.2    9.13    0.00  100014         dfa_clean_insn_cache [47]

there seems to be a imbalance between init and finish calls!?  Maybe there
is something obvious to improve.

Also all the ggc stuff so far top in the profile doesn't make me happy for
a 16GB machine either... (the compilation needs about 1.8GB of ram).

The next offender would be walk_tree - callgraph looks like

                             93090617             walk_tree <cycle 1> [32]
                                 728             break_out_target_exprs <cycle 1> [2727]
                                3122             for_each_template_parm <cycle 1> [2478]
                                6193             for_each_template_parm_r <cycle 1> [3286]
                               97953             walk_tree_without_duplicates <cycle 1> [1038]
                              139286             cxx_unsave_expr_now <cycle 1> [1170]
                              194929             copy_body <cycle 1> [1552]
                              198322             cgraph_create_edges <cycle 1> [788]
                              239054             record_call_1 <cycle 1> [319]
                              490646             remap_decl <cycle 1> [314]
                              929116             expand_call_inline <cycle 1> [111]
                             1347782             cp_walk_subtrees <cycle 1> [223]
                0.00    0.00    2538/85141820     optimize_inline_calls [1412]
[32]     5.3    6.70    8.38 3649669+93090617 walk_tree <cycle 1> [32]
                0.35    5.63 17440519/17440519     cp_unsave_r [66]
                0.54    0.90 43519924/52040240     htab_find_slot [132]
                0.69    0.00 51452506/53622979     first_rtl_op [227]
                0.03    0.10  951918/951918      inline_forbidden_p_1 [593]
                0.05    0.00 1622033/1622033     calls_setjmp_r [866]
                0.05    0.00 3669705/31134895     cp_is_overload_p [323]
                0.02    0.00  593309/593309      c_estimate_num_insns_1 [1228]
                0.01    0.00  245861/245861      no_linkage_helper [1417]
                0.01    0.00  117465/117465      find_reachable_label_1 [1555]
                0.00    0.00    7355/7355        bot_replace [2723]
                0.00    0.00    2424/2424        local_variable_p_walkfn [2801]
                0.00    0.00    6917/6917        nullify_returns_r [3303]
                             43519924             htab_find_slot_with_hash <cycle 1> [68]
                             37225555             cp_walk_subtrees <cycle 1> [223]
                             31547283             mark_local_for_remap_r <cycle 1> [163]
                             16823873             copy_body_r <cycle 1> [48]
                             15491406             expand_call_inline <cycle 1> [111]
                             8459533             clear_decl_rtl <cycle 1> [176]
                             7747832             record_call_1 <cycle 1> [319]
                             1106716             simplify_aggr_init_exprs_r <cycle 1> [1152]
                                9138             for_each_template_parm_r <cycle 1> [3286]
                                1036             bot_manip <cycle 1> [2615]
                             93090617             walk_tree <cycle 1> [32]

for_each_template_parm is way down now - nice.  It seems for my testcase
inlining is consuming the most time here (I have leafify enabled here).

For bitmap_set_bit I either suppose it has a very lame implementation for
ia64, or, looking at parts of the callgraph

                [...]
                0.05    0.00 2633237/366372947     update_life_info <cycle 10> [16]
                0.15    0.00 8276426/366372947     sched_analyze [33]
                0.15    0.00 8414793/366372947     mark_used_reg [136]
                0.64    0.00 35853193/366372947     mark_set_1 [41]
                1.11    0.00 62294012/366372947     simple_loop_p [65]
                1.21    0.00 67601188/366372947     variable_initial_values [108]
                3.11    0.00 174367076/366372947     sched_analyze_insn [37]
[62]     2.3    6.54    0.00 366372947         bitmap_set_bit [62]
                0.00    0.00   15071/74858479     ggc_alloc [34]

the DFA automaton for ia64 is really costly.

Maybe someone can try improving performance and memory consumption for
ia64, too.

Thanks,

Richard.

--
Richard Guenther <richard dot guenther at uni-tuebingen dot de>
WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]