This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Please benchmark --param large-function-insns=3000 (Was Re: Some updates on tree-ssa and PR8361)


> On Fri, 9 Jan 2004, Diego Novillo wrote:
> >> Given that you also report a 1% reduction in the size of the
> >> generated binary, I really will try to perform a round of benchmarks
> >> this weekend, comparing 2.95, 3.3.2 and mainline against tree-ssa.
> > Thanks.  The daily build scripts are now able to build DLV, but we also
> > have more RAM now (1Gb).  Low memory machines seem to be getting scarce
> > around here :/
> 
> Perhaps we should give developers slower machines with less memory? :->
> 
> I could not run benchmarks, as tree-ssa currently seems to generate
> incorrect code for DLV (or there is a very intricate bug in the single
> third party library there, which I can hardly debug), but the time and
> memory consumptions for PR8361 are interesting in their own:
> 
>                 -O2 time[s]    -O3 time[s]/memory[MB]
>             ---------------------------------------------
>   3.2.3            50.48           53.64     109 
>   3.3.2            51.88           54.50     142
>   3.3.3-cvs        51.60           54.35     144
>   mainline         63.90           65.77     202
>   tree-ssa         52.14           54.59     216
> 
> In terms of memory consumption, both mainline and tree-ssa have regressed
> by about 50%, though due to the work by Jeff, you, and others, tree-ssa is
> now nearly on the level of mainline.
> 
> In terms of compilation time, tree-ssa is now on the level of previous 3.x
> releases, and in fact faster than mainline.
> 
> Summary:
> 
> ? Mainline/3.4 has seriously regressed: 20% time, 50% memory.
> 
> ? tree-ssa has (been) improved significantly, and is more or less on par 
>   with mainline.
> 
> ? In general, we do need to reduce memory consumption (and compile time).

I've been poking around the mainline and -O2 copilation times.  The
problem seems to be overall more inlining performed as we now do out of
order inlining and we discover about 40% more inlinining sites then done
without unit-at-a-time.  This is of course good in some way :)

Looking at the profiles, there don't seems to be much to obivously cut
down.  I am testing speedup to aliasing that makes about 2% differences,
but that seems to be all I can do.

Other bottleneck that may be avoidable is the for_each_template_parm_r
that is called 7 million times for not really apparent reason for me.
It calls the callback relatively few times, so perhaps some kind of
cache would do solve it, but that all seems to be about 3% compilation
time.  (it is the major reason for hashtable being top of profile:
080f2740 3370      3.8577     cc1plus 		       htab_find_slot_with_hash
0007ae60 3102      3.5509     libc.so.6                memset
0832d790 2569      2.9408     cc1plus                  walk_tree
08323dd0 2513      2.8767     cc1plus                  ggc_set_mark
081f0770 2135      2.4440     cc1plus                  gt_ggc_mx_lang_tree_node
083235f0 1731      1.9815     cc1plus                  ggc_alloc
00077d50 1706      1.9529     libc.so.6                index
08309c00 1663      1.9037     cc1plus                  for_each_rtx
082853b0 1443      1.6518     cc1plus                  fixup_var_refs_1
c0257713 1389      1.5900     vmlinux                  acpi_processor_idle
08248da0 1305      1.4939     cc1plus                  cse_insn
080f2f30 1074      1.2294     cc1plus                   splay_tree_splay_helper
c0128a40 1074      1.2294     vmlinux                  do_softirq
082736d0 995       1.1390     cc1plus                  mark_set_1
0828ed30 909       1.0406     cc1plus                  find_loads
080f24c0 825       0.9444     cc1plus                  find_empty_slot_for_expand
08274b50 786       0.8998     cc1plus                  propagate_one_insn
08082cd0 724       0.8288     cc1plus                  init_alias_analysis
08309480 699       0.8002     cc1plus                  rtx_equal_p
082ef790 661       0.7567     cc1plus                  reg_scan_mark_refs
08309eb0 654       0.7487     cc1plus                  note_stores)

However rest of differences seems to be comming from inherently
quadratic global analysis, like liveness, GCSE and similar beasts
meaning that we do have huge function bodies.

By passing --param large-function-insns=1000 I can get 40% speedup that
makes us faster than 3.0.4 is (at least make profiledbootstrap-ed
compiler).  I tested your benchmarks and it does seem to bring some
regression (up to 9%), but makes 3.4 consistently faster than 3.0.4.

I've tested --param large-function-inssn=3000 that makes little
performance changes (4% regression in one benchmark, small speedup in
other benchmarks, but overall times are comparable) and still over 20%
speedup and thus I am thinking about changing that default.  
Does this seem resonable step?  I would be interested to hear about any
performance regressions caused by this switch.
I plan to do SPEC testing, but I tested that the limit does not trigger
in gcc/perl/vpr at all, so it is unlikely going to bring any change at
all.

I mistakely removed the timmings for 3000, but
The timmings with default setting (10000) and reduced (1000) are:
 garbage collection    :   2.36 ( 6%)  1.91 ( 7%)
 callgraph construction:   0.18 ( 0%)  0.18 ( 1%)
 callgraph optimization:   0.02 ( 0%)  0.02 ( 0%)
 cfg construction      :   0.48 ( 1%)  0.25 ( 1%)
 cfg cleanup           :   0.85 ( 2%)  0.54 ( 2%)
 trivially dead code   :   0.71 ( 2%)  0.40 ( 1%)
 life analysis         :   1.37 ( 4%)  0.83 ( 3%)
 life info update      :   0.85 ( 2%)  0.57 ( 2%)
 alias analysis        :   1.15 ( 3%)  0.90 ( 3%)
 register scan         :   0.52 ( 1%)  0.28 ( 1%)
 rebuild jump labels   :   0.31 ( 1%)  0.17 ( 1%)
 preprocessing         :   0.10 ( 0%)  0.11 ( 0%)
 parser                :   3.85 (11%)  3.93 (15%)
 name lookup           :   2.30 ( 6%)  2.17 ( 8%)
 expand                :   3.40 ( 9%)  2.15 ( 8%)
 varconst              :   0.07 ( 0%)  0.05 ( 0%)
 integration           :   2.71 ( 7%)  2.05 ( 8%)
 jump                  :   0.18 ( 0%)  0.12 ( 0%)
 CSE                   :   3.13 ( 9%)  2.25 ( 8%)
 global CSE            :   2.65 ( 7%)  1.26 ( 5%)
 loop analysis         :   0.40 ( 1%)  0.26 ( 1%)
 bypass jumps          :   0.30 ( 1%)  0.24 ( 1%)
 CSE 2                 :   1.22 ( 3%)  0.93 ( 3%)
 branch prediction     :   0.43 ( 1%)  0.29 ( 1%)
 flow analysis         :   0.04 ( 0%)  0.01 ( 0%)
 combiner              :   0.74 ( 2%)  0.65 ( 2%)
 if-conversion         :   0.15 ( 0%)  0.06 ( 0%)
 regmove               :   0.51 ( 1%)  0.17 ( 1%)
 local alloc           :   0.62 ( 2%)  0.40 ( 1%)
 global alloc          :   1.38 ( 4%)  1.00 ( 4%)
 reload CSE regs       :   0.74 ( 2%)  0.43 ( 2%)
 flow 2                :   0.12 ( 0%)  0.15 ( 1%)
 if-conversion 2       :   0.06 ( 0%)  0.06 ( 0%)
 peephole 2            :   0.17 ( 0%)  0.13 ( 0%)
 rename registers      :   0.19 ( 1%)  0.11 ( 0%)
 scheduling 2          :   0.71 ( 2%)  0.68 ( 3%)
 reorder blocks        :   0.13 ( 0%)  0.10 ( 0%)
 shorten branches      :   0.23 ( 1%)  0.16 ( 1%)
 final                 :   0.33 ( 1%)  0.23 ( 1%)
 symout                :   0.01 ( 0%)  0.01 ( 0%)
 rest of compilation   :   0.75 ( 2%)  0.45 ( 2%)
 TOTAL                 :  36.43       26.68      

Honza


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]