Bug 13776 - [4.0/4.1 Regression] Many C++ compile-time regressions for MICO's ORB code
Summary: [4.0/4.1 Regression] Many C++ compile-time regressions for MICO's ORB code
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: tree-ssa
: P2 normal
Target Milestone: 4.0.0
Assignee: Not yet assigned to anyone
URL:
Keywords: compile-time-hog
: 14408 (view as bug list)
Depends on: 13953 14440 14719 17278 18507
Blocks:
  Show dependency treegraph
 
Reported: 2004-01-20 18:39 UTC by Karel Gardas
Modified: 2005-03-02 21:32 UTC (History)
9 users (show)

See Also:
Host: i686-pc-linux-gnu
Target: i686-pc-linux-gnu
Build: i686-pc-linux-gnu
Known to work:
Known to fail:
Last reconfirmed: 2004-03-31 19:33:00


Attachments
C example (91.53 KB, text/plain)
2004-03-29 02:35 UTC, Andrew Pinski
Details
ir.ii.bz2 (166.37 KB, application/octet-stream)
2004-10-25 13:09 UTC, Karel Gardas
Details
Top 10 functions for all preprocessed mico files at -O2 (9.43 KB, text/plain)
2004-11-24 23:22 UTC, Steven Bosscher
Details
profile report using shark (8.22 KB, text/plain)
2004-11-25 00:31 UTC, Andrew Pinski
Details
A patch to turn off local-alloc, which buys 5% for ir.cc (540 bytes, patch)
2005-01-01 19:54 UTC, Steven Bosscher
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Karel Gardas 2004-01-20 18:39:09 UTC
Hello,

there are many C++ compile-time regression in tree-ssa branch in comparison with
gcc-3_4-branch. I have tested it on MICO's ORB core sources and send more
details report to the gcc developer mailing list:
http://gcc.gnu.org/ml/gcc/2004-01/msg01516.html

If you are curious, then you can download tarball of preprocessed files here:
http://www.mico.org/~karel/orb-ii-gcc35_040120.tar.bz2

Cheers,

Karel
Comment 1 Wolfgang Bangerth 2004-01-20 18:52:49 UTC

*** This bug has been marked as a duplicate of 13775 ***
Comment 2 Karel Gardas 2004-01-20 18:57:39 UTC
Sorry, I don't understand -- this bugreport is about regression in 3.5-tree-ssa,
while 13775 is about regression in 3.4.0. I've thought they should be different
bugreports for different set of people (working on different branches). Should I
reopen bug in this case?
Thanks,
Karel
Comment 3 Dara Hazeghi 2004-01-20 19:03:45 UTC
I think Wolfgang's rationale is that the problem is compilation speed, and fixing that problem will 
fix both bugs. Not sure I agree though...
Comment 4 Karel Gardas 2004-01-20 19:17:28 UTC
Hmm, well fixing 13775 might also fix 13777 but certainly not this problem which
is regression in tree-ssa. So I reopen this bug, especially when I know that
tree-ssa developers are curious to see such regressions.
Comment 5 Andrew Pinski 2004-01-20 19:29:13 UTC
I am wondering how much of this is due to the current work that was done after the last merge into 
the tree-ssa.
Comment 6 Wolfgang Bangerth 2004-01-20 19:42:11 UTC
My bad, I misread it. Sorry 
  W. 
Comment 7 Mark Mitchell 2004-01-25 21:03:49 UTC
Measurements made in comparison with 3.4.0 040114.
Comment 8 Diego Novillo 2004-03-03 15:06:36 UTC
Subject:  [Fwd: [tree-ssa] 20% compile time regression vs.
	3.4]


Adding to PR notes.  More related C++ compile time regressions.


Diego.

-----Forwarded Message-----
From: Richard Guenther <rguenth@tat.physik.uni-tuebingen.de>
To: gcc@gcc.gnu.org
Subject: [tree-ssa] 20% compile time regression vs. 3.4
Date: Wed, 03 Mar 2004 15:46:01 +0100

Hi!

I thought it was time for another 3.4 vs. tree-ssa compile-time
comparison.  For -O2 compile-time we regressed quite a bit (20%) with the
main problem areas are (first 3.4, second tree-ssa):

 garbage collection    :  12.19 ( 7%) usr   0.00 ( 0%) sys  12.50 ( 7%) wall
 garbage collection    :  17.26 ( 8%) usr   0.02 ( 0%) sys  17.45 ( 8%) wall

tree-ssa uses about double amount of memory

 parser                :  14.59 ( 9%) usr   1.26 (27%) sys  16.41 ( 9%) wall
 parser                :  18.29 ( 8%) usr   1.42 (27%) sys  19.94 ( 9%) wall

I cannot make any sense out of this - are there significant changes to the
parser!?  Maybe that-much larger libstdc++?

 integration           :  17.86 (11%) usr   0.29 ( 6%) sys  18.34 (10%) wall
 integration           :  21.62 (10%) usr   0.18 ( 3%) sys  22.19 (10%) wall

probably different inlining choices

and finally some tree-ssa optimizer numbers stick out

 tree gimplify         :   3.39 ( 2%) usr   0.04 ( 1%) sys   3.48 ( 1%) wall
 tree eh               :   2.71 ( 1%) usr   0.01 ( 0%) sys   2.77 ( 1%) wall
 tree CFG construction :   1.69 ( 1%) usr   0.12 ( 2%) sys   1.87 ( 1%) wall
 tree CFG cleanup      :   2.89 ( 1%) usr   0.02 ( 0%) sys   2.98 ( 1%) wall
 tree PTA              :   0.49 ( 0%) usr   0.03 ( 1%) sys   0.52 ( 0%) wall
 tree alias analysis   :   0.71 ( 0%) usr   0.01 ( 0%) sys   0.75 ( 0%) wall
 tree PHI insertion    :   2.14 ( 1%) usr   0.04 ( 1%) sys   2.25 ( 1%) wall
 tree SSA rewrite      :   2.94 ( 1%) usr   0.01 ( 0%) sys   3.03 ( 1%) wall
 tree SSA other        :   3.77 ( 2%) usr   0.33 ( 6%) sys   4.17 ( 2%) wall
 tree operand scan     :   2.95 ( 1%) usr   0.46 ( 8%) sys   3.51 ( 2%) wall
 dominator optimization:  14.06 ( 6%) usr   0.20 ( 4%) sys  14.60 ( 6%) wall
 tree SRA              :   0.29 ( 0%) usr   0.00 ( 0%) sys   0.31 ( 0%) wall
 tree CCP              :   2.29 ( 1%) usr   0.00 ( 0%) sys   2.39 ( 1%) wall
 tree split crit edges :   0.27 ( 0%) usr   0.00 ( 0%) sys   0.28 ( 0%) wall
 tree PRE              :   6.11 ( 3%) usr   0.06 ( 1%) sys   6.40 ( 3%) wall
 tree linearize phis   :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall
 tree forward propagate:   1.37 ( 1%) usr   0.00 ( 0%) sys   1.42 ( 1%) wall
 tree conservative DCE :   2.71 ( 1%) usr   0.02 ( 0%) sys   2.80 ( 1%) wall
 tree aggressive DCE   :   1.40 ( 1%) usr   0.00 ( 0%) sys   1.45 ( 1%) wall
 tree DSE              :   3.30 ( 2%) usr   0.03 ( 1%) sys   3.42 ( 1%) wall
 tree copy headers     :   1.80 ( 1%) usr   0.01 ( 0%) sys   1.84 ( 1%) wall
 tree SSA to normal    :   3.18 ( 1%) usr   0.13 ( 2%) sys   3.39 ( 1%) wall
 tree NRV optimization :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall
 tree rename SSA copies:   0.83 ( 0%) usr   0.03 ( 1%) sys   0.88 ( 0%) wall

namely DOM (again) and PRE.

This is with the famous tramp3d-v2.cpp testcase you can find at
http://www.tat.physik.uni-tuebingen.de/~rguenth/gcc/tramp3d-v2.cpp.gz

g++-ssa (GCC) 3.5-tree-ssa 20040303 (merged 20040227)
g++ (GCC) 3.4.0 20040301 (prerelease)

compiled with -O2 -c tramp3d-v2.cpp -Dleafify=fooblah -ftime-report to
disable leafify effects.  The 3.4 compiler was profiledbootstrapped while
the ssa one was only bootstrapped.  Of course checking was disabled.

Thanks,

Richard.

--
Richard Guenther <richard dot guenther at uni-tuebingen dot de>
WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/

Comment 9 Richard Biener 2004-03-03 15:39:20 UTC
Testcase for the last report can be found attached to PR14408.
Comment 10 Andrew Pinski 2004-03-03 16:27:43 UTC
*** Bug 14408 has been marked as a duplicate of this bug. ***
Comment 11 Richard Biener 2004-03-12 14:20:57 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

Compilation times in mode that matters to me (leafify enabled) degraded 
half an order of magnitude:

g++-ssa (GCC) 3.5-tree-ssa 20040311 (merged 20040307)

bellatrix:/tmp$ g++-ssa -O2 -o tramp3d-v2 tramp3d-v2.cpp -static 
-ftime-report

Execution times (seconds)
  garbage collection    :  38.48 ( 3%) usr   0.15 ( 1%) sys  38.90 ( 3%) 
wall
  callgraph construction:   1.49 ( 0%) usr   0.00 ( 0%) sys   1.49 ( 0%) 
wall
  callgraph optimization:   1.54 ( 0%) usr   0.08 ( 0%) sys   1.63 ( 0%) 
wall
  cfg construction      :   1.35 ( 0%) usr   0.17 ( 1%) sys   1.52 ( 0%) 
wall
  cfg cleanup           :   5.68 ( 0%) usr   0.03 ( 0%) sys   5.74 ( 0%) 
wall
  trivially dead code   :   5.11 ( 0%) usr   0.01 ( 0%) sys   5.16 ( 0%) 
wall
  life analysis         :   9.41 ( 1%) usr   0.05 ( 0%) sys   9.59 ( 1%) 
wall
  life info update      :   5.78 ( 0%) usr   0.00 ( 0%) sys   5.81 ( 0%) 
wall
  alias analysis        :   7.50 ( 1%) usr   0.03 ( 0%) sys   7.58 ( 1%) 
wall
  register scan         :   3.92 ( 0%) usr   0.01 ( 0%) sys   3.95 ( 0%) 
wall
  rebuild jump labels   :   1.37 ( 0%) usr   0.00 ( 0%) sys   1.41 ( 0%) 
wall
  preprocessing         :   0.51 ( 0%) usr   0.10 ( 0%) sys   0.64 ( 0%) 
wall
  parser                :  18.60 ( 2%) usr   1.47 ( 7%) sys  20.14 ( 2%) 
wall
  name lookup           :   6.55 ( 1%) usr   1.53 ( 7%) sys   8.10 ( 1%) 
wall
  integration           :  67.76 ( 6%) usr   1.46 ( 7%) sys  69.59 ( 6%) 
wall
  tree gimplify         :   3.44 ( 0%) usr   0.04 ( 0%) sys   3.50 ( 0%) 
wall
  tree eh               :   7.64 ( 1%) usr   0.25 ( 1%) sys   7.96 ( 1%) 
wall
  tree CFG construction :   4.54 ( 0%) usr   0.53 ( 3%) sys   5.07 ( 0%) 
wall
  tree CFG cleanup      :   9.67 ( 1%) usr   0.08 ( 0%) sys   9.81 ( 1%) 
wall
  tree PTA              :   1.26 ( 0%) usr   0.05 ( 0%) sys   1.31 ( 0%) 
wall
  tree alias analysis   :   1.37 ( 0%) usr   0.01 ( 0%) sys   1.38 ( 0%) 
wall
  tree PHI insertion    :  74.91 ( 6%) usr   0.24 ( 1%) sys  75.62 ( 6%) 
wall
  tree SSA rewrite      :   7.46 ( 1%) usr   0.21 ( 1%) sys   7.72 ( 1%) 
wall
  tree SSA other        :  10.67 ( 1%) usr   0.79 ( 4%) sys  11.58 ( 1%) 
wall
  tree operand scan     :   6.77 ( 1%) usr   0.61 ( 3%) sys   7.41 ( 1%) 
wall
  dominator optimization:  46.56 ( 4%) usr   1.54 ( 8%) sys  48.39 ( 4%) 
wall
  tree SRA              :   0.79 ( 0%) usr   0.02 ( 0%) sys   0.83 ( 0%) 
wall
  tree CCP              :   4.88 ( 0%) usr   0.02 ( 0%) sys   4.97 ( 0%) 
wall
  tree split crit edges :   0.64 ( 0%) usr   0.06 ( 0%) sys   0.70 ( 0%) 
wall
  tree PRE              : 583.13 (49%) usr   6.18 (30%) sys 592.99 (48%) 
wall
  tree linearize phis   :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) 
wall
  tree forward propagate:   3.51 ( 0%) usr   0.00 ( 0%) sys   3.53 ( 0%) 
wall
  tree conservative DCE :   6.95 ( 1%) usr   0.08 ( 0%) sys   7.05 ( 1%) 
wall
  tree aggressive DCE   :   2.89 ( 0%) usr   0.03 ( 0%) sys   2.93 ( 0%) 
wall
  tree DSE              :   6.33 ( 1%) usr   0.19 ( 1%) sys   6.56 ( 1%) 
wall
  tree copy headers     :   5.00 ( 0%) usr   0.05 ( 0%) sys   5.09 ( 0%) 
wall
  tree SSA to normal    :   9.60 ( 1%) usr   0.42 ( 2%) sys  10.09 ( 1%) 
wall
  tree rename SSA copies:   2.11 ( 0%) usr   0.05 ( 0%) sys   2.17 ( 0%) 
wall
  dominance frontiers   :   0.86 ( 0%) usr   0.00 ( 0%) sys   0.89 ( 0%) 
wall
  control dependences   :   0.49 ( 0%) usr   0.00 ( 0%) sys   0.51 ( 0%) 
wall
  expand                :  42.02 ( 3%) usr   1.38 ( 7%) sys  43.63 ( 4%) 
wall
  varconst              :   0.82 ( 0%) usr   0.01 ( 0%) sys   0.83 ( 0%) 
wall
  jump                  :   8.24 ( 1%) usr   0.36 ( 2%) sys   8.64 ( 1%) 
wall
  CSE                   :  14.22 ( 1%) usr   0.08 ( 0%) sys  14.39 ( 1%) 
wall
  global CSE            :  67.80 ( 6%) usr   0.84 ( 4%) sys  69.06 ( 6%) 
wall
  loop analysis         :  11.90 ( 1%) usr   0.02 ( 0%) sys  12.02 ( 1%) 
wall
  bypass jumps          :   2.48 ( 0%) usr   0.12 ( 1%) sys   2.60 ( 0%) 
wall
  CSE 2                 :   6.33 ( 1%) usr   0.04 ( 0%) sys   6.38 ( 1%) 
wall
  branch prediction     :   8.78 ( 1%) usr   0.03 ( 0%) sys   8.88 ( 1%) 
wall
  flow analysis         :   0.28 ( 0%) usr   0.00 ( 0%) sys   0.29 ( 0%) 
wall
  combiner              :   6.32 ( 1%) usr   0.08 ( 0%) sys   6.46 ( 1%) 
wall
  if-conversion         :   1.72 ( 0%) usr   0.02 ( 0%) sys   1.74 ( 0%) 
wall
  regmove               :   2.47 ( 0%) usr   0.00 ( 0%) sys   2.48 ( 0%) 
wall
  local alloc           :   6.16 ( 1%) usr   0.03 ( 0%) sys   6.25 ( 1%) 
wall
  global alloc          :  13.97 ( 1%) usr   0.21 ( 1%) sys  14.22 ( 1%) 
wall
  reload CSE regs       :   5.61 ( 0%) usr   0.07 ( 0%) sys   5.75 ( 0%) 
wall
  flow 2                :   1.31 ( 0%) usr   0.07 ( 0%) sys   1.41 ( 0%) 
wall
  if-conversion 2       :   0.90 ( 0%) usr   0.00 ( 0%) sys   0.91 ( 0%) 
wall
  peephole 2            :   0.94 ( 0%) usr   0.02 ( 0%) sys   0.97 ( 0%) 
wall
  rename registers      :   1.66 ( 0%) usr   0.07 ( 0%) sys   1.75 ( 0%) 
wall
  scheduling 2          :   8.51 ( 1%) usr   0.13 ( 1%) sys   8.67 ( 1%) 
wall
  machine dep reorg     :   1.74 ( 0%) usr   0.00 ( 0%) sys   1.76 ( 0%) 
wall
  reorder blocks        :   1.06 ( 0%) usr   0.04 ( 0%) sys   1.10 ( 0%) 
wall
  shorten branches      :   1.87 ( 0%) usr   0.05 ( 0%) sys   1.94 ( 0%) 
wall
  reg stack             :   0.36 ( 0%) usr   0.00 ( 0%) sys   0.36 ( 0%) 
wall
  final                 :   2.55 ( 0%) usr   0.18 ( 1%) sys   2.74 ( 0%) 
wall
  symout                :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) 
wall
  rest of compilation   :   4.92 ( 0%) usr   0.07 ( 0%) sys   4.99 ( 0%) 
wall
  TOTAL                 :1201.61            20.47          1229.69

Look at the PRE times!!!

Also the resulting binary segfaults and such is miscompiled (for both 
leafify enabled and disabled compilation). Ugh.

For reference, the leafify patch still sits at 
http://www.tat.physik.uni-tuebingen.de/~rguenth/gcc/leafify-ssa-2

Building an instrumented compiler now.

Richard.
Comment 12 Richard Biener 2004-03-12 23:34:26 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

Richard Guenther wrote:
> Compilation times in mode that matters to me (leafify enabled) degraded 
> half an order of magnitude:
> 
> g++-ssa (GCC) 3.5-tree-ssa 20040311 (merged 20040307)
> 
> bellatrix:/tmp$ g++-ssa -O2 -o tramp3d-v2 tramp3d-v2.cpp -static 
> -ftime-report

instrumented compiler gives:

Flat profile:

Each sample counts as 0.01 seconds.
   %   cumulative   self              self     total
  time   seconds   seconds    calls  Ks/call  Ks/call  name
  15.51    183.22   183.22   726080     0.00     0.00 
process_left_occs_and_kills
   8.55    284.25   101.03    16644     0.00     0.00 
create_and_insert_occ_in_preorder_dt_order
   6.72    363.67    79.42   184430     0.00     0.00  compute_global_livein
   6.54    440.93    77.26    16644     0.00     0.00  rename_1
   3.13    477.86    36.93    16644     0.00     0.00  clear_all_eref_arrays
   3.05    513.86    36.00    16644     0.00     0.00  compute_down_safety
   2.49    543.32    29.46 152057062     0.00     0.00  expr_lexically_eq
   2.09    568.00    24.68   201843     0.00     0.00  cgraph_remove_node
   1.89    590.33    22.33   432753     0.00     0.00  alloc_page
   1.70    610.44    20.11                             eref_compare
   1.70    630.47    20.03   158882     0.00     0.00  compute_transp
   1.46    647.70    17.23   416131     0.00     0.00  cgraph_remove_edge
   1.42    664.43    16.73 14808480     0.00     0.00 
gt_ggc_mx_lang_tree_node
   1.13    677.82    13.39 226466163     0.00     0.00  ggc_set_mark
   1.07    690.43    12.61     1482     0.00     0.00  collect_expressions
   0.90    701.05    10.62 15658237     0.00     0.00  walk_tree

i.e. PRE seems to do something very stupid?

Richard.
Comment 13 Daniel Berlin 2004-03-13 02:02:18 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120


>
> instrumented compiler gives:
>
> Flat profile:
>
> Each sample counts as 0.01 seconds.
>    %   cumulative   self              self     total
>   time   seconds   seconds    calls  Ks/call  Ks/call  name
>   15.51    183.22   183.22   726080     0.00     0.00
> process_left_occs_and_kills


This one is O(n^2) or O(n^3) in the number of vuses.  Known problem. a 
fix is really complicated.


>    8.55    284.25   101.03    16644     0.00     0.00
> create_and_insert_occ_in_preorder_dt_order

Hmmmm.
It is attempting to PRE 16664 things.
How many basic blocks do you have?
We shouldn't end up with trying to PRE that many expressions, since we 
only try to PRE things that occur at least twice.


>    6.72    363.67    79.42   184430     0.00     0.00  
> compute_global_livein
>    6.54    440.93    77.26    16644     0.00     0.00  rename_1
>    3.13    477.86    36.93    16644     0.00     0.00  
> clear_all_eref_arrays
>    3.05    513.86    36.00    16644     0.00     0.00  
> compute_down_safety
>    2.49    543.32    29.46 152057062     0.00     0.00  
> expr_lexically_eq
>    2.09    568.00    24.68   201843     0.00     0.00  
> cgraph_remove_node
>    1.89    590.33    22.33   432753     0.00     0.00  alloc_page
>    1.70    610.44    20.11                             eref_compare
>    1.70    630.47    20.03   158882     0.00     0.00  compute_transp
>    1.46    647.70    17.23   416131     0.00     0.00  
> cgraph_remove_edge
>    1.42    664.43    16.73 14808480     0.00     0.00
> gt_ggc_mx_lang_tree_node
>    1.13    677.82    13.39 226466163     0.00     0.00  ggc_set_mark
>    1.07    690.43    12.61     1482     0.00     0.00  
> collect_expressions
>    0.90    701.05    10.62 15658237     0.00     0.00  walk_tree
>
> i.e. PRE seems to do something very stupid?
>

You must have an incredibly large number of basic blocks or something, 
or a very weird flowgraph.
How many BB's are we talking about?

I can't fix the algorithmic properties of the SSAPRE algorithm we use, 
which is what you are running into, i'm betting.

I'm working on a new PRE implementation that is O(n^2) memory usage in 
the number of phi nodes, but should be a bit faster overall.

Comment 14 Diego Novillo 2004-03-13 02:08:15 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
	3.5-tree-ssa 040120

On Fri, 2004-03-12 at 21:02, dberlin at dberlin dot org wrote:

> I can't fix the algorithmic properties of the SSAPRE algorithm we use, 
> which is what you are running into, i'm betting.
> 
Could we add thresholds to back away from overly complicated functions?


Diego.

Comment 15 Daniel Berlin 2004-03-13 02:08:32 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120

> You must have an incredibly large number of basic blocks or something,
> or a very weird flowgraph.
> How many BB's are we talking about?
>
> I can't fix the algorithmic properties of the SSAPRE algorithm we use,
> which is what you are running into, i'm betting.
>
> I'm working on a new PRE implementation that is O(n^2) memory usage in
> the number of phi nodes, but should be a bit faster overall.
>

Regardless, i'll see if i can find a machine with enough memory to look 
at these.

Comment 16 Daniel Berlin 2004-03-13 02:10:19 UTC
(In reply to comment #14)
> Subject: Re:  [tree-ssa] Many C++ compile-time regression in
>         3.5-tree-ssa 040120
> 
> On Fri, 2004-03-12 at 21:02, dberlin at dberlin dot org wrote:
> 
> > I can't fix the algorithmic properties of the SSAPRE algorithm we use, 
> > which is what you are running into, i'm betting.
> > 
> Could we add thresholds to back away from overly complicated functions?
> 
> 
> Diego.
> 
> 


I need to know what exactly the properties of these functions are, it's unclear.
As i said, i'm working on it.
Comment 17 Richard Biener 2004-03-13 11:43:38 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

dberlin at gcc dot gnu dot org wrote:
> ------- Additional Comments From dberlin at gcc dot gnu dot org  2004-03-13 02:10 -------
> (In reply to comment #14)
> 
>>Subject: Re:  [tree-ssa] Many C++ compile-time regression in
>>        3.5-tree-ssa 040120
>>
>>On Fri, 2004-03-12 at 21:02, dberlin at dberlin dot org wrote:
>>
>>
>>>I can't fix the algorithmic properties of the SSAPRE algorithm we use, 
>>>which is what you are running into, i'm betting.
>>>
>>
>>Could we add thresholds to back away from overly complicated functions?
>>
>>
>>Diego.
>>
>>
> 
> 
> 
> I need to know what exactly the properties of these functions are, it's unclear.
> As i said, i'm working on it.

Remember you need to patch the compiler to support 
__attribute__((leafify)) to trigger the problem with the tramp3d-v2.cpp 
testcase.  I suspect the huge number of basic blocks comes from inlining 
as I suspect at least one new basic block is inserted per inlined 
function, no?  So with a lot of C++ abstraction inside a leafified 
function you get a lot of basic blocks.  But I suppose a lot of them 
could be eliminated easily?

Richard.

Comment 18 Richard Biener 2004-03-13 11:46:09 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

dnovillo at redhat dot com wrote:
> ------- Additional Comments From dnovillo at redhat dot com  2004-03-13 02:08 -------
> Subject: Re:  [tree-ssa] Many C++ compile-time regression in
> 	3.5-tree-ssa 040120
> 
> On Fri, 2004-03-12 at 21:02, dberlin at dberlin dot org wrote:
> 
> 
>>I can't fix the algorithmic properties of the SSAPRE algorithm we use, 
>>which is what you are running into, i'm betting.
>>
> 
> Could we add thresholds to back away from overly complicated functions?

Or just "split" them up using sort of windowing?  It looks clearly wrong 
to not limit a O(n^2) or O(n^3) algorithm.

Richard.
Comment 19 Daniel Berlin 2004-03-13 15:57:47 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120


On Mar 13, 2004, at 6:46 AM, rguenth at tat dot physik dot 
uni-tuebingen dot de wrote:
>>>
>>
>> Could we add thresholds to back away from overly complicated 
>> functions?
>
> Or just "split" them up using sort of windowing?  It looks clearly 
> wrong
> to not limit a O(n^2) or O(n^3) algorithm.
>

It's only collecting expressions that is O(n^2). The other parts of the 
algorithm just has a large constant.

Also, it *is* splitting up the function. It performs PRE one expression 
at a time.

We can't perform it one basic block at a time or anything with the 
current algorithm (and it wouldn't make sense to, because you can't 
find the optimal insertion points).

Comment 20 Daniel Berlin 2004-03-13 16:00:07 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120

>
> and a lot more of int d farther away? Also, how are bb's marked? I see 
> <bb 0>: but no more, and some gotos reference <bb 18> and <bb 16> 
> (with a label, too)?
>
> Can I get summaries somehow here?  Or just dump one interesting 
> function rather than all of the program?
>
> Also, how do I dump some stuff about the PRE pass?  Specifying 
> -fdump-tree-pre just dumps the trees after PRE with no information 
> about the PRE pass itself.

-fdump-tree-pre-stats-details. But i already know what it is going to 
show in this case, based on the profile.
I just need other properties of the functions, which i'm attempting to 
get.

Comment 21 Richard Biener 2004-03-13 16:44:02 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

Note that the reported times are a huge regression compared to
g++ (GCC) 3.5-tree-ssa 20040209 (merged 20040126)
which shows

Execution times (seconds)
  garbage collection    :  26.06 ( 7%) usr   0.00 ( 0%) sys  26.06 ( 6%) 
wall
  callgraph construction:   1.29 ( 0%) usr   0.00 ( 0%) sys   1.29 ( 0%) 
wall
  callgraph optimization:   1.39 ( 0%) usr   0.05 ( 1%) sys   1.44 ( 0%) 
wall
  cfg construction      :   1.71 ( 0%) usr   0.02 ( 0%) sys   1.73 ( 0%) 
wall
  cfg cleanup           :   3.49 ( 1%) usr   0.02 ( 0%) sys   3.51 ( 1%) 
wall
  trivially dead code   :   2.85 ( 1%) usr   0.01 ( 0%) sys   2.86 ( 1%) 
wall
  life analysis         :   5.88 ( 1%) usr   0.00 ( 0%) sys   5.88 ( 1%) 
wall
  life info update      :   2.81 ( 1%) usr   0.00 ( 0%) sys   2.81 ( 1%) 
wall
  alias analysis        :   4.46 ( 1%) usr   0.02 ( 0%) sys   4.48 ( 1%) 
wall
  register scan         :   1.96 ( 0%) usr   0.00 ( 0%) sys   1.96 ( 0%) 
wall
  rebuild jump labels   :   0.88 ( 0%) usr   0.01 ( 0%) sys   0.89 ( 0%) 
wall
  preprocessing         :   0.61 ( 0%) usr   0.15 ( 3%) sys   0.76 ( 0%) 
wall
  parser                :  19.47 ( 5%) usr   1.10 (23%) sys  21.16 ( 5%) 
wall
  name lookup           :  12.05 ( 3%) usr   1.54 (33%) sys  13.75 ( 3%) 
wall
  integration           :  47.73 (12%) usr   0.14 ( 3%) sys  47.87 (12%) 
wall
  tree gimplify         :   3.05 ( 1%) usr   0.06 ( 1%) sys   3.19 ( 1%) 
wall
  tree eh               :   5.32 ( 1%) usr   0.01 ( 0%) sys   5.34 ( 1%) 
wall
  tree CFG construction :   2.74 ( 1%) usr   0.08 ( 2%) sys   2.82 ( 1%) 
wall
  tree CFG cleanup      :   6.10 ( 2%) usr   0.00 ( 0%) sys   6.10 ( 2%) 
wall
  tree alias analysis   :   1.11 ( 0%) usr   0.00 ( 0%) sys   1.11 ( 0%) 
wall
  tree PHI insertion    :  17.62 ( 4%) usr   0.01 ( 0%) sys  17.63 ( 4%) 
wall
  tree SSA rewrite      :   5.90 ( 1%) usr   0.02 ( 0%) sys   5.92 ( 1%) 
wall
  tree SSA other        :  10.18 ( 3%) usr   0.04 ( 1%) sys  10.22 ( 3%) 
wall
  dominator optimization:  31.18 ( 8%) usr   0.25 ( 5%) sys  31.43 ( 8%) 
wall
  tree SRA              :   0.42 ( 0%) usr   0.00 ( 0%) sys   0.42 ( 0%) 
wall
  tree CCP              :   6.99 ( 2%) usr   0.05 ( 1%) sys   7.04 ( 2%) 
wall
  tree split crit edges :   0.53 ( 0%) usr   0.01 ( 0%) sys   0.54 ( 0%) 
wall
  tree PRE              :  67.53 (17%) usr   0.08 ( 2%) sys  67.92 (17%) 
wall
  tree conservative DCE :   5.12 ( 1%) usr   0.01 ( 0%) sys   5.13 ( 1%) 
wall
  tree aggressive DCE   :   2.41 ( 1%) usr   0.00 ( 0%) sys   2.41 ( 1%) 
wall
  tree SSA to normal    :   5.69 ( 1%) usr   0.17 ( 4%) sys   5.86 ( 1%) 
wall
  dominance frontiers   :   0.65 ( 0%) usr   0.00 ( 0%) sys   0.65 ( 0%) 
wall
  control dependences   :   0.35 ( 0%) usr   0.00 ( 0%) sys   0.35 ( 0%) 
wall
  expand                :  20.80 ( 5%) usr   0.07 ( 1%) sys  20.88 ( 5%) 
wall
  varconst              :   0.81 ( 0%) usr   0.04 ( 1%) sys   0.85 ( 0%) 
wall
  jump                  :   1.72 ( 0%) usr   0.13 ( 3%) sys   1.86 ( 0%) 
wall
  CSE                   :   8.43 ( 2%) usr   0.00 ( 0%) sys   8.43 ( 2%) 
wall
  global CSE            :  10.58 ( 3%) usr   0.15 ( 3%) sys  10.74 ( 3%) 
wall
  loop analysis         :   2.59 ( 1%) usr   0.01 ( 0%) sys   2.60 ( 1%) 
wall
  bypass jumps          :   1.95 ( 0%) usr   0.03 ( 1%) sys   1.98 ( 0%) 
wall
  CSE 2                 :   3.57 ( 1%) usr   0.00 ( 0%) sys   3.57 ( 1%) 
wall
  branch prediction     :   4.66 ( 1%) usr   0.01 ( 0%) sys   4.69 ( 1%) 
wall
  flow analysis         :   0.18 ( 0%) usr   0.00 ( 0%) sys   0.18 ( 0%) 
wall
  combiner              :   3.53 ( 1%) usr   0.00 ( 0%) sys   3.53 ( 1%) 
wall
  if-conversion         :   0.92 ( 0%) usr   0.00 ( 0%) sys   0.92 ( 0%) 
wall
  regmove               :   1.29 ( 0%) usr   0.00 ( 0%) sys   1.29 ( 0%) 
wall
  local alloc           :   3.36 ( 1%) usr   0.01 ( 0%) sys   3.37 ( 1%) 
wall
  global alloc          :   7.97 ( 2%) usr   0.13 ( 3%) sys   8.10 ( 2%) 
wall
  reload CSE regs       :   3.79 ( 1%) usr   0.00 ( 0%) sys   3.79 ( 1%) 
wall
  flow 2                :   0.78 ( 0%) usr   0.00 ( 0%) sys   0.78 ( 0%) 
wall
  if-conversion 2       :   0.46 ( 0%) usr   0.00 ( 0%) sys   0.46 ( 0%) 
wall
  peephole 2            :   0.83 ( 0%) usr   0.01 ( 0%) sys   0.84 ( 0%) 
wall
  rename registers      :   1.16 ( 0%) usr   0.05 ( 1%) sys   1.21 ( 0%) 
wall
  scheduling 2          :   4.62 ( 1%) usr   0.06 ( 1%) sys   4.68 ( 1%) 
wall
  reorder blocks        :   0.73 ( 0%) usr   0.00 ( 0%) sys   0.73 ( 0%) 
wall
  shorten branches      :   1.16 ( 0%) usr   0.02 ( 0%) sys   1.18 ( 0%) 
wall
  reg stack             :   0.19 ( 0%) usr   0.00 ( 0%) sys   0.19 ( 0%) 
wall
  final                 :   1.79 ( 0%) usr   0.13 ( 3%) sys   1.92 ( 0%) 
wall
  symout                :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) 
wall
  rest of compilation   :   2.76 ( 1%) usr   0.02 ( 0%) sys   2.78 ( 1%) 
wall
  TOTAL                 : 396.23             4.72           402.15

So appearantly PRE got a factor of 10 slower!?

Richard.
Comment 22 Richard Biener 2004-03-13 16:53:26 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

dberlin at dberlin dot org wrote:
> ------- Additional Comments From dberlin at dberlin dot org  2004-03-13 16:00 -------
> Subject: Re:  [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120
> 
> 
>>and a lot more of int d farther away? Also, how are bb's marked? I see 
>><bb 0>: but no more, and some gotos reference <bb 18> and <bb 16> 
>>(with a label, too)?
>>
>>Can I get summaries somehow here?  Or just dump one interesting 
>>function rather than all of the program?
>>
>>Also, how do I dump some stuff about the PRE pass?  Specifying 
>>-fdump-tree-pre just dumps the trees after PRE with no information 
>>about the PRE pass itself.
> 
> 
> -fdump-tree-pre-stats-details. But i already know what it is going to 
> show in this case, based on the profile.
> I just need other properties of the functions, which i'm attempting to 
> get.

I also see we're running PRE before DCE - the functions probably contain 
  a lot of dead code - would it be sensible and profitable to move the 
first DCE pass before PRE?  Can this be specified on the command line or 
where would I need to change the source to do this?

Richard.
Comment 23 Daniel Berlin 2004-03-13 17:00:31 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120

>
> So appearantly PRE got a factor of 10 slower!?
>
Highly unlikely.
There haven't been any PRE changes in between the two compilers.

Something else changed, like inlining or something.
>
You are likely inlining *way* too much again or something.
--Dan

Comment 24 Daniel Berlin 2004-03-13 17:02:32 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120

>
> I also see we're running PRE before DCE - the functions probably 
> contain
>   a lot of dead code - would it be sensible and profitable to move the
> first DCE pass before PRE?

No we aren't.
We run 3 DCE passes before PRE.

NEXT_PASS (pass_build_cfg);
...
   NEXT_PASS (pass_dce);
   ...
NEXT_PASS (DUP_PASS (pass_dce));
  ...
NEXT_PASS (DUP_PASS (pass_dce));
NEXT_PASS (pass_split_crit_edges);
   NEXT_PASS (pass_pre);


>  Can this be specified on the command line or
> where would I need to change the source to do this?
>
> Richard.
>
>
> -- 
>
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13776

Comment 25 Daniel Berlin 2004-03-13 17:07:17 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120

> So appearantly PRE got a factor of 10 slower!?
>

Note that the other functions got a factor of 3-5 slower too. As I 
said, PRE just has a larger constant, so it's more noticeable.
This tells me something else important changed, probably in cgraph or 
something.
There is little i can do, a lot of the portions wasting time are 
already O(n) (compute_down_safety for example).
The only thing to do is reduce the number of expressions we PRE, give 
up PRE entirely on such functions, or change PRE algorithms.

I'm actually working on 3 and 2, rather than 1.
1 is tricky, we already give up on expressions that occur once, which 
makes us lose some load motion.
Number 2 requires figuring out what properties of this function make it 
such a pain in the ass, which is what i'm doing.
and #3 is being worked on in the background, i'm waiting for Steven to 
get back to get more work done.


Comment 26 Daniel Berlin 2004-03-14 04:47:26 UTC
There are about 100 functions here with > a couple thousand bb's.
PRE takes about 2-3 seconds for each of these functions.
Which means i have to microoptimize it in order to get rid of the cumulative time effect.
A lot of is it simply iterating over large lists looking for certain types of nodes (like EPHiS), where the 
lists are O(n_basic_blocks), and we only need to look at 10 entries or so.  This doesn't matter when the 
numbers are close, but when you have 8000 bb's to walk 20 times, instead of walking 40 entries 20 
times, it matters.

Comment 27 Richard Biener 2004-03-14 12:10:39 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

dberlin at gcc dot gnu dot org wrote:
> ------- Additional Comments From dberlin at gcc dot gnu dot org  2004-03-14 04:47 -------
> There are about 100 functions here with > a couple thousand bb's.
> PRE takes about 2-3 seconds for each of these functions.
> Which means i have to microoptimize it in order to get rid of the cumulative time effect.
> A lot of is it simply iterating over large lists looking for certain types of nodes (like EPHiS), where the 
> lists are O(n_basic_blocks), and we only need to look at 10 entries or so.  This doesn't matter when the 
> numbers are close, but when you have 8000 bb's to walk 20 times, instead of walking 40 entries 20 
> times, it matters.

Yes.  I suppose simply storing those nodes separate does not work, as 
does using a hash-table for storing them, no?

Another way would be to reduce the number of bb's somehow?  I cannot 
think of how 8000 bb's can accumulate in one of my math kernels other 
than by inlining and maybe loop header copying.  Can't we merge some 
bb's before doing PRE?

Richard.
Comment 28 Richard Biener 2004-03-14 13:54:05 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

dberlin at gcc dot gnu dot org wrote:
> ------- Additional Comments From dberlin at gcc dot gnu dot org  2004-03-14 04:47 -------
> There are about 100 functions here with > a couple thousand bb's.
> PRE takes about 2-3 seconds for each of these functions.
> Which means i have to microoptimize it in order to get rid of the cumulative time effect.
> A lot of is it simply iterating over large lists looking for certain types of nodes (like EPHiS), where the 
> lists are O(n_basic_blocks), and we only need to look at 10 entries or so.  This doesn't matter when the 
> numbers are close, but when you have 8000 bb's to walk 20 times, instead of walking 40 entries 20 
> times, it matters.

The nice thing is, that with -fno-exceptions the results look a _lot_ 
better:

Execution times (seconds)
  garbage collection    :  21.13 ( 7%) usr   0.01 ( 0%) sys  21.20 ( 7%) 
wall
  callgraph construction:   1.45 ( 0%) usr   0.01 ( 0%) sys   1.46 ( 0%) 
wall
  callgraph optimization:   1.51 ( 0%) usr   0.09 ( 1%) sys   1.61 ( 1%) 
wall
  cfg construction      :   0.52 ( 0%) usr   0.05 ( 1%) sys   0.57 ( 0%) 
wall
  cfg cleanup           :   1.67 ( 1%) usr   0.00 ( 0%) sys   1.67 ( 1%) 
wall
  trivially dead code   :   2.27 ( 1%) usr   0.01 ( 0%) sys   2.28 ( 1%) 
wall
  life analysis         :   5.01 ( 2%) usr   0.00 ( 0%) sys   5.02 ( 2%) 
wall
  life info update      :   3.11 ( 1%) usr   0.00 ( 0%) sys   3.17 ( 1%) 
wall
  alias analysis        :   4.02 ( 1%) usr   0.01 ( 0%) sys   4.03 ( 1%) 
wall
  register scan         :   1.97 ( 1%) usr   0.00 ( 0%) sys   1.97 ( 1%) 
wall
  rebuild jump labels   :   0.54 ( 0%) usr   0.00 ( 0%) sys   0.54 ( 0%) 
wall
  preprocessing         :   0.69 ( 0%) usr   0.20 ( 3%) sys   1.72 ( 1%) 
wall
  parser                :  18.39 ( 6%) usr   1.03 (16%) sys  19.44 ( 6%) 
wall
  name lookup           :   6.74 ( 2%) usr   1.43 (23%) sys   8.18 ( 3%) 
wall
  integration           :  58.53 (19%) usr   0.43 ( 7%) sys  58.99 (19%) 
wall
  tree gimplify         :   3.43 ( 1%) usr   0.05 ( 1%) sys   3.48 ( 1%) 
wall
  tree eh               :   0.76 ( 0%) usr   0.00 ( 0%) sys   0.76 ( 0%) 
wall
  tree CFG construction :   1.54 ( 1%) usr   0.13 ( 2%) sys   1.67 ( 1%) 
wall
  tree CFG cleanup      :   1.84 ( 1%) usr   0.01 ( 0%) sys   1.85 ( 1%) 
wall
  tree PTA              :   0.68 ( 0%) usr   0.00 ( 0%) sys   0.68 ( 0%) 
wall
  tree alias analysis   :   1.07 ( 0%) usr   0.01 ( 0%) sys   1.08 ( 0%) 
wall
  tree PHI insertion    :   1.37 ( 0%) usr   0.06 ( 1%) sys   1.43 ( 0%) 
wall
  tree SSA rewrite      :   3.53 ( 1%) usr   0.06 ( 1%) sys   3.59 ( 1%) 
wall
  tree SSA other        :   4.69 ( 2%) usr   0.41 ( 7%) sys   5.12 ( 2%) 
wall
  tree operand scan     :   3.57 ( 1%) usr   0.27 ( 4%) sys   3.85 ( 1%) 
wall
  dominator optimization:  16.32 ( 5%) usr   0.52 ( 8%) sys  16.84 ( 5%) 
wall
  tree SRA              :   0.43 ( 0%) usr   0.00 ( 0%) sys   0.43 ( 0%) 
wall
  tree CCP              :   1.51 ( 0%) usr   0.01 ( 0%) sys   1.52 ( 0%) 
wall
  tree split crit edges :   0.16 ( 0%) usr   0.00 ( 0%) sys   0.16 ( 0%) 
wall
  tree PRE              :  17.34 ( 6%) usr   0.05 ( 1%) sys  17.40 ( 6%) 
wall
  tree linearize phis   :   0.01 ( 0%) usr   0.01 ( 0%) sys   0.02 ( 0%) 
wall
  tree forward propagate:   1.01 ( 0%) usr   0.00 ( 0%) sys   1.01 ( 0%) 
wall
  tree conservative DCE :   2.54 ( 1%) usr   0.01 ( 0%) sys   2.55 ( 1%) 
wall
  tree aggressive DCE   :   0.83 ( 0%) usr   0.00 ( 0%) sys   0.83 ( 0%) 
wall
  tree DSE              :   1.86 ( 1%) usr   0.07 ( 1%) sys   1.93 ( 1%) 
wall
  tree copy headers     :   1.39 ( 0%) usr   0.01 ( 0%) sys   1.40 ( 0%) 
wall
  tree SSA to normal    :   3.01 ( 1%) usr   0.04 ( 1%) sys   3.05 ( 1%) 
wall
  tree rename SSA copies:   0.69 ( 0%) usr   0.07 ( 1%) sys   0.77 ( 0%) 
wall
  dominance frontiers   :   0.18 ( 0%) usr   0.00 ( 0%) sys   0.18 ( 0%) 
wall
  control dependences   :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) 
wall
  expand                :  31.02 (10%) usr   0.24 ( 4%) sys  31.41 (10%) 
wall
  varconst              :   0.94 ( 0%) usr   0.01 ( 0%) sys   0.99 ( 0%) 
wall
  jump                  :   1.77 ( 1%) usr   0.14 ( 2%) sys   1.97 ( 1%) 
wall
  CSE                   :   9.85 ( 3%) usr   0.03 ( 0%) sys   9.90 ( 3%) 
wall
  global CSE            :  14.32 ( 5%) usr   0.17 ( 3%) sys  14.49 ( 5%) 
wall
  loop analysis         :   4.19 ( 1%) usr   0.01 ( 0%) sys   4.21 ( 1%) 
wall
  bypass jumps          :   1.19 ( 0%) usr   0.01 ( 0%) sys   1.20 ( 0%) 
wall
  CSE 2                 :   4.24 ( 1%) usr   0.00 ( 0%) sys   4.24 ( 1%) 
wall
  branch prediction     :   1.49 ( 0%) usr   0.03 ( 0%) sys   1.54 ( 0%) 
wall
  flow analysis         :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.14 ( 0%) 
wall
  combiner              :   3.80 ( 1%) usr   0.01 ( 0%) sys   3.82 ( 1%) 
wall
  if-conversion         :   0.61 ( 0%) usr   0.01 ( 0%) sys   0.63 ( 0%) 
wall
  regmove               :   2.17 ( 1%) usr   0.00 ( 0%) sys   2.20 ( 1%) 
wall
  local alloc           :   3.22 ( 1%) usr   0.03 ( 0%) sys   3.25 ( 1%) 
wall
  global alloc          :   7.58 ( 3%) usr   0.21 ( 3%) sys   7.79 ( 3%) 
wall
  reload CSE regs       :   3.25 ( 1%) usr   0.02 ( 0%) sys   3.27 ( 1%) 
wall
  flow 2                :   0.65 ( 0%) usr   0.00 ( 0%) sys   0.65 ( 0%) 
wall
  if-conversion 2       :   0.30 ( 0%) usr   0.00 ( 0%) sys   0.30 ( 0%) 
wall
  peephole 2            :   0.62 ( 0%) usr   0.02 ( 0%) sys   0.64 ( 0%) 
wall
  rename registers      :   0.97 ( 0%) usr   0.04 ( 1%) sys   1.01 ( 0%) 
wall
  scheduling 2          :   5.57 ( 2%) usr   0.04 ( 1%) sys   5.66 ( 2%) 
wall
  machine dep reorg     :   1.21 ( 0%) usr   0.00 ( 0%) sys   1.21 ( 0%) 
wall
  reorder blocks        :   0.78 ( 0%) usr   0.00 ( 0%) sys   0.80 ( 0%) 
wall
  shorten branches      :   0.82 ( 0%) usr   0.02 ( 0%) sys   0.84 ( 0%) 
wall
  reg stack             :   0.20 ( 0%) usr   0.00 ( 0%) sys   0.20 ( 0%) 
wall
  final                 :   1.41 ( 0%) usr   0.14 ( 2%) sys   1.56 ( 1%) 
wall
  rest of compilation   :   2.63 ( 1%) usr   0.04 ( 1%) sys   2.67 ( 1%) 
wall
  TOTAL                 : 302.38             6.29           310.24

So the question is, where is the difference and wether it needs to be 
there ;)

Richard.
Comment 29 Daniel Berlin 2004-03-14 15:38:36 UTC
Subject: Bug 13776


On Mar 14, 2004, at 8:35 AM, Richard Guenther wrote:

> Daniel Berlin wrote:
>> This adds a DOM pass in between split critical edges and PRE, and 
>> works for me on i686 and powerpc
>> Tell me if it helps
>
> It made things worse in total, even PRE degraded some, but that may be 
> in the noise.
>
> Richard.

I don't even get close to these numbers.
I've got your leafify patch installed (the one linked from the bug 
report)
Even at -O2, on a checking enabled compiler, with tramp3d-v2 from the 
bug report, with the following sizes:

[root@dberlin dberlin]# ls -trl tramp3d-v2.ii
-rw-r--r--    1 root     root      2962361 Feb  5 10:27 tramp3d-v2.ii
generated from
[root@dberlin dberlin]# ls -l tramp3d-v2.cpp
-rw-r--r--    1 dberlin  dberlin   1952077 Feb  5 10:14 tramp3d-v2.cpp

I get (without any changes to PRE):
[root@dberlin gcc]# ./cc1plus -O2 ~dberlin/tramp3d-v2.ii
...
Execution times (seconds)
  garbage collection    :  46.23 (15%) usr   0.27 ( 3%) sys  46.66 (15%) 
wall
  callgraph construction:   0.68 ( 0%) usr   0.01 ( 0%) sys   0.72 ( 0%) 
wall
  callgraph optimization:   0.80 ( 0%) usr   0.07 ( 1%) sys   0.92 ( 0%) 
wall
  cfg construction      :   0.46 ( 0%) usr   0.04 ( 0%) sys   0.50 ( 0%) 
wall
  cfg cleanup           :   1.82 ( 1%) usr   0.02 ( 0%) sys   1.84 ( 1%) 
wall
  CFG verifier          :   8.07 ( 3%) usr   0.03 ( 0%) sys   8.15 ( 3%) 
wall
  trivially dead code   :   1.28 ( 0%) usr   0.00 ( 0%) sys   1.29 ( 0%) 
wall
  life analysis         :   2.96 ( 1%) usr   0.01 ( 0%) sys   2.97 ( 1%) 
wall
  life info update      :   1.52 ( 0%) usr   0.01 ( 0%) sys   1.56 ( 0%) 
wall
  alias analysis        :   2.64 ( 1%) usr   0.01 ( 0%) sys   2.66 ( 1%) 
wall
  register scan         :   1.23 ( 0%) usr   0.02 ( 0%) sys   1.25 ( 0%) 
wall
  rebuild jump labels   :   0.38 ( 0%) usr   0.00 ( 0%) sys   0.38 ( 0%) 
wall
  preprocessing         :   0.29 ( 0%) usr   0.17 ( 2%) sys   0.46 ( 0%) 
wall
  parser                :  13.65 ( 4%) usr   1.27 (16%) sys  20.56 ( 6%) 
wall
  name lookup           :   4.99 ( 2%) usr   2.00 (25%) sys   7.07 ( 2%) 
wall
  integration           :  28.17 ( 9%) usr   0.19 ( 2%) sys  28.57 ( 9%) 
wall
  tree gimplify         :   2.08 ( 1%) usr   0.05 ( 1%) sys   2.19 ( 1%) 
wall
  tree eh               :   2.86 ( 1%) usr   0.08 ( 1%) sys   2.96 ( 1%) 
wall
  tree CFG construction :   1.60 ( 1%) usr   0.09 ( 1%) sys   1.71 ( 1%) 
wall
  tree CFG cleanup      :   3.99 ( 1%) usr   0.04 ( 0%) sys   4.04 ( 1%) 
wall
  tree PTA              :   0.47 ( 0%) usr   0.01 ( 0%) sys   0.49 ( 0%) 
wall
  tree alias analysis   :   0.61 ( 0%) usr   0.00 ( 0%) sys   0.61 ( 0%) 
wall
  tree PHI insertion    :   9.15 ( 3%) usr   0.07 ( 1%) sys   9.26 ( 3%) 
wall
  tree SSA rewrite      :   3.30 ( 1%) usr   0.01 ( 0%) sys   3.32 ( 1%) 
wall
  tree SSA other        :   3.63 ( 1%) usr   0.51 ( 6%) sys   4.20 ( 1%) 
wall
  tree operand scan     :   3.62 ( 1%) usr   0.59 ( 7%) sys   4.22 ( 1%) 
wall
  dominator optimization:  15.57 ( 5%) usr   0.46 ( 6%) sys  16.09 ( 5%) 
wall
  tree SRA              :   0.31 ( 0%) usr   0.01 ( 0%) sys   0.32 ( 0%) 
wall
  tree CCP              :   1.56 ( 1%) usr   0.02 ( 0%) sys   1.58 ( 0%) 
wall
  tree split crit edges :   0.57 ( 0%) usr   0.03 ( 0%) sys   0.61 ( 0%) 
wall
  tree PRE              :  34.92 ( 9%) usr   0.14 ( 2%) sys  35.20 ( 9%) 
wall
  tree linearize phis   :   0.03 ( 0%) usr   0.02 ( 0%) sys   0.05 ( 0%) 
wall
  tree forward propagate:   1.12 ( 0%) usr   0.02 ( 0%) sys   1.14 ( 0%) 
wall
  tree conservative DCE :   3.02 ( 1%) usr   0.03 ( 0%) sys   3.06 ( 1%) 
wall
  tree aggressive DCE   :   0.78 ( 0%) usr   0.01 ( 0%) sys   0.79 ( 0%) 
wall
  tree DSE              :   2.18 ( 1%) usr   0.01 ( 0%) sys   2.20 ( 1%) 
wall
  tree copy headers     :   2.15 ( 1%) usr   0.02 ( 0%) sys   2.19 ( 1%) 
wall
  tree SSA to normal    :   2.42 ( 1%) usr   0.13 ( 2%) sys   2.61 ( 1%) 
wall
  tree NRV optimization :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) 
wall
  tree rename SSA copies:   0.71 ( 0%) usr   0.04 ( 0%) sys   0.75 ( 0%) 
wall
  tree SSA verifier     :  25.23 ( 8%) usr   0.23 ( 3%) sys  25.52 ( 8%) 
wall
  tree STMT verifier    :   3.72 ( 1%) usr   0.03 ( 0%) sys   3.76 ( 1%) 
wall
  callgraph verifier    :   7.79 ( 3%) usr   0.25 ( 3%) sys   8.09 ( 3%) 
wall
  dominance frontiers   :   0.27 ( 0%) usr   0.00 ( 0%) sys   0.27 ( 0%) 
wall
  control dependences   :   0.14 ( 0%) usr   0.00 ( 0%) sys   0.14 ( 0%) 
wall
  expand                :  16.03 ( 5%) usr   0.19 ( 2%) sys  16.41 ( 5%) 
wall
  varconst              :   0.66 ( 0%) usr   0.05 ( 1%) sys   1.06 ( 0%) 
wall
  jump                  :   1.17 ( 0%) usr   0.15 ( 2%) sys   1.41 ( 0%) 
wall
  CSE                   :   8.76 ( 3%) usr   0.05 ( 1%) sys   8.84 ( 3%) 
wall
  global CSE            :   5.01 ( 2%) usr   0.13 ( 2%) sys   5.15 ( 2%) 
wall
  loop analysis         :   1.21 ( 0%) usr   0.01 ( 0%) sys   1.24 ( 0%) 
wall
  bypass jumps          :   0.94 ( 0%) usr   0.00 ( 0%) sys   0.94 ( 0%) 
wall
  CSE 2                 :   3.59 ( 1%) usr   0.02 ( 0%) sys   3.78 ( 1%) 
wall
  branch prediction     :   2.25 ( 1%) usr   0.01 ( 0%) sys   2.31 ( 1%) 
wall
  flow analysis         :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.10 ( 0%) 
wall
  combiner              :   2.58 ( 1%) usr   0.03 ( 0%) sys   2.64 ( 1%) 
wall
  if-conversion         :   0.57 ( 0%) usr   0.00 ( 0%) sys   0.57 ( 0%) 
wall
  regmove               :   0.85 ( 0%) usr   0.00 ( 0%) sys   0.86 ( 0%) 
wall
  local alloc           :   1.80 ( 1%) usr   0.01 ( 0%) sys   1.84 ( 1%) 
wall
  global alloc          :   5.34 ( 2%) usr   0.10 ( 1%) sys   5.50 ( 2%) 
wall
  reload CSE regs       :   2.24 ( 1%) usr   0.00 ( 0%) sys   2.25 ( 1%) 
wall
  flow 2                :   0.33 ( 0%) usr   0.00 ( 0%) sys   0.34 ( 0%) 
wall
  if-conversion 2       :   0.35 ( 0%) usr   0.00 ( 0%) sys   0.35 ( 0%) 
wall
  peephole 2            :   0.38 ( 0%) usr   0.00 ( 0%) sys   0.39 ( 0%) 
wall
  rename registers      :   1.43 ( 0%) usr   0.04 ( 0%) sys   1.52 ( 0%) 
wall
  scheduling 2          :   2.28 ( 1%) usr   0.08 ( 1%) sys   2.38 ( 1%) 
wall
  reorder blocks        :   0.49 ( 0%) usr   0.01 ( 0%) sys   0.50 ( 0%) 
wall
  shorten branches      :   0.70 ( 0%) usr   0.01 ( 0%) sys   0.71 ( 0%) 
wall
  reg stack             :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) 
wall
  final                 :   1.03 ( 0%) usr   0.14 ( 2%) sys   1.38 ( 0%) 
wall
  symout                :   0.02 ( 0%) usr   0.03 ( 0%) sys   0.06 ( 0%) 
wall
  rest of compilation   :   1.48 ( 0%) usr   0.04 ( 0%) sys   1.54 ( 0%) 
wall
  TOTAL                 : 310.60             8.12           327.62
Extra diagnostic checks enabled; compiler may run slowly.
Configure with --disable-checking to disable checks.


With my changes to PRE, i get the same numbers, except PRE is at 28 
seconds instead of 36.

I certainly get *nowhere close* to 600 seconds in PRE, or the numbers 
you get overall.
I can't fix a problem i can't reproduce, i can only take stabs at it.
Can someone else please verify his numbers so i know whether it's my 
test setup or his?

Comment 30 Richard Biener 2004-03-14 18:00:10 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

dberlin at dberlin dot org wrote:
> ------- Additional Comments From dberlin at dberlin dot org  2004-03-14 15:38 -------
> Subject: Bug 13776
> 
> 
> On Mar 14, 2004, at 8:35 AM, Richard Guenther wrote:
> 
> 
>>Daniel Berlin wrote:
>>
>>>This adds a DOM pass in between split critical edges and PRE, and 
>>>works for me on i686 and powerpc
>>>Tell me if it helps
>>
>>It made things worse in total, even PRE degraded some, but that may be 
>>in the noise.
>>
>>Richard.
> 
> 
> I don't even get close to these numbers.
> I've got your leafify patch installed (the one linked from the bug 
> report)
> Even at -O2, on a checking enabled compiler, with tramp3d-v2 from the 
> bug report, with the following sizes:
> 
> [root@dberlin dberlin]# ls -trl tramp3d-v2.ii
> -rw-r--r--    1 root     root      2962361 Feb  5 10:27 tramp3d-v2.ii
> generated from
> [root@dberlin dberlin]# ls -l tramp3d-v2.cpp
> -rw-r--r--    1 dberlin  dberlin   1952077 Feb  5 10:14 tramp3d-v2.cpp

That's the correct one.

> I get (without any changes to PRE):
> [root@dberlin gcc]# ./cc1plus -O2 ~dberlin/tramp3d-v2.ii
> ...
> Execution times (seconds)
>   garbage collection    :  46.23 (15%) usr   0.27 ( 3%) sys  46.66 (15%) 
> wall
>   callgraph construction:   0.68 ( 0%) usr   0.01 ( 0%) sys   0.72 ( 0%) 
> wall
>   callgraph optimization:   0.80 ( 0%) usr   0.07 ( 1%) sys   0.92 ( 0%) 
> wall
>   cfg construction      :   0.46 ( 0%) usr   0.04 ( 0%) sys   0.50 ( 0%) 
> wall
>   cfg cleanup           :   1.82 ( 1%) usr   0.02 ( 0%) sys   1.84 ( 1%) 
> wall
>   CFG verifier          :   8.07 ( 3%) usr   0.03 ( 0%) sys   8.15 ( 3%) 
> wall
>   trivially dead code   :   1.28 ( 0%) usr   0.00 ( 0%) sys   1.29 ( 0%) 
> wall
>   life analysis         :   2.96 ( 1%) usr   0.01 ( 0%) sys   2.97 ( 1%) 
> wall
>   life info update      :   1.52 ( 0%) usr   0.01 ( 0%) sys   1.56 ( 0%) 
> wall
>   alias analysis        :   2.64 ( 1%) usr   0.01 ( 0%) sys   2.66 ( 1%) 
> wall
>   register scan         :   1.23 ( 0%) usr   0.02 ( 0%) sys   1.25 ( 0%) 
> wall
>   rebuild jump labels   :   0.38 ( 0%) usr   0.00 ( 0%) sys   0.38 ( 0%) 
> wall
>   preprocessing         :   0.29 ( 0%) usr   0.17 ( 2%) sys   0.46 ( 0%) 
> wall
>   parser                :  13.65 ( 4%) usr   1.27 (16%) sys  20.56 ( 6%) 
> wall
>   name lookup           :   4.99 ( 2%) usr   2.00 (25%) sys   7.07 ( 2%) 
> wall
>   integration           :  28.17 ( 9%) usr   0.19 ( 2%) sys  28.57 ( 9%) 
> wall
>   tree gimplify         :   2.08 ( 1%) usr   0.05 ( 1%) sys   2.19 ( 1%) 
> wall
>   tree eh               :   2.86 ( 1%) usr   0.08 ( 1%) sys   2.96 ( 1%) 
> wall
>   tree CFG construction :   1.60 ( 1%) usr   0.09 ( 1%) sys   1.71 ( 1%) 
> wall
>   tree CFG cleanup      :   3.99 ( 1%) usr   0.04 ( 0%) sys   4.04 ( 1%) 
> wall
>   tree PTA              :   0.47 ( 0%) usr   0.01 ( 0%) sys   0.49 ( 0%) 
> wall
>   tree alias analysis   :   0.61 ( 0%) usr   0.00 ( 0%) sys   0.61 ( 0%) 
> wall
>   tree PHI insertion    :   9.15 ( 3%) usr   0.07 ( 1%) sys   9.26 ( 3%) 
> wall
>   tree SSA rewrite      :   3.30 ( 1%) usr   0.01 ( 0%) sys   3.32 ( 1%) 
> wall
>   tree SSA other        :   3.63 ( 1%) usr   0.51 ( 6%) sys   4.20 ( 1%) 
> wall
>   tree operand scan     :   3.62 ( 1%) usr   0.59 ( 7%) sys   4.22 ( 1%) 
> wall
>   dominator optimization:  15.57 ( 5%) usr   0.46 ( 6%) sys  16.09 ( 5%) 
> wall
>   tree SRA              :   0.31 ( 0%) usr   0.01 ( 0%) sys   0.32 ( 0%) 
> wall
>   tree CCP              :   1.56 ( 1%) usr   0.02 ( 0%) sys   1.58 ( 0%) 
> wall
>   tree split crit edges :   0.57 ( 0%) usr   0.03 ( 0%) sys   0.61 ( 0%) 
> wall
>   tree PRE              :  34.92 ( 9%) usr   0.14 ( 2%) sys  35.20 ( 9%) 
> wall
>   tree linearize phis   :   0.03 ( 0%) usr   0.02 ( 0%) sys   0.05 ( 0%) 
> wall
>   tree forward propagate:   1.12 ( 0%) usr   0.02 ( 0%) sys   1.14 ( 0%) 
> wall
>   tree conservative DCE :   3.02 ( 1%) usr   0.03 ( 0%) sys   3.06 ( 1%) 
> wall
>   tree aggressive DCE   :   0.78 ( 0%) usr   0.01 ( 0%) sys   0.79 ( 0%) 
> wall
>   tree DSE              :   2.18 ( 1%) usr   0.01 ( 0%) sys   2.20 ( 1%) 
> wall
>   tree copy headers     :   2.15 ( 1%) usr   0.02 ( 0%) sys   2.19 ( 1%) 
> wall
>   tree SSA to normal    :   2.42 ( 1%) usr   0.13 ( 2%) sys   2.61 ( 1%) 
> wall
>   tree NRV optimization :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) 
> wall
>   tree rename SSA copies:   0.71 ( 0%) usr   0.04 ( 0%) sys   0.75 ( 0%) 
> wall
>   tree SSA verifier     :  25.23 ( 8%) usr   0.23 ( 3%) sys  25.52 ( 8%) 
> wall
>   tree STMT verifier    :   3.72 ( 1%) usr   0.03 ( 0%) sys   3.76 ( 1%) 
> wall
>   callgraph verifier    :   7.79 ( 3%) usr   0.25 ( 3%) sys   8.09 ( 3%) 
> wall
>   dominance frontiers   :   0.27 ( 0%) usr   0.00 ( 0%) sys   0.27 ( 0%) 
> wall
>   control dependences   :   0.14 ( 0%) usr   0.00 ( 0%) sys   0.14 ( 0%) 
> wall
>   expand                :  16.03 ( 5%) usr   0.19 ( 2%) sys  16.41 ( 5%) 
> wall
>   varconst              :   0.66 ( 0%) usr   0.05 ( 1%) sys   1.06 ( 0%) 
> wall
>   jump                  :   1.17 ( 0%) usr   0.15 ( 2%) sys   1.41 ( 0%) 
> wall
>   CSE                   :   8.76 ( 3%) usr   0.05 ( 1%) sys   8.84 ( 3%) 
> wall
>   global CSE            :   5.01 ( 2%) usr   0.13 ( 2%) sys   5.15 ( 2%) 
> wall
>   loop analysis         :   1.21 ( 0%) usr   0.01 ( 0%) sys   1.24 ( 0%) 
> wall
>   bypass jumps          :   0.94 ( 0%) usr   0.00 ( 0%) sys   0.94 ( 0%) 
> wall
>   CSE 2                 :   3.59 ( 1%) usr   0.02 ( 0%) sys   3.78 ( 1%) 
> wall
>   branch prediction     :   2.25 ( 1%) usr   0.01 ( 0%) sys   2.31 ( 1%) 
> wall
>   flow analysis         :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.10 ( 0%) 
> wall
>   combiner              :   2.58 ( 1%) usr   0.03 ( 0%) sys   2.64 ( 1%) 
> wall
>   if-conversion         :   0.57 ( 0%) usr   0.00 ( 0%) sys   0.57 ( 0%) 
> wall
>   regmove               :   0.85 ( 0%) usr   0.00 ( 0%) sys   0.86 ( 0%) 
> wall
>   local alloc           :   1.80 ( 1%) usr   0.01 ( 0%) sys   1.84 ( 1%) 
> wall
>   global alloc          :   5.34 ( 2%) usr   0.10 ( 1%) sys   5.50 ( 2%) 
> wall
>   reload CSE regs       :   2.24 ( 1%) usr   0.00 ( 0%) sys   2.25 ( 1%) 
> wall
>   flow 2                :   0.33 ( 0%) usr   0.00 ( 0%) sys   0.34 ( 0%) 
> wall
>   if-conversion 2       :   0.35 ( 0%) usr   0.00 ( 0%) sys   0.35 ( 0%) 
> wall
>   peephole 2            :   0.38 ( 0%) usr   0.00 ( 0%) sys   0.39 ( 0%) 
> wall
>   rename registers      :   1.43 ( 0%) usr   0.04 ( 0%) sys   1.52 ( 0%) 
> wall
>   scheduling 2          :   2.28 ( 1%) usr   0.08 ( 1%) sys   2.38 ( 1%) 
> wall
>   reorder blocks        :   0.49 ( 0%) usr   0.01 ( 0%) sys   0.50 ( 0%) 
> wall
>   shorten branches      :   0.70 ( 0%) usr   0.01 ( 0%) sys   0.71 ( 0%) 
> wall
>   reg stack             :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) 
> wall
>   final                 :   1.03 ( 0%) usr   0.14 ( 2%) sys   1.38 ( 0%) 
> wall
>   symout                :   0.02 ( 0%) usr   0.03 ( 0%) sys   0.06 ( 0%) 
> wall
>   rest of compilation   :   1.48 ( 0%) usr   0.04 ( 0%) sys   1.54 ( 0%) 
> wall
>   TOTAL                 : 310.60             8.12           327.62
> Extra diagnostic checks enabled; compiler may run slowly.
> Configure with --disable-checking to disable checks.
> 
> 
> With my changes to PRE, i get the same numbers, except PRE is at 28 
> seconds instead of 36.
> 
> I certainly get *nowhere close* to 600 seconds in PRE, or the numbers 
> you get overall.
> I can't fix a problem i can't reproduce, i can only take stabs at it.
> Can someone else please verify his numbers so i know whether it's my 
> test setup or his?

I even have checking disabled.  GC time seems to be identical, parsing 
is 13.5s vs 18.4s - the first big difference is integration, which 
suggests that leafifying is not enabled?  Maybe the patch applied 
"wrong", I attached a complete diff of my local changes.

Anyway, I'm running on a 1GHz Athlon with 1GB of ram, compiler is 
bootstrapped with checking disabled.

Richard.
Index: gcc/c-common.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/c-common.c,v
retrieving revision 1.344.2.63
diff -u -u -r1.344.2.63 c-common.c
--- gcc/c-common.c	2 Mar 2004 18:41:21 -0000	1.344.2.63
+++ gcc/c-common.c	14 Mar 2004 17:51:26 -0000
@@ -746,6 +746,7 @@
 static tree handle_noinline_attribute (tree *, tree, tree, int, bool *);
 static tree handle_always_inline_attribute (tree *, tree, tree, int,
 					    bool *);
+static tree handle_leafify_attribute (tree *, tree, tree, int, bool *);
 static tree handle_used_attribute (tree *, tree, tree, int, bool *);
 static tree handle_unused_attribute (tree *, tree, tree, int, bool *);
 static tree handle_const_attribute (tree *, tree, tree, int, bool *);
@@ -807,6 +808,8 @@
 			      handle_noinline_attribute },
   { "always_inline",          0, 0, true,  false, false,
 			      handle_always_inline_attribute },
+  { "leafify",                0, 0, true,  false, false,
+                              handle_leafify_attribute },
   { "used",                   0, 0, true,  false, false,
 			      handle_used_attribute },
   { "unused",                 0, 0, false, false, false,
@@ -4458,6 +4461,29 @@
 
   return NULL_TREE;
 }
+
+/* Handle a "leafify" attribute; arguments as in
+   struct attribute_spec.handler.  */
+
+static tree
+handle_leafify_attribute (tree *node, tree name,
+                          tree args ATTRIBUTE_UNUSED,
+                          int flags ATTRIBUTE_UNUSED, bool *no_add_attrs)
+{
+  if (TREE_CODE (*node) == FUNCTION_DECL)
+    {
+      /* Do nothing else, just set the attribute.  We'll get at
+         it later with lookup_attribute.  */
+    }
+  else
+    {
+      warning ("`%s' attribute ignored", IDENTIFIER_POINTER (name));
+      *no_add_attrs = true;
+    }
+
+  return NULL_TREE;
+}
+
 
 /* Handle a "used" attribute; arguments as in
    struct attribute_spec.handler.  */
Index: gcc/cgraphunit.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/cgraphunit.c,v
retrieving revision 1.1.4.39
diff -u -u -r1.1.4.39 cgraphunit.c
--- gcc/cgraphunit.c	4 Mar 2004 15:38:34 -0000	1.1.4.39
+++ gcc/cgraphunit.c	14 Mar 2004 17:51:26 -0000
@@ -1045,7 +1045,7 @@
   else
     e->callee->global.inlined_to = e->caller;
 
-  /* Recursivly clone all bodies.  */
+  /* Recursivly clone all inlined bodies.  */
   for (e = e->callee->callees; e; e = e->next_callee)
     if (!e->inline_failed)
       cgraph_clone_inlined_nodes (e, duplicate);
@@ -1192,7 +1192,7 @@
     recursive = what->decl == to->global.inlined_to->decl;
   else
     recursive = what->decl == to->decl;
-  /* Marking recursive function inlinine has sane semantic and thus we should
+  /* Marking recursive function inline has sane semantic and thus we should
      not warn on it.  */
   if (recursive && reason)
     *reason = (what->local.disregard_inline_limits
@@ -1440,6 +1440,67 @@
   free (heap_node);
 }
 
+/* Find callgraph nodes closing a circle in the graph.  The
+   resulting hashtab can be used to avoid walking the circles.
+   Uses the cgraph nodes ->aux field which needs to be zero
+   before and will be zero after operation.  */
+
+static void
+cgraph_find_cycles (struct cgraph_node *node, htab_t cycles)
+{
+  struct cgraph_edge *e;
+
+  if (node->aux)
+    {
+      void **slot;
+      slot = htab_find_slot (cycles, node, INSERT);
+      if (!*slot)
+	{
+	  if (cgraph_dump_file)
+	    fprintf (cgraph_dump_file, "Cycle contains %s\n", cgraph_node_name (node));
+	  *slot = node;
+	}
+      return;
+    }
+
+  node->aux = node;
+  for (e = node->callees; e; e = e->next_callee)
+    {
+       cgraph_find_cycles (e->callee, cycles); 
+    }
+  node->aux = 0;
+}
+
+/* Leafify the cgraph node.  We have to be careful in recursing
+   as to not run endlessly in circles of the callgraph.
+   We do so by using a hashtab of cycle entering nodes as generated
+   by cgraph_find_cycles.  */
+
+static void
+cgraph_leafify_node (struct cgraph_node *node, htab_t cycles)
+{
+  struct cgraph_edge *e;
+
+  for (e = node->callees; e; e = e->next_callee)
+    {
+      /* Inline call, if possible, and recurse.  Be sure we are not
+	 entering callgraph circles here.  */
+      if (e->inline_failed
+	  && e->callee->local.inlinable
+	  && !cgraph_recursive_inlining_p (node, e->callee,
+				  	   &e->inline_failed)
+	  && !htab_find (cycles, e->callee))
+	{
+	  if (cgraph_dump_file)
+    	    fprintf (cgraph_dump_file, " inlining %s", cgraph_node_name (e->callee));
+          cgraph_mark_inline_edge (e);
+	  cgraph_leafify_node (e->callee, cycles);
+	}
+      else if (cgraph_dump_file)
+	fprintf (cgraph_dump_file, " !inlining %s", cgraph_node_name (e->callee));
+    }
+}
+
 /* Decide on the inlining.  We do so in the topological order to avoid
    expenses on updating datastructures.  */
 
@@ -1477,6 +1538,24 @@
       struct cgraph_edge *e;
 
       node = order[i];
+
+      /* Handle nodes to be leafified, but don't update overall unit size.  */
+      if (lookup_attribute ("leafify", DECL_ATTRIBUTES (node->decl)) != NULL)
+        {
+	  int old_overall_insns = overall_insns;
+	  htab_t cycles;
+  	  if (cgraph_dump_file)
+    	    fprintf (cgraph_dump_file,
+	     	     "Leafifying %s\n", cgraph_node_name (node));
+	  cycles = htab_create (7, htab_hash_pointer, htab_eq_pointer, NULL);
+	  cgraph_find_cycles (node, cycles);
+	  cgraph_leafify_node (node, cycles);
+	  htab_delete (cycles);
+	  overall_insns = old_overall_insns;
+	  /* We don't need to consider always_inline functions inside the leafified
+	     function anymore.  */
+	  continue;
+        }
 
       for (e = node->callees; e; e = e->next_callee)
 	if (e->callee->local.disregard_inline_limits)
Index: gcc/doc/extend.texi
===================================================================
RCS file: /cvs/gcc/gcc/gcc/doc/extend.texi,v
retrieving revision 1.82.2.36
diff -u -u -r1.82.2.36 extend.texi
--- gcc/doc/extend.texi	2 Mar 2004 18:42:50 -0000	1.82.2.36
+++ gcc/doc/extend.texi	14 Mar 2004 17:51:30 -0000
@@ -1893,7 +1893,7 @@
 attributes when making a declaration.  This keyword is followed by an
 attribute specification inside double parentheses.  The following
 attributes are currently defined for functions on all targets:
-@code{noreturn}, @code{noinline}, @code{always_inline},
+@code{noreturn}, @code{noinline}, @code{always_inline}, @code{leafify},
 @code{pure}, @code{const}, @code{nothrow},
 @code{format}, @code{format_arg}, @code{no_instrument_function},
 @code{section}, @code{constructor}, @code{destructor}, @code{used},
@@ -1969,6 +1969,14 @@
 Generally, functions are not inlined unless optimization is specified.
 For functions declared inline, this attribute inlines the function even
 if no optimization level was specified.
+
+@cindex @code{leafify} function attribute
+@item leafify
+Generally, inlining into a function is limited.  For a function marked with
+this attribute, every call inside this function will be inlined, if possible.
+Whether the function itself is considered for inlining depends on its size and
+the current inlining parameters.  The @code{leafify} attribute only works
+reliably in unit-at-a-time mode.
 
 @cindex @code{pure} function attribute
 @item pure
Index: libstdc++-v3/include/c_std/std_cmath.h
===================================================================
RCS file: /cvs/gcc/gcc/libstdc++-v3/include/c_std/std_cmath.h,v
retrieving revision 1.5.6.7
diff -u -u -r1.5.6.7 std_cmath.h
--- libstdc++-v3/include/c_std/std_cmath.h	3 Jan 2004 23:05:32 -0000	1.5.6.7
+++ libstdc++-v3/include/c_std/std_cmath.h	14 Mar 2004 17:51:55 -0000
@@ -330,9 +330,31 @@
   { return __builtin_modfl(__x, __iptr); }
 
   template<typename _Tp>
-    inline _Tp
+    inline _Tp __attribute__((always_inline))
     __pow_helper(_Tp __x, int __n)
     {
+      if (__builtin_constant_p(__n))
+        switch (__n) {
+        case -1:
+          return _Tp(1)/__x;
+        case 0:
+          return _Tp(1);
+        case 1:
+          return __x;
+        case 2:
+          return __x*__x;
+#if ! __OPTIMIZE_SIZE__
+        case -2:
+          return _Tp(1)/(__x*__x);
+        case 3:
+          return __x*__x*__x;
+        case 4:
+          {
+             _Tp __y = __x*__x;
+             return __y*__y;
+          }
+#endif
+        }
       return __n < 0
         ? _Tp(1)/__cmath_power(__x, -__n)
         : __cmath_power(__x, __n);
Comment 31 Richard Biener 2004-03-14 18:22:36 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

dberlin at dberlin dot org wrote:
> ------- Additional Comments From dberlin at dberlin dot org  2004-03-14 15:38 -------
> Subject: Bug 13776
> With my changes to PRE, i get the same numbers, except PRE is at 28 
> seconds instead of 36.
> 
> I certainly get *nowhere close* to 600 seconds in PRE, or the numbers 
> you get overall.
> I can't fix a problem i can't reproduce, i can only take stabs at it.
> Can someone else please verify his numbers so i know whether it's my 
> test setup or his?

A way to check if leafify is working correctly is to look at the 
assembler generated for f.i.

_ZN14MultiArgKernelI9MultiArg5I5FieldI22UniformRectilinearMeshI10MeshTraitsILi3Ed21UniformRectilinearTag12CartesianTagLi3EEEd9BrickViewES9_S9_S9_S9_E15EvaluateLocLoopIN6Forgas5VXUpdILi3EEELi3EEE3runEv

it should be straight-line code without calls. Note that without 
-funroll-loops or -fpeel-loops the code contains a lot of explicit 
3-times rolling loops, so it's more "easy" to look at it with 
-funroll-loops enabled.

Richard.

Comment 32 Daniel Berlin 2004-03-14 22:21:41 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120

>>
>
> A way to check if leafify is working correctly is to look at the
> assembler generated for f.i.
>
> _ZN14MultiArgKernelI9MultiArg5I5FieldI22UniformRectilinearMeshI10MeshTr 
> aitsILi3Ed21UniformRectilinearTag12CartesianTagLi3EEEd9BrickViewES9_S9_ 
> S9_S9_E15EvaluateLocLoopIN6Forgas5VXUpdILi3EEELi3EEE3runEv
>
> it should be straight-line code without calls.

It is.

Comment 33 Daniel Berlin 2004-03-14 22:23:47 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120

>
> I even have checking disabled.  GC time seems to be identical, parsing
> is 13.5s vs 18.4s - the first big difference is integration, which
> suggests that leafifying is not enabled?

As I showed in the next comment, the leafified functions have no 
function calls.

>   Maybe the patch applied
> "wrong", I attached a complete diff of my local changes.

I have exactly these changes installed.
(I verified it by hand and by comparing the applied diffs).


What platform are you doing this on?

Comment 34 Richard Biener 2004-03-14 22:28:39 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

dberlin at dberlin dot org wrote:
> ------- Additional Comments From dberlin at dberlin dot org  2004-03-14 22:23 -------
> Subject: Re:  [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120
> 
> 
>>I even have checking disabled.  GC time seems to be identical, parsing
>>is 13.5s vs 18.4s - the first big difference is integration, which
>>suggests that leafifying is not enabled?
> 
> 
> As I showed in the next comment, the leafified functions have no 
> function calls.
> 
> 
>>  Maybe the patch applied
>>"wrong", I attached a complete diff of my local changes.
> 
> 
> I have exactly these changes installed.
> (I verified it by hand and by comparing the applied diffs).

Ok.

> 
> What platform are you doing this on?

On ia32, I'm trying to bootstrap on ia64 now.  I'm configuring with
--enable-languages="c,c++" --enable-threads=posix --enable-__cxa_atexit 
--disable-libunwind-exceptions --disable-mudflap --disable-checking

Richard.
Comment 35 Daniel Berlin 2004-03-14 22:52:05 UTC
(In reply to comment #34)
> > What platform are you doing this on?
>
> On ia32, I'm trying to bootstrap on ia64 now.  I'm configuring with
> --enable-languages="c,c++" --enable-threads=posix --enable-__cxa_atexit 
> --disable-libunwind-exceptions --disable-mudflap --disable-checking
>

Hmmm.
I reconfigured with exactly those flags, and re-bootstrapped, and now i get the same numbers you do.
Memory usage was also way up.
However, after that, i just ran configure, then bootstrapped, then get the numbers i posted.
Can you just run configure without any options at all, bootstrap, and see what numbers you get?

Comment 36 Daniel Berlin 2004-03-14 23:14:49 UTC
these are my numbers when configured with just --disable-checking (with the leafify patch, etc)
Execution times (seconds)
 garbage collection    :  21.30 ( 9%) usr   0.12 ( 1%) sys  22.05 ( 8%) wall
 callgraph construction:   0.73 ( 0%) usr   0.00 ( 0%) sys   0.76 ( 0%) wall
 callgraph optimization:   0.73 ( 0%) usr   0.03 ( 0%) sys   0.78 ( 0%) wall
 cfg construction      :   0.54 ( 0%) usr   0.04 ( 0%) sys   0.58 ( 0%) wall
 cfg cleanup           :   2.08 ( 1%) usr   0.05 ( 1%) sys   2.17 ( 1%) wall
 trivially dead code   :   1.45 ( 1%) usr   0.01 ( 0%) sys   1.48 ( 1%) wall
 life analysis         :   4.52 ( 2%) usr   0.01 ( 0%) sys   4.64 ( 2%) wall
 life info update      :   2.23 ( 1%) usr   0.01 ( 0%) sys   2.26 ( 1%) wall
 alias analysis        :   2.66 ( 1%) usr   0.03 ( 0%) sys   2.86 ( 1%) wall
 register scan         :   1.73 ( 1%) usr   0.00 ( 0%) sys   1.73 ( 1%) wall
 rebuild jump labels   :   0.52 ( 0%) usr   0.00 ( 0%) sys   0.52 ( 0%) wall
 preprocessing         :   0.63 ( 0%) usr   0.16 ( 2%) sys   0.80 ( 0%) wall
 parser                :  13.73 ( 6%) usr   1.55 (19%) sys  20.68 ( 8%) wall
 name lookup           :   5.70 ( 2%) usr   2.05 (25%) sys   7.89 ( 3%) wall
 integration           :  27.48 (11%) usr   0.21 ( 3%) sys  28.53 (11%) wall
 tree gimplify         :   1.96 ( 1%) usr   0.02 ( 0%) sys   2.02 ( 1%) wall
 tree eh               :   3.06 ( 1%) usr   0.13 ( 2%) sys   3.35 ( 1%) wall
 tree CFG construction :   1.65 ( 1%) usr   0.07 ( 1%) sys   1.80 ( 1%) wall
 tree CFG cleanup      :   3.53 ( 1%) usr   0.03 ( 0%) sys   3.76 ( 1%) wall
 tree PTA              :   0.64 ( 0%) usr   0.00 ( 0%) sys   0.64 ( 0%) wall
 tree alias analysis   :   0.70 ( 0%) usr   0.00 ( 0%) sys   0.72 ( 0%) wall
 tree PHI insertion    :  11.00 ( 5%) usr   0.07 ( 1%) sys  11.31 ( 4%) wall
 tree SSA rewrite      :   3.34 ( 1%) usr   0.06 ( 1%) sys   3.55 ( 1%) wall
 tree SSA other        :   4.79 ( 2%) usr   0.64 ( 8%) sys   5.57 ( 2%) wall
 tree operand scan     :   4.10 ( 2%) usr   0.63 ( 8%) sys   4.80 ( 2%) wall
 dominator optimization:  14.61 ( 6%) usr   0.54 ( 7%) sys  15.46 ( 6%) wall
 tree SRA              :   0.27 ( 0%) usr   0.02 ( 0%) sys   0.29 ( 0%) wall
 tree CCP              :   1.58 ( 1%) usr   0.02 ( 0%) sys   1.65 ( 1%) wall
 tree split crit edges :   0.22 ( 0%) usr   0.00 ( 0%) sys   0.22 ( 0%) wall
 tree PRE              :  26.66 (11%) usr   0.17 ( 2%) sys  27.40 (10%) wall
 tree linearize phis   :   0.00 ( 0%) usr   0.01 ( 0%) sys   0.01 ( 0%) wall
 tree forward propagate:   1.25 ( 1%) usr   0.01 ( 0%) sys   1.28 ( 0%) wall
 tree conservative DCE :   2.54 ( 1%) usr   0.05 ( 1%) sys   2.70 ( 1%) wall
 tree aggressive DCE   :   1.09 ( 0%) usr   0.01 ( 0%) sys   1.10 ( 0%) wall
 tree DSE              :   2.52 ( 1%) usr   0.01 ( 0%) sys   2.64 ( 1%) wall
 tree copy headers     :   2.22 ( 1%) usr   0.06 ( 1%) sys   2.32 ( 1%) wall
 tree SSA to normal    :   2.74 ( 1%) usr   0.15 ( 2%) sys   2.90 ( 1%) wall
 tree rename SSA copies:   0.59 ( 0%) usr   0.03 ( 0%) sys   0.66 ( 0%) wall
 dominance frontiers   :   0.42 ( 0%) usr   0.00 ( 0%) sys   0.42 ( 0%) wall
 control dependences   :   0.15 ( 0%) usr   0.00 ( 0%) sys   0.15 ( 0%) wall
 expand                :  15.77 ( 6%) usr   0.26 ( 3%) sys  16.61 ( 6%) wall
 varconst              :   0.54 ( 0%) usr   0.03 ( 0%) sys   0.89 ( 0%) wall
 jump                  :   1.16 ( 0%) usr   0.14 ( 2%) sys   1.37 ( 1%) wall
 CSE                   :   7.87 ( 3%) usr   0.04 ( 0%) sys   8.19 ( 3%) wall
 global CSE            :   6.11 ( 3%) usr   0.09 ( 1%) sys   6.30 ( 2%) wall
 loop analysis         :   1.41 ( 1%) usr   0.00 ( 0%) sys   1.41 ( 1%) wall
 bypass jumps          :   1.10 ( 0%) usr   0.00 ( 0%) sys   1.12 ( 0%) wall
 CSE 2                 :   3.16 ( 1%) usr   0.02 ( 0%) sys   3.20 ( 1%) wall
 branch prediction     :   2.52 ( 1%) usr   0.08 ( 1%) sys   2.73 ( 1%) wall
 flow analysis         :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.10 ( 0%) wall
 combiner              :   3.49 ( 1%) usr   0.01 ( 0%) sys   3.62 ( 1%) wall
 if-conversion         :   0.70 ( 0%) usr   0.01 ( 0%) sys   0.74 ( 0%) wall
 regmove               :   1.01 ( 0%) usr   0.01 ( 0%) sys   1.04 ( 0%) wall
 mode switching        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall
 local alloc           :   2.88 ( 1%) usr   0.02 ( 0%) sys   2.97 ( 1%) wall
 global alloc          :   6.36 ( 3%) usr   0.17 ( 2%) sys   6.91 ( 3%) wall
 reload CSE regs       :   2.86 ( 1%) usr   0.00 ( 0%) sys   3.21 ( 1%) wall
 flow 2                :   0.52 ( 0%) usr   0.00 ( 0%) sys   0.54 ( 0%) wall
 if-conversion 2       :   0.39 ( 0%) usr   0.00 ( 0%) sys   0.40 ( 0%) wall
 peephole 2            :   0.51 ( 0%) usr   0.02 ( 0%) sys   0.54 ( 0%) wall
 rename registers      :   0.73 ( 0%) usr   0.05 ( 1%) sys   0.79 ( 0%) wall
 scheduling 2          :   2.85 ( 1%) usr   0.05 ( 1%) sys   3.02 ( 1%) wall
 reorder blocks        :   0.28 ( 0%) usr   0.01 ( 0%) sys   0.30 ( 0%) wall
 shorten branches      :   0.54 ( 0%) usr   0.02 ( 0%) sys   0.56 ( 0%) wall
 reg stack             :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall
 final                 :   1.10 ( 0%) usr   0.13 ( 2%) sys   1.43 ( 1%) wall
 symout                :   0.03 ( 0%) usr   0.01 ( 0%) sys   0.04 ( 0%) wall
 rest of compilation   :   1.83 ( 1%) usr   0.01 ( 0%) sys   1.87 ( 1%) wall
 TOTAL                 : 243.59             8.18           264.59
Comment 37 Andrew Pinski 2004-03-14 23:45:32 UTC
I think this one:
 integration           :  27.48 (11%) usr   0.21 ( 3%) sys  28.53 (11%) wall
is caused by gimple having more trees to copy so maybe doing inlining later on will help (aka after the 
first DCE happens) but the inliner then needs to be a BB inliner.
Comment 38 Richard Biener 2004-03-15 09:03:09 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

On Mon, 14 Mar 2004, dberlin at gcc dot gnu dot org wrote:

> ------- Additional Comments From dberlin at gcc dot gnu dot org  2004-03-14 23:14 -------
> these are my numbers when configured with just --disable-checking (with the leafify patch, etc)

The results with just --disable-checking are the same.  Humm.
--disable-libunwind-exceptions should make no difference for me, as I
don't have libunwind installed - maybe it's making the difference for you?

Confused,

Richard.

--
Richard Guenther <richard dot guenther at uni-tuebingen dot de>
WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/
Comment 39 Andrew Pinski 2004-03-28 07:38:40 UTC
I noticed while profiling the build of libstdc++, I noticed that comptypes was not being 
tailed/sibcalled because of the return type is bool so this depends on PR 14440.
Comment 40 Andrew Pinski 2004-03-28 18:45:23 UTC
I noticed that bsi functions were not being optimized that well because bsi is a struct 
which contained structs so marking this depends on PR 13953 which is for SRA 
optimizing on structs containing structs.
Comment 41 Andrew Pinski 2004-03-29 02:32:19 UTC
I am attaching a C example where tree-ssa is slower:
[zhivago2:~/src/testspeed] pinskia% time ~/gcc-tree-ssa/bin/gcc fold-const.i -S 
18.640u 1.480s 0:21.38 94.1%    0+0k 0+5io 0pf+0w
[zhivago2:~/src/testspeed] pinskia% time ~/fsf-clean-nocheck/bin/gcc fold-const.i -S 
9.060u 0.540s 0:09.93 96.6%     0+0k 0+4io 0pf+0w
Comment 42 Andrew Pinski 2004-03-29 02:35:25 UTC
Created attachment 6011 [details]
C example

Here is the C example.	It is a fold-const.c from a crosscompiler from
powerpc-apple-darwin to powerpc64-apple-darwin.
Comment 43 Richard Biener 2004-03-29 12:14:10 UTC
I set up a nightly tester on ia64-linux that does a bootstrap for c,c++ and
builds the tramp3d-v3.cpp testcase and does a performance check on the resulting
binary.  Stats can be viewed at
http://www.tat.physik.uni-tuebingen.de/~rguenth/gcc/monitor-summary.html

Testing is done with an unpatched tree-ssa branch (i.e. w/o leafify).
The summary plot is updated manually and so can lag behind if I forget updating it.
Comment 44 Steven Bosscher 2004-03-31 19:32:53 UTC
C is also slower, here's the top of the oprofile on amd64 for 
"-fno-tree-pre -O3" on a subset of Diego Novillo's cc1-i-files. 
 
vma      samples  %           symbol name 
00730fa0 117920   10.1391     htab_find_slot_with_hash 
00731350 53286     4.5817     iterative_hash 
004802b0 22184     1.9074     bitmap_bit_p 
006a3e20 20801     1.7885     ggc_alloc_stat 
006717e0 19669     1.6912     for_each_rtx 
006c5590 18536     1.5938     walk_tree 
00730d00 16933     1.4559     find_empty_slot_for_expand 
0064d5d0 16794     1.4440     constrain_operands 
006579f0 16467     1.4159     reg_scan_mark_refs 
00701db0 13922     1.1971     reg_is_remote_constant_p 
00402b60 12999     1.1177     yyparse 
004af330 12958     1.1142     cse_insn 
00501050 12339     1.0609     mark_set_1 
00671d00 12320     1.0593     note_stores 
00523570 11714     1.0072     compute_transp 
004a9270 10726     0.9223     count_reg_usage 
 
 
Comment 45 Richard Biener 2004-03-31 19:37:01 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

steven at gcc dot gnu dot org wrote:
> ------- Additional Comments From steven at gcc dot gnu dot org  2004-03-31 19:32 -------
> C is also slower, here's the top of the oprofile on amd64 for 
> "-fno-tree-pre -O3" on a subset of Diego Novillo's cc1-i-files. 
>  
> vma      samples  %           symbol name 
> 00730fa0 117920   10.1391     htab_find_slot_with_hash 

We have a lot of pointer hashing in gcc now and I see the above, too. 
We can possibly micro-optimize the pointer hashing by introducing a 
"specialization" of the libiberty hashfn for pointers where we can 
inline both the hashing function and the comparison function.  It will 
introduce some code duplication, though (if this only was using C++ and 
templates...).

Richard.
Comment 46 Zack Weinberg 2004-03-31 19:53:23 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

"rguenth at tat dot physik dot uni-tuebingen dot de" <gcc-bugzilla@gcc.gnu.org> writes:

> We have a lot of pointer hashing in gcc now and I see the above, too. 
> We can possibly micro-optimize the pointer hashing by introducing a 
> "specialization" of the libiberty hashfn for pointers where we can 
> inline both the hashing function and the comparison function.  It will 
> introduce some code duplication, though (if this only was using C++ and 
> templates...).

Something I've wanted to do for a long time is do poor-man's templates
on hashtab.[ch] with macros.  But I never seem to get sufficient round
tuits.

zw
Comment 47 Richard Biener 2004-03-31 20:01:23 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

zack at codesourcery dot com wrote:
> ------- Additional Comments From zack at codesourcery dot com  2004-03-31 19:53 -------
> Subject: Re:  [tree-ssa] Many C++ compile-time regression in
>  3.5-tree-ssa 040120
> 
> "rguenth at tat dot physik dot uni-tuebingen dot de" <gcc-bugzilla@gcc.gnu.org> writes:
> 
> 
>>We have a lot of pointer hashing in gcc now and I see the above, too. 
>>We can possibly micro-optimize the pointer hashing by introducing a 
>>"specialization" of the libiberty hashfn for pointers where we can 
>>inline both the hashing function and the comparison function.  It will 
>>introduce some code duplication, though (if this only was using C++ and 
>>templates...).
> 
> 
> Something I've wanted to do for a long time is do poor-man's templates
> on hashtab.[ch] with macros.  But I never seem to get sufficient round
> tuits.

I think it would pay for pointer hashing only, as this is the main use. 
  I did some experiments some time ago with a stripped down pointer-only 
hash just replacing the walk_tree hashtab and it still was #1 in the 
profile with little change in time (but I didn't measure overall 
performance change).

Richard.
Comment 48 Steven Bosscher 2004-03-31 20:44:01 UTC
I agree that a special pointer hasher would be nice.  Should be easy, 
just duplicate the code of iterative_hash in hashtab.c and specialize 
it for void *. 
 
But that doesn't reduce the number of find_slot calls.  It's not like 
the tables are sparse and we're getting tons of collisions.  We just 
use the hash table that much, and we should be looking into ways for 
speeding it up. 
 
 
Comment 49 Steven Bosscher 2004-04-04 12:45:49 UTC
I did some profiling of iterative_hash on tree-ssa.  Not 
immediately related to this PR, perhaps, but part of the problem. 
 
  %   cumulative   self              self     total 
 time   seconds   seconds    calls   s/call   s/call  name 
  2.75      1.66     1.66  2329935     0.00     0.00  iterative_hash 
  2.29      3.04     1.38   235027     0.00     0.00  walk_tree 
  2.04      4.27     1.23  1419091     0.00     0.00  ggc_alloc_stat 
  1.87      5.40     1.13  1020397     0.00     0.00  htab_find_slot_with_hash 
  1.74      6.45     1.05  1295674     0.00     0.00  mark_set_1 
  1.67      7.46     1.01   396445     0.00     0.00  iterative_hash_expr 
  1.64      8.45     0.99  2947490     0.00     0.00  bitmap_bit_p 
  1.59      9.41     0.96   321482     0.00     0.00  for_each_rtx 
  1.54     10.34     0.93  1566242     0.00     0.00  bitmap_set_bit 
  1.42     11.20     0.86   770792     0.00     0.00  et_splay 
 
Right now, this function seems to be used only on the tree-ssa 
branch, and mostly in the tree optimizers via iterative_hash_expr: 
 
----------------------------------------------- 
                             1423028             iterative_hash_expr [35] 
                0.00    0.00      40/396445      pre_expression [433] 
                0.00    0.00     162/396445      process_delayed_rename [971] 
                0.03    0.04   10126/396445      gimple_tree_hash [516] 
                0.39    0.67  151915/396445      avail_expr_hash [71] 
                0.60    1.03  234202/396445      true_false_expr_hash [52] 
[35]     4.6    1.01    1.74  396445+1423028 iterative_hash_expr [35] 
                1.65    0.00 2308918/2329935     iterative_hash [53] 
                0.06    0.00  383567/1028690     first_rtl_op [321] 
                0.03    0.00  546018/635717      commutative_tree_code [699] 
                             1423028             iterative_hash_expr [35] 
----------------------------------------------- 
                0.00    0.00     919/2329935     build_type_attribute_variant 
<cycle 12> [1420] 
                0.00    0.00     940/2329935     build_array_type [1299] 
                0.00    0.00    4814/2329935     build_function_type <cycle 
12> [671] 
                0.01    0.00   14344/2329935     type_hash_list [900] 
                1.65    0.00 2308918/2329935     iterative_hash_expr [35] 
[53]     2.8    1.66    0.00 2329935         iterative_hash [53] 
----------------------------------------------- 
 
So ~95% of all iterative_hash_expr calls are from DOM, which could use 
a little help in terms of compilation speed: ~12% for this particular 
test case pt.i. 
 
I also did some coverage testing on iterative_hash: 
 
        -:  794:hashval_t iterative_hash (k_in, length, initval) 
        -:  795:     const PTR k_in;               /* the key */ 
        -:  796:     register size_t  length;      /* the length of the key */ 
        -:  797:     register hashval_t  initval;  /* the previous hash, or an 
arbitrary value */ 
 13721488:  798:{ 
 13721488:  799:  register const unsigned char *k = (const unsigned char 
*)k_in; 
 13721488:  800:  register hashval_t a,b,c,len; 
        -:  801: 
        -:  802:  /* Set up the internal state */ 
 13721488:  803:  len = length; 
 13721488:  804:  a = b = 0x9e3779b9;  /* the golden ratio; an arbitrary value 
*/ 
 13721488:  805:  c = initval;           /* the previous hash value */ 
        -:  806: 
        -:  807:  /*---------------------------------------- handle most of 
the key */ 
        -:  808:#ifndef WORDS_BIGENDIAN 
        -:  809:  /* On a little-endian machine, if the data is 4-byte aligned 
we can hash 
        -:  810:     by word for better speed.  This gives nondeterministic 
results on 
        -:  811:     big-endian machines.  */ 
 13721488:  812:  if (sizeof (hashval_t) == 4 && (((size_t)k)&3) == 0) 
branch  0 taken 0% 
 13724520:  813:    while (len >= 12)    /* aligned */ 
branch  0 taken 1% 
branch  1 taken 100% 
        -:  814:      { 
     3032:  815:        a += *(hashval_t *)(k+0); 
     3032:  816:        b += *(hashval_t *)(k+4); 
     3032:  817:        c += *(hashval_t *)(k+8); 
     3032:  818:        mix(a,b,c); 
     3032:  819:        k += 12; len -= 12; 
branch  0 taken 100% 
        -:  820:      } 
        -:  821:  else /* unaligned */ 
        -:  822:#endif 
    #####:  823:    while (len >= 12) 
branch  0 never executed 
branch  1 never executed 
        -:  824:      { 
    #####:  825:        a += (k[0] +((hashval_t)k[1]<<8) 
+((hashval_t)k[2]<<16) +((hashval_t)k[3]<<24)); 
    #####:  826:        b += (k[4] +((hashval_t)k[5]<<8) 
+((hashval_t)k[6]<<16) +((hashval_t)k[7]<<24)); 
    #####:  827:        c += (k[8] +((hashval_t)k[9]<<8) 
+((hashval_t)k[10]<<16)+((hashval_t)k[11]<<24)); 
    #####:  828:        mix(a,b,c); 
    #####:  829:        k += 12; len -= 12; 
branch  0 never executed 
        -:  830:      } 
        -:  831: 
        -:  832:  /*------------------------------------- handle the last 11 
bytes */ 
 13721488:  833:  c += length; 
 13721488:  834:  switch(len)              /* all the case statements fall 
through */ 
branch  0 taken 0% 
branch  1 taken 0% 
branch  2 taken 0% 
branch  3 taken 0% 
branch  4 taken 0% 
branch  5 taken 1% 
branch  6 taken 0% 
branch  7 taken 1% 
branch  8 taken 99% 
branch  9 taken 1% 
branch 10 taken 1% 
branch 11 taken 0% 
branch 12 taken 1% 
        -:  835:    { 
    #####:  836:    case 11: c+=((hashval_t)k[10]<<24); 
    #####:  837:    case 10: c+=((hashval_t)k[9]<<16); 
    #####:  838:    case 9 : c+=((hashval_t)k[8]<<8); 
        -:  839:      /* the first byte of c is reserved for the length */ 
    #####:  840:    case 8 : b+=((hashval_t)k[7]<<24); 
      129:  841:    case 7 : b+=((hashval_t)k[6]<<16); 
      129:  842:    case 6 : b+=((hashval_t)k[5]<<8); 
      181:  843:    case 5 : b+=k[4]; 
 13719971:  844:    case 4 : a+=((hashval_t)k[3]<<24); 
 13719977:  845:    case 3 : a+=((hashval_t)k[2]<<16); 
 13719979:  846:    case 2 : a+=((hashval_t)k[1]<<8); 
 13719979:  847:    case 1 : a+=k[0]; 
        -:  848:      /* case 0: nothing left to add */ 
        -:  849:    } 
 13721488:  850:  mix(a,b,c); 
        -:  851:  /*-------------------------------------------- report the 
result */ 
 13721488:  852:  return c; 
        -:  853:} 
 
So it seems that a specialized version for 4 byte objects really would 
help here. 
 
(Xeon is 32bit, so the 8 byte case is important for 64bit targets??) 
 
Comment 50 Richard Biener 2004-04-10 14:04:02 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

Again the automatic tester at 
http://www.tat.physik.uni-tuebingen.de/~rguenth/gcc/monitor-summary.html
caught some compile time regressions for tree-ssa.
While bootstrap time didn't change (much), tramp3d-v3 compile time got a 
hit between Wednesday and Thursday, same for runtime.  You'll also note 
that mainline runtime was improving a lot yesterday.

There aren't that much changes on tree-ssa right now, so I suspect 
changes causing the regression be

2004-04-07  Diego Novillo  <dnovillo@redhat.com>

         * gimplify.c (gimplify_call_expr): Remove argument POST_P.
         Update all callers.
         Don't use POST_P when gimplifying the call expression.

(the tree is updated at 3am CEST, incident happened with the update
on Thursday)

Richard.
Comment 51 Diego Novillo 2004-04-10 14:58:59 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
	3.5-tree-ssa 040120

On Sat, 2004-04-10 at 10:04, rguenth at tat dot physik dot uni-tuebingen
dot de wrote:

> There aren't that much changes on tree-ssa right now, so I suspect 
> changes causing the regression be
> 
> 2004-04-07  Diego Novillo  <dnovillo@redhat.com>
> 
>          * gimplify.c (gimplify_call_expr): Remove argument POST_P.
>          Update all callers.
>          Don't use POST_P when gimplifying the call expression.
> 
Hmm, odd.  This is a correctness fix.  Side effects in function call
arguments must occur before the actual call takes place.

What may be happening here is that we are getting fewer commoning
opportunities for call-clobbered variables.  Before, foo (a++) would
expand to:

foo (a);
a = a + 1;

But now, it expands to:

t = a;
a = a + 1;
foo (t);

If 'a' is call-clobbered, the second form will not allow us to common
out 'a + 1' because of the clobbering of 'a' by the call to foo.

However, it is a bit surprising that this would cause a significant
decline in compile time.  Would you have a pre-patched cc1plus binary to
compare dump files?


Thanks.  Diego.

Comment 52 Richard Biener 2004-04-10 15:20:00 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

dnovillo at redhat dot com wrote:
> ------- Additional Comments From dnovillo at redhat dot com  2004-04-10 14:58 -------
> Subject: Re:  [tree-ssa] Many C++ compile-time regression in
> 	3.5-tree-ssa 040120
> 
> On Sat, 2004-04-10 at 10:04, rguenth at tat dot physik dot uni-tuebingen
> dot de wrote:
> 
> 
>>There aren't that much changes on tree-ssa right now, so I suspect 
>>changes causing the regression be
>>
>>2004-04-07  Diego Novillo  <dnovillo@redhat.com>
>>
>>         * gimplify.c (gimplify_call_expr): Remove argument POST_P.
>>         Update all callers.
>>         Don't use POST_P when gimplifying the call expression.
>>
> 
> Hmm, odd.  This is a correctness fix.  Side effects in function call
> arguments must occur before the actual call takes place.
> 
> What may be happening here is that we are getting fewer commoning
> opportunities for call-clobbered variables.  Before, foo (a++) would
> expand to:
> 
> foo (a);
> a = a + 1;
> 
> But now, it expands to:
> 
> t = a;
> a = a + 1;
> foo (t);
> 
> If 'a' is call-clobbered, the second form will not allow us to common
> out 'a + 1' because of the clobbering of 'a' by the call to foo.
> 
> However, it is a bit surprising that this would cause a significant
> decline in compile time.  Would you have a pre-patched cc1plus binary to
> compare dump files?

Yes, I have cc1plus binaries from all days lying around (though with 
checking disabled).  Just tell me what to do.

Richard.
Comment 53 Richard Biener 2004-04-10 15:36:17 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

> However, it is a bit surprising that this would cause a significant
> decline in compile time.  Would you have a pre-patched cc1plus binary to
> compare dump files?

Ok, I tried to just diff tree-optimized dumps, but noise is papering 
over the differences (temps are differently numbered).

At least, before the compile time increase the dump had 1003736 lines, 
and after it now has 1048682 lines.  So there is a difference.

Richard.
Comment 54 Richard Biener 2004-04-10 15:48:14 UTC
Subject: Re:  [tree-ssa] Many C++ compile-time regression in
 3.5-tree-ssa 040120

> However, it is a bit surprising that this would cause a significant
> decline in compile time.  Would you have a pre-patched cc1plus binary to
> compare dump files?

Cutting off the numbers from the vars and killing <Dxxx> reveals:

@@ -256,6 +256,7 @@ virtual Smarts::Runnable::~Runnable() (t
  {
    bool T.;
    int T.;
+  int (*__vtbl_ptr_type) () * T.;

  <bb 0>:
    this->_vptr.Runnable = &_ZTVN6Smarts8RunnableE[2];

(and similar in all destructors)

  int fillLocStorage(int, Loc<Dim>&, constT1&) [with int Dim = 3, T1 = 
Loc<3>] (currIndex, loc, a)
  {
+  int currIndex.;
    int ;
    int d;
    int T.;
@@ -1581,6 +1503,7 @@ int fillLocStorage(int, Loc<Dim>&, const
    struct Domain<1,DomainTraits<Loc<1> > > * T.;
    struct Loc<1> * T.;
    struct Loc<1> & T.;
+  int currIndex.;
    struct Domain<3,DomainTraits<Loc<3> > > * loc.;
    int retval.;
    int retval.;
@@ -1595,13 +1518,17 @@ int fillLocStorage(int, Loc<Dim>&, const
    i = 0;

  <L0>:;
+  currIndex. = currIndex + 1;
    *(int &)(struct Domain<1,DomainTraits<Loc<1> > > *)(struct Loc<1> 
*)(struct Loc<1> &)((struct Loc<1> *)((long unsigned int)currIndex * 4) 
+ (struct Loc<1> *)(struct UninitializedVector<Loc<1>,3,int> *)(struct 
Domain<3,DomainTraits<Loc<3> > > *)loc) = ((struct 
DomainBase<DomainTraits<Loc<1> > > *)(struct 
Domain<1,DomainTraits<Loc<1> > > *)(struct Loc<1> &)((struct Loc<1> 
*)((long unsigned int)i * 4) + (struct Loc<1> *)(struct 
UninitializedVector<Loc<1>,3,int> *)(struct Domain<3,DomainTraits<Loc<3> 
 > > *)a))->domain_m;
-  currIndex = currIndex + 1;
    i = i + 1;
-  if (i <= 2) goto <L0>; else goto <L10>;
+  if (i <= 2) goto <L13>; else goto <L10>;
+
+<L13>:;
+  currIndex = currIndex.;
+  goto <bb 1> (<L0>);

  <L10>:;
-  return currIndex;
+  return currIndex.;

  }

looks like DOM is now missing some optimization

then, lots of re-ordering of functions in the diff, and noise... (label 
number changes, bb number changes).  The dump files are huge (both 
around 50MB uncompressed), if you want to download them, I can put them 
to an accessible location.

Comment 55 Giovanni Bajo 2004-06-30 03:06:08 UTC
Karel,

all the main optimization issues that we spotted looking at the MICO 
regressions are supposed to be fixed now. It would be very cool if you could 
prepare an updated performance comparison table between 3.4.0 and today's 
mainline, so that we can check how mainline is doing now.

Thanks
Comment 56 Karel Gardas 2004-07-08 18:16:34 UTC
Subject: Re:  [3.5 Regression] [tree-ssa] Many
 C++ compile-time regression in 3.5-tree-ssa 040120


Giovani,

I have done comparison of 3.4.0, 3.4.1RC1 and trunk from 2004-06-30 and
posted all results here: http://gcc.gnu.org/ml/gcc/2004-07/msg00391.html

Cheers,

Karel

Comment 57 Giovanni Bajo 2004-08-30 01:40:40 UTC
Karel, would you mind posting an updated table using a recent mainline? Thanks.
Comment 58 Karel Gardas 2004-08-31 09:15:02 UTC
Subject: Re:  [3.5 Regression] [tree-ssa] Many
 C++ compile-time regression in 3.5-tree-ssa 040120


Hi,

updated table for gcc3.4.1 and main trunk 2004-08-30 is here:
http://gcc.gnu.org/ml/gcc/2004-08/msg01594.html

Cheers,
Karel

Comment 59 Andrew Pinski 2004-10-23 21:26:22 UTC
Can you post again the new result as a huge amount has been changed since Auguest 31 and there has 
been some compile time improvements in that time?
Comment 60 Karel Gardas 2004-10-25 12:03:24 UTC
Subject: Re:  [4.0 Regression] [tree-ssa] Many
 C++ compile-time regression in 4.0-tree-ssa 040120


Sure! Here we go: http://gcc.gnu.org/ml/gcc/2004-10/msg00952.html
and results are really promissing, although some interesting regressions
are still presented.

Cheers,
Karel

Comment 61 Richard Biener 2004-10-25 13:02:05 UTC
Subject: Re:  [4.0 Regression] [tree-ssa] Many
 C++ compile-time regression in 4.0-tree-ssa 040120

And
http://gcc.gnu.org/ml/gcc/2004-10/msg00955.html

Comment 62 Karel Gardas 2004-10-25 13:08:59 UTC
Subject: Re:  [4.0 Regression] [tree-ssa] Many
 C++ compile-time regression in 4.0-tree-ssa 040120


In recent testing ir.cc seems to be a big culprit. It is attached
preprocessed by 4.0.0-041024 for your experiments.

Cheers,
Karel
Comment 63 Karel Gardas 2004-10-25 13:09:01 UTC
Created attachment 7408 [details]
ir.ii.bz2
Comment 64 Karel Gardas 2004-10-25 13:20:50 UTC
Subject: Re:  [4.0 Regression] [tree-ssa] Many
 C++ compile-time regression in 4.0-tree-ssa 040120


Updated table with GCC 3.4.2 and 4.0.0-041024 results is available here:
http://gcc.gnu.org/ml/gcc/2004-10/msg00952.html -- still some regressions
mainly on -O1 and -O2.

Cheers,
Karel

Comment 65 Andrew Pinski 2004-11-16 01:51:52 UTC
ir.cc           47.17   69.26   -31.89  72.42   129.49  -44.07  100.1   165.27  -39.43
I just sped up ir.cc a little with my patch to cp-gimplify.c (which was committed)
Reference: http://gcc.gnu.org/ml/gcc-patches/2004-11/msg01247.html

Also my patch to remove the a number of calls to is_gimple_reg speeds up optimizations:
http://gcc.gnu.org/ml/gcc-patches/2004-11/msg01284.html
Comment 66 Andrew Pinski 2004-11-18 21:12:43 UTC
Hmm, with the mainline on PPC-darwin for ir.ii at -O0 we are faster than both 3.3 and 3.1.
3.1:
51.260u 2.110s 0:56.27 94.8%    0+0k 0+7io 0pf+0w
3.3:
46.000u 3.600s 0:50.91 97.4%    0+0k 0+7io 0pf+0w
mainline:
39.730u 5.270s 0:48.27 93.2%    0+0k 0+8io 0pf+0w

Even at -O1 we are faster than 3.3:
mainline:
70.860u 5.010s 1:18.76 96.3%    0+0k 0+11io 0pf+0w
3.3:
72.650u 13.250s 1:29.99 95.4%   0+0k 0+7io 0pf+0w
For -O2 we are only 1 second slower than 3.3:
mainline:
99.720u 5.510s 1:54.78 91.6%    0+0k 0+13io 0pf+0w
3.3:
98.610u 38.800s 2:25.59 94.3%   0+0k 0+15io 0pf+0w

Could you check again on your platform?
Comment 67 Karel Gardas 2004-11-19 11:14:57 UTC
Subject: Re:  [4.0 Regression] [tree-ssa] Many
 C++ compile-time regression in 4.0-tree-ssa 040120


I've tested 3.4.2, 4.0.0 (20041026) and 4.0.0 (20041118) with following
results:

3.4.2:

c++  -I../include  -time -O0 -Wall   -DPIC -fPIC  -c ir.cc -o ir.pic.o
# cc1plus 46.98 0.53
# as 4.62 0.22

peak memory consumed: 99MB

4.0.0 (20041026):

c++  -I../include  -time -O0 -Wall   -DPIC -fPIC  -c ir.cc -o ir.pic.o
# cc1plus 67.13 2.05
# as 5.98 0.30

peak memory consumed: 243MB

4.0.0 (20041118):

c++  -I../include  -time -O0 -Wall   -DPIC -fPIC  -c ir.cc -o ir.pic.o
# cc1plus 66.47 1.97
# as 5.84 0.27

peak memory consumed 243MB


so there is still both compile-time and memory usage regressions presented
on main-line.

The reason why do you see speed-up in comparison with 3.1/3.3 is that
3.4.2 is really faster compiler (at least from MICO sources point of
view).

Cheers,
Karel

Comment 68 Steven Bosscher 2004-11-24 23:22:58 UTC
Created attachment 7601 [details]
Top 10 functions for all preprocessed mico files at -O2

The attachment is a file with the top 10 from gprof profiles.
The base compiler is GCC 3.3 (SUSE), the profiling compiler is
"GNU C++ version 4.0.0 20041124 (experimental) (i686-pc-linux-gnu)"

If anyone wants to see a complete gprof profile, ping me.
Comment 69 Andrew Pinski 2004-11-25 00:31:18 UTC
Created attachment 7602 [details]
profile report using shark

This is a run of 4 compilation of current.cc.ii at -O0.
Comment 70 Karel Gardas 2004-11-29 19:56:55 UTC
Subject: Re:  [4.0 Regression] [tree-ssa] Many
 C++ compile-time regression in 4.0-tree-ssa 040120


I've updated comparison table for 4.0.0 20041126 compiler version. You can
find it here: http://gcc.gnu.org/ml/gcc/2004-11/msg01157.html

Cheers,
Karel

Comment 71 Jeffrey A. Law 2004-11-29 20:05:44 UTC
Subject: Re:  [4.0 Regression] [tree-ssa] Many
	C++ compile-time regression in 4.0-tree-ssa 040120

On Mon, 2004-11-29 at 19:56 +0000, kgardas at objectsecurity dot com
wrote:
> ------- Additional Comments From kgardas at objectsecurity dot com  2004-11-29 19:56 -------
> Subject: Re:  [4.0 Regression] [tree-ssa] Many
>  C++ compile-time regression in 4.0-tree-ssa 040120
> 
> 
> I've updated comparison table for 4.0.0 20041126 compiler version. You can
> find it here: http://gcc.gnu.org/ml/gcc/2004-11/msg01157.html
BTW, if I'm reading that table correctly, overall the compile time 
performance of mainline is actually on-par or better than 3.4 at
-O0, -O1 and -O2 for this test.  That's not to diminish the need to
work on ir.cc, but things appear to be heading the right direction.

jeff


Comment 72 Karel Gardas 2004-11-29 21:04:39 UTC
Subject: Re:  [4.0 Regression] [tree-ssa] Many
 C++ compile-time regression in 4.0-tree-ssa 040120

On Mon, 29 Nov 2004, law at redhat dot com wrote:

> > I've updated comparison table for 4.0.0 20041126 compiler version. You can
> > find it here: http://gcc.gnu.org/ml/gcc/2004-11/msg01157.html
> BTW, if I'm reading that table correctly, overall the compile time
> performance of mainline is actually on-par or better than 3.4 at
> -O0, -O1 and -O2 for this test.

Yes, you are 100% right.

Karel

Comment 73 Andrew Pinski 2004-12-13 03:03:48 UTC
I noticed that for ir.ii, there is some compile time spent in GC which means we have a memory problem, 
I have a patch which should help a little on the memory problem but that too much.
Comment 74 Andrew Pinski 2004-12-13 06:38:06 UTC
Note for ir.ii at -O0, we spend more time in local alloc and global alloc with the mainline than 3.3.2.
2.41 vs 3.86 and 3.74 vs 6.07 so someone who knows local alloc and global alloc might want to look 
into this.  This is on powerpc-darwin by the way, on x86, there might be a different problem someone 
should do a -ftime-report with both the mainline and 3.4.x to see if this is also true on x86.
Comment 75 Andrew Pinski 2004-12-13 06:59:39 UTC
For -O1, integration is slower in the mainline compared with 3.3.2, 2.46 vs 1.51.
global alloc is also slower: 3.21 vs 2.38.
Speeding those up will help.

This again on powerpc-darwin.  The reason why I thought 3.3.2 was much slower than the mainline was 
because the GC limits were low for 3.3.2 on darwin.
Comment 76 Karel Gardas 2004-12-28 21:03:46 UTC
Hello,

New comparison is here:
http://gcc.gnu.org/ml/gcc/2004-12/msg01157.html

Cheers,
Karel
Comment 77 Steven Bosscher 2005-01-01 19:54:21 UTC
Created attachment 7858 [details]
A patch to turn off local-alloc, which buys 5% for ir.cc

Turning off local-alloc like in the attach patch makes compiling
ir.cc 5% faster for me on powerpc-linux (from 30s to 28.5s).

It seems like a good idea anyway to turn off most of local-alloc,
turning it off improves SPEC scores too.  I'm not sure why gcc
still has it at all...
Comment 78 Steven Bosscher 2005-01-26 10:20:58 UTC
Bah, I hate profiles for "cc1plus -O2 ir.ii" without peaks: 
 
CPU: P4 / Xeon with 2 hyper-threads, speed 3194.17 MHz (estimated) 
Counted GLOBAL_POWER_EVENTS events (time during which processor is not 
stopped) with a unit mask of 0x01 (mandatory) count 100000 
samples  %        symbol name 
78641     5.2991  ggc_alloc_stat 
28267     1.9047  ggc_set_mark 
26230     1.7675  splay_tree_splay_helper 
25018     1.6858  walk_tree 
24322     1.6389  cgraph_node_for_asm 
20428     1.3765  gt_ggc_mx_lang_tree_node 
19586     1.3198  htab_find_slot_with_hash 
16006     1.0785  compute_immediate_uses 
15133     1.0197  get_stmt_operands 
14481     0.9758  constrain_operands 
13414     0.9039  insert_aux 
13308     0.8967  decl_assembler_name_equal 
12795     0.8622  find_reloads 
12052     0.8121  decl_assembler_name 
11986     0.8077  cse_insn 
11743     0.7913  record_reg_classes 
11707     0.7889  bitmap_set_bit 
11630     0.7837  ix86_decompose_address 
11610     0.7823  mark_set_1 
11538     0.7775  optimize_stmt 
11201     0.7548  iterative_hash_expr 
10615     0.7153  cp_walk_subtrees 
10235     0.6897  rewrite_stmt 
9892      0.6666  for_each_rtx_1 
9816      0.6614  get_expr_operands 
9813      0.6612  invalidate 
9302      0.6268  pointer_set_insert 
9293      0.6262  mark_def_sites 
8570      0.5775  reg_scan_mark_refs 
8503      0.5730  propagate_necessity 
8424      0.5676  is_gimple_reg 
8322      0.5608  compute_may_aliases 
 
No single problem to focus on... 
Comment 79 Richard Biener 2005-01-26 10:24:56 UTC
Subject: Re:  [4.0 Regression] Many C++ compile-time
 regressions for MICO's ORB code

> Bah, I hate profiles for "cc1plus -O2 ir.ii" without peaks:
>
> CPU: P4 / Xeon with 2 hyper-threads, speed 3194.17 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not
> stopped) with a unit mask of 0x01 (mandatory) count 100000
> samples  %        symbol name
> 25018     1.6858  walk_tree
> 24322     1.6389  cgraph_node_for_asm
> 19586     1.3198  htab_find_slot_with_hash

Do you have numbers wether we are memory-bandwith limited here?  If
not, we might micro-optimize hash table access somewhat more.

Comment 80 Karel Gardas 2005-01-26 10:24:59 UTC
Subject: Re:  [4.0 Regression] Many C++ compile-time
 regressions for MICO's ORB code

On Wed, 26 Jan 2005, steven at gcc dot gnu dot org wrote:

>
> ------- Additional Comments From steven at gcc dot gnu dot org  2005-01-26 10:20 -------
> Bah, I hate profiles for "cc1plus -O2 ir.ii" without peaks:

True, if I may add something, I would recommend to look at why ir.cc
regress so much in memory consumption in comparison with 3.4.x. If you
solve this, perhaps compile time regressions goes away too.

Thanks,
Karel

Comment 81 Karel Gardas 2005-01-26 10:46:25 UTC
Subject: Re:  [4.0 Regression] Many C++ compile-time
 regressions for MICO's ORB code


Just to note something about 4.0.0 and 3.4.2 memory usage while compiling
ir.cc.

3.4.2: it is quickly gorwing up to 90MB RAM, then it stay there for a long
time and then goes quickly to 99MB RAM where it finishes -- i.e. majority
of time is spend with ~90MB and less consumed memory

4.0.0: in comparison with 3.4.2, it is growing up to 243MB RAM, stays
there for some time (not majority but let say 1/3 of compilation
certainly), then it goes back to 200MB RAM consumed and then it finishes.
Hard to tell avarage memory usage here, but I think it is about 200MB.

My 4.0.0 here is quite old 20041228, but I guess the picture is still the
same.

Thanks,
Karel

Comment 82 Steven Bosscher 2005-01-26 11:36:39 UTC
It would be a Good Thing to look at the hash function.  The number of
collisions per search is extremely high:

String pool
entries         128928
identifiers     128928 (100.00%)
slots           262144
bytes           1846k (142k overhead)
table size      2048k
coll/search     0.8518
ins/search      0.2747
avg. entry      14.66 bytes (+/- 17.60)
longest entry   830


There is also still a lot of memory allocated at the end of the compilation:

Memory still allocated at the end of the compilation process
Size   Allocated        Used    Overhead
8           4096         200         120
16          4264k       1211k         91k
64            29M         10M        476k
128         3920k       1472k         53k
256         1240k        519k         16k
512         4084k       2026k         55k
1024         488k        390k       6832
2048        2628k       1998k         35k
4096        1160k       1160k         15k
8192         376k        368k       2632
16384        304k        288k       1064
32768        160k        128k        280
65536        704k        640k        616
131072        384k        384k        168
262144        512k        512k        112
524288        512k        512k         56
112           26M         19M        373k
208           63M         43M        883k
48            27M         14M        443k
32            18M         10M        337k
80            13M         13M        186k
Total        199M        122M       2982k

Note especially the 43MB.  All of that is in the et-forest alloc-pools.
Perhaps we should allocate/free them per function.

Finally, we allocate a lot of SSA_NAMEs, and varrays are problematic as
always:
source location                                     Garbage            Freed   
         Leak         Overhead            Times
varray.c:170 (varray_grow)                         39485908: 3.3% 
280747780:47.6%     229448: 0.2%   80866528:32.0%     552682
tree-ssanames.c:197 (make_ssa_name)                94292264: 7.9%          0:
0.0%          0: 0.0%    8572024: 3.4%    1071503
Comment 83 Karel Gardas 2005-01-31 09:31:05 UTC
Hello,

new timings MICO ORB sources are here:
http://gcc.gnu.org/ml/gcc/2005-01/msg01714.html

Cheers,
Karel
Comment 84 arend.bayer@web.de 2005-02-01 13:39:39 UTC
Karel, ir.ii does not compile since Mark Mitchell's patch to disallow floating 
point literals in constant expressions went in. I think if you could 
regenerated the preprocessed source, it should work again. 
Comment 85 Karel Gardas 2005-03-02 20:09:11 UTC
New results meassured for MICO compiled with 4.0.0 20050301 are posted here:
http://gcc.gnu.org/ml/gcc/2005-03/msg00132.html

Cheers,
Karel
Comment 86 Giovanni Bajo 2005-03-02 21:32:13 UTC
I gave a quick look at this and I can't find anything that is not already 
fixed, especially after Karel's last results. Also having a bug with 85 
comments is a good way to make developers run, so let's close this as fixed as 
well. If anyone in CC list believes there is something still to fix mentioned 
here, it is better to create a new bug.