GCC Bugzilla – Bug 13776
[4.0/4.1 Regression] Many C++ compile-time regressions for MICO's ORB code
Last modified: 2005-03-02 21:32:13 UTC
Hello, there are many C++ compile-time regression in tree-ssa branch in comparison with gcc-3_4-branch. I have tested it on MICO's ORB core sources and send more details report to the gcc developer mailing list: http://gcc.gnu.org/ml/gcc/2004-01/msg01516.html If you are curious, then you can download tarball of preprocessed files here: http://www.mico.org/~karel/orb-ii-gcc35_040120.tar.bz2 Cheers, Karel
*** This bug has been marked as a duplicate of 13775 ***
Sorry, I don't understand -- this bugreport is about regression in 3.5-tree-ssa, while 13775 is about regression in 3.4.0. I've thought they should be different bugreports for different set of people (working on different branches). Should I reopen bug in this case? Thanks, Karel
I think Wolfgang's rationale is that the problem is compilation speed, and fixing that problem will fix both bugs. Not sure I agree though...
Hmm, well fixing 13775 might also fix 13777 but certainly not this problem which is regression in tree-ssa. So I reopen this bug, especially when I know that tree-ssa developers are curious to see such regressions.
I am wondering how much of this is due to the current work that was done after the last merge into the tree-ssa.
My bad, I misread it. Sorry W.
Measurements made in comparison with 3.4.0 040114.
Subject: [Fwd: [tree-ssa] 20% compile time regression vs. 3.4] Adding to PR notes. More related C++ compile time regressions. Diego. -----Forwarded Message----- From: Richard Guenther <rguenth@tat.physik.uni-tuebingen.de> To: gcc@gcc.gnu.org Subject: [tree-ssa] 20% compile time regression vs. 3.4 Date: Wed, 03 Mar 2004 15:46:01 +0100 Hi! I thought it was time for another 3.4 vs. tree-ssa compile-time comparison. For -O2 compile-time we regressed quite a bit (20%) with the main problem areas are (first 3.4, second tree-ssa): garbage collection : 12.19 ( 7%) usr 0.00 ( 0%) sys 12.50 ( 7%) wall garbage collection : 17.26 ( 8%) usr 0.02 ( 0%) sys 17.45 ( 8%) wall tree-ssa uses about double amount of memory parser : 14.59 ( 9%) usr 1.26 (27%) sys 16.41 ( 9%) wall parser : 18.29 ( 8%) usr 1.42 (27%) sys 19.94 ( 9%) wall I cannot make any sense out of this - are there significant changes to the parser!? Maybe that-much larger libstdc++? integration : 17.86 (11%) usr 0.29 ( 6%) sys 18.34 (10%) wall integration : 21.62 (10%) usr 0.18 ( 3%) sys 22.19 (10%) wall probably different inlining choices and finally some tree-ssa optimizer numbers stick out tree gimplify : 3.39 ( 2%) usr 0.04 ( 1%) sys 3.48 ( 1%) wall tree eh : 2.71 ( 1%) usr 0.01 ( 0%) sys 2.77 ( 1%) wall tree CFG construction : 1.69 ( 1%) usr 0.12 ( 2%) sys 1.87 ( 1%) wall tree CFG cleanup : 2.89 ( 1%) usr 0.02 ( 0%) sys 2.98 ( 1%) wall tree PTA : 0.49 ( 0%) usr 0.03 ( 1%) sys 0.52 ( 0%) wall tree alias analysis : 0.71 ( 0%) usr 0.01 ( 0%) sys 0.75 ( 0%) wall tree PHI insertion : 2.14 ( 1%) usr 0.04 ( 1%) sys 2.25 ( 1%) wall tree SSA rewrite : 2.94 ( 1%) usr 0.01 ( 0%) sys 3.03 ( 1%) wall tree SSA other : 3.77 ( 2%) usr 0.33 ( 6%) sys 4.17 ( 2%) wall tree operand scan : 2.95 ( 1%) usr 0.46 ( 8%) sys 3.51 ( 2%) wall dominator optimization: 14.06 ( 6%) usr 0.20 ( 4%) sys 14.60 ( 6%) wall tree SRA : 0.29 ( 0%) usr 0.00 ( 0%) sys 0.31 ( 0%) wall tree CCP : 2.29 ( 1%) usr 0.00 ( 0%) sys 2.39 ( 1%) wall tree split crit edges : 0.27 ( 0%) usr 0.00 ( 0%) sys 0.28 ( 0%) wall tree PRE : 6.11 ( 3%) usr 0.06 ( 1%) sys 6.40 ( 3%) wall tree linearize phis : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall tree forward propagate: 1.37 ( 1%) usr 0.00 ( 0%) sys 1.42 ( 1%) wall tree conservative DCE : 2.71 ( 1%) usr 0.02 ( 0%) sys 2.80 ( 1%) wall tree aggressive DCE : 1.40 ( 1%) usr 0.00 ( 0%) sys 1.45 ( 1%) wall tree DSE : 3.30 ( 2%) usr 0.03 ( 1%) sys 3.42 ( 1%) wall tree copy headers : 1.80 ( 1%) usr 0.01 ( 0%) sys 1.84 ( 1%) wall tree SSA to normal : 3.18 ( 1%) usr 0.13 ( 2%) sys 3.39 ( 1%) wall tree NRV optimization : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall tree rename SSA copies: 0.83 ( 0%) usr 0.03 ( 1%) sys 0.88 ( 0%) wall namely DOM (again) and PRE. This is with the famous tramp3d-v2.cpp testcase you can find at http://www.tat.physik.uni-tuebingen.de/~rguenth/gcc/tramp3d-v2.cpp.gz g++-ssa (GCC) 3.5-tree-ssa 20040303 (merged 20040227) g++ (GCC) 3.4.0 20040301 (prerelease) compiled with -O2 -c tramp3d-v2.cpp -Dleafify=fooblah -ftime-report to disable leafify effects. The 3.4 compiler was profiledbootstrapped while the ssa one was only bootstrapped. Of course checking was disabled. Thanks, Richard. -- Richard Guenther <richard dot guenther at uni-tuebingen dot de> WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/
Testcase for the last report can be found attached to PR14408.
*** Bug 14408 has been marked as a duplicate of this bug. ***
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 Compilation times in mode that matters to me (leafify enabled) degraded half an order of magnitude: g++-ssa (GCC) 3.5-tree-ssa 20040311 (merged 20040307) bellatrix:/tmp$ g++-ssa -O2 -o tramp3d-v2 tramp3d-v2.cpp -static -ftime-report Execution times (seconds) garbage collection : 38.48 ( 3%) usr 0.15 ( 1%) sys 38.90 ( 3%) wall callgraph construction: 1.49 ( 0%) usr 0.00 ( 0%) sys 1.49 ( 0%) wall callgraph optimization: 1.54 ( 0%) usr 0.08 ( 0%) sys 1.63 ( 0%) wall cfg construction : 1.35 ( 0%) usr 0.17 ( 1%) sys 1.52 ( 0%) wall cfg cleanup : 5.68 ( 0%) usr 0.03 ( 0%) sys 5.74 ( 0%) wall trivially dead code : 5.11 ( 0%) usr 0.01 ( 0%) sys 5.16 ( 0%) wall life analysis : 9.41 ( 1%) usr 0.05 ( 0%) sys 9.59 ( 1%) wall life info update : 5.78 ( 0%) usr 0.00 ( 0%) sys 5.81 ( 0%) wall alias analysis : 7.50 ( 1%) usr 0.03 ( 0%) sys 7.58 ( 1%) wall register scan : 3.92 ( 0%) usr 0.01 ( 0%) sys 3.95 ( 0%) wall rebuild jump labels : 1.37 ( 0%) usr 0.00 ( 0%) sys 1.41 ( 0%) wall preprocessing : 0.51 ( 0%) usr 0.10 ( 0%) sys 0.64 ( 0%) wall parser : 18.60 ( 2%) usr 1.47 ( 7%) sys 20.14 ( 2%) wall name lookup : 6.55 ( 1%) usr 1.53 ( 7%) sys 8.10 ( 1%) wall integration : 67.76 ( 6%) usr 1.46 ( 7%) sys 69.59 ( 6%) wall tree gimplify : 3.44 ( 0%) usr 0.04 ( 0%) sys 3.50 ( 0%) wall tree eh : 7.64 ( 1%) usr 0.25 ( 1%) sys 7.96 ( 1%) wall tree CFG construction : 4.54 ( 0%) usr 0.53 ( 3%) sys 5.07 ( 0%) wall tree CFG cleanup : 9.67 ( 1%) usr 0.08 ( 0%) sys 9.81 ( 1%) wall tree PTA : 1.26 ( 0%) usr 0.05 ( 0%) sys 1.31 ( 0%) wall tree alias analysis : 1.37 ( 0%) usr 0.01 ( 0%) sys 1.38 ( 0%) wall tree PHI insertion : 74.91 ( 6%) usr 0.24 ( 1%) sys 75.62 ( 6%) wall tree SSA rewrite : 7.46 ( 1%) usr 0.21 ( 1%) sys 7.72 ( 1%) wall tree SSA other : 10.67 ( 1%) usr 0.79 ( 4%) sys 11.58 ( 1%) wall tree operand scan : 6.77 ( 1%) usr 0.61 ( 3%) sys 7.41 ( 1%) wall dominator optimization: 46.56 ( 4%) usr 1.54 ( 8%) sys 48.39 ( 4%) wall tree SRA : 0.79 ( 0%) usr 0.02 ( 0%) sys 0.83 ( 0%) wall tree CCP : 4.88 ( 0%) usr 0.02 ( 0%) sys 4.97 ( 0%) wall tree split crit edges : 0.64 ( 0%) usr 0.06 ( 0%) sys 0.70 ( 0%) wall tree PRE : 583.13 (49%) usr 6.18 (30%) sys 592.99 (48%) wall tree linearize phis : 0.08 ( 0%) usr 0.00 ( 0%) sys 0.08 ( 0%) wall tree forward propagate: 3.51 ( 0%) usr 0.00 ( 0%) sys 3.53 ( 0%) wall tree conservative DCE : 6.95 ( 1%) usr 0.08 ( 0%) sys 7.05 ( 1%) wall tree aggressive DCE : 2.89 ( 0%) usr 0.03 ( 0%) sys 2.93 ( 0%) wall tree DSE : 6.33 ( 1%) usr 0.19 ( 1%) sys 6.56 ( 1%) wall tree copy headers : 5.00 ( 0%) usr 0.05 ( 0%) sys 5.09 ( 0%) wall tree SSA to normal : 9.60 ( 1%) usr 0.42 ( 2%) sys 10.09 ( 1%) wall tree rename SSA copies: 2.11 ( 0%) usr 0.05 ( 0%) sys 2.17 ( 0%) wall dominance frontiers : 0.86 ( 0%) usr 0.00 ( 0%) sys 0.89 ( 0%) wall control dependences : 0.49 ( 0%) usr 0.00 ( 0%) sys 0.51 ( 0%) wall expand : 42.02 ( 3%) usr 1.38 ( 7%) sys 43.63 ( 4%) wall varconst : 0.82 ( 0%) usr 0.01 ( 0%) sys 0.83 ( 0%) wall jump : 8.24 ( 1%) usr 0.36 ( 2%) sys 8.64 ( 1%) wall CSE : 14.22 ( 1%) usr 0.08 ( 0%) sys 14.39 ( 1%) wall global CSE : 67.80 ( 6%) usr 0.84 ( 4%) sys 69.06 ( 6%) wall loop analysis : 11.90 ( 1%) usr 0.02 ( 0%) sys 12.02 ( 1%) wall bypass jumps : 2.48 ( 0%) usr 0.12 ( 1%) sys 2.60 ( 0%) wall CSE 2 : 6.33 ( 1%) usr 0.04 ( 0%) sys 6.38 ( 1%) wall branch prediction : 8.78 ( 1%) usr 0.03 ( 0%) sys 8.88 ( 1%) wall flow analysis : 0.28 ( 0%) usr 0.00 ( 0%) sys 0.29 ( 0%) wall combiner : 6.32 ( 1%) usr 0.08 ( 0%) sys 6.46 ( 1%) wall if-conversion : 1.72 ( 0%) usr 0.02 ( 0%) sys 1.74 ( 0%) wall regmove : 2.47 ( 0%) usr 0.00 ( 0%) sys 2.48 ( 0%) wall local alloc : 6.16 ( 1%) usr 0.03 ( 0%) sys 6.25 ( 1%) wall global alloc : 13.97 ( 1%) usr 0.21 ( 1%) sys 14.22 ( 1%) wall reload CSE regs : 5.61 ( 0%) usr 0.07 ( 0%) sys 5.75 ( 0%) wall flow 2 : 1.31 ( 0%) usr 0.07 ( 0%) sys 1.41 ( 0%) wall if-conversion 2 : 0.90 ( 0%) usr 0.00 ( 0%) sys 0.91 ( 0%) wall peephole 2 : 0.94 ( 0%) usr 0.02 ( 0%) sys 0.97 ( 0%) wall rename registers : 1.66 ( 0%) usr 0.07 ( 0%) sys 1.75 ( 0%) wall scheduling 2 : 8.51 ( 1%) usr 0.13 ( 1%) sys 8.67 ( 1%) wall machine dep reorg : 1.74 ( 0%) usr 0.00 ( 0%) sys 1.76 ( 0%) wall reorder blocks : 1.06 ( 0%) usr 0.04 ( 0%) sys 1.10 ( 0%) wall shorten branches : 1.87 ( 0%) usr 0.05 ( 0%) sys 1.94 ( 0%) wall reg stack : 0.36 ( 0%) usr 0.00 ( 0%) sys 0.36 ( 0%) wall final : 2.55 ( 0%) usr 0.18 ( 1%) sys 2.74 ( 0%) wall symout : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall rest of compilation : 4.92 ( 0%) usr 0.07 ( 0%) sys 4.99 ( 0%) wall TOTAL :1201.61 20.47 1229.69 Look at the PRE times!!! Also the resulting binary segfaults and such is miscompiled (for both leafify enabled and disabled compilation). Ugh. For reference, the leafify patch still sits at http://www.tat.physik.uni-tuebingen.de/~rguenth/gcc/leafify-ssa-2 Building an instrumented compiler now. Richard.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 Richard Guenther wrote: > Compilation times in mode that matters to me (leafify enabled) degraded > half an order of magnitude: > > g++-ssa (GCC) 3.5-tree-ssa 20040311 (merged 20040307) > > bellatrix:/tmp$ g++-ssa -O2 -o tramp3d-v2 tramp3d-v2.cpp -static > -ftime-report instrumented compiler gives: Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ks/call Ks/call name 15.51 183.22 183.22 726080 0.00 0.00 process_left_occs_and_kills 8.55 284.25 101.03 16644 0.00 0.00 create_and_insert_occ_in_preorder_dt_order 6.72 363.67 79.42 184430 0.00 0.00 compute_global_livein 6.54 440.93 77.26 16644 0.00 0.00 rename_1 3.13 477.86 36.93 16644 0.00 0.00 clear_all_eref_arrays 3.05 513.86 36.00 16644 0.00 0.00 compute_down_safety 2.49 543.32 29.46 152057062 0.00 0.00 expr_lexically_eq 2.09 568.00 24.68 201843 0.00 0.00 cgraph_remove_node 1.89 590.33 22.33 432753 0.00 0.00 alloc_page 1.70 610.44 20.11 eref_compare 1.70 630.47 20.03 158882 0.00 0.00 compute_transp 1.46 647.70 17.23 416131 0.00 0.00 cgraph_remove_edge 1.42 664.43 16.73 14808480 0.00 0.00 gt_ggc_mx_lang_tree_node 1.13 677.82 13.39 226466163 0.00 0.00 ggc_set_mark 1.07 690.43 12.61 1482 0.00 0.00 collect_expressions 0.90 701.05 10.62 15658237 0.00 0.00 walk_tree i.e. PRE seems to do something very stupid? Richard.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 > > instrumented compiler gives: > > Flat profile: > > Each sample counts as 0.01 seconds. > % cumulative self self total > time seconds seconds calls Ks/call Ks/call name > 15.51 183.22 183.22 726080 0.00 0.00 > process_left_occs_and_kills This one is O(n^2) or O(n^3) in the number of vuses. Known problem. a fix is really complicated. > 8.55 284.25 101.03 16644 0.00 0.00 > create_and_insert_occ_in_preorder_dt_order Hmmmm. It is attempting to PRE 16664 things. How many basic blocks do you have? We shouldn't end up with trying to PRE that many expressions, since we only try to PRE things that occur at least twice. > 6.72 363.67 79.42 184430 0.00 0.00 > compute_global_livein > 6.54 440.93 77.26 16644 0.00 0.00 rename_1 > 3.13 477.86 36.93 16644 0.00 0.00 > clear_all_eref_arrays > 3.05 513.86 36.00 16644 0.00 0.00 > compute_down_safety > 2.49 543.32 29.46 152057062 0.00 0.00 > expr_lexically_eq > 2.09 568.00 24.68 201843 0.00 0.00 > cgraph_remove_node > 1.89 590.33 22.33 432753 0.00 0.00 alloc_page > 1.70 610.44 20.11 eref_compare > 1.70 630.47 20.03 158882 0.00 0.00 compute_transp > 1.46 647.70 17.23 416131 0.00 0.00 > cgraph_remove_edge > 1.42 664.43 16.73 14808480 0.00 0.00 > gt_ggc_mx_lang_tree_node > 1.13 677.82 13.39 226466163 0.00 0.00 ggc_set_mark > 1.07 690.43 12.61 1482 0.00 0.00 > collect_expressions > 0.90 701.05 10.62 15658237 0.00 0.00 walk_tree > > i.e. PRE seems to do something very stupid? > You must have an incredibly large number of basic blocks or something, or a very weird flowgraph. How many BB's are we talking about? I can't fix the algorithmic properties of the SSAPRE algorithm we use, which is what you are running into, i'm betting. I'm working on a new PRE implementation that is O(n^2) memory usage in the number of phi nodes, but should be a bit faster overall.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 On Fri, 2004-03-12 at 21:02, dberlin at dberlin dot org wrote: > I can't fix the algorithmic properties of the SSAPRE algorithm we use, > which is what you are running into, i'm betting. > Could we add thresholds to back away from overly complicated functions? Diego.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 > You must have an incredibly large number of basic blocks or something, > or a very weird flowgraph. > How many BB's are we talking about? > > I can't fix the algorithmic properties of the SSAPRE algorithm we use, > which is what you are running into, i'm betting. > > I'm working on a new PRE implementation that is O(n^2) memory usage in > the number of phi nodes, but should be a bit faster overall. > Regardless, i'll see if i can find a machine with enough memory to look at these.
(In reply to comment #14) > Subject: Re: [tree-ssa] Many C++ compile-time regression in > 3.5-tree-ssa 040120 > > On Fri, 2004-03-12 at 21:02, dberlin at dberlin dot org wrote: > > > I can't fix the algorithmic properties of the SSAPRE algorithm we use, > > which is what you are running into, i'm betting. > > > Could we add thresholds to back away from overly complicated functions? > > > Diego. > > I need to know what exactly the properties of these functions are, it's unclear. As i said, i'm working on it.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 dberlin at gcc dot gnu dot org wrote: > ------- Additional Comments From dberlin at gcc dot gnu dot org 2004-03-13 02:10 ------- > (In reply to comment #14) > >>Subject: Re: [tree-ssa] Many C++ compile-time regression in >> 3.5-tree-ssa 040120 >> >>On Fri, 2004-03-12 at 21:02, dberlin at dberlin dot org wrote: >> >> >>>I can't fix the algorithmic properties of the SSAPRE algorithm we use, >>>which is what you are running into, i'm betting. >>> >> >>Could we add thresholds to back away from overly complicated functions? >> >> >>Diego. >> >> > > > > I need to know what exactly the properties of these functions are, it's unclear. > As i said, i'm working on it. Remember you need to patch the compiler to support __attribute__((leafify)) to trigger the problem with the tramp3d-v2.cpp testcase. I suspect the huge number of basic blocks comes from inlining as I suspect at least one new basic block is inserted per inlined function, no? So with a lot of C++ abstraction inside a leafified function you get a lot of basic blocks. But I suppose a lot of them could be eliminated easily? Richard.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 dnovillo at redhat dot com wrote: > ------- Additional Comments From dnovillo at redhat dot com 2004-03-13 02:08 ------- > Subject: Re: [tree-ssa] Many C++ compile-time regression in > 3.5-tree-ssa 040120 > > On Fri, 2004-03-12 at 21:02, dberlin at dberlin dot org wrote: > > >>I can't fix the algorithmic properties of the SSAPRE algorithm we use, >>which is what you are running into, i'm betting. >> > > Could we add thresholds to back away from overly complicated functions? Or just "split" them up using sort of windowing? It looks clearly wrong to not limit a O(n^2) or O(n^3) algorithm. Richard.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 On Mar 13, 2004, at 6:46 AM, rguenth at tat dot physik dot uni-tuebingen dot de wrote: >>> >> >> Could we add thresholds to back away from overly complicated >> functions? > > Or just "split" them up using sort of windowing? It looks clearly > wrong > to not limit a O(n^2) or O(n^3) algorithm. > It's only collecting expressions that is O(n^2). The other parts of the algorithm just has a large constant. Also, it *is* splitting up the function. It performs PRE one expression at a time. We can't perform it one basic block at a time or anything with the current algorithm (and it wouldn't make sense to, because you can't find the optimal insertion points).
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 > > and a lot more of int d farther away? Also, how are bb's marked? I see > <bb 0>: but no more, and some gotos reference <bb 18> and <bb 16> > (with a label, too)? > > Can I get summaries somehow here? Or just dump one interesting > function rather than all of the program? > > Also, how do I dump some stuff about the PRE pass? Specifying > -fdump-tree-pre just dumps the trees after PRE with no information > about the PRE pass itself. -fdump-tree-pre-stats-details. But i already know what it is going to show in this case, based on the profile. I just need other properties of the functions, which i'm attempting to get.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 Note that the reported times are a huge regression compared to g++ (GCC) 3.5-tree-ssa 20040209 (merged 20040126) which shows Execution times (seconds) garbage collection : 26.06 ( 7%) usr 0.00 ( 0%) sys 26.06 ( 6%) wall callgraph construction: 1.29 ( 0%) usr 0.00 ( 0%) sys 1.29 ( 0%) wall callgraph optimization: 1.39 ( 0%) usr 0.05 ( 1%) sys 1.44 ( 0%) wall cfg construction : 1.71 ( 0%) usr 0.02 ( 0%) sys 1.73 ( 0%) wall cfg cleanup : 3.49 ( 1%) usr 0.02 ( 0%) sys 3.51 ( 1%) wall trivially dead code : 2.85 ( 1%) usr 0.01 ( 0%) sys 2.86 ( 1%) wall life analysis : 5.88 ( 1%) usr 0.00 ( 0%) sys 5.88 ( 1%) wall life info update : 2.81 ( 1%) usr 0.00 ( 0%) sys 2.81 ( 1%) wall alias analysis : 4.46 ( 1%) usr 0.02 ( 0%) sys 4.48 ( 1%) wall register scan : 1.96 ( 0%) usr 0.00 ( 0%) sys 1.96 ( 0%) wall rebuild jump labels : 0.88 ( 0%) usr 0.01 ( 0%) sys 0.89 ( 0%) wall preprocessing : 0.61 ( 0%) usr 0.15 ( 3%) sys 0.76 ( 0%) wall parser : 19.47 ( 5%) usr 1.10 (23%) sys 21.16 ( 5%) wall name lookup : 12.05 ( 3%) usr 1.54 (33%) sys 13.75 ( 3%) wall integration : 47.73 (12%) usr 0.14 ( 3%) sys 47.87 (12%) wall tree gimplify : 3.05 ( 1%) usr 0.06 ( 1%) sys 3.19 ( 1%) wall tree eh : 5.32 ( 1%) usr 0.01 ( 0%) sys 5.34 ( 1%) wall tree CFG construction : 2.74 ( 1%) usr 0.08 ( 2%) sys 2.82 ( 1%) wall tree CFG cleanup : 6.10 ( 2%) usr 0.00 ( 0%) sys 6.10 ( 2%) wall tree alias analysis : 1.11 ( 0%) usr 0.00 ( 0%) sys 1.11 ( 0%) wall tree PHI insertion : 17.62 ( 4%) usr 0.01 ( 0%) sys 17.63 ( 4%) wall tree SSA rewrite : 5.90 ( 1%) usr 0.02 ( 0%) sys 5.92 ( 1%) wall tree SSA other : 10.18 ( 3%) usr 0.04 ( 1%) sys 10.22 ( 3%) wall dominator optimization: 31.18 ( 8%) usr 0.25 ( 5%) sys 31.43 ( 8%) wall tree SRA : 0.42 ( 0%) usr 0.00 ( 0%) sys 0.42 ( 0%) wall tree CCP : 6.99 ( 2%) usr 0.05 ( 1%) sys 7.04 ( 2%) wall tree split crit edges : 0.53 ( 0%) usr 0.01 ( 0%) sys 0.54 ( 0%) wall tree PRE : 67.53 (17%) usr 0.08 ( 2%) sys 67.92 (17%) wall tree conservative DCE : 5.12 ( 1%) usr 0.01 ( 0%) sys 5.13 ( 1%) wall tree aggressive DCE : 2.41 ( 1%) usr 0.00 ( 0%) sys 2.41 ( 1%) wall tree SSA to normal : 5.69 ( 1%) usr 0.17 ( 4%) sys 5.86 ( 1%) wall dominance frontiers : 0.65 ( 0%) usr 0.00 ( 0%) sys 0.65 ( 0%) wall control dependences : 0.35 ( 0%) usr 0.00 ( 0%) sys 0.35 ( 0%) wall expand : 20.80 ( 5%) usr 0.07 ( 1%) sys 20.88 ( 5%) wall varconst : 0.81 ( 0%) usr 0.04 ( 1%) sys 0.85 ( 0%) wall jump : 1.72 ( 0%) usr 0.13 ( 3%) sys 1.86 ( 0%) wall CSE : 8.43 ( 2%) usr 0.00 ( 0%) sys 8.43 ( 2%) wall global CSE : 10.58 ( 3%) usr 0.15 ( 3%) sys 10.74 ( 3%) wall loop analysis : 2.59 ( 1%) usr 0.01 ( 0%) sys 2.60 ( 1%) wall bypass jumps : 1.95 ( 0%) usr 0.03 ( 1%) sys 1.98 ( 0%) wall CSE 2 : 3.57 ( 1%) usr 0.00 ( 0%) sys 3.57 ( 1%) wall branch prediction : 4.66 ( 1%) usr 0.01 ( 0%) sys 4.69 ( 1%) wall flow analysis : 0.18 ( 0%) usr 0.00 ( 0%) sys 0.18 ( 0%) wall combiner : 3.53 ( 1%) usr 0.00 ( 0%) sys 3.53 ( 1%) wall if-conversion : 0.92 ( 0%) usr 0.00 ( 0%) sys 0.92 ( 0%) wall regmove : 1.29 ( 0%) usr 0.00 ( 0%) sys 1.29 ( 0%) wall local alloc : 3.36 ( 1%) usr 0.01 ( 0%) sys 3.37 ( 1%) wall global alloc : 7.97 ( 2%) usr 0.13 ( 3%) sys 8.10 ( 2%) wall reload CSE regs : 3.79 ( 1%) usr 0.00 ( 0%) sys 3.79 ( 1%) wall flow 2 : 0.78 ( 0%) usr 0.00 ( 0%) sys 0.78 ( 0%) wall if-conversion 2 : 0.46 ( 0%) usr 0.00 ( 0%) sys 0.46 ( 0%) wall peephole 2 : 0.83 ( 0%) usr 0.01 ( 0%) sys 0.84 ( 0%) wall rename registers : 1.16 ( 0%) usr 0.05 ( 1%) sys 1.21 ( 0%) wall scheduling 2 : 4.62 ( 1%) usr 0.06 ( 1%) sys 4.68 ( 1%) wall reorder blocks : 0.73 ( 0%) usr 0.00 ( 0%) sys 0.73 ( 0%) wall shorten branches : 1.16 ( 0%) usr 0.02 ( 0%) sys 1.18 ( 0%) wall reg stack : 0.19 ( 0%) usr 0.00 ( 0%) sys 0.19 ( 0%) wall final : 1.79 ( 0%) usr 0.13 ( 3%) sys 1.92 ( 0%) wall symout : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall rest of compilation : 2.76 ( 1%) usr 0.02 ( 0%) sys 2.78 ( 1%) wall TOTAL : 396.23 4.72 402.15 So appearantly PRE got a factor of 10 slower!? Richard.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 dberlin at dberlin dot org wrote: > ------- Additional Comments From dberlin at dberlin dot org 2004-03-13 16:00 ------- > Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 > > >>and a lot more of int d farther away? Also, how are bb's marked? I see >><bb 0>: but no more, and some gotos reference <bb 18> and <bb 16> >>(with a label, too)? >> >>Can I get summaries somehow here? Or just dump one interesting >>function rather than all of the program? >> >>Also, how do I dump some stuff about the PRE pass? Specifying >>-fdump-tree-pre just dumps the trees after PRE with no information >>about the PRE pass itself. > > > -fdump-tree-pre-stats-details. But i already know what it is going to > show in this case, based on the profile. > I just need other properties of the functions, which i'm attempting to > get. I also see we're running PRE before DCE - the functions probably contain a lot of dead code - would it be sensible and profitable to move the first DCE pass before PRE? Can this be specified on the command line or where would I need to change the source to do this? Richard.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 > > So appearantly PRE got a factor of 10 slower!? > Highly unlikely. There haven't been any PRE changes in between the two compilers. Something else changed, like inlining or something. > You are likely inlining *way* too much again or something. --Dan
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 > > I also see we're running PRE before DCE - the functions probably > contain > a lot of dead code - would it be sensible and profitable to move the > first DCE pass before PRE? No we aren't. We run 3 DCE passes before PRE. NEXT_PASS (pass_build_cfg); ... NEXT_PASS (pass_dce); ... NEXT_PASS (DUP_PASS (pass_dce)); ... NEXT_PASS (DUP_PASS (pass_dce)); NEXT_PASS (pass_split_crit_edges); NEXT_PASS (pass_pre); > Can this be specified on the command line or > where would I need to change the source to do this? > > Richard. > > > -- > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13776
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 > So appearantly PRE got a factor of 10 slower!? > Note that the other functions got a factor of 3-5 slower too. As I said, PRE just has a larger constant, so it's more noticeable. This tells me something else important changed, probably in cgraph or something. There is little i can do, a lot of the portions wasting time are already O(n) (compute_down_safety for example). The only thing to do is reduce the number of expressions we PRE, give up PRE entirely on such functions, or change PRE algorithms. I'm actually working on 3 and 2, rather than 1. 1 is tricky, we already give up on expressions that occur once, which makes us lose some load motion. Number 2 requires figuring out what properties of this function make it such a pain in the ass, which is what i'm doing. and #3 is being worked on in the background, i'm waiting for Steven to get back to get more work done.
There are about 100 functions here with > a couple thousand bb's. PRE takes about 2-3 seconds for each of these functions. Which means i have to microoptimize it in order to get rid of the cumulative time effect. A lot of is it simply iterating over large lists looking for certain types of nodes (like EPHiS), where the lists are O(n_basic_blocks), and we only need to look at 10 entries or so. This doesn't matter when the numbers are close, but when you have 8000 bb's to walk 20 times, instead of walking 40 entries 20 times, it matters.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 dberlin at gcc dot gnu dot org wrote: > ------- Additional Comments From dberlin at gcc dot gnu dot org 2004-03-14 04:47 ------- > There are about 100 functions here with > a couple thousand bb's. > PRE takes about 2-3 seconds for each of these functions. > Which means i have to microoptimize it in order to get rid of the cumulative time effect. > A lot of is it simply iterating over large lists looking for certain types of nodes (like EPHiS), where the > lists are O(n_basic_blocks), and we only need to look at 10 entries or so. This doesn't matter when the > numbers are close, but when you have 8000 bb's to walk 20 times, instead of walking 40 entries 20 > times, it matters. Yes. I suppose simply storing those nodes separate does not work, as does using a hash-table for storing them, no? Another way would be to reduce the number of bb's somehow? I cannot think of how 8000 bb's can accumulate in one of my math kernels other than by inlining and maybe loop header copying. Can't we merge some bb's before doing PRE? Richard.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 dberlin at gcc dot gnu dot org wrote: > ------- Additional Comments From dberlin at gcc dot gnu dot org 2004-03-14 04:47 ------- > There are about 100 functions here with > a couple thousand bb's. > PRE takes about 2-3 seconds for each of these functions. > Which means i have to microoptimize it in order to get rid of the cumulative time effect. > A lot of is it simply iterating over large lists looking for certain types of nodes (like EPHiS), where the > lists are O(n_basic_blocks), and we only need to look at 10 entries or so. This doesn't matter when the > numbers are close, but when you have 8000 bb's to walk 20 times, instead of walking 40 entries 20 > times, it matters. The nice thing is, that with -fno-exceptions the results look a _lot_ better: Execution times (seconds) garbage collection : 21.13 ( 7%) usr 0.01 ( 0%) sys 21.20 ( 7%) wall callgraph construction: 1.45 ( 0%) usr 0.01 ( 0%) sys 1.46 ( 0%) wall callgraph optimization: 1.51 ( 0%) usr 0.09 ( 1%) sys 1.61 ( 1%) wall cfg construction : 0.52 ( 0%) usr 0.05 ( 1%) sys 0.57 ( 0%) wall cfg cleanup : 1.67 ( 1%) usr 0.00 ( 0%) sys 1.67 ( 1%) wall trivially dead code : 2.27 ( 1%) usr 0.01 ( 0%) sys 2.28 ( 1%) wall life analysis : 5.01 ( 2%) usr 0.00 ( 0%) sys 5.02 ( 2%) wall life info update : 3.11 ( 1%) usr 0.00 ( 0%) sys 3.17 ( 1%) wall alias analysis : 4.02 ( 1%) usr 0.01 ( 0%) sys 4.03 ( 1%) wall register scan : 1.97 ( 1%) usr 0.00 ( 0%) sys 1.97 ( 1%) wall rebuild jump labels : 0.54 ( 0%) usr 0.00 ( 0%) sys 0.54 ( 0%) wall preprocessing : 0.69 ( 0%) usr 0.20 ( 3%) sys 1.72 ( 1%) wall parser : 18.39 ( 6%) usr 1.03 (16%) sys 19.44 ( 6%) wall name lookup : 6.74 ( 2%) usr 1.43 (23%) sys 8.18 ( 3%) wall integration : 58.53 (19%) usr 0.43 ( 7%) sys 58.99 (19%) wall tree gimplify : 3.43 ( 1%) usr 0.05 ( 1%) sys 3.48 ( 1%) wall tree eh : 0.76 ( 0%) usr 0.00 ( 0%) sys 0.76 ( 0%) wall tree CFG construction : 1.54 ( 1%) usr 0.13 ( 2%) sys 1.67 ( 1%) wall tree CFG cleanup : 1.84 ( 1%) usr 0.01 ( 0%) sys 1.85 ( 1%) wall tree PTA : 0.68 ( 0%) usr 0.00 ( 0%) sys 0.68 ( 0%) wall tree alias analysis : 1.07 ( 0%) usr 0.01 ( 0%) sys 1.08 ( 0%) wall tree PHI insertion : 1.37 ( 0%) usr 0.06 ( 1%) sys 1.43 ( 0%) wall tree SSA rewrite : 3.53 ( 1%) usr 0.06 ( 1%) sys 3.59 ( 1%) wall tree SSA other : 4.69 ( 2%) usr 0.41 ( 7%) sys 5.12 ( 2%) wall tree operand scan : 3.57 ( 1%) usr 0.27 ( 4%) sys 3.85 ( 1%) wall dominator optimization: 16.32 ( 5%) usr 0.52 ( 8%) sys 16.84 ( 5%) wall tree SRA : 0.43 ( 0%) usr 0.00 ( 0%) sys 0.43 ( 0%) wall tree CCP : 1.51 ( 0%) usr 0.01 ( 0%) sys 1.52 ( 0%) wall tree split crit edges : 0.16 ( 0%) usr 0.00 ( 0%) sys 0.16 ( 0%) wall tree PRE : 17.34 ( 6%) usr 0.05 ( 1%) sys 17.40 ( 6%) wall tree linearize phis : 0.01 ( 0%) usr 0.01 ( 0%) sys 0.02 ( 0%) wall tree forward propagate: 1.01 ( 0%) usr 0.00 ( 0%) sys 1.01 ( 0%) wall tree conservative DCE : 2.54 ( 1%) usr 0.01 ( 0%) sys 2.55 ( 1%) wall tree aggressive DCE : 0.83 ( 0%) usr 0.00 ( 0%) sys 0.83 ( 0%) wall tree DSE : 1.86 ( 1%) usr 0.07 ( 1%) sys 1.93 ( 1%) wall tree copy headers : 1.39 ( 0%) usr 0.01 ( 0%) sys 1.40 ( 0%) wall tree SSA to normal : 3.01 ( 1%) usr 0.04 ( 1%) sys 3.05 ( 1%) wall tree rename SSA copies: 0.69 ( 0%) usr 0.07 ( 1%) sys 0.77 ( 0%) wall dominance frontiers : 0.18 ( 0%) usr 0.00 ( 0%) sys 0.18 ( 0%) wall control dependences : 0.12 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 0%) wall expand : 31.02 (10%) usr 0.24 ( 4%) sys 31.41 (10%) wall varconst : 0.94 ( 0%) usr 0.01 ( 0%) sys 0.99 ( 0%) wall jump : 1.77 ( 1%) usr 0.14 ( 2%) sys 1.97 ( 1%) wall CSE : 9.85 ( 3%) usr 0.03 ( 0%) sys 9.90 ( 3%) wall global CSE : 14.32 ( 5%) usr 0.17 ( 3%) sys 14.49 ( 5%) wall loop analysis : 4.19 ( 1%) usr 0.01 ( 0%) sys 4.21 ( 1%) wall bypass jumps : 1.19 ( 0%) usr 0.01 ( 0%) sys 1.20 ( 0%) wall CSE 2 : 4.24 ( 1%) usr 0.00 ( 0%) sys 4.24 ( 1%) wall branch prediction : 1.49 ( 0%) usr 0.03 ( 0%) sys 1.54 ( 0%) wall flow analysis : 0.12 ( 0%) usr 0.00 ( 0%) sys 0.14 ( 0%) wall combiner : 3.80 ( 1%) usr 0.01 ( 0%) sys 3.82 ( 1%) wall if-conversion : 0.61 ( 0%) usr 0.01 ( 0%) sys 0.63 ( 0%) wall regmove : 2.17 ( 1%) usr 0.00 ( 0%) sys 2.20 ( 1%) wall local alloc : 3.22 ( 1%) usr 0.03 ( 0%) sys 3.25 ( 1%) wall global alloc : 7.58 ( 3%) usr 0.21 ( 3%) sys 7.79 ( 3%) wall reload CSE regs : 3.25 ( 1%) usr 0.02 ( 0%) sys 3.27 ( 1%) wall flow 2 : 0.65 ( 0%) usr 0.00 ( 0%) sys 0.65 ( 0%) wall if-conversion 2 : 0.30 ( 0%) usr 0.00 ( 0%) sys 0.30 ( 0%) wall peephole 2 : 0.62 ( 0%) usr 0.02 ( 0%) sys 0.64 ( 0%) wall rename registers : 0.97 ( 0%) usr 0.04 ( 1%) sys 1.01 ( 0%) wall scheduling 2 : 5.57 ( 2%) usr 0.04 ( 1%) sys 5.66 ( 2%) wall machine dep reorg : 1.21 ( 0%) usr 0.00 ( 0%) sys 1.21 ( 0%) wall reorder blocks : 0.78 ( 0%) usr 0.00 ( 0%) sys 0.80 ( 0%) wall shorten branches : 0.82 ( 0%) usr 0.02 ( 0%) sys 0.84 ( 0%) wall reg stack : 0.20 ( 0%) usr 0.00 ( 0%) sys 0.20 ( 0%) wall final : 1.41 ( 0%) usr 0.14 ( 2%) sys 1.56 ( 1%) wall rest of compilation : 2.63 ( 1%) usr 0.04 ( 1%) sys 2.67 ( 1%) wall TOTAL : 302.38 6.29 310.24 So the question is, where is the difference and wether it needs to be there ;) Richard.
Subject: Bug 13776 On Mar 14, 2004, at 8:35 AM, Richard Guenther wrote: > Daniel Berlin wrote: >> This adds a DOM pass in between split critical edges and PRE, and >> works for me on i686 and powerpc >> Tell me if it helps > > It made things worse in total, even PRE degraded some, but that may be > in the noise. > > Richard. I don't even get close to these numbers. I've got your leafify patch installed (the one linked from the bug report) Even at -O2, on a checking enabled compiler, with tramp3d-v2 from the bug report, with the following sizes: [root@dberlin dberlin]# ls -trl tramp3d-v2.ii -rw-r--r-- 1 root root 2962361 Feb 5 10:27 tramp3d-v2.ii generated from [root@dberlin dberlin]# ls -l tramp3d-v2.cpp -rw-r--r-- 1 dberlin dberlin 1952077 Feb 5 10:14 tramp3d-v2.cpp I get (without any changes to PRE): [root@dberlin gcc]# ./cc1plus -O2 ~dberlin/tramp3d-v2.ii ... Execution times (seconds) garbage collection : 46.23 (15%) usr 0.27 ( 3%) sys 46.66 (15%) wall callgraph construction: 0.68 ( 0%) usr 0.01 ( 0%) sys 0.72 ( 0%) wall callgraph optimization: 0.80 ( 0%) usr 0.07 ( 1%) sys 0.92 ( 0%) wall cfg construction : 0.46 ( 0%) usr 0.04 ( 0%) sys 0.50 ( 0%) wall cfg cleanup : 1.82 ( 1%) usr 0.02 ( 0%) sys 1.84 ( 1%) wall CFG verifier : 8.07 ( 3%) usr 0.03 ( 0%) sys 8.15 ( 3%) wall trivially dead code : 1.28 ( 0%) usr 0.00 ( 0%) sys 1.29 ( 0%) wall life analysis : 2.96 ( 1%) usr 0.01 ( 0%) sys 2.97 ( 1%) wall life info update : 1.52 ( 0%) usr 0.01 ( 0%) sys 1.56 ( 0%) wall alias analysis : 2.64 ( 1%) usr 0.01 ( 0%) sys 2.66 ( 1%) wall register scan : 1.23 ( 0%) usr 0.02 ( 0%) sys 1.25 ( 0%) wall rebuild jump labels : 0.38 ( 0%) usr 0.00 ( 0%) sys 0.38 ( 0%) wall preprocessing : 0.29 ( 0%) usr 0.17 ( 2%) sys 0.46 ( 0%) wall parser : 13.65 ( 4%) usr 1.27 (16%) sys 20.56 ( 6%) wall name lookup : 4.99 ( 2%) usr 2.00 (25%) sys 7.07 ( 2%) wall integration : 28.17 ( 9%) usr 0.19 ( 2%) sys 28.57 ( 9%) wall tree gimplify : 2.08 ( 1%) usr 0.05 ( 1%) sys 2.19 ( 1%) wall tree eh : 2.86 ( 1%) usr 0.08 ( 1%) sys 2.96 ( 1%) wall tree CFG construction : 1.60 ( 1%) usr 0.09 ( 1%) sys 1.71 ( 1%) wall tree CFG cleanup : 3.99 ( 1%) usr 0.04 ( 0%) sys 4.04 ( 1%) wall tree PTA : 0.47 ( 0%) usr 0.01 ( 0%) sys 0.49 ( 0%) wall tree alias analysis : 0.61 ( 0%) usr 0.00 ( 0%) sys 0.61 ( 0%) wall tree PHI insertion : 9.15 ( 3%) usr 0.07 ( 1%) sys 9.26 ( 3%) wall tree SSA rewrite : 3.30 ( 1%) usr 0.01 ( 0%) sys 3.32 ( 1%) wall tree SSA other : 3.63 ( 1%) usr 0.51 ( 6%) sys 4.20 ( 1%) wall tree operand scan : 3.62 ( 1%) usr 0.59 ( 7%) sys 4.22 ( 1%) wall dominator optimization: 15.57 ( 5%) usr 0.46 ( 6%) sys 16.09 ( 5%) wall tree SRA : 0.31 ( 0%) usr 0.01 ( 0%) sys 0.32 ( 0%) wall tree CCP : 1.56 ( 1%) usr 0.02 ( 0%) sys 1.58 ( 0%) wall tree split crit edges : 0.57 ( 0%) usr 0.03 ( 0%) sys 0.61 ( 0%) wall tree PRE : 34.92 ( 9%) usr 0.14 ( 2%) sys 35.20 ( 9%) wall tree linearize phis : 0.03 ( 0%) usr 0.02 ( 0%) sys 0.05 ( 0%) wall tree forward propagate: 1.12 ( 0%) usr 0.02 ( 0%) sys 1.14 ( 0%) wall tree conservative DCE : 3.02 ( 1%) usr 0.03 ( 0%) sys 3.06 ( 1%) wall tree aggressive DCE : 0.78 ( 0%) usr 0.01 ( 0%) sys 0.79 ( 0%) wall tree DSE : 2.18 ( 1%) usr 0.01 ( 0%) sys 2.20 ( 1%) wall tree copy headers : 2.15 ( 1%) usr 0.02 ( 0%) sys 2.19 ( 1%) wall tree SSA to normal : 2.42 ( 1%) usr 0.13 ( 2%) sys 2.61 ( 1%) wall tree NRV optimization : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall tree rename SSA copies: 0.71 ( 0%) usr 0.04 ( 0%) sys 0.75 ( 0%) wall tree SSA verifier : 25.23 ( 8%) usr 0.23 ( 3%) sys 25.52 ( 8%) wall tree STMT verifier : 3.72 ( 1%) usr 0.03 ( 0%) sys 3.76 ( 1%) wall callgraph verifier : 7.79 ( 3%) usr 0.25 ( 3%) sys 8.09 ( 3%) wall dominance frontiers : 0.27 ( 0%) usr 0.00 ( 0%) sys 0.27 ( 0%) wall control dependences : 0.14 ( 0%) usr 0.00 ( 0%) sys 0.14 ( 0%) wall expand : 16.03 ( 5%) usr 0.19 ( 2%) sys 16.41 ( 5%) wall varconst : 0.66 ( 0%) usr 0.05 ( 1%) sys 1.06 ( 0%) wall jump : 1.17 ( 0%) usr 0.15 ( 2%) sys 1.41 ( 0%) wall CSE : 8.76 ( 3%) usr 0.05 ( 1%) sys 8.84 ( 3%) wall global CSE : 5.01 ( 2%) usr 0.13 ( 2%) sys 5.15 ( 2%) wall loop analysis : 1.21 ( 0%) usr 0.01 ( 0%) sys 1.24 ( 0%) wall bypass jumps : 0.94 ( 0%) usr 0.00 ( 0%) sys 0.94 ( 0%) wall CSE 2 : 3.59 ( 1%) usr 0.02 ( 0%) sys 3.78 ( 1%) wall branch prediction : 2.25 ( 1%) usr 0.01 ( 0%) sys 2.31 ( 1%) wall flow analysis : 0.08 ( 0%) usr 0.00 ( 0%) sys 0.10 ( 0%) wall combiner : 2.58 ( 1%) usr 0.03 ( 0%) sys 2.64 ( 1%) wall if-conversion : 0.57 ( 0%) usr 0.00 ( 0%) sys 0.57 ( 0%) wall regmove : 0.85 ( 0%) usr 0.00 ( 0%) sys 0.86 ( 0%) wall local alloc : 1.80 ( 1%) usr 0.01 ( 0%) sys 1.84 ( 1%) wall global alloc : 5.34 ( 2%) usr 0.10 ( 1%) sys 5.50 ( 2%) wall reload CSE regs : 2.24 ( 1%) usr 0.00 ( 0%) sys 2.25 ( 1%) wall flow 2 : 0.33 ( 0%) usr 0.00 ( 0%) sys 0.34 ( 0%) wall if-conversion 2 : 0.35 ( 0%) usr 0.00 ( 0%) sys 0.35 ( 0%) wall peephole 2 : 0.38 ( 0%) usr 0.00 ( 0%) sys 0.39 ( 0%) wall rename registers : 1.43 ( 0%) usr 0.04 ( 0%) sys 1.52 ( 0%) wall scheduling 2 : 2.28 ( 1%) usr 0.08 ( 1%) sys 2.38 ( 1%) wall reorder blocks : 0.49 ( 0%) usr 0.01 ( 0%) sys 0.50 ( 0%) wall shorten branches : 0.70 ( 0%) usr 0.01 ( 0%) sys 0.71 ( 0%) wall reg stack : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall final : 1.03 ( 0%) usr 0.14 ( 2%) sys 1.38 ( 0%) wall symout : 0.02 ( 0%) usr 0.03 ( 0%) sys 0.06 ( 0%) wall rest of compilation : 1.48 ( 0%) usr 0.04 ( 0%) sys 1.54 ( 0%) wall TOTAL : 310.60 8.12 327.62 Extra diagnostic checks enabled; compiler may run slowly. Configure with --disable-checking to disable checks. With my changes to PRE, i get the same numbers, except PRE is at 28 seconds instead of 36. I certainly get *nowhere close* to 600 seconds in PRE, or the numbers you get overall. I can't fix a problem i can't reproduce, i can only take stabs at it. Can someone else please verify his numbers so i know whether it's my test setup or his?
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 dberlin at dberlin dot org wrote: > ------- Additional Comments From dberlin at dberlin dot org 2004-03-14 15:38 ------- > Subject: Bug 13776 > > > On Mar 14, 2004, at 8:35 AM, Richard Guenther wrote: > > >>Daniel Berlin wrote: >> >>>This adds a DOM pass in between split critical edges and PRE, and >>>works for me on i686 and powerpc >>>Tell me if it helps >> >>It made things worse in total, even PRE degraded some, but that may be >>in the noise. >> >>Richard. > > > I don't even get close to these numbers. > I've got your leafify patch installed (the one linked from the bug > report) > Even at -O2, on a checking enabled compiler, with tramp3d-v2 from the > bug report, with the following sizes: > > [root@dberlin dberlin]# ls -trl tramp3d-v2.ii > -rw-r--r-- 1 root root 2962361 Feb 5 10:27 tramp3d-v2.ii > generated from > [root@dberlin dberlin]# ls -l tramp3d-v2.cpp > -rw-r--r-- 1 dberlin dberlin 1952077 Feb 5 10:14 tramp3d-v2.cpp That's the correct one. > I get (without any changes to PRE): > [root@dberlin gcc]# ./cc1plus -O2 ~dberlin/tramp3d-v2.ii > ... > Execution times (seconds) > garbage collection : 46.23 (15%) usr 0.27 ( 3%) sys 46.66 (15%) > wall > callgraph construction: 0.68 ( 0%) usr 0.01 ( 0%) sys 0.72 ( 0%) > wall > callgraph optimization: 0.80 ( 0%) usr 0.07 ( 1%) sys 0.92 ( 0%) > wall > cfg construction : 0.46 ( 0%) usr 0.04 ( 0%) sys 0.50 ( 0%) > wall > cfg cleanup : 1.82 ( 1%) usr 0.02 ( 0%) sys 1.84 ( 1%) > wall > CFG verifier : 8.07 ( 3%) usr 0.03 ( 0%) sys 8.15 ( 3%) > wall > trivially dead code : 1.28 ( 0%) usr 0.00 ( 0%) sys 1.29 ( 0%) > wall > life analysis : 2.96 ( 1%) usr 0.01 ( 0%) sys 2.97 ( 1%) > wall > life info update : 1.52 ( 0%) usr 0.01 ( 0%) sys 1.56 ( 0%) > wall > alias analysis : 2.64 ( 1%) usr 0.01 ( 0%) sys 2.66 ( 1%) > wall > register scan : 1.23 ( 0%) usr 0.02 ( 0%) sys 1.25 ( 0%) > wall > rebuild jump labels : 0.38 ( 0%) usr 0.00 ( 0%) sys 0.38 ( 0%) > wall > preprocessing : 0.29 ( 0%) usr 0.17 ( 2%) sys 0.46 ( 0%) > wall > parser : 13.65 ( 4%) usr 1.27 (16%) sys 20.56 ( 6%) > wall > name lookup : 4.99 ( 2%) usr 2.00 (25%) sys 7.07 ( 2%) > wall > integration : 28.17 ( 9%) usr 0.19 ( 2%) sys 28.57 ( 9%) > wall > tree gimplify : 2.08 ( 1%) usr 0.05 ( 1%) sys 2.19 ( 1%) > wall > tree eh : 2.86 ( 1%) usr 0.08 ( 1%) sys 2.96 ( 1%) > wall > tree CFG construction : 1.60 ( 1%) usr 0.09 ( 1%) sys 1.71 ( 1%) > wall > tree CFG cleanup : 3.99 ( 1%) usr 0.04 ( 0%) sys 4.04 ( 1%) > wall > tree PTA : 0.47 ( 0%) usr 0.01 ( 0%) sys 0.49 ( 0%) > wall > tree alias analysis : 0.61 ( 0%) usr 0.00 ( 0%) sys 0.61 ( 0%) > wall > tree PHI insertion : 9.15 ( 3%) usr 0.07 ( 1%) sys 9.26 ( 3%) > wall > tree SSA rewrite : 3.30 ( 1%) usr 0.01 ( 0%) sys 3.32 ( 1%) > wall > tree SSA other : 3.63 ( 1%) usr 0.51 ( 6%) sys 4.20 ( 1%) > wall > tree operand scan : 3.62 ( 1%) usr 0.59 ( 7%) sys 4.22 ( 1%) > wall > dominator optimization: 15.57 ( 5%) usr 0.46 ( 6%) sys 16.09 ( 5%) > wall > tree SRA : 0.31 ( 0%) usr 0.01 ( 0%) sys 0.32 ( 0%) > wall > tree CCP : 1.56 ( 1%) usr 0.02 ( 0%) sys 1.58 ( 0%) > wall > tree split crit edges : 0.57 ( 0%) usr 0.03 ( 0%) sys 0.61 ( 0%) > wall > tree PRE : 34.92 ( 9%) usr 0.14 ( 2%) sys 35.20 ( 9%) > wall > tree linearize phis : 0.03 ( 0%) usr 0.02 ( 0%) sys 0.05 ( 0%) > wall > tree forward propagate: 1.12 ( 0%) usr 0.02 ( 0%) sys 1.14 ( 0%) > wall > tree conservative DCE : 3.02 ( 1%) usr 0.03 ( 0%) sys 3.06 ( 1%) > wall > tree aggressive DCE : 0.78 ( 0%) usr 0.01 ( 0%) sys 0.79 ( 0%) > wall > tree DSE : 2.18 ( 1%) usr 0.01 ( 0%) sys 2.20 ( 1%) > wall > tree copy headers : 2.15 ( 1%) usr 0.02 ( 0%) sys 2.19 ( 1%) > wall > tree SSA to normal : 2.42 ( 1%) usr 0.13 ( 2%) sys 2.61 ( 1%) > wall > tree NRV optimization : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) > wall > tree rename SSA copies: 0.71 ( 0%) usr 0.04 ( 0%) sys 0.75 ( 0%) > wall > tree SSA verifier : 25.23 ( 8%) usr 0.23 ( 3%) sys 25.52 ( 8%) > wall > tree STMT verifier : 3.72 ( 1%) usr 0.03 ( 0%) sys 3.76 ( 1%) > wall > callgraph verifier : 7.79 ( 3%) usr 0.25 ( 3%) sys 8.09 ( 3%) > wall > dominance frontiers : 0.27 ( 0%) usr 0.00 ( 0%) sys 0.27 ( 0%) > wall > control dependences : 0.14 ( 0%) usr 0.00 ( 0%) sys 0.14 ( 0%) > wall > expand : 16.03 ( 5%) usr 0.19 ( 2%) sys 16.41 ( 5%) > wall > varconst : 0.66 ( 0%) usr 0.05 ( 1%) sys 1.06 ( 0%) > wall > jump : 1.17 ( 0%) usr 0.15 ( 2%) sys 1.41 ( 0%) > wall > CSE : 8.76 ( 3%) usr 0.05 ( 1%) sys 8.84 ( 3%) > wall > global CSE : 5.01 ( 2%) usr 0.13 ( 2%) sys 5.15 ( 2%) > wall > loop analysis : 1.21 ( 0%) usr 0.01 ( 0%) sys 1.24 ( 0%) > wall > bypass jumps : 0.94 ( 0%) usr 0.00 ( 0%) sys 0.94 ( 0%) > wall > CSE 2 : 3.59 ( 1%) usr 0.02 ( 0%) sys 3.78 ( 1%) > wall > branch prediction : 2.25 ( 1%) usr 0.01 ( 0%) sys 2.31 ( 1%) > wall > flow analysis : 0.08 ( 0%) usr 0.00 ( 0%) sys 0.10 ( 0%) > wall > combiner : 2.58 ( 1%) usr 0.03 ( 0%) sys 2.64 ( 1%) > wall > if-conversion : 0.57 ( 0%) usr 0.00 ( 0%) sys 0.57 ( 0%) > wall > regmove : 0.85 ( 0%) usr 0.00 ( 0%) sys 0.86 ( 0%) > wall > local alloc : 1.80 ( 1%) usr 0.01 ( 0%) sys 1.84 ( 1%) > wall > global alloc : 5.34 ( 2%) usr 0.10 ( 1%) sys 5.50 ( 2%) > wall > reload CSE regs : 2.24 ( 1%) usr 0.00 ( 0%) sys 2.25 ( 1%) > wall > flow 2 : 0.33 ( 0%) usr 0.00 ( 0%) sys 0.34 ( 0%) > wall > if-conversion 2 : 0.35 ( 0%) usr 0.00 ( 0%) sys 0.35 ( 0%) > wall > peephole 2 : 0.38 ( 0%) usr 0.00 ( 0%) sys 0.39 ( 0%) > wall > rename registers : 1.43 ( 0%) usr 0.04 ( 0%) sys 1.52 ( 0%) > wall > scheduling 2 : 2.28 ( 1%) usr 0.08 ( 1%) sys 2.38 ( 1%) > wall > reorder blocks : 0.49 ( 0%) usr 0.01 ( 0%) sys 0.50 ( 0%) > wall > shorten branches : 0.70 ( 0%) usr 0.01 ( 0%) sys 0.71 ( 0%) > wall > reg stack : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) > wall > final : 1.03 ( 0%) usr 0.14 ( 2%) sys 1.38 ( 0%) > wall > symout : 0.02 ( 0%) usr 0.03 ( 0%) sys 0.06 ( 0%) > wall > rest of compilation : 1.48 ( 0%) usr 0.04 ( 0%) sys 1.54 ( 0%) > wall > TOTAL : 310.60 8.12 327.62 > Extra diagnostic checks enabled; compiler may run slowly. > Configure with --disable-checking to disable checks. > > > With my changes to PRE, i get the same numbers, except PRE is at 28 > seconds instead of 36. > > I certainly get *nowhere close* to 600 seconds in PRE, or the numbers > you get overall. > I can't fix a problem i can't reproduce, i can only take stabs at it. > Can someone else please verify his numbers so i know whether it's my > test setup or his? I even have checking disabled. GC time seems to be identical, parsing is 13.5s vs 18.4s - the first big difference is integration, which suggests that leafifying is not enabled? Maybe the patch applied "wrong", I attached a complete diff of my local changes. Anyway, I'm running on a 1GHz Athlon with 1GB of ram, compiler is bootstrapped with checking disabled. Richard. Index: gcc/c-common.c =================================================================== RCS file: /cvs/gcc/gcc/gcc/c-common.c,v retrieving revision 1.344.2.63 diff -u -u -r1.344.2.63 c-common.c --- gcc/c-common.c 2 Mar 2004 18:41:21 -0000 1.344.2.63 +++ gcc/c-common.c 14 Mar 2004 17:51:26 -0000 @@ -746,6 +746,7 @@ static tree handle_noinline_attribute (tree *, tree, tree, int, bool *); static tree handle_always_inline_attribute (tree *, tree, tree, int, bool *); +static tree handle_leafify_attribute (tree *, tree, tree, int, bool *); static tree handle_used_attribute (tree *, tree, tree, int, bool *); static tree handle_unused_attribute (tree *, tree, tree, int, bool *); static tree handle_const_attribute (tree *, tree, tree, int, bool *); @@ -807,6 +808,8 @@ handle_noinline_attribute }, { "always_inline", 0, 0, true, false, false, handle_always_inline_attribute }, + { "leafify", 0, 0, true, false, false, + handle_leafify_attribute }, { "used", 0, 0, true, false, false, handle_used_attribute }, { "unused", 0, 0, false, false, false, @@ -4458,6 +4461,29 @@ return NULL_TREE; } + +/* Handle a "leafify" attribute; arguments as in + struct attribute_spec.handler. */ + +static tree +handle_leafify_attribute (tree *node, tree name, + tree args ATTRIBUTE_UNUSED, + int flags ATTRIBUTE_UNUSED, bool *no_add_attrs) +{ + if (TREE_CODE (*node) == FUNCTION_DECL) + { + /* Do nothing else, just set the attribute. We'll get at + it later with lookup_attribute. */ + } + else + { + warning ("`%s' attribute ignored", IDENTIFIER_POINTER (name)); + *no_add_attrs = true; + } + + return NULL_TREE; +} + /* Handle a "used" attribute; arguments as in struct attribute_spec.handler. */ Index: gcc/cgraphunit.c =================================================================== RCS file: /cvs/gcc/gcc/gcc/cgraphunit.c,v retrieving revision 1.1.4.39 diff -u -u -r1.1.4.39 cgraphunit.c --- gcc/cgraphunit.c 4 Mar 2004 15:38:34 -0000 1.1.4.39 +++ gcc/cgraphunit.c 14 Mar 2004 17:51:26 -0000 @@ -1045,7 +1045,7 @@ else e->callee->global.inlined_to = e->caller; - /* Recursivly clone all bodies. */ + /* Recursivly clone all inlined bodies. */ for (e = e->callee->callees; e; e = e->next_callee) if (!e->inline_failed) cgraph_clone_inlined_nodes (e, duplicate); @@ -1192,7 +1192,7 @@ recursive = what->decl == to->global.inlined_to->decl; else recursive = what->decl == to->decl; - /* Marking recursive function inlinine has sane semantic and thus we should + /* Marking recursive function inline has sane semantic and thus we should not warn on it. */ if (recursive && reason) *reason = (what->local.disregard_inline_limits @@ -1440,6 +1440,67 @@ free (heap_node); } +/* Find callgraph nodes closing a circle in the graph. The + resulting hashtab can be used to avoid walking the circles. + Uses the cgraph nodes ->aux field which needs to be zero + before and will be zero after operation. */ + +static void +cgraph_find_cycles (struct cgraph_node *node, htab_t cycles) +{ + struct cgraph_edge *e; + + if (node->aux) + { + void **slot; + slot = htab_find_slot (cycles, node, INSERT); + if (!*slot) + { + if (cgraph_dump_file) + fprintf (cgraph_dump_file, "Cycle contains %s\n", cgraph_node_name (node)); + *slot = node; + } + return; + } + + node->aux = node; + for (e = node->callees; e; e = e->next_callee) + { + cgraph_find_cycles (e->callee, cycles); + } + node->aux = 0; +} + +/* Leafify the cgraph node. We have to be careful in recursing + as to not run endlessly in circles of the callgraph. + We do so by using a hashtab of cycle entering nodes as generated + by cgraph_find_cycles. */ + +static void +cgraph_leafify_node (struct cgraph_node *node, htab_t cycles) +{ + struct cgraph_edge *e; + + for (e = node->callees; e; e = e->next_callee) + { + /* Inline call, if possible, and recurse. Be sure we are not + entering callgraph circles here. */ + if (e->inline_failed + && e->callee->local.inlinable + && !cgraph_recursive_inlining_p (node, e->callee, + &e->inline_failed) + && !htab_find (cycles, e->callee)) + { + if (cgraph_dump_file) + fprintf (cgraph_dump_file, " inlining %s", cgraph_node_name (e->callee)); + cgraph_mark_inline_edge (e); + cgraph_leafify_node (e->callee, cycles); + } + else if (cgraph_dump_file) + fprintf (cgraph_dump_file, " !inlining %s", cgraph_node_name (e->callee)); + } +} + /* Decide on the inlining. We do so in the topological order to avoid expenses on updating datastructures. */ @@ -1477,6 +1538,24 @@ struct cgraph_edge *e; node = order[i]; + + /* Handle nodes to be leafified, but don't update overall unit size. */ + if (lookup_attribute ("leafify", DECL_ATTRIBUTES (node->decl)) != NULL) + { + int old_overall_insns = overall_insns; + htab_t cycles; + if (cgraph_dump_file) + fprintf (cgraph_dump_file, + "Leafifying %s\n", cgraph_node_name (node)); + cycles = htab_create (7, htab_hash_pointer, htab_eq_pointer, NULL); + cgraph_find_cycles (node, cycles); + cgraph_leafify_node (node, cycles); + htab_delete (cycles); + overall_insns = old_overall_insns; + /* We don't need to consider always_inline functions inside the leafified + function anymore. */ + continue; + } for (e = node->callees; e; e = e->next_callee) if (e->callee->local.disregard_inline_limits) Index: gcc/doc/extend.texi =================================================================== RCS file: /cvs/gcc/gcc/gcc/doc/extend.texi,v retrieving revision 1.82.2.36 diff -u -u -r1.82.2.36 extend.texi --- gcc/doc/extend.texi 2 Mar 2004 18:42:50 -0000 1.82.2.36 +++ gcc/doc/extend.texi 14 Mar 2004 17:51:30 -0000 @@ -1893,7 +1893,7 @@ attributes when making a declaration. This keyword is followed by an attribute specification inside double parentheses. The following attributes are currently defined for functions on all targets: -@code{noreturn}, @code{noinline}, @code{always_inline}, +@code{noreturn}, @code{noinline}, @code{always_inline}, @code{leafify}, @code{pure}, @code{const}, @code{nothrow}, @code{format}, @code{format_arg}, @code{no_instrument_function}, @code{section}, @code{constructor}, @code{destructor}, @code{used}, @@ -1969,6 +1969,14 @@ Generally, functions are not inlined unless optimization is specified. For functions declared inline, this attribute inlines the function even if no optimization level was specified. + +@cindex @code{leafify} function attribute +@item leafify +Generally, inlining into a function is limited. For a function marked with +this attribute, every call inside this function will be inlined, if possible. +Whether the function itself is considered for inlining depends on its size and +the current inlining parameters. The @code{leafify} attribute only works +reliably in unit-at-a-time mode. @cindex @code{pure} function attribute @item pure Index: libstdc++-v3/include/c_std/std_cmath.h =================================================================== RCS file: /cvs/gcc/gcc/libstdc++-v3/include/c_std/std_cmath.h,v retrieving revision 1.5.6.7 diff -u -u -r1.5.6.7 std_cmath.h --- libstdc++-v3/include/c_std/std_cmath.h 3 Jan 2004 23:05:32 -0000 1.5.6.7 +++ libstdc++-v3/include/c_std/std_cmath.h 14 Mar 2004 17:51:55 -0000 @@ -330,9 +330,31 @@ { return __builtin_modfl(__x, __iptr); } template<typename _Tp> - inline _Tp + inline _Tp __attribute__((always_inline)) __pow_helper(_Tp __x, int __n) { + if (__builtin_constant_p(__n)) + switch (__n) { + case -1: + return _Tp(1)/__x; + case 0: + return _Tp(1); + case 1: + return __x; + case 2: + return __x*__x; +#if ! __OPTIMIZE_SIZE__ + case -2: + return _Tp(1)/(__x*__x); + case 3: + return __x*__x*__x; + case 4: + { + _Tp __y = __x*__x; + return __y*__y; + } +#endif + } return __n < 0 ? _Tp(1)/__cmath_power(__x, -__n) : __cmath_power(__x, __n);
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 dberlin at dberlin dot org wrote: > ------- Additional Comments From dberlin at dberlin dot org 2004-03-14 15:38 ------- > Subject: Bug 13776 > With my changes to PRE, i get the same numbers, except PRE is at 28 > seconds instead of 36. > > I certainly get *nowhere close* to 600 seconds in PRE, or the numbers > you get overall. > I can't fix a problem i can't reproduce, i can only take stabs at it. > Can someone else please verify his numbers so i know whether it's my > test setup or his? A way to check if leafify is working correctly is to look at the assembler generated for f.i. _ZN14MultiArgKernelI9MultiArg5I5FieldI22UniformRectilinearMeshI10MeshTraitsILi3Ed21UniformRectilinearTag12CartesianTagLi3EEEd9BrickViewES9_S9_S9_S9_E15EvaluateLocLoopIN6Forgas5VXUpdILi3EEELi3EEE3runEv it should be straight-line code without calls. Note that without -funroll-loops or -fpeel-loops the code contains a lot of explicit 3-times rolling loops, so it's more "easy" to look at it with -funroll-loops enabled. Richard.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 >> > > A way to check if leafify is working correctly is to look at the > assembler generated for f.i. > > _ZN14MultiArgKernelI9MultiArg5I5FieldI22UniformRectilinearMeshI10MeshTr > aitsILi3Ed21UniformRectilinearTag12CartesianTagLi3EEEd9BrickViewES9_S9_ > S9_S9_E15EvaluateLocLoopIN6Forgas5VXUpdILi3EEELi3EEE3runEv > > it should be straight-line code without calls. It is.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 > > I even have checking disabled. GC time seems to be identical, parsing > is 13.5s vs 18.4s - the first big difference is integration, which > suggests that leafifying is not enabled? As I showed in the next comment, the leafified functions have no function calls. > Maybe the patch applied > "wrong", I attached a complete diff of my local changes. I have exactly these changes installed. (I verified it by hand and by comparing the applied diffs). What platform are you doing this on?
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 dberlin at dberlin dot org wrote: > ------- Additional Comments From dberlin at dberlin dot org 2004-03-14 22:23 ------- > Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 > > >>I even have checking disabled. GC time seems to be identical, parsing >>is 13.5s vs 18.4s - the first big difference is integration, which >>suggests that leafifying is not enabled? > > > As I showed in the next comment, the leafified functions have no > function calls. > > >> Maybe the patch applied >>"wrong", I attached a complete diff of my local changes. > > > I have exactly these changes installed. > (I verified it by hand and by comparing the applied diffs). Ok. > > What platform are you doing this on? On ia32, I'm trying to bootstrap on ia64 now. I'm configuring with --enable-languages="c,c++" --enable-threads=posix --enable-__cxa_atexit --disable-libunwind-exceptions --disable-mudflap --disable-checking Richard.
(In reply to comment #34) > > What platform are you doing this on? > > On ia32, I'm trying to bootstrap on ia64 now. I'm configuring with > --enable-languages="c,c++" --enable-threads=posix --enable-__cxa_atexit > --disable-libunwind-exceptions --disable-mudflap --disable-checking > Hmmm. I reconfigured with exactly those flags, and re-bootstrapped, and now i get the same numbers you do. Memory usage was also way up. However, after that, i just ran configure, then bootstrapped, then get the numbers i posted. Can you just run configure without any options at all, bootstrap, and see what numbers you get?
these are my numbers when configured with just --disable-checking (with the leafify patch, etc) Execution times (seconds) garbage collection : 21.30 ( 9%) usr 0.12 ( 1%) sys 22.05 ( 8%) wall callgraph construction: 0.73 ( 0%) usr 0.00 ( 0%) sys 0.76 ( 0%) wall callgraph optimization: 0.73 ( 0%) usr 0.03 ( 0%) sys 0.78 ( 0%) wall cfg construction : 0.54 ( 0%) usr 0.04 ( 0%) sys 0.58 ( 0%) wall cfg cleanup : 2.08 ( 1%) usr 0.05 ( 1%) sys 2.17 ( 1%) wall trivially dead code : 1.45 ( 1%) usr 0.01 ( 0%) sys 1.48 ( 1%) wall life analysis : 4.52 ( 2%) usr 0.01 ( 0%) sys 4.64 ( 2%) wall life info update : 2.23 ( 1%) usr 0.01 ( 0%) sys 2.26 ( 1%) wall alias analysis : 2.66 ( 1%) usr 0.03 ( 0%) sys 2.86 ( 1%) wall register scan : 1.73 ( 1%) usr 0.00 ( 0%) sys 1.73 ( 1%) wall rebuild jump labels : 0.52 ( 0%) usr 0.00 ( 0%) sys 0.52 ( 0%) wall preprocessing : 0.63 ( 0%) usr 0.16 ( 2%) sys 0.80 ( 0%) wall parser : 13.73 ( 6%) usr 1.55 (19%) sys 20.68 ( 8%) wall name lookup : 5.70 ( 2%) usr 2.05 (25%) sys 7.89 ( 3%) wall integration : 27.48 (11%) usr 0.21 ( 3%) sys 28.53 (11%) wall tree gimplify : 1.96 ( 1%) usr 0.02 ( 0%) sys 2.02 ( 1%) wall tree eh : 3.06 ( 1%) usr 0.13 ( 2%) sys 3.35 ( 1%) wall tree CFG construction : 1.65 ( 1%) usr 0.07 ( 1%) sys 1.80 ( 1%) wall tree CFG cleanup : 3.53 ( 1%) usr 0.03 ( 0%) sys 3.76 ( 1%) wall tree PTA : 0.64 ( 0%) usr 0.00 ( 0%) sys 0.64 ( 0%) wall tree alias analysis : 0.70 ( 0%) usr 0.00 ( 0%) sys 0.72 ( 0%) wall tree PHI insertion : 11.00 ( 5%) usr 0.07 ( 1%) sys 11.31 ( 4%) wall tree SSA rewrite : 3.34 ( 1%) usr 0.06 ( 1%) sys 3.55 ( 1%) wall tree SSA other : 4.79 ( 2%) usr 0.64 ( 8%) sys 5.57 ( 2%) wall tree operand scan : 4.10 ( 2%) usr 0.63 ( 8%) sys 4.80 ( 2%) wall dominator optimization: 14.61 ( 6%) usr 0.54 ( 7%) sys 15.46 ( 6%) wall tree SRA : 0.27 ( 0%) usr 0.02 ( 0%) sys 0.29 ( 0%) wall tree CCP : 1.58 ( 1%) usr 0.02 ( 0%) sys 1.65 ( 1%) wall tree split crit edges : 0.22 ( 0%) usr 0.00 ( 0%) sys 0.22 ( 0%) wall tree PRE : 26.66 (11%) usr 0.17 ( 2%) sys 27.40 (10%) wall tree linearize phis : 0.00 ( 0%) usr 0.01 ( 0%) sys 0.01 ( 0%) wall tree forward propagate: 1.25 ( 1%) usr 0.01 ( 0%) sys 1.28 ( 0%) wall tree conservative DCE : 2.54 ( 1%) usr 0.05 ( 1%) sys 2.70 ( 1%) wall tree aggressive DCE : 1.09 ( 0%) usr 0.01 ( 0%) sys 1.10 ( 0%) wall tree DSE : 2.52 ( 1%) usr 0.01 ( 0%) sys 2.64 ( 1%) wall tree copy headers : 2.22 ( 1%) usr 0.06 ( 1%) sys 2.32 ( 1%) wall tree SSA to normal : 2.74 ( 1%) usr 0.15 ( 2%) sys 2.90 ( 1%) wall tree rename SSA copies: 0.59 ( 0%) usr 0.03 ( 0%) sys 0.66 ( 0%) wall dominance frontiers : 0.42 ( 0%) usr 0.00 ( 0%) sys 0.42 ( 0%) wall control dependences : 0.15 ( 0%) usr 0.00 ( 0%) sys 0.15 ( 0%) wall expand : 15.77 ( 6%) usr 0.26 ( 3%) sys 16.61 ( 6%) wall varconst : 0.54 ( 0%) usr 0.03 ( 0%) sys 0.89 ( 0%) wall jump : 1.16 ( 0%) usr 0.14 ( 2%) sys 1.37 ( 1%) wall CSE : 7.87 ( 3%) usr 0.04 ( 0%) sys 8.19 ( 3%) wall global CSE : 6.11 ( 3%) usr 0.09 ( 1%) sys 6.30 ( 2%) wall loop analysis : 1.41 ( 1%) usr 0.00 ( 0%) sys 1.41 ( 1%) wall bypass jumps : 1.10 ( 0%) usr 0.00 ( 0%) sys 1.12 ( 0%) wall CSE 2 : 3.16 ( 1%) usr 0.02 ( 0%) sys 3.20 ( 1%) wall branch prediction : 2.52 ( 1%) usr 0.08 ( 1%) sys 2.73 ( 1%) wall flow analysis : 0.10 ( 0%) usr 0.00 ( 0%) sys 0.10 ( 0%) wall combiner : 3.49 ( 1%) usr 0.01 ( 0%) sys 3.62 ( 1%) wall if-conversion : 0.70 ( 0%) usr 0.01 ( 0%) sys 0.74 ( 0%) wall regmove : 1.01 ( 0%) usr 0.01 ( 0%) sys 1.04 ( 0%) wall mode switching : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall local alloc : 2.88 ( 1%) usr 0.02 ( 0%) sys 2.97 ( 1%) wall global alloc : 6.36 ( 3%) usr 0.17 ( 2%) sys 6.91 ( 3%) wall reload CSE regs : 2.86 ( 1%) usr 0.00 ( 0%) sys 3.21 ( 1%) wall flow 2 : 0.52 ( 0%) usr 0.00 ( 0%) sys 0.54 ( 0%) wall if-conversion 2 : 0.39 ( 0%) usr 0.00 ( 0%) sys 0.40 ( 0%) wall peephole 2 : 0.51 ( 0%) usr 0.02 ( 0%) sys 0.54 ( 0%) wall rename registers : 0.73 ( 0%) usr 0.05 ( 1%) sys 0.79 ( 0%) wall scheduling 2 : 2.85 ( 1%) usr 0.05 ( 1%) sys 3.02 ( 1%) wall reorder blocks : 0.28 ( 0%) usr 0.01 ( 0%) sys 0.30 ( 0%) wall shorten branches : 0.54 ( 0%) usr 0.02 ( 0%) sys 0.56 ( 0%) wall reg stack : 0.08 ( 0%) usr 0.00 ( 0%) sys 0.08 ( 0%) wall final : 1.10 ( 0%) usr 0.13 ( 2%) sys 1.43 ( 1%) wall symout : 0.03 ( 0%) usr 0.01 ( 0%) sys 0.04 ( 0%) wall rest of compilation : 1.83 ( 1%) usr 0.01 ( 0%) sys 1.87 ( 1%) wall TOTAL : 243.59 8.18 264.59
I think this one: integration : 27.48 (11%) usr 0.21 ( 3%) sys 28.53 (11%) wall is caused by gimple having more trees to copy so maybe doing inlining later on will help (aka after the first DCE happens) but the inliner then needs to be a BB inliner.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 On Mon, 14 Mar 2004, dberlin at gcc dot gnu dot org wrote: > ------- Additional Comments From dberlin at gcc dot gnu dot org 2004-03-14 23:14 ------- > these are my numbers when configured with just --disable-checking (with the leafify patch, etc) The results with just --disable-checking are the same. Humm. --disable-libunwind-exceptions should make no difference for me, as I don't have libunwind installed - maybe it's making the difference for you? Confused, Richard. -- Richard Guenther <richard dot guenther at uni-tuebingen dot de> WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/
I noticed while profiling the build of libstdc++, I noticed that comptypes was not being tailed/sibcalled because of the return type is bool so this depends on PR 14440.
I noticed that bsi functions were not being optimized that well because bsi is a struct which contained structs so marking this depends on PR 13953 which is for SRA optimizing on structs containing structs.
I am attaching a C example where tree-ssa is slower: [zhivago2:~/src/testspeed] pinskia% time ~/gcc-tree-ssa/bin/gcc fold-const.i -S 18.640u 1.480s 0:21.38 94.1% 0+0k 0+5io 0pf+0w [zhivago2:~/src/testspeed] pinskia% time ~/fsf-clean-nocheck/bin/gcc fold-const.i -S 9.060u 0.540s 0:09.93 96.6% 0+0k 0+4io 0pf+0w
Created attachment 6011 [details] C example Here is the C example. It is a fold-const.c from a crosscompiler from powerpc-apple-darwin to powerpc64-apple-darwin.
I set up a nightly tester on ia64-linux that does a bootstrap for c,c++ and builds the tramp3d-v3.cpp testcase and does a performance check on the resulting binary. Stats can be viewed at http://www.tat.physik.uni-tuebingen.de/~rguenth/gcc/monitor-summary.html Testing is done with an unpatched tree-ssa branch (i.e. w/o leafify). The summary plot is updated manually and so can lag behind if I forget updating it.
C is also slower, here's the top of the oprofile on amd64 for "-fno-tree-pre -O3" on a subset of Diego Novillo's cc1-i-files. vma samples % symbol name 00730fa0 117920 10.1391 htab_find_slot_with_hash 00731350 53286 4.5817 iterative_hash 004802b0 22184 1.9074 bitmap_bit_p 006a3e20 20801 1.7885 ggc_alloc_stat 006717e0 19669 1.6912 for_each_rtx 006c5590 18536 1.5938 walk_tree 00730d00 16933 1.4559 find_empty_slot_for_expand 0064d5d0 16794 1.4440 constrain_operands 006579f0 16467 1.4159 reg_scan_mark_refs 00701db0 13922 1.1971 reg_is_remote_constant_p 00402b60 12999 1.1177 yyparse 004af330 12958 1.1142 cse_insn 00501050 12339 1.0609 mark_set_1 00671d00 12320 1.0593 note_stores 00523570 11714 1.0072 compute_transp 004a9270 10726 0.9223 count_reg_usage
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 steven at gcc dot gnu dot org wrote: > ------- Additional Comments From steven at gcc dot gnu dot org 2004-03-31 19:32 ------- > C is also slower, here's the top of the oprofile on amd64 for > "-fno-tree-pre -O3" on a subset of Diego Novillo's cc1-i-files. > > vma samples % symbol name > 00730fa0 117920 10.1391 htab_find_slot_with_hash We have a lot of pointer hashing in gcc now and I see the above, too. We can possibly micro-optimize the pointer hashing by introducing a "specialization" of the libiberty hashfn for pointers where we can inline both the hashing function and the comparison function. It will introduce some code duplication, though (if this only was using C++ and templates...). Richard.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 "rguenth at tat dot physik dot uni-tuebingen dot de" <gcc-bugzilla@gcc.gnu.org> writes: > We have a lot of pointer hashing in gcc now and I see the above, too. > We can possibly micro-optimize the pointer hashing by introducing a > "specialization" of the libiberty hashfn for pointers where we can > inline both the hashing function and the comparison function. It will > introduce some code duplication, though (if this only was using C++ and > templates...). Something I've wanted to do for a long time is do poor-man's templates on hashtab.[ch] with macros. But I never seem to get sufficient round tuits. zw
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 zack at codesourcery dot com wrote: > ------- Additional Comments From zack at codesourcery dot com 2004-03-31 19:53 ------- > Subject: Re: [tree-ssa] Many C++ compile-time regression in > 3.5-tree-ssa 040120 > > "rguenth at tat dot physik dot uni-tuebingen dot de" <gcc-bugzilla@gcc.gnu.org> writes: > > >>We have a lot of pointer hashing in gcc now and I see the above, too. >>We can possibly micro-optimize the pointer hashing by introducing a >>"specialization" of the libiberty hashfn for pointers where we can >>inline both the hashing function and the comparison function. It will >>introduce some code duplication, though (if this only was using C++ and >>templates...). > > > Something I've wanted to do for a long time is do poor-man's templates > on hashtab.[ch] with macros. But I never seem to get sufficient round > tuits. I think it would pay for pointer hashing only, as this is the main use. I did some experiments some time ago with a stripped down pointer-only hash just replacing the walk_tree hashtab and it still was #1 in the profile with little change in time (but I didn't measure overall performance change). Richard.
I agree that a special pointer hasher would be nice. Should be easy, just duplicate the code of iterative_hash in hashtab.c and specialize it for void *. But that doesn't reduce the number of find_slot calls. It's not like the tables are sparse and we're getting tons of collisions. We just use the hash table that much, and we should be looking into ways for speeding it up.
I did some profiling of iterative_hash on tree-ssa. Not immediately related to this PR, perhaps, but part of the problem. % cumulative self self total time seconds seconds calls s/call s/call name 2.75 1.66 1.66 2329935 0.00 0.00 iterative_hash 2.29 3.04 1.38 235027 0.00 0.00 walk_tree 2.04 4.27 1.23 1419091 0.00 0.00 ggc_alloc_stat 1.87 5.40 1.13 1020397 0.00 0.00 htab_find_slot_with_hash 1.74 6.45 1.05 1295674 0.00 0.00 mark_set_1 1.67 7.46 1.01 396445 0.00 0.00 iterative_hash_expr 1.64 8.45 0.99 2947490 0.00 0.00 bitmap_bit_p 1.59 9.41 0.96 321482 0.00 0.00 for_each_rtx 1.54 10.34 0.93 1566242 0.00 0.00 bitmap_set_bit 1.42 11.20 0.86 770792 0.00 0.00 et_splay Right now, this function seems to be used only on the tree-ssa branch, and mostly in the tree optimizers via iterative_hash_expr: ----------------------------------------------- 1423028 iterative_hash_expr [35] 0.00 0.00 40/396445 pre_expression [433] 0.00 0.00 162/396445 process_delayed_rename [971] 0.03 0.04 10126/396445 gimple_tree_hash [516] 0.39 0.67 151915/396445 avail_expr_hash [71] 0.60 1.03 234202/396445 true_false_expr_hash [52] [35] 4.6 1.01 1.74 396445+1423028 iterative_hash_expr [35] 1.65 0.00 2308918/2329935 iterative_hash [53] 0.06 0.00 383567/1028690 first_rtl_op [321] 0.03 0.00 546018/635717 commutative_tree_code [699] 1423028 iterative_hash_expr [35] ----------------------------------------------- 0.00 0.00 919/2329935 build_type_attribute_variant <cycle 12> [1420] 0.00 0.00 940/2329935 build_array_type [1299] 0.00 0.00 4814/2329935 build_function_type <cycle 12> [671] 0.01 0.00 14344/2329935 type_hash_list [900] 1.65 0.00 2308918/2329935 iterative_hash_expr [35] [53] 2.8 1.66 0.00 2329935 iterative_hash [53] ----------------------------------------------- So ~95% of all iterative_hash_expr calls are from DOM, which could use a little help in terms of compilation speed: ~12% for this particular test case pt.i. I also did some coverage testing on iterative_hash: -: 794:hashval_t iterative_hash (k_in, length, initval) -: 795: const PTR k_in; /* the key */ -: 796: register size_t length; /* the length of the key */ -: 797: register hashval_t initval; /* the previous hash, or an arbitrary value */ 13721488: 798:{ 13721488: 799: register const unsigned char *k = (const unsigned char *)k_in; 13721488: 800: register hashval_t a,b,c,len; -: 801: -: 802: /* Set up the internal state */ 13721488: 803: len = length; 13721488: 804: a = b = 0x9e3779b9; /* the golden ratio; an arbitrary value */ 13721488: 805: c = initval; /* the previous hash value */ -: 806: -: 807: /*---------------------------------------- handle most of the key */ -: 808:#ifndef WORDS_BIGENDIAN -: 809: /* On a little-endian machine, if the data is 4-byte aligned we can hash -: 810: by word for better speed. This gives nondeterministic results on -: 811: big-endian machines. */ 13721488: 812: if (sizeof (hashval_t) == 4 && (((size_t)k)&3) == 0) branch 0 taken 0% 13724520: 813: while (len >= 12) /* aligned */ branch 0 taken 1% branch 1 taken 100% -: 814: { 3032: 815: a += *(hashval_t *)(k+0); 3032: 816: b += *(hashval_t *)(k+4); 3032: 817: c += *(hashval_t *)(k+8); 3032: 818: mix(a,b,c); 3032: 819: k += 12; len -= 12; branch 0 taken 100% -: 820: } -: 821: else /* unaligned */ -: 822:#endif #####: 823: while (len >= 12) branch 0 never executed branch 1 never executed -: 824: { #####: 825: a += (k[0] +((hashval_t)k[1]<<8) +((hashval_t)k[2]<<16) +((hashval_t)k[3]<<24)); #####: 826: b += (k[4] +((hashval_t)k[5]<<8) +((hashval_t)k[6]<<16) +((hashval_t)k[7]<<24)); #####: 827: c += (k[8] +((hashval_t)k[9]<<8) +((hashval_t)k[10]<<16)+((hashval_t)k[11]<<24)); #####: 828: mix(a,b,c); #####: 829: k += 12; len -= 12; branch 0 never executed -: 830: } -: 831: -: 832: /*------------------------------------- handle the last 11 bytes */ 13721488: 833: c += length; 13721488: 834: switch(len) /* all the case statements fall through */ branch 0 taken 0% branch 1 taken 0% branch 2 taken 0% branch 3 taken 0% branch 4 taken 0% branch 5 taken 1% branch 6 taken 0% branch 7 taken 1% branch 8 taken 99% branch 9 taken 1% branch 10 taken 1% branch 11 taken 0% branch 12 taken 1% -: 835: { #####: 836: case 11: c+=((hashval_t)k[10]<<24); #####: 837: case 10: c+=((hashval_t)k[9]<<16); #####: 838: case 9 : c+=((hashval_t)k[8]<<8); -: 839: /* the first byte of c is reserved for the length */ #####: 840: case 8 : b+=((hashval_t)k[7]<<24); 129: 841: case 7 : b+=((hashval_t)k[6]<<16); 129: 842: case 6 : b+=((hashval_t)k[5]<<8); 181: 843: case 5 : b+=k[4]; 13719971: 844: case 4 : a+=((hashval_t)k[3]<<24); 13719977: 845: case 3 : a+=((hashval_t)k[2]<<16); 13719979: 846: case 2 : a+=((hashval_t)k[1]<<8); 13719979: 847: case 1 : a+=k[0]; -: 848: /* case 0: nothing left to add */ -: 849: } 13721488: 850: mix(a,b,c); -: 851: /*-------------------------------------------- report the result */ 13721488: 852: return c; -: 853:} So it seems that a specialized version for 4 byte objects really would help here. (Xeon is 32bit, so the 8 byte case is important for 64bit targets??)
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 Again the automatic tester at http://www.tat.physik.uni-tuebingen.de/~rguenth/gcc/monitor-summary.html caught some compile time regressions for tree-ssa. While bootstrap time didn't change (much), tramp3d-v3 compile time got a hit between Wednesday and Thursday, same for runtime. You'll also note that mainline runtime was improving a lot yesterday. There aren't that much changes on tree-ssa right now, so I suspect changes causing the regression be 2004-04-07 Diego Novillo <dnovillo@redhat.com> * gimplify.c (gimplify_call_expr): Remove argument POST_P. Update all callers. Don't use POST_P when gimplifying the call expression. (the tree is updated at 3am CEST, incident happened with the update on Thursday) Richard.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 On Sat, 2004-04-10 at 10:04, rguenth at tat dot physik dot uni-tuebingen dot de wrote: > There aren't that much changes on tree-ssa right now, so I suspect > changes causing the regression be > > 2004-04-07 Diego Novillo <dnovillo@redhat.com> > > * gimplify.c (gimplify_call_expr): Remove argument POST_P. > Update all callers. > Don't use POST_P when gimplifying the call expression. > Hmm, odd. This is a correctness fix. Side effects in function call arguments must occur before the actual call takes place. What may be happening here is that we are getting fewer commoning opportunities for call-clobbered variables. Before, foo (a++) would expand to: foo (a); a = a + 1; But now, it expands to: t = a; a = a + 1; foo (t); If 'a' is call-clobbered, the second form will not allow us to common out 'a + 1' because of the clobbering of 'a' by the call to foo. However, it is a bit surprising that this would cause a significant decline in compile time. Would you have a pre-patched cc1plus binary to compare dump files? Thanks. Diego.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 dnovillo at redhat dot com wrote: > ------- Additional Comments From dnovillo at redhat dot com 2004-04-10 14:58 ------- > Subject: Re: [tree-ssa] Many C++ compile-time regression in > 3.5-tree-ssa 040120 > > On Sat, 2004-04-10 at 10:04, rguenth at tat dot physik dot uni-tuebingen > dot de wrote: > > >>There aren't that much changes on tree-ssa right now, so I suspect >>changes causing the regression be >> >>2004-04-07 Diego Novillo <dnovillo@redhat.com> >> >> * gimplify.c (gimplify_call_expr): Remove argument POST_P. >> Update all callers. >> Don't use POST_P when gimplifying the call expression. >> > > Hmm, odd. This is a correctness fix. Side effects in function call > arguments must occur before the actual call takes place. > > What may be happening here is that we are getting fewer commoning > opportunities for call-clobbered variables. Before, foo (a++) would > expand to: > > foo (a); > a = a + 1; > > But now, it expands to: > > t = a; > a = a + 1; > foo (t); > > If 'a' is call-clobbered, the second form will not allow us to common > out 'a + 1' because of the clobbering of 'a' by the call to foo. > > However, it is a bit surprising that this would cause a significant > decline in compile time. Would you have a pre-patched cc1plus binary to > compare dump files? Yes, I have cc1plus binaries from all days lying around (though with checking disabled). Just tell me what to do. Richard.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 > However, it is a bit surprising that this would cause a significant > decline in compile time. Would you have a pre-patched cc1plus binary to > compare dump files? Ok, I tried to just diff tree-optimized dumps, but noise is papering over the differences (temps are differently numbered). At least, before the compile time increase the dump had 1003736 lines, and after it now has 1048682 lines. So there is a difference. Richard.
Subject: Re: [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 > However, it is a bit surprising that this would cause a significant > decline in compile time. Would you have a pre-patched cc1plus binary to > compare dump files? Cutting off the numbers from the vars and killing <Dxxx> reveals: @@ -256,6 +256,7 @@ virtual Smarts::Runnable::~Runnable() (t { bool T.; int T.; + int (*__vtbl_ptr_type) () * T.; <bb 0>: this->_vptr.Runnable = &_ZTVN6Smarts8RunnableE[2]; (and similar in all destructors) int fillLocStorage(int, Loc<Dim>&, constT1&) [with int Dim = 3, T1 = Loc<3>] (currIndex, loc, a) { + int currIndex.; int ; int d; int T.; @@ -1581,6 +1503,7 @@ int fillLocStorage(int, Loc<Dim>&, const struct Domain<1,DomainTraits<Loc<1> > > * T.; struct Loc<1> * T.; struct Loc<1> & T.; + int currIndex.; struct Domain<3,DomainTraits<Loc<3> > > * loc.; int retval.; int retval.; @@ -1595,13 +1518,17 @@ int fillLocStorage(int, Loc<Dim>&, const i = 0; <L0>:; + currIndex. = currIndex + 1; *(int &)(struct Domain<1,DomainTraits<Loc<1> > > *)(struct Loc<1> *)(struct Loc<1> &)((struct Loc<1> *)((long unsigned int)currIndex * 4) + (struct Loc<1> *)(struct UninitializedVector<Loc<1>,3,int> *)(struct Domain<3,DomainTraits<Loc<3> > > *)loc) = ((struct DomainBase<DomainTraits<Loc<1> > > *)(struct Domain<1,DomainTraits<Loc<1> > > *)(struct Loc<1> &)((struct Loc<1> *)((long unsigned int)i * 4) + (struct Loc<1> *)(struct UninitializedVector<Loc<1>,3,int> *)(struct Domain<3,DomainTraits<Loc<3> > > *)a))->domain_m; - currIndex = currIndex + 1; i = i + 1; - if (i <= 2) goto <L0>; else goto <L10>; + if (i <= 2) goto <L13>; else goto <L10>; + +<L13>:; + currIndex = currIndex.; + goto <bb 1> (<L0>); <L10>:; - return currIndex; + return currIndex.; } looks like DOM is now missing some optimization then, lots of re-ordering of functions in the diff, and noise... (label number changes, bb number changes). The dump files are huge (both around 50MB uncompressed), if you want to download them, I can put them to an accessible location.
Karel, all the main optimization issues that we spotted looking at the MICO regressions are supposed to be fixed now. It would be very cool if you could prepare an updated performance comparison table between 3.4.0 and today's mainline, so that we can check how mainline is doing now. Thanks
Subject: Re: [3.5 Regression] [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 Giovani, I have done comparison of 3.4.0, 3.4.1RC1 and trunk from 2004-06-30 and posted all results here: http://gcc.gnu.org/ml/gcc/2004-07/msg00391.html Cheers, Karel
Karel, would you mind posting an updated table using a recent mainline? Thanks.
Subject: Re: [3.5 Regression] [tree-ssa] Many C++ compile-time regression in 3.5-tree-ssa 040120 Hi, updated table for gcc3.4.1 and main trunk 2004-08-30 is here: http://gcc.gnu.org/ml/gcc/2004-08/msg01594.html Cheers, Karel
Can you post again the new result as a huge amount has been changed since Auguest 31 and there has been some compile time improvements in that time?
Subject: Re: [4.0 Regression] [tree-ssa] Many C++ compile-time regression in 4.0-tree-ssa 040120 Sure! Here we go: http://gcc.gnu.org/ml/gcc/2004-10/msg00952.html and results are really promissing, although some interesting regressions are still presented. Cheers, Karel
Subject: Re: [4.0 Regression] [tree-ssa] Many C++ compile-time regression in 4.0-tree-ssa 040120 And http://gcc.gnu.org/ml/gcc/2004-10/msg00955.html
Subject: Re: [4.0 Regression] [tree-ssa] Many C++ compile-time regression in 4.0-tree-ssa 040120 In recent testing ir.cc seems to be a big culprit. It is attached preprocessed by 4.0.0-041024 for your experiments. Cheers, Karel
Created attachment 7408 [details] ir.ii.bz2
Subject: Re: [4.0 Regression] [tree-ssa] Many C++ compile-time regression in 4.0-tree-ssa 040120 Updated table with GCC 3.4.2 and 4.0.0-041024 results is available here: http://gcc.gnu.org/ml/gcc/2004-10/msg00952.html -- still some regressions mainly on -O1 and -O2. Cheers, Karel
ir.cc 47.17 69.26 -31.89 72.42 129.49 -44.07 100.1 165.27 -39.43 I just sped up ir.cc a little with my patch to cp-gimplify.c (which was committed) Reference: http://gcc.gnu.org/ml/gcc-patches/2004-11/msg01247.html Also my patch to remove the a number of calls to is_gimple_reg speeds up optimizations: http://gcc.gnu.org/ml/gcc-patches/2004-11/msg01284.html
Hmm, with the mainline on PPC-darwin for ir.ii at -O0 we are faster than both 3.3 and 3.1. 3.1: 51.260u 2.110s 0:56.27 94.8% 0+0k 0+7io 0pf+0w 3.3: 46.000u 3.600s 0:50.91 97.4% 0+0k 0+7io 0pf+0w mainline: 39.730u 5.270s 0:48.27 93.2% 0+0k 0+8io 0pf+0w Even at -O1 we are faster than 3.3: mainline: 70.860u 5.010s 1:18.76 96.3% 0+0k 0+11io 0pf+0w 3.3: 72.650u 13.250s 1:29.99 95.4% 0+0k 0+7io 0pf+0w For -O2 we are only 1 second slower than 3.3: mainline: 99.720u 5.510s 1:54.78 91.6% 0+0k 0+13io 0pf+0w 3.3: 98.610u 38.800s 2:25.59 94.3% 0+0k 0+15io 0pf+0w Could you check again on your platform?
Subject: Re: [4.0 Regression] [tree-ssa] Many C++ compile-time regression in 4.0-tree-ssa 040120 I've tested 3.4.2, 4.0.0 (20041026) and 4.0.0 (20041118) with following results: 3.4.2: c++ -I../include -time -O0 -Wall -DPIC -fPIC -c ir.cc -o ir.pic.o # cc1plus 46.98 0.53 # as 4.62 0.22 peak memory consumed: 99MB 4.0.0 (20041026): c++ -I../include -time -O0 -Wall -DPIC -fPIC -c ir.cc -o ir.pic.o # cc1plus 67.13 2.05 # as 5.98 0.30 peak memory consumed: 243MB 4.0.0 (20041118): c++ -I../include -time -O0 -Wall -DPIC -fPIC -c ir.cc -o ir.pic.o # cc1plus 66.47 1.97 # as 5.84 0.27 peak memory consumed 243MB so there is still both compile-time and memory usage regressions presented on main-line. The reason why do you see speed-up in comparison with 3.1/3.3 is that 3.4.2 is really faster compiler (at least from MICO sources point of view). Cheers, Karel
Created attachment 7601 [details] Top 10 functions for all preprocessed mico files at -O2 The attachment is a file with the top 10 from gprof profiles. The base compiler is GCC 3.3 (SUSE), the profiling compiler is "GNU C++ version 4.0.0 20041124 (experimental) (i686-pc-linux-gnu)" If anyone wants to see a complete gprof profile, ping me.
Created attachment 7602 [details] profile report using shark This is a run of 4 compilation of current.cc.ii at -O0.
Subject: Re: [4.0 Regression] [tree-ssa] Many C++ compile-time regression in 4.0-tree-ssa 040120 I've updated comparison table for 4.0.0 20041126 compiler version. You can find it here: http://gcc.gnu.org/ml/gcc/2004-11/msg01157.html Cheers, Karel
Subject: Re: [4.0 Regression] [tree-ssa] Many C++ compile-time regression in 4.0-tree-ssa 040120 On Mon, 2004-11-29 at 19:56 +0000, kgardas at objectsecurity dot com wrote: > ------- Additional Comments From kgardas at objectsecurity dot com 2004-11-29 19:56 ------- > Subject: Re: [4.0 Regression] [tree-ssa] Many > C++ compile-time regression in 4.0-tree-ssa 040120 > > > I've updated comparison table for 4.0.0 20041126 compiler version. You can > find it here: http://gcc.gnu.org/ml/gcc/2004-11/msg01157.html BTW, if I'm reading that table correctly, overall the compile time performance of mainline is actually on-par or better than 3.4 at -O0, -O1 and -O2 for this test. That's not to diminish the need to work on ir.cc, but things appear to be heading the right direction. jeff
Subject: Re: [4.0 Regression] [tree-ssa] Many C++ compile-time regression in 4.0-tree-ssa 040120 On Mon, 29 Nov 2004, law at redhat dot com wrote: > > I've updated comparison table for 4.0.0 20041126 compiler version. You can > > find it here: http://gcc.gnu.org/ml/gcc/2004-11/msg01157.html > BTW, if I'm reading that table correctly, overall the compile time > performance of mainline is actually on-par or better than 3.4 at > -O0, -O1 and -O2 for this test. Yes, you are 100% right. Karel
I noticed that for ir.ii, there is some compile time spent in GC which means we have a memory problem, I have a patch which should help a little on the memory problem but that too much.
Note for ir.ii at -O0, we spend more time in local alloc and global alloc with the mainline than 3.3.2. 2.41 vs 3.86 and 3.74 vs 6.07 so someone who knows local alloc and global alloc might want to look into this. This is on powerpc-darwin by the way, on x86, there might be a different problem someone should do a -ftime-report with both the mainline and 3.4.x to see if this is also true on x86.
For -O1, integration is slower in the mainline compared with 3.3.2, 2.46 vs 1.51. global alloc is also slower: 3.21 vs 2.38. Speeding those up will help. This again on powerpc-darwin. The reason why I thought 3.3.2 was much slower than the mainline was because the GC limits were low for 3.3.2 on darwin.
Hello, New comparison is here: http://gcc.gnu.org/ml/gcc/2004-12/msg01157.html Cheers, Karel
Created attachment 7858 [details] A patch to turn off local-alloc, which buys 5% for ir.cc Turning off local-alloc like in the attach patch makes compiling ir.cc 5% faster for me on powerpc-linux (from 30s to 28.5s). It seems like a good idea anyway to turn off most of local-alloc, turning it off improves SPEC scores too. I'm not sure why gcc still has it at all...
Bah, I hate profiles for "cc1plus -O2 ir.ii" without peaks: CPU: P4 / Xeon with 2 hyper-threads, speed 3194.17 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 samples % symbol name 78641 5.2991 ggc_alloc_stat 28267 1.9047 ggc_set_mark 26230 1.7675 splay_tree_splay_helper 25018 1.6858 walk_tree 24322 1.6389 cgraph_node_for_asm 20428 1.3765 gt_ggc_mx_lang_tree_node 19586 1.3198 htab_find_slot_with_hash 16006 1.0785 compute_immediate_uses 15133 1.0197 get_stmt_operands 14481 0.9758 constrain_operands 13414 0.9039 insert_aux 13308 0.8967 decl_assembler_name_equal 12795 0.8622 find_reloads 12052 0.8121 decl_assembler_name 11986 0.8077 cse_insn 11743 0.7913 record_reg_classes 11707 0.7889 bitmap_set_bit 11630 0.7837 ix86_decompose_address 11610 0.7823 mark_set_1 11538 0.7775 optimize_stmt 11201 0.7548 iterative_hash_expr 10615 0.7153 cp_walk_subtrees 10235 0.6897 rewrite_stmt 9892 0.6666 for_each_rtx_1 9816 0.6614 get_expr_operands 9813 0.6612 invalidate 9302 0.6268 pointer_set_insert 9293 0.6262 mark_def_sites 8570 0.5775 reg_scan_mark_refs 8503 0.5730 propagate_necessity 8424 0.5676 is_gimple_reg 8322 0.5608 compute_may_aliases No single problem to focus on...
Subject: Re: [4.0 Regression] Many C++ compile-time regressions for MICO's ORB code > Bah, I hate profiles for "cc1plus -O2 ir.ii" without peaks: > > CPU: P4 / Xeon with 2 hyper-threads, speed 3194.17 MHz (estimated) > Counted GLOBAL_POWER_EVENTS events (time during which processor is not > stopped) with a unit mask of 0x01 (mandatory) count 100000 > samples % symbol name > 25018 1.6858 walk_tree > 24322 1.6389 cgraph_node_for_asm > 19586 1.3198 htab_find_slot_with_hash Do you have numbers wether we are memory-bandwith limited here? If not, we might micro-optimize hash table access somewhat more.
Subject: Re: [4.0 Regression] Many C++ compile-time regressions for MICO's ORB code On Wed, 26 Jan 2005, steven at gcc dot gnu dot org wrote: > > ------- Additional Comments From steven at gcc dot gnu dot org 2005-01-26 10:20 ------- > Bah, I hate profiles for "cc1plus -O2 ir.ii" without peaks: True, if I may add something, I would recommend to look at why ir.cc regress so much in memory consumption in comparison with 3.4.x. If you solve this, perhaps compile time regressions goes away too. Thanks, Karel
Subject: Re: [4.0 Regression] Many C++ compile-time regressions for MICO's ORB code Just to note something about 4.0.0 and 3.4.2 memory usage while compiling ir.cc. 3.4.2: it is quickly gorwing up to 90MB RAM, then it stay there for a long time and then goes quickly to 99MB RAM where it finishes -- i.e. majority of time is spend with ~90MB and less consumed memory 4.0.0: in comparison with 3.4.2, it is growing up to 243MB RAM, stays there for some time (not majority but let say 1/3 of compilation certainly), then it goes back to 200MB RAM consumed and then it finishes. Hard to tell avarage memory usage here, but I think it is about 200MB. My 4.0.0 here is quite old 20041228, but I guess the picture is still the same. Thanks, Karel
It would be a Good Thing to look at the hash function. The number of collisions per search is extremely high: String pool entries 128928 identifiers 128928 (100.00%) slots 262144 bytes 1846k (142k overhead) table size 2048k coll/search 0.8518 ins/search 0.2747 avg. entry 14.66 bytes (+/- 17.60) longest entry 830 There is also still a lot of memory allocated at the end of the compilation: Memory still allocated at the end of the compilation process Size Allocated Used Overhead 8 4096 200 120 16 4264k 1211k 91k 64 29M 10M 476k 128 3920k 1472k 53k 256 1240k 519k 16k 512 4084k 2026k 55k 1024 488k 390k 6832 2048 2628k 1998k 35k 4096 1160k 1160k 15k 8192 376k 368k 2632 16384 304k 288k 1064 32768 160k 128k 280 65536 704k 640k 616 131072 384k 384k 168 262144 512k 512k 112 524288 512k 512k 56 112 26M 19M 373k 208 63M 43M 883k 48 27M 14M 443k 32 18M 10M 337k 80 13M 13M 186k Total 199M 122M 2982k Note especially the 43MB. All of that is in the et-forest alloc-pools. Perhaps we should allocate/free them per function. Finally, we allocate a lot of SSA_NAMEs, and varrays are problematic as always: source location Garbage Freed Leak Overhead Times varray.c:170 (varray_grow) 39485908: 3.3% 280747780:47.6% 229448: 0.2% 80866528:32.0% 552682 tree-ssanames.c:197 (make_ssa_name) 94292264: 7.9% 0: 0.0% 0: 0.0% 8572024: 3.4% 1071503
Hello, new timings MICO ORB sources are here: http://gcc.gnu.org/ml/gcc/2005-01/msg01714.html Cheers, Karel
Karel, ir.ii does not compile since Mark Mitchell's patch to disallow floating point literals in constant expressions went in. I think if you could regenerated the preprocessed source, it should work again.
New results meassured for MICO compiled with 4.0.0 20050301 are posted here: http://gcc.gnu.org/ml/gcc/2005-03/msg00132.html Cheers, Karel
I gave a quick look at this and I can't find anything that is not already fixed, especially after Karel's last results. Also having a bug with 85 comments is a good way to make developers run, so let's close this as fixed as well. If anyone in CC list believes there is something still to fix mentioned here, it is better to create a new bug.