We have a significant performance problem with optimizing C++ compilations. I did some new tests using the example from PR 3083, and here are the timings with --disable-checking (last updated 2003-04-14): -O0 -O1 -O2 -O3 GCC 3.0.4 27.95 44.52 56.57 56.48 3.2-branch 29.87 +7% 54.28 +22% 70.95 +25% 75.29 +33% 3.3-branch 29.09 +4% 57.11 +30% 78.99 +40% 81.61 +44% mainline 27.06 56.09 77.77 82.02 That is, basically PR 3083 still applies, even if not as drastically as originally. Dan Nicolaescu and Kaveh Ghazi did some further analyses: http://gcc.gnu.org/ml/gcc/2002-12/msg00516.html http://gcc.gnu.org/ml/gcc/2003-04/msg00251.html http://gcc.gnu.org/ml/gcc/2003-04/msg00252.html Release: gcc version 3.3 20021025 (experimental) Environment: i386-unknown-freebsd4.6, Pentium III/1.0 How-To-Repeat: for GCC 3.0-3.3: \time gcc -c -O3 generate.ii for GCC 3.4 : \time gcc -c -O3 generate-3.4.ii (sources last updated 2003-01-30)
State-Changed-From-To: open->analyzed State-Changed-Why: Confirmed.
From: Steven Bosscher <s.bosscher@student.tudelft.nl> To: pfeifer@dbai.tuwien.ac.at, gcc-gnats@gcc.gnu.org, gcc-bugs@gcc.gnu.org, nobody@gcc.gnu.org, gcc-prs@gcc.gnu.org Cc: Subject: Re: optimization/8361: [3.3/3.4 regression] C++ compile-time performance regression Date: Sat, 15 Mar 2003 15:29:28 +0100 http://gcc.gnu.org/cgi-bin/gnatsweb.pl?cmd=view%20audit-trail&database=gcc&pr=8361 Hi Gerald, Could you produce compile times for this PR and see how we're doing on 3.3 and mainline? Quite a few speedup patches have gone in lately (most notably the GC limits patches), so the numbers really are way outdated now... Greetz Steven P.S. Welcome back ;-)
From: Steven Bosscher <s.bosscher@student.tudelft.nl> To: pfeifer@dbai.tuwien.ac.at, gcc-gnats@gcc.gnu.org, gcc-bugs@gcc.gnu.org, nobody@gcc.gnu.org, gcc-prs@gcc.gnu.org Cc: Subject: Re: optimization/8361: [3.3/3.4 regression] C++ compile-time performance regression Date: Sun, 23 Mar 2003 10:10:03 +0100 http://gcc.gnu.org/cgi-bin/gnatsweb.pl?cmd=view%20audit-trail&database=gcc&pr=8361 Janis did some more timing: http://gcc.gnu.org/ml/gcc/2003-03/msg01425.html
From: Gerald Pfeifer <pfeifer@dbai.tuwien.ac.at> To: gcc-gnats@gcc.gnu.org, gcc-prs@gcc.gnu.org, gcc-bugs@gcc.gnu.org Cc: Steven Bosscher <s.bosscher@student.tudelft.nl> Subject: Re: optimization/8361: [3.3/3.4 regression] C++ compile-time performance regression Date: Mon, 24 Mar 2003 15:18:53 +0100 (CET) I have now tested Mark's fix for PR 8361 and while that does not affect non-optimizing performance, the difference for -O3 is noticable. Below are the results of the 3.3-branch from 03/15 versus 03/24 (without explicit --disable-checking): % \time /files/pfeifer/gcc-3.3-0315/bin/g++ -c generate.ii 28.50 real 27.57 user 0.75 sys % \time /files/pfeifer/gcc-3.3-0324/bin/g++ -c generate.ii 28.70 real 27.42 user 0.69 sys No difference for non-optimizing compilation... % \time /files/pfeifer/gcc-3.3-0315/bin/g++ -O3 -c generate.ii 109.55 real 105.85 user 3.06 sys % \time /files/pfeifer/gcc-3.3-0324/bin/g++ -O3 -c generate.ii 81.28 real 77.41 user 3.55 sys ...but a nice speedup for -O3! Gerald
From: Andrew Pinski <pinskia@physics.uc.edu> To: pfeifer@dbai.tuwien.ac.at, gcc-gnats@gcc.gnu.org, gcc-bugs@gcc.gnu.org, nobody@gcc.gnu.org, gcc-prs@gcc.gnu.org Cc: Andrew Pinski <pinskia@physics.uc.edu> Subject: Re: optimization/8361: [3.3/3.4 regression] C++ compile-time performance regression Date: Thu, 1 May 2003 07:26:47 -0400 Is there any way, I can get the non-preprocessed code, because the preprocess code produces errors in the system headers. http://gcc.gnu.org/cgi-bin/gnatsweb.pl?cmd=view%20audit- trail&database=gcc&pr=8361 Thanks, Andrew Pinski
For powerpc-apple-darwin and 3.4, I needed to edit the generate-3.4.ii file and change size_t to be unsigned long from unsigned int.
This bug can be helped by fixing bug 10944 <http://gcc.gnu.org/PR10944>.
It loooks like gcc is walking the tree too much which slows down the complation, I will look into when it is doing it, aka inlining or another time.
I found one of the problems with walking the trees too much, finish_function calls calls_setjmp_p when inlining is turned off which is not needed. This was added with this patch: 1999-12-05 Mark Mitchell <mark@codesourcery.com> * decl.c (init_decl_processing): Set flag_inline_trees if !flag_no_inline. * cp-tree.h (calls_setjmp_p): Declare. * decl.c (finish_function): Mark functions that call setjmp as uninlinable. * optimize.c (calls_setjmp_r): New function. (calls_setjmp_p): Likewise. There is an easy fix for this one is not to call setjmp if flag_inline_trees is non-zero.
Subject: Re: [3.3/3.4 regression] C++ compile-time performance regression Andrew, If you are right about all those tree walks, check out my fix for 1687 (3.3 branch only, 3.4 is in the works). http://gcc.gnu.org/PR1687. The idea is to simply use walk_tree_without_duplicates. Our front ends tend to produce horrible convoluted trees that makr walk_tree walks really slow sometimes. Gr. Steven
I can do better then using walk_tree_without_duplicates if no optimizations, I do not have to look at all if there is no need to aka no inlining is requested (this is just for the -O0 case) which means 3.4 might be faster then 3.0.4 which is tested. Patch in the works will test tonight.
Subject: Re: [3.3/3.4 regression] C++ compile-time performance regression pinskia at physics dot uc dot edu wrote: >PLEASE REPLY TO gcc-bugzilla@gcc.gnu.org ONLY, *NOT* gcc-bugs@gcc.gnu.org. > >http://gcc.gnu.org/bugzilla/show_bug.cgi?id=8361 > > >pinskia at physics dot uc dot edu changed: > > What |Removed |Added >---------------------------------------------------------------------------- > BugsThisDependsOn| |1687 > No it does not.
As shown in: http://gcc.gnu.org/ml/gcc/2003-06/msg01596.html a significant number of insns generated (about 1/3 of the total) for generate-3.4 are NOTE_INSN_DELETED. I am not sure if this is a regression or not, but generating less useless insns should help somewhat, so this might be an interesting data point.
With the new version of the CHUD tools (beta version) which give backtraces from the top-down, I see that most of the time is spent in for_each_template_parm and the related functions for -O0 (not taking GC into account).
Yeah yeah yeah, so g++ has been slowing down since, well, forever. But no way this will be fixed for 3.3.1, or 3.3.2 for that matter. For 3.4 we're doing much better already and we still have a few months to find some speed-ups. [ Damn I wish Apple had a reputation for delivering what they promise. Then we would have a 6x faster GCC soon :-) ]
Target milestone moved back again at the request of Gerald.
Created attachment 4415 [details] Version for 3.4 and later compilers
Created attachment 4416 [details] Version for pre 3.4 era compilers
Postponed, yet again -- until GCC 3.3.2 at least. Nathan is working on a major improvement to type-comparison and template-matching performance, but it requires the elimination of a GNU extension. We've now agreed to eliminate that extension (default arguments on function types), but that means we have to deprecate it in GCC 3.4 and remove it in GCC 3.5, unless people are willing to move up the removal to GCC 3.4.
Subject: Re: [3.3/3.4 regression] C++ compile-time performance regression "mmitchel at gcc dot gnu dot org" <gcc-bugzilla@gcc.gnu.org> writes: | Nathan is working on a major improvement to type-comparison and | template-matching performance, but it requires the elimination of a GNU | extension. We've now agreed to eliminate that extension (default arguments on | function types), but that means we have to deprecate it in GCC 3.4 and remove it | in GCC 3.5, unless people are willing to move up the removal to GCC 3.4. That deprecation was raised ages ago. I vote for its removal in GCC-3.4. -- Gaby
I vote for its removal in 3.4 since it fixes PR 4205, PR 4908 as nobody knew of the extension.
Given that the new parser in GCC 3.4 will "break" (note that quotes!) many/most C++ applications one way or the other anyway, removing such a language extension from G++ seem okay in the 3.4 timeframe (and even more so if it really blocks important improvements).
I'm not a maintainer, but if asked I'd vote for abandoning the extension as well. I'm pretty sure more people would think of a bug in the compiler than an intentional feature if they encountered it in real life. And, yes, just as Gerald said: 3.4 is _the_ time to get rid of cruft in the C++ compiler. W.
Well, here we go postponing this PR yet again... This time until GCC 3.4.
Zdenek's new dominator interface helps, see: http://gcc.gnu.org/ml/gcc-patches/2003-12/msg02164.html
Some improvements lately made by Jan.
Some additional benchmark data (which will soon be outdated, and for the better it seems) by work Jan is doing. http://gcc.gnu.org/ml/gcc/2004-01/msg00657.html
This PR just keeps hanging around. How sad. But, no more work will be done this before 3.4.0, so I've postponed until 3.4.1.
Postponed until GCC 3.4.2.
For powerpc-apple-darwin I posted two patches which helps at -O0 which goes from 18.0 seconds to 15.3 seconds: <http://gcc.gnu.org/ml/gcc-patches/2004-06/msg02029.html> <http://gcc.gnu.org/ml/gcc-patches/2004-06/msg02031.html> Note these patches solve problems specific to darwin and only helps there.
Postponed until GCC 3.4.3.
This PR is unlikely to be closed ever, but some fresh numbers ought to be taken for mainline. Unfortunately I don't have even a fraction of the compilers in the PR description here (only 3.3.4-debian and mainline), so no, I'm not volunteering to do it. :-) Paolo
The updated testcase doesn't compile on i686-pc-linux-gnu, with what looks to be target independent errors. Here are the first few, /sw/gcc-3.0.4/include/g++-v3/bits/stl_iterator.h:452: error: type/value mismatch at argument 1 in template parameter list for ' template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' /sw/gcc-3.0.4/include/g++-v3/bits/stl_iterator.h:452: error: expected a type, got 'std::iterator_traits<_Iterator>::iterator_ category' /sw/gcc-3.0.4/include/g++-v3/bits/stl_iterator.h:452: error: type/value mismatch at argument 2 in template parameter list for ' template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' /sw/gcc-3.0.4/include/g++-v3/bits/stl_iterator.h:452: error: expected a type, got 'std::iterator_traits<_Iterator>::value_typ e' /sw/gcc-3.0.4/include/g++-v3/bits/stl_iterator.h:452: error: type/value mismatch at argument 3 in template parameter list for ' template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' /sw/gcc-3.0.4/include/g++-v3/bits/stl_iterator.h:452: error: expected a type, got 'std::iterator_traits<_Iterator>::differenc e_type' /sw/gcc-3.0.4/include/g++-v3/bits/stl_iterator.h:452: error: type/value mismatch at argument 4 in template parameter list for ' template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' /sw/gcc-3.0.4/include/g++-v3/bits/stl_iterator.h:452: error: expected a type, got 'std::iterator_traits<_Iterator>::pointer' /sw/gcc-3.0.4/include/g++-v3/bits/stl_iterator.h:452: error: type/value mismatch at argument 5 in template parameter list for ' template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' /sw/gcc-3.0.4/include/g++-v3/bits/stl_iterator.h:452: error: expected a type, got 'std::iterator_traits<_Iterator>::reference ' what's up?
Is there anything left to do wrt. the testcases? I saw that Nathan made some (description-only?) changes.
No, Nathan just got confused on which attachment to take.
I'm not sure how interesting it is to keep this PR open. I'll be postponing it every time we get to a release for the forseeable future.
GCC 3.4 (CVS today) takes 35s usr on my machine. GCC 4.0 (CVS today) takes 46s usr on the same machine. The difference is entirely in DOM, into-SSA and SSA-other which is really also into-SSA: usr sys wall dominator optimization 3.16 0.02 3.26 tree SSA rewrite 3.24 0.01 3.27 tree SSA other 3.47 0.09 3.40 Per-pass and cummulative time spent (top 10 only): integration 1.09 2.30% 48.88% tree PHI insertion 1.21 2.56% 51.44% loop invariant motion 1.30 2.75% 54.18% global alloc 1.30 2.75% 56.93% CSE 1.72 3.63% 60.56% parser 3.05 6.44% 67.00% dominator optimization 3.16 6.68% 73.68% tree SSA rewrite 3.24 6.84% 80.52% tree SSA other 3.47 7.33% 87.85% expand 5.75 12.15% 100.00% Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 1.82 8.19 8.19 13878865 0.00 0.00 is_gimple_reg 1.50 14.95 6.76 6594 0.00 0.00 synth_mult 1.28 20.69 5.74 12785589 0.00 0.00 ggc_alloc_stat 1.27 26.38 5.69 3433257 0.00 0.00 free_df_for_stmt 1.25 32.01 5.63 16868123 0.00 0.00 bitmap_set_bit 1.19 37.35 5.34 4846931 0.00 0.00 get_stmt_operands 1.17 42.59 5.24 62034 0.00 0.00 alloc_page 1.15 47.75 5.16 3559 0.00 0.01 compute_immediate_uses 0.99 52.18 4.43 6408238 0.00 0.00 htab_find_slot_with_hash 0.98 56.60 4.42 2104725 0.00 0.00 compute_immediate_uses_for_phi 0.93 60.76 4.16 821051 0.00 0.00 gt_ggc_mx_lang_tree_node 0.91 64.83 4.07 7802758 0.00 0.00 register_new_def 0.90 68.87 4.04 951728 0.00 0.00 rewrite_stmt 0.88 72.82 3.95 30035694 0.00 0.00 bitmap_bit_p 0.84 76.61 3.79 574332 0.00 0.00 cse_insn 0.81 80.26 3.65 196671 0.00 0.00 compute_global_livein 0.81 83.91 3.65 177070 0.00 0.00 insert_phi_nodes_for 0.81 87.54 3.63 2697441 0.00 0.00 for_each_rtx 0.81 91.16 3.62 1079773 0.00 0.00 check_phi_redundancy which is a different way of saying "all over the map" :-(
I noticed today that my patch for PR 18507 also helps this testcase.
Here is the current results for 3.3.2 vs the mainline: -O0 -O1 -O2 -O3 3.3.2 28.93 42.81 61.13 58.140 mainline 11.06 43.18 54.86 58.35 So we are faster at -O0 but slightly slower at optimization levels but if we trust the numbers for 3.0.4 compared to 3.3, we are still 30% slower than 3.0.4 except at -O0.
(In reply to comment #39) > Here is the current results for 3.3.2 vs the mainline: Now I am getting results that -O3 is faster than -O2, that is not right.
Gerald, you think you can find some cycles to see where we stand? I'm very curious how we do for this file, and for the rest of your test suite. (It'd be nice if you can compare mainline with some other official FSF build (3.3, 3.4), because our system compilers are profiledbootstraped so that gives a skewed picture...)
I am now getting results which say at -O1, we are now faster than 3.3.2, could someone test to make sure that they get close results to mine?
(In reply to comment #39) > Here is the current results for 3.3.2 vs the mainline: > -O0 -O1 -O2 -O3 > 3.3.2 28.93 42.81 61.13 58.140 > mainline 11.06 43.18 54.86 58.35 And more current results for the mainline on powerpc-darwin: 11.09 30.55 39.09 38.74 So it looks like this is fixed really and we are 40 % faster than 3.3.2 at -O1 on this testcase. 56% faster at -O2 and 50% at -O3. (which means we have caught back up to and past 3.0.4's number if the numbers in comment #0 scales the same on powerpc). Someone should really do timings on x86 to make sure that they give about the same as powerpc.
here is the timings for i686-pc-linux-gnu: 3.0.5 3.2.3 3.3.6 3.4.4 4.0.0 4.0.0/3.0.5 -O0 24.5 26.0 22.4 20.5 16.9 -31% -O1 41.8 48.3 42.8 37.3 44.8 +7% -O2 53.4 64.9 59.0 61.6 55.9 +5% -O3 54.5 68.8 62.8 64.8 57.2 +5% compilers are: 3.0.5 20030502 (prerelease) 3.2.3 3.3.6 20050116 (prerelease) 3.4.4 20050116 (prerelease) 4.0.0 20050116 (experimental) all compilers compiled by GNU C version 3.3.6 20050116 (prerelease).
Please don't close this bug, ever! It's GCC nostalgia. ;-)
Can someone do the timings again on x86, I think we are faster at -O1 now than previous versions and faster for all other optimization levels? On ppc-darwin we speed up about 3% (-O2/-O3) to 16% (-O1) between the 15th and now.
I will do timings with a bunch of gcc3.x compilers and gcc4.0.
All compilers were bootstrapped, with the following flags: "--disable-{nls,checking} --enable-languages=c,c++" Below, gcc40 is CVS HEAD. This was on a 1.6GHz Opteron, with -m32. The machine has 4GB of memory so garbage collection times are zero, which may account for some of the rather unexpected results. For gcc34 and gcc40 I used generate-3.4.ii.bz2 (attachment 3 [details]) and for the other two I used the latest generate.ii.bz2 (attachment 4 [details]). gcc32 gcc33 gcc34 gcc40 -O0 16.439s 16.172s 15.223s 6.674s -O1 30.265s 25.115s 20.678s 20.305s -O2 42.678s 34.908s 34.526s 27.418s -O3 47.469s 47.538s 35.706s 27.896s I'll try to get numbers on a 32bits machine (i686) as well.
Similar numbers on a 1.4GHz Xeon (i686): gcc32 gcc33 gcc34 gcc40 -O0 18.865s 15.107s 13.286s 10.193s -O1 33.511s 30.096s 24.693s 23.543s -O2 46.527s 42.657s 42.618s 33.549s -O3 49.537s 43.887s 44.056s 33.917s
Considering the numbers from #44, #48, and #49, I think we can conclude that we are back to the compile times GCC 3.0 used to have. It should be noted that we have a significantly larger memory foot print though, and some of the speedups (especially from GCC 3.2 to GCC 3.3) came from smaller hacks to the GC system (collect less often, etc.). But then, most people just use the compiler with -O[0123] and no fancy --params and similar hacks, so from a user POV this bug really is fixed, mostly. I'm not sure if it is useful to keep this bug open any longer.
If you want to compare how the memory footprint has affected performance, use these flags in 3.3 and later: --param ggc-min-expand=30 --param ggc-min-heapsize=4096 Those are the hardcoded values that 3.2 uses to tune how often the collector runs. I would be interested to see how later versions behave when supplied these flags, this will simulate how fast we compile on memory constrained boxes relative to 3.2. Another perhaps more interesting test (but one which will take slightly more effort for you) would be to see how raising these values in 3.2 will affect performance. Some distros (RH?) did in fact raise them in their releases so users may be comparing their cranked distro gcc-3.2 to our FSF releases. Of course since these values are hardcoded in 3.2, you'd have to rebuild that compiler, however I think an apples-to-apples comparsion is in order before closing this PR.
I had done extensive benchmarks around New Year, based on Steven's request in comment #41. Unfortunately I lost most of that data directly before posting it here and couldn't repeat everything, but coincidently I could save exactly those parts that Steven did not check now. ;-) CVS refers to the state in early January. The following are for the full application which generate.ii is only one part of, albeit a representative one. First the time to build with -O3 and the resulting binary size: --------+ stripped-+ build time 2.95 | 4577588 | 170.78 real 3.2.3 | 4106176 | 219.70 real 3.3 CVS | 1073280 | 209.02 real 3.4 CVS | 1079120 | 189.82 real 4.0 CVS | 1081776 | 164.86 real Then some benchmarks results for the binaries; times in seconds, smaller is better: | 2.95 | 3.2.3 | 3.3 CVS | 3.4 CVS | 4.0 CVS | --------------+--------+--------+---------+---------+---------+ STRATCOMP2-ALL| 17.96 | 127.44 | 89.51 | 21.02 | 20.47 | STRATCOMP-BRAVE| 77.09 | 78.33 | 77.70 | 83.33 | 82.83 | 2QBF1| 11.68 | 13.72 | 13.45 | 13.75 | 12.31 | PRIMEIMPL2| 7.52 | 8.05 | 7.21 | 7.00 | 7.42 | ANCESTOR| 70.44 | 69.91 | 71.22 | 67.36 | 61.36 | 3COL-SIMPLEX1| 3.67 | 3.81 | 3.86 | 3.77 | 3.52 | 3COL-LADDER| 77.99 | 81.11 | 81.72 | 73.23 | 71.58 | 3COL-N-LADDER| 1.68 | 2.82 | 2.76 | 1.81 | 1.81 | 3COL-RANDOM1| 8.38 | 8.33 | 7.84 | 8.13 | 8.61 | HP-RANDOM1| 6.52 | 7.29 | 7.19 | 7.90 | 7.65 | HAMCYCLE-FREE| 68.46 | 88.72 | 82.77 | 64.63 | 66.40 | DECOMP2| 7.75 | 8.48 | 8.98 | 9.87 | 8.80 | BW-P5-Esra-a| 34.76 | 36.23 | 35.20 | 31.39 | 31.41 | BW-P8-nopush| 90.17 | 89.79 | 88.17 | 81.97 | 83.51 | BW-P6-pushbin| 60.23 | 62.86 | 61.34 | 59.09 | 59.94 | BW-P7-nopushbin| 84.94 | 87.46 | 83.80 | 79.93 | 81.23 | 3SAT-1| 23.91 | 24.91 | 22.55 | 22.23 | 23.19 | 3SAT-1-CONSTRAINT| 13.97 | 14.76 | 13.51 | 13.37 | 14.15 | HANOI-Towers| 737.91 | 632.95 | 636.27 | 680.56 | 661.77 | RAMSEY(3,7)!=21| 68.93 | 73.92 | 71.77 | 74.71 | 73.59 | RAMSEY(3,7)!=21, normal| 83.92 | 84.02 | 83.32 | 81.23 | 79.21 | RAMSEY(4,6)!=25| 92.53 | 99.69 | 95.06 | 96.33 | 90.40 | RAMSEY(4,6)!=26| 130.68 | 142.55 | 134.61 | 134.75 | 124.73 | CRISTAL| 5.75 | 5.98 | 5.67 | 5.56 | 5.29 | HANOI-K|1176.06 |1289.65 | 1252.41 | 1154.43 | 1082.85 | 21-QUEENS| 7.09 | 7.12 | 6.30 | 6.30 | 6.31 | MSTDir[V=13,A=40]| 14.34 | 13.02 | 12.34 | 11.50 | 11.69 | MSTDir[V=15,A=40]| 14.20 | 12.98 | 12.43 | 11.47 | 11.65 | MSTUndir[V=13,A=40]| 7.18 | 7.07 | 6.53 | 6.14 | 6.34 | MSTUndir[V=15,A=40]| 116.86 | 113.12 | 104.71 | 99.37 | 103.56 | TIMETABLING_4C| 137.64 | 140.79 | 138.66 | 173.87 | 165.50 | SCHOOL_TIMETABLING| 328.57 | - | - | 329.02 | 310.30 | So, in terms of build time and binary size we are fine, and also benchmark performance is nicely improved on average (with some regressions, though). For whether we can close this now, I'll just refer to comment #32 and comment #45 (and Kaveh's note on memory usage).
We have regressioned since the last time someone reported on this one: -O0 -O1 -O2 -O3 11.1 41.7 55.6 65.9 For -O3, the following passes stand out for compile time: tree PTA : 4.04 ( 6%) usr 0.11 ( 1%) sys 4.45 ( 5%) wall 9319 kB ( 2%) ggc tree alias analysis : 5.34 ( 7%) usr 1.42 ( 9%) sys 7.07 ( 8%) wall 11463 kB ( 2%) ggc parser : 4.48 ( 6%) usr 2.16 (14%) sys 7.11 ( 8%) wall 95214 kB (18%) ggc tree operand scan : 4.28 ( 6%) usr 2.86 (19%) sys 7.41 ( 8%) wall 22145 kB ( 4%) ggc dominator optimization: 3.60 ( 5%) usr 0.21 ( 1%) sys 4.02 ( 4%) wall 16448 kB ( 3%) ggc expand : 3.13 ( 4%) usr 0.27 ( 2%) sys 3.53 ( 4%) wall 34210 kB ( 6%) ggc For memory usage: integration : 2.70 ( 4%) usr 0.30 ( 2%) sys 3.24 ( 4%) wall 124856 kB (24%) ggc parser : 4.48 ( 6%) usr 2.16 (14%) sys 7.11 ( 8%) wall 95214 kB (18%) ggc At -O0 compile time: parser : 4.55 (33%) usr 2.00 (29%) sys 6.75 (31%) wall 94454 kB (50%) ggc name lookup : 1.82 (13%) usr 2.98 (43%) sys 5.02 (23%) wall 17923 kB ( 9%) ggc expand : 1.57 (11%) usr 0.40 ( 6%) sys 2.04 ( 9%) wall 33674 kB (18%) ggc global alloc : 1.22 ( 9%) usr 0.06 ( 1%) sys 1.36 ( 6%) wall 8858 kB ( 5%) ggc for memory usage, just the parser. at -O1: parser : 4.23 ( 9%) usr 2.23 (17%) sys 6.94 (11%) wall 94371 kB (22%) ggc integration : 2.46 ( 5%) usr 0.29 ( 2%) sys 2.70 ( 4%) wall 104683 kB (25%) ggc tree PTA : 3.48 ( 7%) usr 0.09 ( 1%) sys 3.76 ( 6%) wall 8378 kB ( 2%) ggc tree alias analysis : 3.22 ( 7%) usr 1.23 ( 9%) sys 4.69 ( 7%) wall 6203 kB ( 1%) ggc tree SSA incremental : 2.52 ( 5%) usr 0.30 ( 2%) sys 3.06 ( 5%) wall 3278 kB ( 1%) ggc tree operand scan : 3.56 ( 7%) usr 2.32 (18%) sys 6.40 (10%) wall 18232 kB ( 4%) ggc memory usage: integration : 2.46 ( 5%) usr 0.29 ( 2%) sys 2.70 ( 4%) wall 104683 kB (25%) ggc parser : 4.23 ( 9%) usr 2.23 (17%) sys 6.94 (11%) wall 94371 kB (22%) ggc -O2: expand : 2.90 ( 5%) usr 0.24 ( 2%) sys 3.02 ( 4%) wall 31476 kB ( 7%) ggc tree SSA incremental : 2.67 ( 4%) usr 0.38 ( 3%) sys 3.30 ( 4%) wall 6252 kB ( 1%) ggc tree operand scan : 3.76 ( 6%) usr 2.49 (18%) sys 6.05 ( 8%) wall 19509 kB ( 4%) ggc dominator optimization: 2.91 ( 5%) usr 0.13 ( 1%) sys 3.14 ( 4%) wall 14117 kB ( 3%) ggc tree PTA : 3.46 ( 6%) usr 0.15 ( 1%) sys 3.79 ( 5%) wall 8394 kB ( 2%) ggc tree alias analysis : 3.97 ( 6%) usr 1.40 (10%) sys 5.65 ( 7%) wall 10165 kB ( 2%) ggc parser : 4.41 ( 7%) usr 2.34 (17%) sys 7.21 ( 9%) wall 94371 kB (20%) ggc integration : 2.48 ( 4%) usr 0.23 ( 2%) sys 2.70 ( 3%) wall 104710 kB (22%) ggc memory usage: parser : 4.41 ( 7%) usr 2.34 (17%) sys 7.21 ( 9%) wall 94371 kB (20%) ggc integration : 2.48 ( 4%) usr 0.23 ( 2%) sys 2.70 ( 3%) wall 104710 kB (22%) ggc
Current numbers for 4.0.0 vs 4.1.0: pc64:~/src/pr8361> time ~/onetest.release/bin/gcc pr8361.ii -S -m32 -O1 21.137u 0.399s 0:21.89 98.3% 0+0k 0+0io 3pf+0w pc64:~/src/pr8361> time gcc-4.0 pr8361.ii -S -m32 -O1 14.059u 0.269s 0:14.46 98.9% 0+0k 0+0io 2pf+0w This on x86_64-pc-linux-gnu. -ftime-report for 4.1.0: Execution times (seconds) garbage collection : 0.35 ( 2%) usr 0.01 ( 1%) sys 0.37 ( 2%) wall 0 kB ( 0%) ggc callgraph construction: 0.13 ( 1%) usr 0.01 ( 1%) sys 0.18 ( 1%) wall 4538 kB ( 1%) ggc callgraph optimization: 0.01 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 1193 kB ( 0%) ggc ipa reference : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.08 ( 0%) wall 273 kB ( 0%) ggc ipa pure const : 0.04 ( 0%) usr 0.01 ( 1%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc cfg construction : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 1607 kB ( 0%) ggc cfg cleanup : 0.14 ( 1%) usr 0.01 ( 1%) sys 0.13 ( 1%) wall 103 kB ( 0%) ggc trivially dead code : 0.10 ( 0%) usr 0.00 ( 0%) sys 0.17 ( 1%) wall 0 kB ( 0%) ggc life analysis : 0.52 ( 2%) usr 0.00 ( 0%) sys 0.52 ( 2%) wall 3245 kB ( 0%) ggc life info update : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 292 kB ( 0%) ggc alias analysis : 0.12 ( 1%) usr 0.00 ( 0%) sys 0.17 ( 1%) wall 2150 kB ( 0%) ggc register scan : 0.16 ( 1%) usr 0.00 ( 0%) sys 0.08 ( 0%) wall 211 kB ( 0%) ggc rebuild jump labels : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 1 kB ( 0%) ggc preprocessing : 0.13 ( 1%) usr 0.15 ( 8%) sys 0.28 ( 1%) wall 591 kB ( 0%) ggc parser : 1.80 ( 8%) usr 0.42 (23%) sys 2.35 (10%) wall 154459 kB (23%) ggc name lookup : 0.57 ( 3%) usr 0.46 (25%) sys 0.97 ( 4%) wall 31048 kB ( 5%) ggc inline heuristics : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall 7605 kB ( 1%) ggc integration : 1.14 ( 5%) usr 0.01 ( 1%) sys 1.14 ( 5%) wall 162853 kB (24%) ggc tree gimplify : 0.30 ( 1%) usr 0.02 ( 1%) sys 0.28 ( 1%) wall 14133 kB ( 2%) ggc tree eh : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 1795 kB ( 0%) ggc tree CFG construction : 0.02 ( 0%) usr 0.01 ( 1%) sys 0.04 ( 0%) wall 11718 kB ( 2%) ggc tree CFG cleanup : 0.49 ( 2%) usr 0.00 ( 0%) sys 0.68 ( 3%) wall 3669 kB ( 1%) ggc tree copy propagation : 0.60 ( 3%) usr 0.00 ( 0%) sys 0.63 ( 3%) wall 1441 kB ( 0%) ggc tree store copy prop : 0.12 ( 1%) usr 0.00 ( 0%) sys 0.10 ( 0%) wall 181 kB ( 0%) ggc tree find ref. vars : 0.25 ( 1%) usr 0.01 ( 1%) sys 0.29 ( 1%) wall 22675 kB ( 3%) ggc tree PTA : 1.61 ( 7%) usr 0.02 ( 1%) sys 1.72 ( 7%) wall 10266 kB ( 2%) ggc tree alias analysis : 1.05 ( 5%) usr 0.15 ( 8%) sys 1.23 ( 5%) wall 11045 kB ( 2%) ggc tree PHI insertion : 0.29 ( 1%) usr 0.00 ( 0%) sys 0.29 ( 1%) wall 16546 kB ( 2%) ggc tree SSA rewrite : 0.65 ( 3%) usr 0.01 ( 1%) sys 0.76 ( 3%) wall 30896 kB ( 5%) ggc tree SSA other : 0.15 ( 1%) usr 0.06 ( 3%) sys 0.20 ( 1%) wall 580 kB ( 0%) ggc tree SSA incremental : 1.58 ( 7%) usr 0.01 ( 1%) sys 1.34 ( 6%) wall 6475 kB ( 1%) ggc tree operand scan : 1.15 ( 5%) usr 0.25 (14%) sys 1.47 ( 6%) wall 15753 kB ( 2%) ggc dominator optimization: 0.80 ( 4%) usr 0.04 ( 2%) sys 0.84 ( 4%) wall 14884 kB ( 2%) ggc tree SRA : 0.22 ( 1%) usr 0.02 ( 1%) sys 0.20 ( 1%) wall 11416 kB ( 2%) ggc tree STORE-CCP : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 165 kB ( 0%) ggc tree CCP : 0.21 ( 1%) usr 0.00 ( 0%) sys 0.19 ( 1%) wall 601 kB ( 0%) ggc tree split crit edges : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 6441 kB ( 1%) ggc tree reassociation : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 1 kB ( 0%) ggc tree FRE : 0.51 ( 2%) usr 0.02 ( 1%) sys 0.53 ( 2%) wall 16049 kB ( 2%) ggc tree code sinking : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.08 ( 0%) wall 54 kB ( 0%) ggc tree linearize phis : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 16 kB ( 0%) ggc tree forward propagate: 0.04 ( 0%) usr 0.00 ( 0%) sys 0.14 ( 1%) wall 3515 kB ( 1%) ggc tree conservative DCE : 0.39 ( 2%) usr 0.00 ( 0%) sys 0.49 ( 2%) wall 0 kB ( 0%) ggc tree aggressive DCE : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.08 ( 0%) wall 0 kB ( 0%) ggc tree DSE : 0.11 ( 1%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall 85 kB ( 0%) ggc PHI merge : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 966 kB ( 0%) ggc tree loop bounds : 0.14 ( 1%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 1796 kB ( 0%) ggc loop invariant motion : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.08 ( 0%) wall 158 kB ( 0%) ggc tree canonical iv : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 955 kB ( 0%) ggc scev constant prop : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 721 kB ( 0%) ggc complete unrolling : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 1340 kB ( 0%) ggc tree iv optimization : 0.15 ( 1%) usr 0.01 ( 1%) sys 0.18 ( 1%) wall 7715 kB ( 1%) ggc tree loop init : 0.11 ( 1%) usr 0.00 ( 0%) sys 0.08 ( 0%) wall 6 kB ( 0%) ggc tree copy headers : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 6478 kB ( 1%) ggc tree SSA uncprop : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc tree SSA to normal : 0.31 ( 1%) usr 0.00 ( 0%) sys 0.41 ( 2%) wall 3411 kB ( 1%) ggc tree rename SSA copies: 0.10 ( 0%) usr 0.01 ( 1%) sys 0.16 ( 1%) wall 0 kB ( 0%) ggc dominance frontiers : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc expand : 1.28 ( 6%) usr 0.02 ( 1%) sys 1.12 ( 5%) wall 43499 kB ( 7%) ggc varconst : 0.08 ( 0%) usr 0.02 ( 1%) sys 0.05 ( 0%) wall 403 kB ( 0%) ggc jump : 0.05 ( 0%) usr 0.01 ( 1%) sys 0.06 ( 0%) wall 1203 kB ( 0%) ggc CSE : 0.22 ( 1%) usr 0.00 ( 0%) sys 0.20 ( 1%) wall 647 kB ( 0%) ggc loop analysis : 0.10 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 1%) wall 1936 kB ( 0%) ggc branch prediction : 0.20 ( 1%) usr 0.01 ( 1%) sys 0.18 ( 1%) wall 1979 kB ( 0%) ggc flow analysis : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 2 kB ( 0%) ggc combiner : 0.46 ( 2%) usr 0.00 ( 0%) sys 0.50 ( 2%) wall 5390 kB ( 1%) ggc if-conversion : 0.08 ( 0%) usr 0.01 ( 1%) sys 0.06 ( 0%) wall 308 kB ( 0%) ggc local alloc : 0.25 ( 1%) usr 0.01 ( 1%) sys 0.26 ( 1%) wall 1622 kB ( 0%) ggc global alloc : 0.85 ( 4%) usr 0.01 ( 1%) sys 0.78 ( 3%) wall 9331 kB ( 1%) ggc reload CSE regs : 0.17 ( 1%) usr 0.00 ( 0%) sys 0.17 ( 1%) wall 2917 kB ( 0%) ggc flow 2 : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 1058 kB ( 0%) ggc if-conversion 2 : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 37 kB ( 0%) ggc rename registers : 0.14 ( 1%) usr 0.00 ( 0%) sys 0.15 ( 1%) wall 21 kB ( 0%) ggc machine dep reorg : 0.12 ( 1%) usr 0.00 ( 0%) sys 0.19 ( 1%) wall 86 kB ( 0%) ggc shorten branches : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc final : 0.24 ( 1%) usr 0.00 ( 0%) sys 0.17 ( 1%) wall 1199 kB ( 0%) ggc symout : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 180 kB ( 0%) ggc TOTAL : 21.70 1.81 23.79 667733 kB for 4.0.0: Execution times (seconds) garbage collection : 0.32 ( 2%) usr 0.00 ( 0%) sys 0.32 ( 2%) wall callgraph construction: 0.08 ( 1%) usr 0.00 ( 0%) sys 0.08 ( 0%) wall callgraph optimization: 0.04 ( 0%) usr 0.01 ( 1%) sys 0.04 ( 0%) wall cfg construction : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall cfg cleanup : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.08 ( 0%) wall trivially dead code : 0.08 ( 1%) usr 0.00 ( 0%) sys 0.09 ( 1%) wall life analysis : 0.43 ( 3%) usr 0.00 ( 0%) sys 0.44 ( 3%) wall life info update : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.13 ( 1%) wall alias analysis : 0.08 ( 1%) usr 0.00 ( 0%) sys 0.13 ( 1%) wall register scan : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.14 ( 1%) wall rebuild jump labels : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall preprocessing : 0.11 ( 1%) usr 0.15 ( 9%) sys 0.22 ( 1%) wall parser : 2.00 (14%) usr 0.46 (29%) sys 2.20 (13%) wall name lookup : 0.49 ( 3%) usr 0.44 (28%) sys 1.12 ( 7%) wall integration : 0.67 ( 5%) usr 0.03 ( 2%) sys 0.77 ( 5%) wall tree gimplify : 0.23 ( 2%) usr 0.01 ( 1%) sys 0.35 ( 2%) wall tree eh : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall tree CFG construction : 0.12 ( 1%) usr 0.00 ( 0%) sys 0.11 ( 1%) wall tree CFG cleanup : 0.22 ( 2%) usr 0.00 ( 0%) sys 0.15 ( 1%) wall tree find referenced vars: 0.23 ( 2%) usr 0.00 ( 0%) sys 0.20 ( 1%) wall tree PTA : 0.37 ( 3%) usr 0.01 ( 1%) sys 0.45 ( 3%) wall tree alias analysis : 0.51 ( 3%) usr 0.00 ( 0%) sys 0.56 ( 3%) wall tree PHI insertion : 0.17 ( 1%) usr 0.00 ( 0%) sys 0.24 ( 1%) wall tree SSA rewrite : 0.54 ( 4%) usr 0.01 ( 1%) sys 0.47 ( 3%) wall tree SSA other : 0.58 ( 4%) usr 0.16 (10%) sys 0.85 ( 5%) wall tree operand scan : 0.51 ( 3%) usr 0.21 (13%) sys 0.68 ( 4%) wall dominator optimization: 0.76 ( 5%) usr 0.05 ( 3%) sys 0.66 ( 4%) wall tree SRA : 0.12 ( 1%) usr 0.00 ( 0%) sys 0.13 ( 1%) wall tree CCP : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 1%) wall tree split crit edges : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall tree remove redundant PHIs: 0.26 ( 2%) usr 0.00 ( 0%) sys 0.33 ( 2%) wall tree linearize phis : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall tree forward propagate: 0.15 ( 1%) usr 0.00 ( 0%) sys 0.12 ( 1%) wall tree conservative DCE : 0.26 ( 2%) usr 0.00 ( 0%) sys 0.26 ( 2%) wall tree aggressive DCE : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall tree DSE : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall PHI merge : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall tree record loop bounds: 0.05 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall loop invariant motion : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.08 ( 0%) wall tree canonical iv creation: 0.03 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall tree iv optimization : 0.22 ( 2%) usr 0.01 ( 1%) sys 0.13 ( 1%) wall tree loop init : 0.12 ( 1%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall tree loop fini : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall tree copy headers : 0.12 ( 1%) usr 0.00 ( 0%) sys 0.17 ( 1%) wall tree SSA to normal : 0.33 ( 2%) usr 0.00 ( 0%) sys 0.30 ( 2%) wall tree NRV optimization : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall tree rename SSA copies: 0.11 ( 1%) usr 0.00 ( 0%) sys 0.10 ( 1%) wall dominance frontiers : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall expand : 0.97 ( 7%) usr 0.01 ( 1%) sys 1.06 ( 6%) wall varconst : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.11 ( 1%) wall jump : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall CSE : 0.27 ( 2%) usr 0.00 ( 0%) sys 0.22 ( 1%) wall loop analysis : 0.08 ( 1%) usr 0.00 ( 0%) sys 0.09 ( 1%) wall branch prediction : 0.18 ( 1%) usr 0.00 ( 0%) sys 0.19 ( 1%) wall flow analysis : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall combiner : 0.39 ( 3%) usr 0.00 ( 0%) sys 0.35 ( 2%) wall if-conversion : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.08 ( 0%) wall local alloc : 0.22 ( 2%) usr 0.00 ( 0%) sys 0.20 ( 1%) wall global alloc : 0.64 ( 4%) usr 0.01 ( 1%) sys 0.66 ( 4%) wall reload CSE regs : 0.21 ( 1%) usr 0.00 ( 0%) sys 0.13 ( 1%) wall flow 2 : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall if-conversion 2 : 0.07 ( 0%) usr 0.01 ( 1%) sys 0.02 ( 0%) wall rename registers : 0.12 ( 1%) usr 0.00 ( 0%) sys 0.12 ( 1%) wall machine dep reorg : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.11 ( 1%) wall shorten branches : 0.10 ( 1%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall final : 0.18 ( 1%) usr 0.02 ( 1%) sys 0.24 ( 1%) wall symout : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall rest of compilation : 0.12 ( 1%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall TOTAL : 14.62 1.60 16.43 14.630u 1.630s 0:16.46 98.7% 0+0k 0+0io 0pf+0w
Even the -fno-inline case slowed down too: pc64:~/src/pr8361> time ~/onetest.release/bin/gcc pr8361.ii -S -m32 -O1 -fno-inline 11.171u 0.359s 0:11.66 98.7% 0+0k 0+0io 0pf+0w pc64:~/src/pr8361> time gcc-4.0 pr8361.ii -S -m32 -O1 -fno-inline 9.578u 0.295s 0:10.02 98.4% 0+0k 0+0io 0pf+0w interesting part of 4.1 time report: combiner : 0.36 ( 3%) usr 0.00 ( 0%) sys 0.31 ( 2%) wall 1933 kB ( 0%) ggc local alloc : 0.33 ( 3%) usr 0.00 ( 0%) sys 0.31 ( 2%) wall 3654 kB ( 1%) ggc global alloc : 0.85 ( 8%) usr 0.00 ( 0%) sys 0.75 ( 6%) wall 12187 kB ( 3%) ggc tree operand scan : 0.25 ( 2%) usr 0.10 ( 7%) sys 0.29 ( 2%) wall 9315 kB ( 2%) ggc dominator optimization: 0.35 ( 3%) usr 0.01 ( 1%) sys 0.34 ( 3%) wall 3938 kB ( 1%) ggc tree PTA : 0.46 ( 4%) usr 0.02 ( 1%) sys 0.53 ( 4%) wall 19358 kB ( 5%) ggc tree alias analysis : 0.25 ( 2%) usr 0.04 ( 3%) sys 0.27 ( 2%) wall 9734 kB ( 2%) ggc tree SSA rewrite : 0.19 ( 2%) usr 0.00 ( 0%) sys 0.24 ( 2%) wall 10314 kB ( 3%) ggc tree SSA incremental : 0.32 ( 3%) usr 0.00 ( 0%) sys 0.26 ( 2%) wall 2956 kB ( 1%) ggc parser : 1.71 (15%) usr 0.42 (28%) sys 2.07 (16%) wall 154459 kB (39%) ggc life analysis : 0.57 ( 5%) usr 0.01 ( 1%) sys 0.51 ( 4%) wall 3127 kB ( 1%) ggc corresponding 4.0 time report: combiner : 0.31 ( 3%) usr 0.01 ( 1%) sys 0.28 ( 2%) wall local alloc : 0.35 ( 4%) usr 0.00 ( 0%) sys 0.47 ( 4%) wall global alloc : 0.74 ( 7%) usr 0.01 ( 1%) sys 0.96 ( 8%) wall tree operand scan : 0.18 ( 2%) usr 0.07 ( 5%) sys 0.23 ( 2%) wall dominator optimization: 0.42 ( 4%) usr 0.01 ( 1%) sys 0.31 ( 3%) wall tree PTA : 0.14 ( 1%) usr 0.01 ( 1%) sys 0.11 ( 1%) wall tree alias analysis : 0.05 ( 1%) usr 0.00 ( 0%) sys 0.07 ( 1%) wall tree SSA rewrite : 0.17 ( 2%) usr 0.00 ( 0%) sys 0.18 ( 2%) wall parser : 1.71 (17%) usr 0.41 (32%) sys 2.22 (20%) wall life analysis : 0.43 ( 4%) usr 0.00 ( 0%) sys 0.61 ( 5%) wall
Is this PR really a 4.0 regression? The timings which I see in the comments suggest that 4.0 is just as fast as earlier releases. That is, the PR may have become a 4.1 regression, but I don't see that it is a 4.0 regression.
A semi recent 4.1 (the 10th) gives: tree PTA : 1.60 ( 6%) usr 0.02 ( 1%) sys 1.73 ( 6%) wall 10338 kB ( 1%) ggc tree alias analysis : 1.32 ( 5%) usr 0.19 (10%) sys 1.48 ( 5%) wall 18910 kB ( 3%) ggc while 4.0 gave: tree PTA : 0.50 ( 2%) usr 0.00 ( 0%) sys 0.48 ( 2%) wall tree alias analysis : 0.73 ( 3%) usr 0.00 ( 0%) sys 0.76 ( 3%) wall So this is definitely a 4.1 regression.
Subject: Re: [3.4/4.0/4.1 regression] C++ compile-time performance regression On Thu, 2005-10-13 at 03:34 +0000, pinskia at gcc dot gnu dot org wrote: > > ------- Comment #57 from pinskia at gcc dot gnu dot org 2005-10-13 03:34 ------- > A semi recent 4.1 (the 10th) gives: > tree PTA : 1.60 ( 6%) usr 0.02 ( 1%) sys 1.73 ( 6%) wall > 10338 kB ( 1%) ggc > tree alias analysis : 1.32 ( 5%) usr 0.19 (10%) sys 1.48 ( 5%) wall > 18910 kB ( 3%) ggc > > while 4.0 gave: > tree PTA : 0.50 ( 2%) usr 0.00 ( 0%) sys 0.48 ( 2%) wall > tree alias analysis : 0.73 ( 3%) usr 0.00 ( 0%) sys 0.76 ( 3%) wall > > So this is definitely a 4.1 regression. > > I'm pretty sure we run PTA more times in 4.1 than 4.0 Maybe i'm wrong. Can you oprofile this and give me some kind of hotspot to look into in PTA?
I'm going to mark this as just a 4.1 regression. As far as I can see, 4.0 was OK. And there is zero chance that we are going to address any of these issues in 3.4, except perhaps coincidentally.
I'd like to fix this for 4.1, but not at the expense of destabilizing things, or losing performance.
(In reply to comment #60) > I'd like to fix this for 4.1, but not at the expense of destabilizing things, > or losing performance. Does this controdict what you wrote in PR 15678: > However, in the meanwhile, I've downgraded this to P4. A small compile-time > increase isn't going to block the upcoming releases. This is a small increase really about 2-3 seconds.
Compile times for generate-3.4.ii All compilers bootstrapped, with checking disabled. Flags: -O2 GCC 4.0 (release branch today): real 0m22.795s 0m22.727s 0m22.760s user 0m22.481s 0m22.297s 0m22.357s sys 0m0.316s 0m0.412s 0m0.404s GCC 4.1 (release branch today): real 0m29.888s 0m28.450s 0m28.420s user 0m28.154s 0m27.906s 0m27.894s sys 0m0.496s 0m0.544s 0m0.524s GCC 4.2 (trunk today): real 0m33.715s 0m31.524s 0m31.483s user 0m31.466s 0m31.034s 0m31.022s sys 0m0.424s 0m0.492s 0m0.460s Flags: -O3 GCC 4.0 (release branch today): real 0m24.412s 0m25.000s 0m24.771s user 0m23.921s 0m24.430s 0m24.210s sys 0m0.368s 0m0.408s 0m0.420s GCC 4.1 (release branch today): real 0m33.260s 0m33.140s 0m33.188s user 0m32.602s 0m32.522s 0m32.554s sys 0m0.556s 0m0.544s 0m0.600s GCC 4.2 (trunk today): real 0m36.544s 0m36.614s 0m36.492s user 0m35.950s 0m35.942s 0m35.994s sys 0m0.544s 0m0.600s 0m0.464s Significant compile time sinks in GCC 4.1 that don't appear in GCC 4.0: tree PTA : 2.31 ( 7%) usr tree SSA incremental : 2.14 ( 6%) usr expand : 1.71 ( 5%) usr The same passes cost the most time in GCC 4.2. The expand cost has increades. The other two are not new, they just run very often or didn't have their own time vars before. The overall problem seems to be that we just run too many passes too often, nothing really stands out.
Subject: Re: [4.1/4.2 regression] C++ compile-time performance regression > Flags: -O3 > > GCC 4.0 (release branch today): > real 0m24.412s 0m25.000s 0m24.771s > user 0m23.921s 0m24.430s 0m24.210s > sys 0m0.368s 0m0.408s 0m0.420s > > GCC 4.1 (release branch today): > real 0m33.260s 0m33.140s 0m33.188s > user 0m32.602s 0m32.522s 0m32.554s > sys 0m0.556s 0m0.544s 0m0.600s > > GCC 4.2 (trunk today): > real 0m36.544s 0m36.614s 0m36.492s > user 0m35.950s 0m35.942s 0m35.994s > sys 0m0.544s 0m0.600s 0m0.464s > > > Significant compile time sinks in GCC 4.1 that don't appear in GCC 4.0: > tree PTA : 2.31 ( 7%) usr > tree SSA incremental : 2.14 ( 6%) usr > expand : 1.71 ( 5%) usr > So, could you do me a favor if you get a chance, and change the macro DONT_PROPAGATE_WITH_ANYTHING to 1 in tree-ssa-structalias.c, and see if it speeds it up at all?
DONT_PROPAGATE_WITH_ANYTHING only exists on the trunk. With that flag, the timings are: Flags: -O2 GCC 4.2 (trunk today): real 0m31.704s user 0m31.094s sys 0m0.584s Flags: -O3 GCC 4.2 (trunk today): real 0m36.206s user 0m35.718s sys 0m0.484s So, no it doesn't help. Again, the problem seems to be more that we just run so many passes, not that one or two specific passes are to blame for most of the compile time.
This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.
Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.
Does anyone have new numbers for this, Richard G.'s recent memory patches have an effect on the compile time also I noticed between 7% and 10% on at least CSiBE.
New timings. These were taken on the same box as those of comment #62 and comment #64 (Intel x86_64 3.20GHz, 1GB ram). Times are usr times Invokation: time g++ -S -fpermissive -Ox -m64 generate-3.4.ii GC params for cc1plus: --param ggc-min-expand=98 --param ggc-min-heapsize=127550 version ID -O2 -O3 GCC 3.4 3.4.6 0m23.673s 0m24.362s GCC 4.0 4.0.4 20060725 0m23.009s 0m23.849s GCC 4.1 4.1.2 20060725 0m24.018s 0m25.294s GCC 4.2 4.2.0 20060724 0m25.214s 0m26.242s
Re. comment #68, I should have added that all compilers were built with "gcc (GCC) 4.0.2 20050901 (prerelease) (SUSE Linux)" with CFLAGS="-O2 -g".
Based on my numbers of comment #69, I'm declaring this fixed once more.