Bug 18687

Summary: >200% compile time regression compared to gcc-3.4
Product: gcc Reporter: Serge Belyshev <belyshev>
Component: tree-optimizationAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: andreasmeier80, dimhen, esigra, gcc-bugs, hubicka, pinskia, sjames, spark, steven, vermontgcc, vmakarov
Priority: P2 Keywords: compile-time-hog
Version: 4.0.0   
Target Milestone: ---   
Host: Target:
Build: Known to work: 3.4.4
Known to fail: 4.0.0, 4.0.4, 4.6.0 Last reconfirmed: 2009-12-28 19:51:37
Bug Depends on: 15678    
Bug Blocks: 47344, 22325    
Attachments: 3338 bytes
9535 bytes
perl script (463 bytes)
test case (325 KB)
testcase (304 KB)
profile for cc1 r155486 on x86_64, options -O2, for infcodes100.c
profile for cc1 r155486 on x86_64, options -O2, for hashes100.c

Description Serge Belyshev 2004-11-27 04:30:07 UTC
These two files taken from linux kernel (hashes.c) and zlib (infcodes.c)

hashes.c:

        3.4.4   4.0.0   delta

-O0     0.045   0.049    9%
-O1     0.067   0.144  115%
-O2     0.108   0.177   64%

infcodes.c:

        3.4.4   4.0.0   delta

-O0     0.080   0.086    7%
-O1     0.120   0.223   86%
-O2     0.196   0.295   51%
Comment 1 Serge Belyshev 2004-11-27 04:31:29 UTC
Created attachment 7616 [details]
3338 bytes
Comment 2 Serge Belyshev 2004-11-27 04:32:27 UTC
Created attachment 7617 [details]
9535 bytes
Comment 3 Andrew Pinski 2004-11-27 15:47:33 UTC
Hmm, These are really short compile times to begin with.  Almost anything this short could be 
considered just error but then again, the compile time for me on ppc-darwin was about the same time 
as 3.3 as 4.0.0.
Comment 4 Serge Belyshev 2004-11-27 18:44:29 UTC
Created attachment 7618 [details]
perl script (463 bytes)

Use this script to get more accurate timings. (install
http://search.cpan.org/~jhi/BSD-Resource-1.24/ if you want to get even more
accurate timings).
Comment 5 Andrew Pinski 2004-11-30 13:09:22 UTC
Confirmed, how many new builtins do we have between 3.4.x and 4.0.0, this will most likely account for 
the -O0 time.
Comment 6 Andrew Pinski 2004-11-30 13:10:19 UTC
In both of my profiles for -O0 and -O1, I see that is_attribute_p is always high up there.
Comment 7 Andrew Pinski 2004-11-30 14:32:50 UTC
At -O2 and -O3 also I see is_attribute_p is always taking 2% of the time.
At -O0:
- is_attribute_p -  6%
- decode_reg_name - 2.1%
at -O1:
- is_attribute_p -  2.5%
- decode_reg_name 0.8%
at -O2:
 - is_attribute_p -  1.9%
- decode_reg_name  - 0.5%
at -O3 (this was a bigger sample than the obove ones):
- is_attribute_p -  1.8%
- decode_reg_name - 0.6%
Comment 8 Andrew Pinski 2004-11-30 14:57:38 UTC
I have a fix for decode_reg_name and there is a branch with the fix for is_attribute_p, maybe that 
should be merged in ...
Comment 9 Andrew Pinski 2004-11-30 15:17:22 UTC
The patch to fix decode_reg_name is here: <http://gcc.gnu.org/ml/gcc-patches/2004-11/
msg02683.html>, I don't know how much this buys us but should buys us at -O0 about 2%.
Comment 10 Andrew Pinski 2004-12-02 14:06:16 UTC
Can you try again after my patch to varasm.c, I cannot get any accrete results on these tests.
Comment 11 Andrew Pinski 2004-12-02 15:27:34 UTC
Patch to speedup lookup_attribute here: <http://gcc.gnu.org/ml/gcc-patches/2004-12/
msg00164.html>, this should help but not all 6% at -O0.

Note for -O0, we are slowing down because there are more attributes and more builtins functions for 
4.0.0 compared with 3.4.x.
Comment 12 Serge Belyshev 2004-12-03 00:24:07 UTC
        3.4.4   4.0.0   delta

hashes.c:
-O0     0.0515  0.0526    2%           
-O1     0.078	0.157   101%
-O2     0.123	0.198    61%

infcodes.c:
-O0     0.0907  0.0933    3%  
-O1     0.137 	0.248    81%
-O2     0.224 	0.328    46%

Average speedup (not only -O0) is 6%.
Comment 13 Serge Belyshev 2004-12-22 21:47:07 UTC
Created attachment 7805 [details]
test case (325 KB)
Comment 14 Serge Belyshev 2004-12-22 21:48:03 UTC
Created attachment 7806 [details]
testcase (304 KB)
Comment 15 Serge Belyshev 2004-12-22 21:50:22 UTC
Hope these two larger test cases will help profiling.

		3.4.4		4.0.0		delta

hashes100.c:

-O0		 3.7		 3.8		  3%
-O1		 6.4		14.5		126%
-O2		11.3		18.7		 65%
-O3		11.9		19.0		 60%


infcodes100.c:

-O0		 7.4		 7.6		  3%
-O1		12.6		25.4		101%
-O2		21.4		33.8		 58%
-O3		22.4		34.7		 55%
Comment 16 Steven Bosscher 2005-01-28 00:55:00 UTC
Did anyone try to profile these already? 
 
Comment 17 Steven Bosscher 2005-01-30 22:53:47 UTC
Profile for hashes100.c on an x86-64 with "-O2 -m32": 
 
CPU: Hammer, speed 1394.98 MHz (estimated) 
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 4000 
Counted DATA_CACHE_MISSES events (Data cache misses) with a unit mask of 0x00 
(No unit mask) count 1000 
samples  %        samples  %        image name       symbol name 
27235     2.3397  158       0.5518  cc1              constrain_operands 
21936     1.8845  53        0.1851  cc1              record_reg_classes 
21735     1.8672  399       1.3936  cc1              reg_scan_mark_refs 
20966     1.8012  1179      4.1179  libc.so.6        __GI_memset 
20786     1.7857  418       1.4600  cc1              ggc_alloc_stat 
20199     1.7353  168       0.5868  cc1              fold 
18374     1.5785  589       2.0572  libc.so.6        _int_malloc 
16467     1.4147  177       0.6182  cc1              iterative_hash_expr 
15562     1.3369  324       1.1316  cc1              htab_find_slot_with_hash 
15080     1.2955  842       2.9409  cc1              note_stores 
13709     1.1777  78        0.2724  cc1              find_reloads 
13054     1.1214  45        0.1572  cc1              nonzero_bits1 
12297     1.0564  84        0.2934  cc1              mark_used_regs 
11402     0.9795  68        0.2375  cc1              refers_to_regno_p 
11335     0.9738  95        0.3318  cc1              cse_insn 
11088     0.9526  71        0.2480  cc1              ix86_decompose_address 
10401     0.8935  70        0.2445  cc1              for_each_rtx_1 
9731      0.8360  354       1.2364  cc1              count_reg_usage 
9176      0.7883  202       0.7055  cc1              bitmap_set_bit 
8578      0.7369  284       0.9919  cc1              bitmap_bit_p 
8137      0.6990  208       0.7265  cc1              extract_insn 
8019      0.6889  30        0.1048  cc1             
reload_cse_simplify_operands 
7755      0.6662  541       1.8896  cc1              pool_alloc 
7341      0.6307  143       0.4995  cc1              et_splay 
6992      0.6007  319       1.1142  cc1              init_alias_analysis 
 
 
Comment 18 Steven Bosscher 2005-01-30 22:55:40 UTC
The profile for infcodes100.c is essentially the same. 
 
Comment 19 Steven Bosscher 2005-01-30 23:42:21 UTC
IVopts takes a lot of time (in several runs): 
 tree iv optimization  :   1.12 ( 9%) usr   0.16 (23%) sys   1.34 (10%) wall 
 
Comment 20 Zdenek Dvorak 2005-02-08 00:29:23 UTC
Out of the time spent in ivopts:

-- this patch saves some 25%:
   http://gcc.gnu.org/ml/gcc-patches/2004-10/msg00136.html

-- another 25% are caused by inefficient use of # of iterations analysis;
   I am thinking about solution

-- yet another 25% are spent in scalar evolutions analysis; I am working on
   speeding this up
Comment 21 Zdenek Dvorak 2005-02-08 16:43:42 UTC
Patch (improving ivopts performance on the testcase by some 40%):

http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00307.html
Comment 22 GCC Commits 2005-02-10 00:33:06 UTC
Subject: Bug 18687

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	rakdver@gcc.gnu.org	2005-02-10 00:32:49

Modified files:
	gcc            : ChangeLog tree-flow.h tree-ssa-loop-ivcanon.c 
	                 tree-ssa-loop-ivopts.c tree-ssa-loop-niter.c 
	                 tree.c tree.h 

Log message:
	PR tree-optimization/18687
	* tree-flow.h (find_loop_niter): Declare.
	* tree-ssa-loop-ivcanon.c (canonicalize_loop_induction_variables):
	Try using scev even for loops with more than one exit.
	* tree-ssa-loop-ivopts.c (struct loop_data): Removed niter field.
	(struct ivopts_data): Added niters field.
	(struct nfe_cache_elt): New.
	(nfe_hash, nfe_eq, niter_for_exit, niter_for_single_dom_exit): New
	functions.
	(tree_ssa_iv_optimize_init): Initialize niters cache.
	(determine_number_of_iterations): Removed.
	(find_induction_variables): Do not call determine_number_of_iterations.
	Access niters for single exit through niter_for_single_dom_exit.
	(add_iv_outer_candidates): Access niters for single exit through
	niter_for_single_dom_exit.
	(may_eliminate_iv): Take data argument.  Use niter_for_exit.  Do not use
	number_of_iterations_cond.
	(iv_period): New function.
	(determine_use_iv_cost_condition): Pass data to may_eliminate_iv.
	(may_replace_final_value): Take data argument.  Use
	niter_for_single_dom_exit.
	(determine_use_iv_cost_outer): Pass data to may_replace_final_value.
	(rewrite_use_compare): Pass data to may_eliminate_iv.
	(rewrite_use_outer): Pass data to may_replace_final_value.
	(free_loop_data): Clean up the niters cache.
	(tree_ssa_iv_optimize_finalize): Free the niters cache.
	(tree_ssa_iv_optimize_loop): Do not call loop_commit_inserts.
	* tree-ssa-loop-niter.c (find_loop_niter): New function.
	(find_loop_niter_by_eval): Use tree_int_cst_lt.
	(num_ending_zeros): Moved to tree.c.
	* tree.h (num_ending_zeros): Declare.
	* tree.c (num_ending_zeros): Moved from tree.c.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.7427&r2=2.7428
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-flow.h.diff?cvsroot=gcc&r1=2.80&r2=2.81
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-ssa-loop-ivcanon.c.diff?cvsroot=gcc&r1=2.5&r2=2.6
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-ssa-loop-ivopts.c.diff?cvsroot=gcc&r1=2.44&r2=2.45
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-ssa-loop-niter.c.diff?cvsroot=gcc&r1=2.18&r2=2.19
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree.c.diff?cvsroot=gcc&r1=1.463&r2=1.464
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree.h.diff?cvsroot=gcc&r1=1.687&r2=1.688

Comment 23 Andrew Pinski 2005-02-10 00:51:15 UTC
(In reply to comment #22)
So does someone do more testings again?
Comment 24 Serge Belyshev 2005-02-10 15:55:21 UTC
                3.4.4           4.0.0           delta

hashes100.c:

-O0              3.66            3.81             4%
-O1              6.43           11.47            78%
-O2             11.40           17.89            57%
-O3             12.00           18.35            53%


infcodes100.c:

-O0              7.24            7.66             6%
-O1             12.43           22.24            79%
-O2             21.28           31.54            48%
-O3             22.37           32.02            43%
Comment 25 Paolo Bonzini 2005-08-17 13:06:20 UTC
what about times on 4.1?
Comment 26 Serge Belyshev 2005-08-17 16:10:25 UTC
        3.4.5   4.0.2   4.1.0           3.4->4.0 (%)    4.0->4.1 (%)

hashes100.c:

O0      1.56    1.59    1.57            2.2            -1.4
O1      2.54    4.23    4.49           66.              6.1
O2      4.43    6.29    6.56           42.              4.4
O3      4.74    6.33    6.66           34.              5.1

infcodes100.c:

O0      3.21    3.69    3.67           15.             -0.64
O1      4.68    7.06    7.72           51.              9.3
O2      8.01    9.84   10.86           23.             10.
O3      8.45    9.94   10.95           18.             10.


Note this is a different machine than in my previous comments (2GHz k8 with
1024K of L2 cache in 64bit mode, previous was 1GHz k7 with 64K of L2).
Comment 27 Mark Mitchell 2005-10-31 01:59:42 UTC
I'm going to mark as P4.  I'd like to see it fixed, but I'm not sure there's much we can do for 4.1.
Comment 28 Mark Mitchell 2006-02-24 00:25:27 UTC
This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.
Comment 29 Mark Mitchell 2006-05-25 02:32:16 UTC
Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.
Comment 30 Andrew Pinski 2006-07-05 09:14:56 UTC
Can you do timings on these again on the mainline since it looks like Richard G.'s memory patches also improved compile time for C at least on the CSiBE benchmark.
Comment 31 Steven Bosscher 2006-09-03 11:05:23 UTC
"real" times for hashes100.c (x86_64-linux, Intel Xeon 3.2 GHz, 1GB RAM):

        3.4.6      4.0.4      4.1.2      4.2-svn20060724
-O0     0m1.618s   0m1.762s   0m1.661s   0m1.645s
-O1     0m2.743s   0m4.646s   0m4.984s   0m4.936s
-O2     0m4.686s   0m6.814s   0m7.140s   0m7.603s


"real" times for infcodes100.c:

        3.4.6      4.0.4      4.1.2      4.2-svn20060724
-O0     0m3.040s   0m3.643s   0m3.555s   0m3.575s
-O1     0m4.989s   0m7.694s   0m8.809s   0m8.943s
-O2     0m8.375s   0m10.622s  0m12.136s  0m13.285s
Comment 32 Steven Bosscher 2006-09-03 11:37:37 UTC
Just to be sure that between 7/24 and today we didn't speed up significantly:

"real" times for hashes100.c (x86_64-linux, Intel Xeon 3.2 GHz, 1GB RAM):

        3.4.6      4.2-svn20060903    delta
-O0     0m1.618s   0m1.634s           +1%
-O1     0m2.743s   0m5.175s           +88%
-O2     0m4.686s   0m7.719s           +65%

"real" times for infcodes100.c:
        3.4        4.2-svn20060903    delta
-O0     0m3.040s   0m3.526s           +16%
-O1     0m4.989s   0m8.871s           +77%
-O2     0m8.375s   0m13.334s          +59%


Given these numbers, I would stick with gcc3 if I were a kernel developer.
Comment 33 Steven Bosscher 2006-09-03 11:41:25 UTC
FWIW, the oprofile for both test cases is basically flat, nothing stands out. We just do _so_ much more work (many more passes without removing anything) and that hurts apparently (not surprising of course).
Comment 34 Richard Biener 2006-09-03 13:22:36 UTC
FYI, the profile (-O2) looks like

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
  2.04      0.62     0.62  5210670     0.00     0.00  htab_find_slot_with_hash
  2.01      1.23     0.61   135824     0.00     0.00  find_reloads
  1.78      1.77     0.54  1181450     0.00     0.00  constrain_operands
  1.71      2.29     0.52  1667653     0.00     0.00  walk_tree
  1.64      2.79     0.50   113076     0.00     0.00  record_reg_classes
  1.18      3.15     0.36  1374013     0.00     0.00  for_each_rtx_1
  1.15      3.50     0.35  2188275     0.00     0.00  iterative_hash_expr
  1.02      3.81     0.31 13803214     0.00     0.00  bitmap_bit_p
  0.99      4.11     0.30   146294     0.00     0.00  reload_cse_simplify_operands
  0.92      4.39     0.28  8238693     0.00     0.00  bitmap_set_bit
  0.89      4.66     0.27 13524760     0.00     0.00  is_gimple_min_invariant
  0.85      4.92     0.26  1944894     0.00     0.00  extract_insn
  0.76      5.15     0.23  3242786     0.00     0.00  note_stores
  0.76      5.38     0.23  1848575     0.00     0.00  mark_set_1
  0.76      5.61     0.23  1073359     0.00     0.00  fold_binary
  0.69      5.82     0.21  3530376     0.00     0.00  ix86_decompose_address
  0.66      6.02     0.20  2801807     0.00     0.00  is_gimple_reg
  0.66      6.22     0.20      101     0.00     0.02  reload
  0.62      6.41     0.19  6596295     0.00     0.00  find_reg_note
  0.62      6.60     0.19  3748843     0.00     0.00  ggc_alloc_stat
  0.62      6.79     0.19  1543768     0.00     0.00  force_fit_type
  0.59      6.97     0.18  5160059     0.00     0.00  pool_alloc
  0.59      7.15     0.18   937999     0.00     0.00  make_node_stat
  0.59      7.33     0.18     1915     0.00     0.00  cleanup_cfg
  0.56      7.50     0.17  2132681     0.00     0.00  mark_all_vars_used_1
  0.56      7.67     0.17  1015399     0.00     0.00  get_expr_operands
  0.56      7.84     0.17   263419     0.00     0.00  cse_insn
  0.53      8.00     0.16  2162651     0.00     0.00  cselib_lookup
  0.53      8.16     0.16  1176845     0.00     0.00  mark_used_regs
  0.53      8.32     0.16      101     0.00     0.01  reload_as_needed
  0.49      8.47     0.15  1748131     0.00     0.00  operand_equal_p
  0.49      8.62     0.15  1160005     0.00     0.00  propagate_one_insn
  0.49      8.77     0.15  1086424     0.00     0.00  et_splay
  0.49      8.92     0.15  1030222     0.00     0.00  rtx_cost
  0.49      9.07     0.15   523805     0.00     0.00  count_reg_usage
  0.46      9.21     0.14  3222872     0.00     0.00  memory_operand
  0.46      9.35     0.14  1651283     0.00     0.00  htab_find_with_hash
  0.46      9.49     0.14   693604     0.00     0.00  rewrite_update_stmt
  0.46      9.63     0.14   457856     0.00     0.00  mul_double

we're hashing SCEV data a lot:

                0.01    0.07  285600/2683522     set_instantiated_value [395]
                0.01    0.09  398929/2683522     build_int_cst_wide [214]
                0.02    0.20  864500/2683522     find_var_scev_info [184]
[68]     2.3    0.06    0.63 2683522         htab_find_slot [68]
                0.34    0.16 2683522/11019933     htab_find_slot_with_hash <cycle 26> [53]
                0.03    0.00 1150100/1603800     hash_scev_info [589]


                0.06    0.03  125694/1181450     reload_cse_simplify_operands [94]
                0.09    0.05  201473/1181450     find_matches [200]
                0.32    0.16  698486/1181450     extract_constrain_insn_cached [67]
[61]     2.7    0.54    0.28 1181450+14400   constrain_operands [61]
                0.01    0.15  806498/873917      strict_memory_address_p [236]
                0.05    0.00  272106/402044      operands_match_p [415]


                0.05    0.04  100911/1588625     create_ssa_var_map [329]
                0.07    0.06  144529/1588625     walk_tree_without_duplicates [268]
                0.14    0.13  292781/1588625     count_uses_and_derefs [149]
                0.21    0.20  450984/1588625     remove_unused_locals [111]
[44]     3.9    0.52    0.68 1667653+3246362 walk_tree <cycle 20> [44]
                0.06    0.18  337362/337362      find_used_portions [189]
                0.08    0.12 1659486/1659486     pointer_set_insert [209]
                0.00    0.06   97327/97327       scan_for_refs [487]
                0.05    0.00 1267779/1267779     count_ptr_derefs [525]


                0.01    0.56  143418/143418      regclass [71]
[80]     1.9    0.01    0.56  143418         scan_one_insn [80]
                0.50    0.04  113076/113076      record_reg_classes [85]
                0.01    0.01  100486/1944894     extract_insn [121]


                0.00    0.07  209960/1775362     x86_extended_reg_mentioned_p [364]
                0.02    0.26  747868/1775362     approx_reg_cost [165]
[70]     2.2    0.04    0.62 1775362         for_each_rtx [70]
                0.36    0.20 1374013/1374013     for_each_rtx_1 [82]
                0.01    0.03  195736/659311      check_dependence [310]


                0.03    0.01  202137/2188275     vn_add_with_vuses [301]
                0.06    0.02  385100/2188275     phi_translate [110]
                0.07    0.03  453965/2188275     vn_lookup_with_vuses [197]
                0.11    0.04  687546/2188275     initialize_hash_element [235]
[96]     1.6    0.35    0.14 2188275+2621356 iterative_hash_expr [96]
                0.11    0.00 1329048/1332689     iterative_hash [318]
                0.03    0.00 1131815/1907772     commutative_tree_code [523]


                0.01    0.00  508000/13803214     build_tree_conflict_graph [618]
                0.01    0.00  541400/13803214     propagate_one_insn [31]
                0.02    0.00  793747/13803214     rewrite_update_stmt [215]
                0.02    0.00  997246/13803214     mark_block_for_update [547]
                0.03    0.00 1120737/13803214     set_contains_value [338]
                0.03    0.00 1314587/13803214     prepare_block_for_update [138]
                0.03    0.00 1484822/13803214     mark_set_1 [87]
                0.06    0.00 2575352/13803214     mark_used_reg [195]
[146]    1.0    0.31    0.00 13803214         bitmap_bit_p [146]


               0.01    0.88     101/101         rest_of_handle_postreload [58]
[57]     2.9    0.01    0.88     101         reload_cse_regs [57]
                0.02    0.84     202/202         reload_cse_regs_1 [59]

                0.02    0.84     202/202         reload_cse_regs [57]
[59]     2.8    0.02    0.84     202         reload_cse_regs_1 [59]
                0.30    0.20  146294/146294      reload_cse_simplify_operands [94]


               0.01    0.00  240636/8238693     set_def_block [398]
                0.01    0.00  318100/8238693     compute_dominance_frontiers [480]
                0.02    0.00  451869/8238693     update_alias_info [118]
                0.03    0.00  837338/8238693     insert_into_set [176]
                0.03    0.00 1004991/8238693     mark_set_1 [87]
                0.04    0.00 1218828/8238693     mark_used_reg [195]
[166]    0.9    0.28    0.00 8238693         bitmap_set_bit [166]


                0.01    0.00  623522/13524760     value_insert_into_set [280]
                0.03    0.00 1279640/13524760     set_contains_value [338]
                0.03    0.00 1478781/13524760     insert_into_set [176]
                0.10    0.00 4785141/13524760     get_value_handle [203]
[170]    0.9    0.27    0.00 13524760         is_gimple_min_invariant [170]
Comment 35 Steven Bosscher 2006-09-03 17:28:24 UTC
Even if we did not hash SCEV data a lot, it would not buy you >50%.
Comment 36 Jan Hubicka 2007-10-09 18:57:31 UTC
Startup times was optimized a bit in 4.3.

hashes.c, 100 compilations:
              GCC-3.3 GCC-3.4   mainline mainline&profiledbootstrap
-O0             3.2s    3.8s      3.6s        3.3s 
-O1             5s      5.6s      8.1s        7.4s  
-O2             7.7s    7.6s      12.4s       11.1s

So at least for -O0 we are back to original speed.
Comment 37 Steven Bosscher 2007-11-05 22:23:07 UTC
If linux kernel hacker complaints were taken seriously, this bug would be P1 ;-)
Comment 38 Joseph S. Myers 2008-07-04 16:46:53 UTC
Closing 4.1 branch.
Comment 39 Richard Biener 2009-01-31 21:36:04 UTC
      4.0.4   4.1.2   4.2.4   4.3.3   4.4.0
-O0:   1.64    1.58    1.91    1.86    2.20
-O1:   4.12    4.60   10.14    5.31    5.43
-O2:   6.10    6.60   13.34    8.40    8.57
-Os:   4.82    5.38   10.92    6.54    6.77
-O3:   6.10    9.80   18.12   12.81   13.42

my 4.2.4 build is with checking enabled.  I do not have a
x86_64 build from the 3.3 or 3.4 series available.

At -O0 with 4.3 the DF framework likely adds compile-time, with 4.4 it is
the (partial) tuple transition and maybe SSA at -O0, maybe IRA.  Comparing
optimized build-times is likely unfair as we add optimization capabilities.
Comment 40 Steven Bosscher 2009-01-31 22:22:13 UTC
Comparing optimized build times *is* very, very fair as we add optimization capabilities that help nothing for 99% of the code (cselim, bswap optimization, cestore, etc.), and with a passes pipeline that unconditionally executes many passes repeatedly for good SPEC scores.
Comment 41 Richard Biener 2009-02-01 11:08:24 UTC
Ok, let's say then comparing -O[23s] compile-times is unfair as we never
stated they are optimized for compile-time but they explicitly contain passes
that may usually _not_ help.  -O1 may be a different story, but I do not
remember when we last tried to find a reasonable set of optimizations for it ;)
(I guess we could do with early optimizations only, no IPA, maybe one
memory CSE and cleanups after it, no tree loop opts, no repeating everything).

Oh, and you are welcome to try to remove redundant passes.  I tried for 4.4
with mixed success (our testsuite shows that we have many corner-cases that
are only optimized if you run the correct passes in the right order).
Comment 42 Joseph S. Myers 2009-03-31 16:42:13 UTC
Closing 4.2 branch.
Comment 43 Richard Biener 2009-08-04 12:26:10 UTC
GCC 4.3.4 is being released, adjusting target milestone.
Comment 44 Steven Bosscher 2009-12-28 19:51:37 UTC
hashes100.c on x86_64:
	3.4.6	4.2.4	4.3.3	4.4.2	4.5.0
-O0	1.37	1.4	1.59	1.91	1.84
-O1	2.07	4	4.44	4.68	4.89
-O2	3.57	5.96	7.08	7.48	7.6
-O3	3.78	8.73	10.85	11.55	11.8
					
	3.4.6	4.2.4	4.3.3	4.4.2	4.5.0
-O0	100%	102%	116%	139%	134%
-O1	100%	193%	214%	226%	236%
-O2	100%	167%	198%	210%	213%
-O3	100%	231%	287%	306%	312%
					
					
					
infcodes100.c on x86_64:
	3.4.6	4.2.4	4.3.3	4.4.2	4.5.0
-O0	2.74	3.2	3.86	4.39	4.79
-O1	3.85	7.81	7.69	8.39	8.16
-O2	6.35	11.81	12.8	13.18	14.71
-O3	6.72	11.9	13.91	14.11	15.95
					
	3.4.6	4.2.4	4.3.3	4.4.2	4.5.0
-O0	100%	117%	141%	160%	175%
-O1	100%	203%	200%	218%	212%
-O2	100%	186%	202%	208%	232%
-O3	100%	177%	207%	210%	237%


These are all best-of-three-runs timings. All compilers built with default release settings (--enable-checking=release, except 3.4.6, which was built with checking disabled by default).

Even relative to gcc 4.3, gcc 4.5 is ~15% slower again. Bravo! :-(
Comment 45 Steven Bosscher 2009-12-28 20:28:17 UTC
Profile for cc1 for SVN r155486 looks like this (all items with >0.5% time):

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
  3.79      0.29     0.29  3718720     0.00     0.00  operand_equal_p
  2.87      0.51     0.22  6543125     0.00     0.00  htab_find_slot_with_hash
  2.15      0.68     0.17 18227570     0.00     0.00  bitmap_set_bit
  2.09      0.84     0.16     1614     0.00     0.00  df_note_compute
  1.83      0.98     0.14                             _fini
  1.63      1.10     0.13  3266034     0.00     0.00  mem_attrs_htab_eq
  1.44      1.21     0.11 23919785     0.00     0.00  bitmap_bit_p
  1.44      1.32     0.11   118624     0.00     0.00  free_ira_costs
  1.31      1.42     0.10  4998409     0.00     0.00  bitmap_ior_into
  1.31      1.52     0.10     7015     0.00     0.00  df_worklist_dataflow
  1.24      1.62     0.10  1319725     0.00     0.00  bitmap_copy
  0.91      1.69     0.07  2951661     0.00     0.00  note_stores
  0.91      1.76     0.07  1709658     0.00     0.00  bitmap_ior_and_compl
  0.91      1.83     0.07  1639154     0.00     0.00  et_splay
  0.78      1.89     0.06  1552007     0.00     0.00  rtx_alloc_stat
  0.78      1.95     0.06  6404265     0.00     0.00  bitmap_clear
  0.78      2.01     0.06  4110147     0.00     0.00  bitmap_elt_insert_after
  0.78      2.07     0.06   598552     0.00     0.00  constrain_operands
  0.78      2.13     0.06    71808     0.00     0.00  df_lr_bb_local_compute
  0.78      2.19     0.06      909     0.00     0.00  substitute_and_fold
  0.65      2.24     0.05  5297243     0.00     0.00  mark_all_vars_used_1
  0.65      2.29     0.05  1652326     0.00     0.00  htab_find_with_hash
  0.65      2.34     0.05     1008     0.00     0.00  init_alias_analysis
  0.52      2.38     0.04  7167584     0.00     0.00  bitmap_elt_clear_from
  0.52      2.42     0.04  4744012     0.00     0.00  walk_tree_1
  0.52      2.46     0.04  4472070     0.00     0.00  is_gimple_reg
  0.52      2.50     0.04  1727617     0.00     0.00  gsi_start_phis
  0.52      2.54     0.04  1054932     0.00     0.00  for_each_rtx_1
  0.52      2.58     0.04   824795     0.00     0.00  fold_binary_loc
  0.52      2.62     0.04   733470     0.00     0.00  invalid_mode_change_p
  0.52      2.66     0.04   396082     0.00     0.00  count_reg_usage
  0.52      2.70     0.04   238716     0.00     0.00  df_chain_create
  0.52      2.74     0.04   210927     0.00     0.00  cse_insn
  0.52      2.78     0.04   106118     0.00     0.00  find_reloads
  0.52      2.82     0.04    41200     0.00     0.00  dfs_enumerate_from
  0.52      2.86     0.04      202     0.00     0.00  run_scc_vn
Comment 46 Steven Bosscher 2009-12-28 20:35:14 UTC
Same thing for hashes100.c (profile in comment #45 is for infcodes100.c), in both cases cc1 r155486 at -O2 on x86_64):

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
  3.59      0.15     0.15  9725444     0.00     0.00  bitmap_set_bit
  2.72      0.26     0.11  5517400     0.00     0.00  htab_find_slot_with_hash
  1.98      0.34     0.08                             _fini
  1.73      0.41     0.07 14841100     0.00     0.00  bitmap_bit_p
  1.73      0.48     0.07  1336700     0.00     0.00  nonzero_bits1
  1.24      0.53     0.05  3776000     0.00     0.00  get_expr_value_id
  1.24      0.58     0.05  1248000     0.00     0.00  cselib_lookup
  1.24      0.63     0.05    88002     0.00     0.00  free_ira_costs
  1.24      0.68     0.05    13300     0.00     0.00  df_worklist_dataflow
  0.99      0.72     0.04  2893600     0.00     0.00  is_gimple_min_invariant
  0.99      0.76     0.04   761910     0.00     0.00  extract_insn
  0.99      0.80     0.04   739736     0.00     0.00  fold_binary_loc
  0.99      0.84     0.04   577000     0.00     0.00  for_each_rtx_1
  0.99      0.88     0.04     4300     0.00     0.00  remove_unused_locals
  0.74      0.91     0.03  5120044     0.00     0.00  tree_strip_nop_conversions
  0.74      0.94     0.03  5048648     0.00     0.00  pool_alloc
  0.74      0.97     0.03  3330406     0.00     0.00  bitmap_clear
  0.74      1.00     0.03   961752     0.00     0.00  rtx_alloc_stat
  0.74      1.03     0.03   825000     0.00     0.00  loop_preheader_edge
  0.74      1.06     0.03   643822     0.00     0.00  tree_code_size
  0.74      1.09     0.03   413900     0.00     0.00  simplify_binary_operation
  0.74      1.12     0.03   331100     0.00     0.00  walk_stmt_load_store_addr_ops
  0.74      1.15     0.03   100200     0.00     0.00  cse_insn
  0.74      1.18     0.03    88700     0.00     0.00  make_compound_operation
  0.74      1.21     0.03    79600     0.00     0.00  add_control_edge
  0.74      1.24     0.03    35500     0.00     0.00  df_lr_bb_local_compute
  0.74      1.27     0.03      300     0.00     0.00  find_costs_and_classes
Comment 47 Steven Bosscher 2009-12-28 20:38:47 UTC
Created attachment 19402 [details]
profile for cc1 r155486 on x86_64, options -O2, for infcodes100.c
Comment 48 Steven Bosscher 2009-12-28 20:39:45 UTC
Created attachment 19403 [details]
profile for cc1 r155486 on x86_64, options -O2, for hashes100.c
Comment 49 Steven Bosscher 2009-12-28 20:46:09 UTC
For hashes100.c, combine+IRA+expand+tree-PRE accounts for 1/3 of the total compile time. For infcodes100.c, the profile is more flat, but IRA+expand still account for 1/4 of the total compile time.

Why is IRA so slow?
Comment 50 Richard Biener 2010-01-04 23:11:28 UTC
ira_traverse_loop_tree and its callee process_bb_node_for_cost accounts for
nearly as much time as reload takes on hashes.c.

Btw, I find valgrind --tool=callgrind together with kcachegrind a better tool for benchmarking this kind of stuff - obviously on the smaller testcases.  It has
deterministic timing at least.
Comment 51 Richard Biener 2010-05-22 18:10:11 UTC
GCC 4.3.5 is being released, adjusting target milestone.
Comment 52 Jan Hubicka 2011-01-08 16:02:07 UTC
hashes oprofile is as follows:
samples  %        app name                 symbol name
8237      2.9617  libc-2.11.1.so           _int_malloc
6877      2.4727  cc1                      bitmap_set_bit
5820      2.0927  cc1                      htab_find_slot_with_hash
4238      1.5238  cc1                      operand_equal_p
4233      1.5220  cc1                      record_reg_classes.constprop.9
4069      1.4631  cc1                      bitmap_bit_p
3780      1.3592  libc-2.11.1.so           memset
3399      1.2222  cc1                      nonzero_bits1
2992      1.0758  cc1                      ggc_internal_alloc_stat
2516      0.9047  libc-2.11.1.so           _int_free
2454      0.8824  as                       /usr/bin/as
2244      0.8069  cc1                      fold_binary_loc
2229      0.8015  libc-2.11.1.so           malloc
1909      0.6864  libc-2.11.1.so           free
1857      0.6677  cc1                      constrain_operands.part.12
1696      0.6098  cc1                      is_gimple_reg
1688      0.6069  cc1                      et_splay
1667      0.5994  cc1                      note_stores
1632      0.5868  cc1                      walk_tree_1
1585      0.5699  cc1                      for_each_rtx_1
1577      0.5670  cc1                      pool_alloc
1572      0.5652  cc1                      tree_strip_nop_conversions

and the other testcase:
samples  %        image name               app name                 symbol name
5195      5.1031  cc1                      cc1                      operand_equal_p
2848      2.7976  cc1                      cc1                      bitmap_set_bit
2080      2.0432  libc-2.11.1.so           libc-2.11.1.so           _int_malloc
1879      1.8458  cc1                      cc1                      htab_find_slot_with_hash
1432      1.4067  cc1                      cc1                      bitmap_bit_p
1298      1.2750  cc1                      cc1                      mem_attrs_htab_eq
1290      1.2672  cc1                      cc1                      record_reg_classes.constprop.9
1063      1.0442  cc1                      cc1                      walk_tree_1
1010      0.9921  cc1                      cc1                      tree_strip_nop_conversions
1004      0.9862  cc1                      cc1                      bitmap_ior_into
963       0.9460  libc-2.11.1.so           libc-2.11.1.so           memset
934       0.9175  cc1                      cc1                      df_worklist_dataflow
882       0.8664  cc1                      cc1                      mark_all_vars_used_1
806       0.7917  as                       as                       /usr/bin/as
795       0.7809  opreport                 opreport                 /usr/bin/opreport
793       0.7790  cc1                      cc1                      ggc_internal_alloc_stat
784       0.7701  cc1                      cc1                      is_gimple_reg
732       0.7190  cc1                      cc1                      constrain_operands.part.12
701       0.6886  cc1                      cc1                      bitmap_ior_and_compl
672       0.6601  cc1                      cc1                      et_splay
650       0.6385  libc-2.11.1.so           libc-2.11.1.so           _int_free
612       0.6012  cc1                      cc1                      for_each_rtx_1
608       0.5972  libc-2.11.1.so           libc-2.11.1.so           memcpy
603       0.5923  libc-2.11.1.so           libc-2.11.1.so           malloc
596       0.5855  cc1                      cc1                      bitmap_elt_insert_after
581       0.5707  cc1                      cc1                      ix86_decompose_address
570       0.5599  cc1                      cc1                      bitmap_copy
568       0.5580  cc1                      cc1                      cse_insn

so apparently not much of surprise.
Comment 53 Jan Hubicka 2011-01-08 16:10:38 UTC
Concerning Steven's comment, for infcode the time report is:
callgraph optimization:   0.07 ( 1%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall     139 kB ( 0%) ggc
 cfg cleanup           :   0.22 ( 2%) usr   0.00 ( 0%) sys   0.19 ( 2%) wall    2591 kB ( 1%) ggc
 trivially dead code   :   0.08 ( 1%) usr   0.00 ( 0%) sys   0.06 ( 1%) wall       0 kB ( 0%) ggc
 df scan insns         :   0.08 ( 1%) usr   0.00 ( 0%) sys   0.08 ( 1%) wall      11 kB ( 0%) ggc
 df reaching defs      :   0.09 ( 1%) usr   0.00 ( 0%) sys   0.14 ( 1%) wall       0 kB ( 0%) ggc
 df live regs          :   0.49 ( 4%) usr   0.00 ( 0%) sys   0.43 ( 4%) wall       0 kB ( 0%) ggc
 df live&initialized regs:   0.37 ( 3%) usr   0.00 ( 0%) sys   0.44 ( 4%) wall       0 kB ( 0%) ggc
 df use-def / def-use chains:   0.00 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.16 ( 1%) usr   0.00 ( 0%) sys   0.12 ( 1%) wall    1805 kB ( 1%) ggc
 register information  :   0.06 ( 1%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis        :   0.14 ( 1%) usr   0.00 ( 0%) sys   0.12 ( 1%) wall    8010 kB ( 4%) ggc
 alias stmt walking    :   0.09 ( 1%) usr   0.02 ( 7%) sys   0.09 ( 1%) wall     268 kB ( 0%) ggc
 parser                :   0.16 ( 1%) usr   0.02 ( 7%) sys   0.20 ( 2%) wall   11320 kB ( 6%) ggc
 tree gimplify         :   0.15 ( 1%) usr   0.02 ( 7%) sys   0.15 ( 1%) wall   16123 kB ( 8%) ggc
 tree CFG cleanup      :   0.18 ( 2%) usr   0.01 ( 3%) sys   0.13 ( 1%) wall     881 kB ( 0%) ggc
 tree VRP              :   0.43 ( 4%) usr   0.02 ( 7%) sys   0.38 ( 3%) wall   20850 kB (10%) ggc
 tree copy propagation :   0.06 ( 1%) usr   0.00 ( 0%) sys   0.14 ( 1%) wall    1638 kB ( 1%) ggc
 tree PTA              :   0.35 ( 3%) usr   0.00 ( 0%) sys   0.35 ( 3%) wall     331 kB ( 0%) ggc
 tree SSA rewrite      :   0.11 ( 1%) usr   0.01 ( 3%) sys   0.13 ( 1%) wall    6269 kB ( 3%) ggc
 tree SSA incremental  :   0.17 ( 1%) usr   0.01 ( 3%) sys   0.16 ( 1%) wall    1321 kB ( 1%) ggc
 dominator optimization:   0.18 ( 2%) usr   0.00 ( 0%) sys   0.17 ( 1%) wall    5507 kB ( 3%) ggc
 tree CCP              :   0.19 ( 2%) usr   0.00 ( 0%) sys   0.10 ( 1%) wall    1065 kB ( 1%) ggc
 tree reassociation    :   0.06 ( 1%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall    1273 kB ( 1%) ggc
 tree PRE              :   0.52 ( 5%) usr   0.02 ( 7%) sys   0.65 ( 6%) wall    6675 kB ( 3%) ggc
 tree FRE              :   0.23 ( 2%) usr   0.02 ( 7%) sys   0.24 ( 2%) wall     801 kB ( 0%) ggc
 tree forward propagate:   0.06 ( 1%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall     751 kB ( 0%) ggc
 tree aggressive DCE   :   0.09 ( 1%) usr   0.00 ( 0%) sys   0.16 ( 1%) wall    2770 kB ( 1%) ggc
 tree DSE              :   0.06 ( 1%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall      34 kB ( 0%) ggc
 scev constant prop    :   0.06 ( 1%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall    4368 kB ( 2%) ggc
 tree iv optimization  :   0.22 ( 2%) usr   0.02 ( 7%) sys   0.27 ( 2%) wall   18289 kB ( 9%) ggc
 dominance computation :   0.07 ( 1%) usr   0.01 ( 3%) sys   0.08 ( 1%) wall       0 kB ( 0%) ggc
 out of ssa            :   0.08 ( 1%) usr   0.00 ( 0%) sys   0.13 ( 1%) wall       3 kB ( 0%) ggc
 expand                :   1.08 ( 9%) usr   0.00 ( 0%) sys   1.08 ( 9%) wall   17470 kB ( 9%) ggc
 forward prop          :   0.16 ( 1%) usr   0.00 ( 0%) sys   0.14 ( 1%) wall    2050 kB ( 1%) ggc
 CSE                   :   0.36 ( 3%) usr   0.00 ( 0%) sys   0.30 ( 3%) wall     173 kB ( 0%) ggc
 dead store elim1      :   0.10 ( 1%) usr   0.00 ( 0%) sys   0.11 ( 1%) wall    1638 kB ( 1%) ggc
 dead store elim2      :   0.11 ( 1%) usr   0.00 ( 0%) sys   0.13 ( 1%) wall    1696 kB ( 1%) ggc
 loop invariant motion :   0.07 ( 1%) usr   0.00 ( 0%) sys   0.10 ( 1%) wall       0 kB ( 0%) ggc
 CPROP                 :   0.19 ( 2%) usr   0.00 ( 0%) sys   0.22 ( 2%) wall    2907 kB ( 1%) ggc
 PRE                   :   0.08 ( 1%) usr   0.00 ( 0%) sys   0.09 ( 1%) wall      89 kB ( 0%) ggc
 CSE 2                 :   0.28 ( 2%) usr   0.00 ( 0%) sys   0.20 ( 2%) wall     118 kB ( 0%) ggc
 combiner              :   0.36 ( 3%) usr   0.00 ( 0%) sys   0.42 ( 4%) wall    7042 kB ( 3%) ggc
 integrated RA         :   0.56 ( 5%) usr   0.00 ( 0%) sys   0.52 ( 4%) wall    2972 kB ( 1%) ggc
 reload                :   0.30 ( 3%) usr   0.00 ( 0%) sys   0.34 ( 3%) wall     833 kB ( 0%) ggc
 reload CSE regs       :   0.27 ( 2%) usr   0.00 ( 0%) sys   0.29 ( 2%) wall    3270 kB ( 2%) ggc
 hard reg cprop        :   0.06 ( 1%) usr   0.00 ( 0%) sys   0.07 ( 1%) wall      21 kB ( 0%) ggc
 scheduling 2          :   0.38 ( 3%) usr   0.00 ( 0%) sys   0.40 ( 3%) wall     211 kB ( 0%) ggc
 reorder blocks        :   0.07 ( 1%) usr   0.00 ( 0%) sys   0.08 ( 1%) wall    2178 kB ( 1%) ggc
 final                 :   0.16 ( 1%) usr   0.01 ( 3%) sys   0.15 ( 1%) wall      99 kB ( 0%) ggc
 rest of compilation   :   0.27 ( 2%) usr   0.01 ( 3%) sys   0.23 ( 2%) wall     968 kB ( 0%) ggc
 remove unused locals  :   0.34 ( 3%) usr   0.00 ( 0%) sys   0.27 ( 2%) wall       0 kB ( 0%) ggc
 TOTAL                 :  11.46             0.30            11.77             201495 kB

so IRA, expand, VRP and DF is still slow despite the fix in accounting vars.

For hashes:
 ipa pure const        :   0.09 ( 1%) usr   0.00 ( 0%) sys   0.12 ( 2%) wall    4617 kB ( 3%) ggc
 cfg cleanup           :   0.10 ( 1%) usr   0.00 ( 0%) sys   0.13 ( 2%) wall     860 kB ( 1%) ggc
 df scan insns         :   0.04 ( 1%) usr   0.00 ( 0%) sys   0.05 ( 1%) wall      32 kB ( 0%) ggc
 df reaching defs      :   0.07 ( 1%) usr   0.00 ( 0%) sys   0.04 ( 1%) wall       0 kB ( 0%) ggc
 df live regs          :   0.15 ( 2%) usr   0.00 ( 0%) sys   0.11 ( 2%) wall       0 kB ( 0%) ggc
 df live&initialized regs:   0.04 ( 1%) usr   0.00 ( 0%) sys   0.06 ( 1%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.15 ( 2%) usr   0.00 ( 0%) sys   0.12 ( 2%) wall    1856 kB ( 1%) ggc
 alias analysis        :   0.09 ( 1%) usr   0.00 ( 0%) sys   0.16 ( 2%) wall    8517 kB ( 5%) ggc
 parser                :   0.07 ( 1%) usr   0.03 (12%) sys   0.11 ( 2%) wall    5992 kB ( 4%) ggc
 tree gimplify         :   0.08 ( 1%) usr   0.01 ( 4%) sys   0.07 ( 1%) wall    9748 kB ( 6%) ggc
 tree CFG cleanup      :   0.08 ( 1%) usr   0.00 ( 0%) sys   0.12 ( 2%) wall     489 kB ( 0%) ggc
 tree VRP              :   0.19 ( 3%) usr   0.01 ( 4%) sys   0.21 ( 3%) wall    8356 kB ( 5%) ggc
 tree copy propagation :   0.06 ( 1%) usr   0.00 ( 0%) sys   0.09 ( 1%) wall     607 kB ( 0%) ggc
 tree PTA              :   0.29 ( 4%) usr   0.00 ( 0%) sys   0.29 ( 4%) wall     235 kB ( 0%) ggc
 tree SSA rewrite      :   0.05 ( 1%) usr   0.01 ( 4%) sys   0.04 ( 1%) wall    3867 kB ( 2%) ggc
 tree SSA incremental  :   0.06 ( 1%) usr   0.02 ( 8%) sys   0.03 ( 0%) wall     796 kB ( 0%) ggc
 dominator optimization:   0.07 ( 1%) usr   0.00 ( 0%) sys   0.04 ( 1%) wall    2538 kB ( 2%) ggc
 tree CCP              :   0.19 ( 3%) usr   0.00 ( 0%) sys   0.22 ( 3%) wall    1652 kB ( 1%) ggc
 tree PRE              :   0.37 ( 5%) usr   0.02 ( 8%) sys   0.40 ( 6%) wall   12227 kB ( 7%) ggc
 tree FRE              :   0.15 ( 2%) usr   0.00 ( 0%) sys   0.15 ( 2%) wall    1216 kB ( 1%) ggc
 tree code sinking     :   0.04 ( 1%) usr   0.00 ( 0%) sys   0.06 ( 1%) wall     358 kB ( 0%) ggc
 tree aggressive DCE   :   0.06 ( 1%) usr   0.00 ( 0%) sys   0.12 ( 2%) wall    3531 kB ( 2%) ggc
 tree loop invariant motion:   0.04 ( 1%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 complete unrolling    :   0.08 ( 1%) usr   0.00 ( 0%) sys   0.06 ( 1%) wall    2857 kB ( 2%) ggc
 tree iv optimization  :   0.47 ( 7%) usr   0.02 ( 8%) sys   0.45 ( 6%) wall   17694 kB (11%) ggc
 dominance computation :   0.07 ( 1%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 out of ssa            :   0.04 ( 1%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      10 kB ( 0%) ggc
 expand vars           :   0.04 ( 1%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall    6220 kB ( 4%) ggc
 expand                :   0.23 ( 3%) usr   0.00 ( 0%) sys   0.24 ( 3%) wall   12120 kB ( 7%) ggc
 forward prop          :   0.04 ( 1%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall    1821 kB ( 1%) ggc
 CSE                   :   0.10 ( 1%) usr   0.00 ( 0%) sys   0.16 ( 2%) wall      34 kB ( 0%) ggc
 dead code elimination :   0.04 ( 1%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 dead store elim1      :   0.06 ( 1%) usr   0.00 ( 0%) sys   0.06 ( 1%) wall     829 kB ( 0%) ggc
 CPROP                 :   0.23 ( 3%) usr   0.01 ( 4%) sys   0.18 ( 2%) wall    2085 kB ( 1%) ggc
 PRE                   :   0.04 ( 1%) usr   0.00 ( 0%) sys   0.04 ( 1%) wall       0 kB ( 0%) ggc
 CSE 2                 :   0.07 ( 1%) usr   0.00 ( 0%) sys   0.06 ( 1%) wall      40 kB ( 0%) ggc
 branch prediction     :   0.09 ( 1%) usr   0.00 ( 0%) sys   0.06 ( 1%) wall    2878 kB ( 2%) ggc
 combiner              :   0.65 ( 9%) usr   0.01 ( 4%) sys   0.69 (10%) wall   14078 kB ( 8%) ggc
 regmove               :   0.06 ( 1%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 integrated RA         :   0.51 ( 7%) usr   0.00 ( 0%) sys   0.50 ( 7%) wall    5209 kB ( 3%) ggc
 reload                :   0.19 ( 3%) usr   0.00 ( 0%) sys   0.26 ( 4%) wall     369 kB ( 0%) ggc
 reload CSE regs       :   0.20 ( 3%) usr   0.00 ( 0%) sys   0.15 ( 2%) wall    2212 kB ( 1%) ggc
 scheduling 2          :   0.21 ( 3%) usr   0.01 ( 4%) sys   0.16 ( 2%) wall     141 kB ( 0%) ggc
 machine dep reorg     :   0.06 ( 1%) usr   0.00 ( 0%) sys   0.04 ( 1%) wall      38 kB ( 0%) ggc
 reorder blocks        :   0.09 ( 1%) usr   0.00 ( 0%) sys   0.04 ( 1%) wall     340 kB ( 0%) ggc
 final                 :   0.04 ( 1%) usr   0.00 ( 0%) sys   0.05 ( 1%) wall     167 kB ( 0%) ggc
 rest of compilation   :   0.10 ( 1%) usr   0.02 ( 8%) sys   0.11 ( 2%) wall    1437 kB ( 1%) ggc
 remove unused locals  :   0.07 ( 1%) usr   0.01 ( 4%) sys   0.10 ( 1%) wall       0 kB ( 0%) ggc
 unaccounted todo      :   0.05 ( 1%) usr   0.01 ( 4%) sys   0.06 ( 1%) wall       0 kB ( 0%) ggc
 TOTAL                 :   6.94             0.26             7.21             166488 kB

Here we now have ivopts, combiner and IRA showing top.
Steven, would be possible to have comparsion with 4.6 included?
Comment 54 Richard Biener 2011-01-18 14:21:28 UTC
PR47344 now tracks the "regression" property of this bug.
Comment 55 Steven Bosscher 2012-05-27 23:11:06 UTC
Some new timings:

hashes.c -O0:
	3.4.6	4.2.4	4.3.3	4.4.7	4.6.3
user	0.02076	0.02037	0.01972	0.02291	0.02286
sys	0.00383	0.00483	0.00468	0.00583	0.00568

hashes.c -O1:
	3.4.6	4.2.4	4.3.3	4.4.7	4.6.3
user	0.0278	0.0472	0.0459	0.0539	0.0572
sys	0.0046	0.0055	0.0066	0.0075	0.0082

hashes.c -O2:
	3.4.6	4.2.4	4.3.3	4.4.7	4.6.3
user	0.0452	0.0700	0.0744	0.0773	0.0842
sys	0.0047	0.0065	0.0060	0.0083	0.0086


infcodes.c -O0:
	3.4.6	4.2.4	4.3.3	4.4.7	4.6.3
user	0.0374	0.0396	0.0421	0.0499	0.0490
sys	0.0046	0.0059	0.0065	0.0064	0.0071

infcodes.c -O1:
	3.4.6	4.2.4	4.3.3	4.4.7	4.6.3
user	0.0481	0.0886	0.0821	0.0950	0.0849
sys	0.0046	0.0074	0.0066	0.0092	0.0070

infcodes.c -O2:
	3.4.6	4.2.4	4.3.3	4.4.7	4.6.3
user	0.0745	0.1303	0.1306	0.1389	0.1354
sys	0.0054	0.0096	0.0096	0.0088	0.0091
Comment 56 Richard Biener 2023-02-06 12:03:13 UTC
GCC 12 now takes sub-second time for the larger testcases and with optimization it's now early VRP taking most of the time.