These two files taken from linux kernel (hashes.c) and zlib (infcodes.c) hashes.c: 3.4.4 4.0.0 delta -O0 0.045 0.049 9% -O1 0.067 0.144 115% -O2 0.108 0.177 64% infcodes.c: 3.4.4 4.0.0 delta -O0 0.080 0.086 7% -O1 0.120 0.223 86% -O2 0.196 0.295 51%
Created attachment 7616 [details] 3338 bytes
Created attachment 7617 [details] 9535 bytes
Hmm, These are really short compile times to begin with. Almost anything this short could be considered just error but then again, the compile time for me on ppc-darwin was about the same time as 3.3 as 4.0.0.
Created attachment 7618 [details] perl script (463 bytes) Use this script to get more accurate timings. (install http://search.cpan.org/~jhi/BSD-Resource-1.24/ if you want to get even more accurate timings).
Confirmed, how many new builtins do we have between 3.4.x and 4.0.0, this will most likely account for the -O0 time.
In both of my profiles for -O0 and -O1, I see that is_attribute_p is always high up there.
At -O2 and -O3 also I see is_attribute_p is always taking 2% of the time. At -O0: - is_attribute_p - 6% - decode_reg_name - 2.1% at -O1: - is_attribute_p - 2.5% - decode_reg_name 0.8% at -O2: - is_attribute_p - 1.9% - decode_reg_name - 0.5% at -O3 (this was a bigger sample than the obove ones): - is_attribute_p - 1.8% - decode_reg_name - 0.6%
I have a fix for decode_reg_name and there is a branch with the fix for is_attribute_p, maybe that should be merged in ...
The patch to fix decode_reg_name is here: <http://gcc.gnu.org/ml/gcc-patches/2004-11/ msg02683.html>, I don't know how much this buys us but should buys us at -O0 about 2%.
Can you try again after my patch to varasm.c, I cannot get any accrete results on these tests.
Patch to speedup lookup_attribute here: <http://gcc.gnu.org/ml/gcc-patches/2004-12/ msg00164.html>, this should help but not all 6% at -O0. Note for -O0, we are slowing down because there are more attributes and more builtins functions for 4.0.0 compared with 3.4.x.
3.4.4 4.0.0 delta hashes.c: -O0 0.0515 0.0526 2% -O1 0.078 0.157 101% -O2 0.123 0.198 61% infcodes.c: -O0 0.0907 0.0933 3% -O1 0.137 0.248 81% -O2 0.224 0.328 46% Average speedup (not only -O0) is 6%.
Created attachment 7805 [details] test case (325 KB)
Created attachment 7806 [details] testcase (304 KB)
Hope these two larger test cases will help profiling. 3.4.4 4.0.0 delta hashes100.c: -O0 3.7 3.8 3% -O1 6.4 14.5 126% -O2 11.3 18.7 65% -O3 11.9 19.0 60% infcodes100.c: -O0 7.4 7.6 3% -O1 12.6 25.4 101% -O2 21.4 33.8 58% -O3 22.4 34.7 55%
Did anyone try to profile these already?
Profile for hashes100.c on an x86-64 with "-O2 -m32": CPU: Hammer, speed 1394.98 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 4000 Counted DATA_CACHE_MISSES events (Data cache misses) with a unit mask of 0x00 (No unit mask) count 1000 samples % samples % image name symbol name 27235 2.3397 158 0.5518 cc1 constrain_operands 21936 1.8845 53 0.1851 cc1 record_reg_classes 21735 1.8672 399 1.3936 cc1 reg_scan_mark_refs 20966 1.8012 1179 4.1179 libc.so.6 __GI_memset 20786 1.7857 418 1.4600 cc1 ggc_alloc_stat 20199 1.7353 168 0.5868 cc1 fold 18374 1.5785 589 2.0572 libc.so.6 _int_malloc 16467 1.4147 177 0.6182 cc1 iterative_hash_expr 15562 1.3369 324 1.1316 cc1 htab_find_slot_with_hash 15080 1.2955 842 2.9409 cc1 note_stores 13709 1.1777 78 0.2724 cc1 find_reloads 13054 1.1214 45 0.1572 cc1 nonzero_bits1 12297 1.0564 84 0.2934 cc1 mark_used_regs 11402 0.9795 68 0.2375 cc1 refers_to_regno_p 11335 0.9738 95 0.3318 cc1 cse_insn 11088 0.9526 71 0.2480 cc1 ix86_decompose_address 10401 0.8935 70 0.2445 cc1 for_each_rtx_1 9731 0.8360 354 1.2364 cc1 count_reg_usage 9176 0.7883 202 0.7055 cc1 bitmap_set_bit 8578 0.7369 284 0.9919 cc1 bitmap_bit_p 8137 0.6990 208 0.7265 cc1 extract_insn 8019 0.6889 30 0.1048 cc1 reload_cse_simplify_operands 7755 0.6662 541 1.8896 cc1 pool_alloc 7341 0.6307 143 0.4995 cc1 et_splay 6992 0.6007 319 1.1142 cc1 init_alias_analysis
The profile for infcodes100.c is essentially the same.
IVopts takes a lot of time (in several runs): tree iv optimization : 1.12 ( 9%) usr 0.16 (23%) sys 1.34 (10%) wall
Out of the time spent in ivopts: -- this patch saves some 25%: http://gcc.gnu.org/ml/gcc-patches/2004-10/msg00136.html -- another 25% are caused by inefficient use of # of iterations analysis; I am thinking about solution -- yet another 25% are spent in scalar evolutions analysis; I am working on speeding this up
Patch (improving ivopts performance on the testcase by some 40%): http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00307.html
Subject: Bug 18687 CVSROOT: /cvs/gcc Module name: gcc Changes by: rakdver@gcc.gnu.org 2005-02-10 00:32:49 Modified files: gcc : ChangeLog tree-flow.h tree-ssa-loop-ivcanon.c tree-ssa-loop-ivopts.c tree-ssa-loop-niter.c tree.c tree.h Log message: PR tree-optimization/18687 * tree-flow.h (find_loop_niter): Declare. * tree-ssa-loop-ivcanon.c (canonicalize_loop_induction_variables): Try using scev even for loops with more than one exit. * tree-ssa-loop-ivopts.c (struct loop_data): Removed niter field. (struct ivopts_data): Added niters field. (struct nfe_cache_elt): New. (nfe_hash, nfe_eq, niter_for_exit, niter_for_single_dom_exit): New functions. (tree_ssa_iv_optimize_init): Initialize niters cache. (determine_number_of_iterations): Removed. (find_induction_variables): Do not call determine_number_of_iterations. Access niters for single exit through niter_for_single_dom_exit. (add_iv_outer_candidates): Access niters for single exit through niter_for_single_dom_exit. (may_eliminate_iv): Take data argument. Use niter_for_exit. Do not use number_of_iterations_cond. (iv_period): New function. (determine_use_iv_cost_condition): Pass data to may_eliminate_iv. (may_replace_final_value): Take data argument. Use niter_for_single_dom_exit. (determine_use_iv_cost_outer): Pass data to may_replace_final_value. (rewrite_use_compare): Pass data to may_eliminate_iv. (rewrite_use_outer): Pass data to may_replace_final_value. (free_loop_data): Clean up the niters cache. (tree_ssa_iv_optimize_finalize): Free the niters cache. (tree_ssa_iv_optimize_loop): Do not call loop_commit_inserts. * tree-ssa-loop-niter.c (find_loop_niter): New function. (find_loop_niter_by_eval): Use tree_int_cst_lt. (num_ending_zeros): Moved to tree.c. * tree.h (num_ending_zeros): Declare. * tree.c (num_ending_zeros): Moved from tree.c. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.7427&r2=2.7428 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-flow.h.diff?cvsroot=gcc&r1=2.80&r2=2.81 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-ssa-loop-ivcanon.c.diff?cvsroot=gcc&r1=2.5&r2=2.6 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-ssa-loop-ivopts.c.diff?cvsroot=gcc&r1=2.44&r2=2.45 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-ssa-loop-niter.c.diff?cvsroot=gcc&r1=2.18&r2=2.19 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree.c.diff?cvsroot=gcc&r1=1.463&r2=1.464 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree.h.diff?cvsroot=gcc&r1=1.687&r2=1.688
(In reply to comment #22) So does someone do more testings again?
3.4.4 4.0.0 delta hashes100.c: -O0 3.66 3.81 4% -O1 6.43 11.47 78% -O2 11.40 17.89 57% -O3 12.00 18.35 53% infcodes100.c: -O0 7.24 7.66 6% -O1 12.43 22.24 79% -O2 21.28 31.54 48% -O3 22.37 32.02 43%
what about times on 4.1?
3.4.5 4.0.2 4.1.0 3.4->4.0 (%) 4.0->4.1 (%) hashes100.c: O0 1.56 1.59 1.57 2.2 -1.4 O1 2.54 4.23 4.49 66. 6.1 O2 4.43 6.29 6.56 42. 4.4 O3 4.74 6.33 6.66 34. 5.1 infcodes100.c: O0 3.21 3.69 3.67 15. -0.64 O1 4.68 7.06 7.72 51. 9.3 O2 8.01 9.84 10.86 23. 10. O3 8.45 9.94 10.95 18. 10. Note this is a different machine than in my previous comments (2GHz k8 with 1024K of L2 cache in 64bit mode, previous was 1GHz k7 with 64K of L2).
I'm going to mark as P4. I'd like to see it fixed, but I'm not sure there's much we can do for 4.1.
This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.
Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.
Can you do timings on these again on the mainline since it looks like Richard G.'s memory patches also improved compile time for C at least on the CSiBE benchmark.
"real" times for hashes100.c (x86_64-linux, Intel Xeon 3.2 GHz, 1GB RAM): 3.4.6 4.0.4 4.1.2 4.2-svn20060724 -O0 0m1.618s 0m1.762s 0m1.661s 0m1.645s -O1 0m2.743s 0m4.646s 0m4.984s 0m4.936s -O2 0m4.686s 0m6.814s 0m7.140s 0m7.603s "real" times for infcodes100.c: 3.4.6 4.0.4 4.1.2 4.2-svn20060724 -O0 0m3.040s 0m3.643s 0m3.555s 0m3.575s -O1 0m4.989s 0m7.694s 0m8.809s 0m8.943s -O2 0m8.375s 0m10.622s 0m12.136s 0m13.285s
Just to be sure that between 7/24 and today we didn't speed up significantly: "real" times for hashes100.c (x86_64-linux, Intel Xeon 3.2 GHz, 1GB RAM): 3.4.6 4.2-svn20060903 delta -O0 0m1.618s 0m1.634s +1% -O1 0m2.743s 0m5.175s +88% -O2 0m4.686s 0m7.719s +65% "real" times for infcodes100.c: 3.4 4.2-svn20060903 delta -O0 0m3.040s 0m3.526s +16% -O1 0m4.989s 0m8.871s +77% -O2 0m8.375s 0m13.334s +59% Given these numbers, I would stick with gcc3 if I were a kernel developer.
FWIW, the oprofile for both test cases is basically flat, nothing stands out. We just do _so_ much more work (many more passes without removing anything) and that hurts apparently (not surprising of course).
FYI, the profile (-O2) looks like Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 2.04 0.62 0.62 5210670 0.00 0.00 htab_find_slot_with_hash 2.01 1.23 0.61 135824 0.00 0.00 find_reloads 1.78 1.77 0.54 1181450 0.00 0.00 constrain_operands 1.71 2.29 0.52 1667653 0.00 0.00 walk_tree 1.64 2.79 0.50 113076 0.00 0.00 record_reg_classes 1.18 3.15 0.36 1374013 0.00 0.00 for_each_rtx_1 1.15 3.50 0.35 2188275 0.00 0.00 iterative_hash_expr 1.02 3.81 0.31 13803214 0.00 0.00 bitmap_bit_p 0.99 4.11 0.30 146294 0.00 0.00 reload_cse_simplify_operands 0.92 4.39 0.28 8238693 0.00 0.00 bitmap_set_bit 0.89 4.66 0.27 13524760 0.00 0.00 is_gimple_min_invariant 0.85 4.92 0.26 1944894 0.00 0.00 extract_insn 0.76 5.15 0.23 3242786 0.00 0.00 note_stores 0.76 5.38 0.23 1848575 0.00 0.00 mark_set_1 0.76 5.61 0.23 1073359 0.00 0.00 fold_binary 0.69 5.82 0.21 3530376 0.00 0.00 ix86_decompose_address 0.66 6.02 0.20 2801807 0.00 0.00 is_gimple_reg 0.66 6.22 0.20 101 0.00 0.02 reload 0.62 6.41 0.19 6596295 0.00 0.00 find_reg_note 0.62 6.60 0.19 3748843 0.00 0.00 ggc_alloc_stat 0.62 6.79 0.19 1543768 0.00 0.00 force_fit_type 0.59 6.97 0.18 5160059 0.00 0.00 pool_alloc 0.59 7.15 0.18 937999 0.00 0.00 make_node_stat 0.59 7.33 0.18 1915 0.00 0.00 cleanup_cfg 0.56 7.50 0.17 2132681 0.00 0.00 mark_all_vars_used_1 0.56 7.67 0.17 1015399 0.00 0.00 get_expr_operands 0.56 7.84 0.17 263419 0.00 0.00 cse_insn 0.53 8.00 0.16 2162651 0.00 0.00 cselib_lookup 0.53 8.16 0.16 1176845 0.00 0.00 mark_used_regs 0.53 8.32 0.16 101 0.00 0.01 reload_as_needed 0.49 8.47 0.15 1748131 0.00 0.00 operand_equal_p 0.49 8.62 0.15 1160005 0.00 0.00 propagate_one_insn 0.49 8.77 0.15 1086424 0.00 0.00 et_splay 0.49 8.92 0.15 1030222 0.00 0.00 rtx_cost 0.49 9.07 0.15 523805 0.00 0.00 count_reg_usage 0.46 9.21 0.14 3222872 0.00 0.00 memory_operand 0.46 9.35 0.14 1651283 0.00 0.00 htab_find_with_hash 0.46 9.49 0.14 693604 0.00 0.00 rewrite_update_stmt 0.46 9.63 0.14 457856 0.00 0.00 mul_double we're hashing SCEV data a lot: 0.01 0.07 285600/2683522 set_instantiated_value [395] 0.01 0.09 398929/2683522 build_int_cst_wide [214] 0.02 0.20 864500/2683522 find_var_scev_info [184] [68] 2.3 0.06 0.63 2683522 htab_find_slot [68] 0.34 0.16 2683522/11019933 htab_find_slot_with_hash <cycle 26> [53] 0.03 0.00 1150100/1603800 hash_scev_info [589] 0.06 0.03 125694/1181450 reload_cse_simplify_operands [94] 0.09 0.05 201473/1181450 find_matches [200] 0.32 0.16 698486/1181450 extract_constrain_insn_cached [67] [61] 2.7 0.54 0.28 1181450+14400 constrain_operands [61] 0.01 0.15 806498/873917 strict_memory_address_p [236] 0.05 0.00 272106/402044 operands_match_p [415] 0.05 0.04 100911/1588625 create_ssa_var_map [329] 0.07 0.06 144529/1588625 walk_tree_without_duplicates [268] 0.14 0.13 292781/1588625 count_uses_and_derefs [149] 0.21 0.20 450984/1588625 remove_unused_locals [111] [44] 3.9 0.52 0.68 1667653+3246362 walk_tree <cycle 20> [44] 0.06 0.18 337362/337362 find_used_portions [189] 0.08 0.12 1659486/1659486 pointer_set_insert [209] 0.00 0.06 97327/97327 scan_for_refs [487] 0.05 0.00 1267779/1267779 count_ptr_derefs [525] 0.01 0.56 143418/143418 regclass [71] [80] 1.9 0.01 0.56 143418 scan_one_insn [80] 0.50 0.04 113076/113076 record_reg_classes [85] 0.01 0.01 100486/1944894 extract_insn [121] 0.00 0.07 209960/1775362 x86_extended_reg_mentioned_p [364] 0.02 0.26 747868/1775362 approx_reg_cost [165] [70] 2.2 0.04 0.62 1775362 for_each_rtx [70] 0.36 0.20 1374013/1374013 for_each_rtx_1 [82] 0.01 0.03 195736/659311 check_dependence [310] 0.03 0.01 202137/2188275 vn_add_with_vuses [301] 0.06 0.02 385100/2188275 phi_translate [110] 0.07 0.03 453965/2188275 vn_lookup_with_vuses [197] 0.11 0.04 687546/2188275 initialize_hash_element [235] [96] 1.6 0.35 0.14 2188275+2621356 iterative_hash_expr [96] 0.11 0.00 1329048/1332689 iterative_hash [318] 0.03 0.00 1131815/1907772 commutative_tree_code [523] 0.01 0.00 508000/13803214 build_tree_conflict_graph [618] 0.01 0.00 541400/13803214 propagate_one_insn [31] 0.02 0.00 793747/13803214 rewrite_update_stmt [215] 0.02 0.00 997246/13803214 mark_block_for_update [547] 0.03 0.00 1120737/13803214 set_contains_value [338] 0.03 0.00 1314587/13803214 prepare_block_for_update [138] 0.03 0.00 1484822/13803214 mark_set_1 [87] 0.06 0.00 2575352/13803214 mark_used_reg [195] [146] 1.0 0.31 0.00 13803214 bitmap_bit_p [146] 0.01 0.88 101/101 rest_of_handle_postreload [58] [57] 2.9 0.01 0.88 101 reload_cse_regs [57] 0.02 0.84 202/202 reload_cse_regs_1 [59] 0.02 0.84 202/202 reload_cse_regs [57] [59] 2.8 0.02 0.84 202 reload_cse_regs_1 [59] 0.30 0.20 146294/146294 reload_cse_simplify_operands [94] 0.01 0.00 240636/8238693 set_def_block [398] 0.01 0.00 318100/8238693 compute_dominance_frontiers [480] 0.02 0.00 451869/8238693 update_alias_info [118] 0.03 0.00 837338/8238693 insert_into_set [176] 0.03 0.00 1004991/8238693 mark_set_1 [87] 0.04 0.00 1218828/8238693 mark_used_reg [195] [166] 0.9 0.28 0.00 8238693 bitmap_set_bit [166] 0.01 0.00 623522/13524760 value_insert_into_set [280] 0.03 0.00 1279640/13524760 set_contains_value [338] 0.03 0.00 1478781/13524760 insert_into_set [176] 0.10 0.00 4785141/13524760 get_value_handle [203] [170] 0.9 0.27 0.00 13524760 is_gimple_min_invariant [170]
Even if we did not hash SCEV data a lot, it would not buy you >50%.
Startup times was optimized a bit in 4.3. hashes.c, 100 compilations: GCC-3.3 GCC-3.4 mainline mainline&profiledbootstrap -O0 3.2s 3.8s 3.6s 3.3s -O1 5s 5.6s 8.1s 7.4s -O2 7.7s 7.6s 12.4s 11.1s So at least for -O0 we are back to original speed.
If linux kernel hacker complaints were taken seriously, this bug would be P1 ;-)
Closing 4.1 branch.
4.0.4 4.1.2 4.2.4 4.3.3 4.4.0 -O0: 1.64 1.58 1.91 1.86 2.20 -O1: 4.12 4.60 10.14 5.31 5.43 -O2: 6.10 6.60 13.34 8.40 8.57 -Os: 4.82 5.38 10.92 6.54 6.77 -O3: 6.10 9.80 18.12 12.81 13.42 my 4.2.4 build is with checking enabled. I do not have a x86_64 build from the 3.3 or 3.4 series available. At -O0 with 4.3 the DF framework likely adds compile-time, with 4.4 it is the (partial) tuple transition and maybe SSA at -O0, maybe IRA. Comparing optimized build-times is likely unfair as we add optimization capabilities.
Comparing optimized build times *is* very, very fair as we add optimization capabilities that help nothing for 99% of the code (cselim, bswap optimization, cestore, etc.), and with a passes pipeline that unconditionally executes many passes repeatedly for good SPEC scores.
Ok, let's say then comparing -O[23s] compile-times is unfair as we never stated they are optimized for compile-time but they explicitly contain passes that may usually _not_ help. -O1 may be a different story, but I do not remember when we last tried to find a reasonable set of optimizations for it ;) (I guess we could do with early optimizations only, no IPA, maybe one memory CSE and cleanups after it, no tree loop opts, no repeating everything). Oh, and you are welcome to try to remove redundant passes. I tried for 4.4 with mixed success (our testsuite shows that we have many corner-cases that are only optimized if you run the correct passes in the right order).
Closing 4.2 branch.
GCC 4.3.4 is being released, adjusting target milestone.
hashes100.c on x86_64: 3.4.6 4.2.4 4.3.3 4.4.2 4.5.0 -O0 1.37 1.4 1.59 1.91 1.84 -O1 2.07 4 4.44 4.68 4.89 -O2 3.57 5.96 7.08 7.48 7.6 -O3 3.78 8.73 10.85 11.55 11.8 3.4.6 4.2.4 4.3.3 4.4.2 4.5.0 -O0 100% 102% 116% 139% 134% -O1 100% 193% 214% 226% 236% -O2 100% 167% 198% 210% 213% -O3 100% 231% 287% 306% 312% infcodes100.c on x86_64: 3.4.6 4.2.4 4.3.3 4.4.2 4.5.0 -O0 2.74 3.2 3.86 4.39 4.79 -O1 3.85 7.81 7.69 8.39 8.16 -O2 6.35 11.81 12.8 13.18 14.71 -O3 6.72 11.9 13.91 14.11 15.95 3.4.6 4.2.4 4.3.3 4.4.2 4.5.0 -O0 100% 117% 141% 160% 175% -O1 100% 203% 200% 218% 212% -O2 100% 186% 202% 208% 232% -O3 100% 177% 207% 210% 237% These are all best-of-three-runs timings. All compilers built with default release settings (--enable-checking=release, except 3.4.6, which was built with checking disabled by default). Even relative to gcc 4.3, gcc 4.5 is ~15% slower again. Bravo! :-(
Profile for cc1 for SVN r155486 looks like this (all items with >0.5% time): Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 3.79 0.29 0.29 3718720 0.00 0.00 operand_equal_p 2.87 0.51 0.22 6543125 0.00 0.00 htab_find_slot_with_hash 2.15 0.68 0.17 18227570 0.00 0.00 bitmap_set_bit 2.09 0.84 0.16 1614 0.00 0.00 df_note_compute 1.83 0.98 0.14 _fini 1.63 1.10 0.13 3266034 0.00 0.00 mem_attrs_htab_eq 1.44 1.21 0.11 23919785 0.00 0.00 bitmap_bit_p 1.44 1.32 0.11 118624 0.00 0.00 free_ira_costs 1.31 1.42 0.10 4998409 0.00 0.00 bitmap_ior_into 1.31 1.52 0.10 7015 0.00 0.00 df_worklist_dataflow 1.24 1.62 0.10 1319725 0.00 0.00 bitmap_copy 0.91 1.69 0.07 2951661 0.00 0.00 note_stores 0.91 1.76 0.07 1709658 0.00 0.00 bitmap_ior_and_compl 0.91 1.83 0.07 1639154 0.00 0.00 et_splay 0.78 1.89 0.06 1552007 0.00 0.00 rtx_alloc_stat 0.78 1.95 0.06 6404265 0.00 0.00 bitmap_clear 0.78 2.01 0.06 4110147 0.00 0.00 bitmap_elt_insert_after 0.78 2.07 0.06 598552 0.00 0.00 constrain_operands 0.78 2.13 0.06 71808 0.00 0.00 df_lr_bb_local_compute 0.78 2.19 0.06 909 0.00 0.00 substitute_and_fold 0.65 2.24 0.05 5297243 0.00 0.00 mark_all_vars_used_1 0.65 2.29 0.05 1652326 0.00 0.00 htab_find_with_hash 0.65 2.34 0.05 1008 0.00 0.00 init_alias_analysis 0.52 2.38 0.04 7167584 0.00 0.00 bitmap_elt_clear_from 0.52 2.42 0.04 4744012 0.00 0.00 walk_tree_1 0.52 2.46 0.04 4472070 0.00 0.00 is_gimple_reg 0.52 2.50 0.04 1727617 0.00 0.00 gsi_start_phis 0.52 2.54 0.04 1054932 0.00 0.00 for_each_rtx_1 0.52 2.58 0.04 824795 0.00 0.00 fold_binary_loc 0.52 2.62 0.04 733470 0.00 0.00 invalid_mode_change_p 0.52 2.66 0.04 396082 0.00 0.00 count_reg_usage 0.52 2.70 0.04 238716 0.00 0.00 df_chain_create 0.52 2.74 0.04 210927 0.00 0.00 cse_insn 0.52 2.78 0.04 106118 0.00 0.00 find_reloads 0.52 2.82 0.04 41200 0.00 0.00 dfs_enumerate_from 0.52 2.86 0.04 202 0.00 0.00 run_scc_vn
Same thing for hashes100.c (profile in comment #45 is for infcodes100.c), in both cases cc1 r155486 at -O2 on x86_64): Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 3.59 0.15 0.15 9725444 0.00 0.00 bitmap_set_bit 2.72 0.26 0.11 5517400 0.00 0.00 htab_find_slot_with_hash 1.98 0.34 0.08 _fini 1.73 0.41 0.07 14841100 0.00 0.00 bitmap_bit_p 1.73 0.48 0.07 1336700 0.00 0.00 nonzero_bits1 1.24 0.53 0.05 3776000 0.00 0.00 get_expr_value_id 1.24 0.58 0.05 1248000 0.00 0.00 cselib_lookup 1.24 0.63 0.05 88002 0.00 0.00 free_ira_costs 1.24 0.68 0.05 13300 0.00 0.00 df_worklist_dataflow 0.99 0.72 0.04 2893600 0.00 0.00 is_gimple_min_invariant 0.99 0.76 0.04 761910 0.00 0.00 extract_insn 0.99 0.80 0.04 739736 0.00 0.00 fold_binary_loc 0.99 0.84 0.04 577000 0.00 0.00 for_each_rtx_1 0.99 0.88 0.04 4300 0.00 0.00 remove_unused_locals 0.74 0.91 0.03 5120044 0.00 0.00 tree_strip_nop_conversions 0.74 0.94 0.03 5048648 0.00 0.00 pool_alloc 0.74 0.97 0.03 3330406 0.00 0.00 bitmap_clear 0.74 1.00 0.03 961752 0.00 0.00 rtx_alloc_stat 0.74 1.03 0.03 825000 0.00 0.00 loop_preheader_edge 0.74 1.06 0.03 643822 0.00 0.00 tree_code_size 0.74 1.09 0.03 413900 0.00 0.00 simplify_binary_operation 0.74 1.12 0.03 331100 0.00 0.00 walk_stmt_load_store_addr_ops 0.74 1.15 0.03 100200 0.00 0.00 cse_insn 0.74 1.18 0.03 88700 0.00 0.00 make_compound_operation 0.74 1.21 0.03 79600 0.00 0.00 add_control_edge 0.74 1.24 0.03 35500 0.00 0.00 df_lr_bb_local_compute 0.74 1.27 0.03 300 0.00 0.00 find_costs_and_classes
Created attachment 19402 [details] profile for cc1 r155486 on x86_64, options -O2, for infcodes100.c
Created attachment 19403 [details] profile for cc1 r155486 on x86_64, options -O2, for hashes100.c
For hashes100.c, combine+IRA+expand+tree-PRE accounts for 1/3 of the total compile time. For infcodes100.c, the profile is more flat, but IRA+expand still account for 1/4 of the total compile time. Why is IRA so slow?
ira_traverse_loop_tree and its callee process_bb_node_for_cost accounts for nearly as much time as reload takes on hashes.c. Btw, I find valgrind --tool=callgrind together with kcachegrind a better tool for benchmarking this kind of stuff - obviously on the smaller testcases. It has deterministic timing at least.
GCC 4.3.5 is being released, adjusting target milestone.
hashes oprofile is as follows: samples % app name symbol name 8237 2.9617 libc-2.11.1.so _int_malloc 6877 2.4727 cc1 bitmap_set_bit 5820 2.0927 cc1 htab_find_slot_with_hash 4238 1.5238 cc1 operand_equal_p 4233 1.5220 cc1 record_reg_classes.constprop.9 4069 1.4631 cc1 bitmap_bit_p 3780 1.3592 libc-2.11.1.so memset 3399 1.2222 cc1 nonzero_bits1 2992 1.0758 cc1 ggc_internal_alloc_stat 2516 0.9047 libc-2.11.1.so _int_free 2454 0.8824 as /usr/bin/as 2244 0.8069 cc1 fold_binary_loc 2229 0.8015 libc-2.11.1.so malloc 1909 0.6864 libc-2.11.1.so free 1857 0.6677 cc1 constrain_operands.part.12 1696 0.6098 cc1 is_gimple_reg 1688 0.6069 cc1 et_splay 1667 0.5994 cc1 note_stores 1632 0.5868 cc1 walk_tree_1 1585 0.5699 cc1 for_each_rtx_1 1577 0.5670 cc1 pool_alloc 1572 0.5652 cc1 tree_strip_nop_conversions and the other testcase: samples % image name app name symbol name 5195 5.1031 cc1 cc1 operand_equal_p 2848 2.7976 cc1 cc1 bitmap_set_bit 2080 2.0432 libc-2.11.1.so libc-2.11.1.so _int_malloc 1879 1.8458 cc1 cc1 htab_find_slot_with_hash 1432 1.4067 cc1 cc1 bitmap_bit_p 1298 1.2750 cc1 cc1 mem_attrs_htab_eq 1290 1.2672 cc1 cc1 record_reg_classes.constprop.9 1063 1.0442 cc1 cc1 walk_tree_1 1010 0.9921 cc1 cc1 tree_strip_nop_conversions 1004 0.9862 cc1 cc1 bitmap_ior_into 963 0.9460 libc-2.11.1.so libc-2.11.1.so memset 934 0.9175 cc1 cc1 df_worklist_dataflow 882 0.8664 cc1 cc1 mark_all_vars_used_1 806 0.7917 as as /usr/bin/as 795 0.7809 opreport opreport /usr/bin/opreport 793 0.7790 cc1 cc1 ggc_internal_alloc_stat 784 0.7701 cc1 cc1 is_gimple_reg 732 0.7190 cc1 cc1 constrain_operands.part.12 701 0.6886 cc1 cc1 bitmap_ior_and_compl 672 0.6601 cc1 cc1 et_splay 650 0.6385 libc-2.11.1.so libc-2.11.1.so _int_free 612 0.6012 cc1 cc1 for_each_rtx_1 608 0.5972 libc-2.11.1.so libc-2.11.1.so memcpy 603 0.5923 libc-2.11.1.so libc-2.11.1.so malloc 596 0.5855 cc1 cc1 bitmap_elt_insert_after 581 0.5707 cc1 cc1 ix86_decompose_address 570 0.5599 cc1 cc1 bitmap_copy 568 0.5580 cc1 cc1 cse_insn so apparently not much of surprise.
Concerning Steven's comment, for infcode the time report is: callgraph optimization: 0.07 ( 1%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 139 kB ( 0%) ggc cfg cleanup : 0.22 ( 2%) usr 0.00 ( 0%) sys 0.19 ( 2%) wall 2591 kB ( 1%) ggc trivially dead code : 0.08 ( 1%) usr 0.00 ( 0%) sys 0.06 ( 1%) wall 0 kB ( 0%) ggc df scan insns : 0.08 ( 1%) usr 0.00 ( 0%) sys 0.08 ( 1%) wall 11 kB ( 0%) ggc df reaching defs : 0.09 ( 1%) usr 0.00 ( 0%) sys 0.14 ( 1%) wall 0 kB ( 0%) ggc df live regs : 0.49 ( 4%) usr 0.00 ( 0%) sys 0.43 ( 4%) wall 0 kB ( 0%) ggc df live&initialized regs: 0.37 ( 3%) usr 0.00 ( 0%) sys 0.44 ( 4%) wall 0 kB ( 0%) ggc df use-def / def-use chains: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 0.16 ( 1%) usr 0.00 ( 0%) sys 0.12 ( 1%) wall 1805 kB ( 1%) ggc register information : 0.06 ( 1%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 0 kB ( 0%) ggc alias analysis : 0.14 ( 1%) usr 0.00 ( 0%) sys 0.12 ( 1%) wall 8010 kB ( 4%) ggc alias stmt walking : 0.09 ( 1%) usr 0.02 ( 7%) sys 0.09 ( 1%) wall 268 kB ( 0%) ggc parser : 0.16 ( 1%) usr 0.02 ( 7%) sys 0.20 ( 2%) wall 11320 kB ( 6%) ggc tree gimplify : 0.15 ( 1%) usr 0.02 ( 7%) sys 0.15 ( 1%) wall 16123 kB ( 8%) ggc tree CFG cleanup : 0.18 ( 2%) usr 0.01 ( 3%) sys 0.13 ( 1%) wall 881 kB ( 0%) ggc tree VRP : 0.43 ( 4%) usr 0.02 ( 7%) sys 0.38 ( 3%) wall 20850 kB (10%) ggc tree copy propagation : 0.06 ( 1%) usr 0.00 ( 0%) sys 0.14 ( 1%) wall 1638 kB ( 1%) ggc tree PTA : 0.35 ( 3%) usr 0.00 ( 0%) sys 0.35 ( 3%) wall 331 kB ( 0%) ggc tree SSA rewrite : 0.11 ( 1%) usr 0.01 ( 3%) sys 0.13 ( 1%) wall 6269 kB ( 3%) ggc tree SSA incremental : 0.17 ( 1%) usr 0.01 ( 3%) sys 0.16 ( 1%) wall 1321 kB ( 1%) ggc dominator optimization: 0.18 ( 2%) usr 0.00 ( 0%) sys 0.17 ( 1%) wall 5507 kB ( 3%) ggc tree CCP : 0.19 ( 2%) usr 0.00 ( 0%) sys 0.10 ( 1%) wall 1065 kB ( 1%) ggc tree reassociation : 0.06 ( 1%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 1273 kB ( 1%) ggc tree PRE : 0.52 ( 5%) usr 0.02 ( 7%) sys 0.65 ( 6%) wall 6675 kB ( 3%) ggc tree FRE : 0.23 ( 2%) usr 0.02 ( 7%) sys 0.24 ( 2%) wall 801 kB ( 0%) ggc tree forward propagate: 0.06 ( 1%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 751 kB ( 0%) ggc tree aggressive DCE : 0.09 ( 1%) usr 0.00 ( 0%) sys 0.16 ( 1%) wall 2770 kB ( 1%) ggc tree DSE : 0.06 ( 1%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 34 kB ( 0%) ggc scev constant prop : 0.06 ( 1%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 4368 kB ( 2%) ggc tree iv optimization : 0.22 ( 2%) usr 0.02 ( 7%) sys 0.27 ( 2%) wall 18289 kB ( 9%) ggc dominance computation : 0.07 ( 1%) usr 0.01 ( 3%) sys 0.08 ( 1%) wall 0 kB ( 0%) ggc out of ssa : 0.08 ( 1%) usr 0.00 ( 0%) sys 0.13 ( 1%) wall 3 kB ( 0%) ggc expand : 1.08 ( 9%) usr 0.00 ( 0%) sys 1.08 ( 9%) wall 17470 kB ( 9%) ggc forward prop : 0.16 ( 1%) usr 0.00 ( 0%) sys 0.14 ( 1%) wall 2050 kB ( 1%) ggc CSE : 0.36 ( 3%) usr 0.00 ( 0%) sys 0.30 ( 3%) wall 173 kB ( 0%) ggc dead store elim1 : 0.10 ( 1%) usr 0.00 ( 0%) sys 0.11 ( 1%) wall 1638 kB ( 1%) ggc dead store elim2 : 0.11 ( 1%) usr 0.00 ( 0%) sys 0.13 ( 1%) wall 1696 kB ( 1%) ggc loop invariant motion : 0.07 ( 1%) usr 0.00 ( 0%) sys 0.10 ( 1%) wall 0 kB ( 0%) ggc CPROP : 0.19 ( 2%) usr 0.00 ( 0%) sys 0.22 ( 2%) wall 2907 kB ( 1%) ggc PRE : 0.08 ( 1%) usr 0.00 ( 0%) sys 0.09 ( 1%) wall 89 kB ( 0%) ggc CSE 2 : 0.28 ( 2%) usr 0.00 ( 0%) sys 0.20 ( 2%) wall 118 kB ( 0%) ggc combiner : 0.36 ( 3%) usr 0.00 ( 0%) sys 0.42 ( 4%) wall 7042 kB ( 3%) ggc integrated RA : 0.56 ( 5%) usr 0.00 ( 0%) sys 0.52 ( 4%) wall 2972 kB ( 1%) ggc reload : 0.30 ( 3%) usr 0.00 ( 0%) sys 0.34 ( 3%) wall 833 kB ( 0%) ggc reload CSE regs : 0.27 ( 2%) usr 0.00 ( 0%) sys 0.29 ( 2%) wall 3270 kB ( 2%) ggc hard reg cprop : 0.06 ( 1%) usr 0.00 ( 0%) sys 0.07 ( 1%) wall 21 kB ( 0%) ggc scheduling 2 : 0.38 ( 3%) usr 0.00 ( 0%) sys 0.40 ( 3%) wall 211 kB ( 0%) ggc reorder blocks : 0.07 ( 1%) usr 0.00 ( 0%) sys 0.08 ( 1%) wall 2178 kB ( 1%) ggc final : 0.16 ( 1%) usr 0.01 ( 3%) sys 0.15 ( 1%) wall 99 kB ( 0%) ggc rest of compilation : 0.27 ( 2%) usr 0.01 ( 3%) sys 0.23 ( 2%) wall 968 kB ( 0%) ggc remove unused locals : 0.34 ( 3%) usr 0.00 ( 0%) sys 0.27 ( 2%) wall 0 kB ( 0%) ggc TOTAL : 11.46 0.30 11.77 201495 kB so IRA, expand, VRP and DF is still slow despite the fix in accounting vars. For hashes: ipa pure const : 0.09 ( 1%) usr 0.00 ( 0%) sys 0.12 ( 2%) wall 4617 kB ( 3%) ggc cfg cleanup : 0.10 ( 1%) usr 0.00 ( 0%) sys 0.13 ( 2%) wall 860 kB ( 1%) ggc df scan insns : 0.04 ( 1%) usr 0.00 ( 0%) sys 0.05 ( 1%) wall 32 kB ( 0%) ggc df reaching defs : 0.07 ( 1%) usr 0.00 ( 0%) sys 0.04 ( 1%) wall 0 kB ( 0%) ggc df live regs : 0.15 ( 2%) usr 0.00 ( 0%) sys 0.11 ( 2%) wall 0 kB ( 0%) ggc df live&initialized regs: 0.04 ( 1%) usr 0.00 ( 0%) sys 0.06 ( 1%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 0.15 ( 2%) usr 0.00 ( 0%) sys 0.12 ( 2%) wall 1856 kB ( 1%) ggc alias analysis : 0.09 ( 1%) usr 0.00 ( 0%) sys 0.16 ( 2%) wall 8517 kB ( 5%) ggc parser : 0.07 ( 1%) usr 0.03 (12%) sys 0.11 ( 2%) wall 5992 kB ( 4%) ggc tree gimplify : 0.08 ( 1%) usr 0.01 ( 4%) sys 0.07 ( 1%) wall 9748 kB ( 6%) ggc tree CFG cleanup : 0.08 ( 1%) usr 0.00 ( 0%) sys 0.12 ( 2%) wall 489 kB ( 0%) ggc tree VRP : 0.19 ( 3%) usr 0.01 ( 4%) sys 0.21 ( 3%) wall 8356 kB ( 5%) ggc tree copy propagation : 0.06 ( 1%) usr 0.00 ( 0%) sys 0.09 ( 1%) wall 607 kB ( 0%) ggc tree PTA : 0.29 ( 4%) usr 0.00 ( 0%) sys 0.29 ( 4%) wall 235 kB ( 0%) ggc tree SSA rewrite : 0.05 ( 1%) usr 0.01 ( 4%) sys 0.04 ( 1%) wall 3867 kB ( 2%) ggc tree SSA incremental : 0.06 ( 1%) usr 0.02 ( 8%) sys 0.03 ( 0%) wall 796 kB ( 0%) ggc dominator optimization: 0.07 ( 1%) usr 0.00 ( 0%) sys 0.04 ( 1%) wall 2538 kB ( 2%) ggc tree CCP : 0.19 ( 3%) usr 0.00 ( 0%) sys 0.22 ( 3%) wall 1652 kB ( 1%) ggc tree PRE : 0.37 ( 5%) usr 0.02 ( 8%) sys 0.40 ( 6%) wall 12227 kB ( 7%) ggc tree FRE : 0.15 ( 2%) usr 0.00 ( 0%) sys 0.15 ( 2%) wall 1216 kB ( 1%) ggc tree code sinking : 0.04 ( 1%) usr 0.00 ( 0%) sys 0.06 ( 1%) wall 358 kB ( 0%) ggc tree aggressive DCE : 0.06 ( 1%) usr 0.00 ( 0%) sys 0.12 ( 2%) wall 3531 kB ( 2%) ggc tree loop invariant motion: 0.04 ( 1%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc complete unrolling : 0.08 ( 1%) usr 0.00 ( 0%) sys 0.06 ( 1%) wall 2857 kB ( 2%) ggc tree iv optimization : 0.47 ( 7%) usr 0.02 ( 8%) sys 0.45 ( 6%) wall 17694 kB (11%) ggc dominance computation : 0.07 ( 1%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc out of ssa : 0.04 ( 1%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 10 kB ( 0%) ggc expand vars : 0.04 ( 1%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 6220 kB ( 4%) ggc expand : 0.23 ( 3%) usr 0.00 ( 0%) sys 0.24 ( 3%) wall 12120 kB ( 7%) ggc forward prop : 0.04 ( 1%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 1821 kB ( 1%) ggc CSE : 0.10 ( 1%) usr 0.00 ( 0%) sys 0.16 ( 2%) wall 34 kB ( 0%) ggc dead code elimination : 0.04 ( 1%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc dead store elim1 : 0.06 ( 1%) usr 0.00 ( 0%) sys 0.06 ( 1%) wall 829 kB ( 0%) ggc CPROP : 0.23 ( 3%) usr 0.01 ( 4%) sys 0.18 ( 2%) wall 2085 kB ( 1%) ggc PRE : 0.04 ( 1%) usr 0.00 ( 0%) sys 0.04 ( 1%) wall 0 kB ( 0%) ggc CSE 2 : 0.07 ( 1%) usr 0.00 ( 0%) sys 0.06 ( 1%) wall 40 kB ( 0%) ggc branch prediction : 0.09 ( 1%) usr 0.00 ( 0%) sys 0.06 ( 1%) wall 2878 kB ( 2%) ggc combiner : 0.65 ( 9%) usr 0.01 ( 4%) sys 0.69 (10%) wall 14078 kB ( 8%) ggc regmove : 0.06 ( 1%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc integrated RA : 0.51 ( 7%) usr 0.00 ( 0%) sys 0.50 ( 7%) wall 5209 kB ( 3%) ggc reload : 0.19 ( 3%) usr 0.00 ( 0%) sys 0.26 ( 4%) wall 369 kB ( 0%) ggc reload CSE regs : 0.20 ( 3%) usr 0.00 ( 0%) sys 0.15 ( 2%) wall 2212 kB ( 1%) ggc scheduling 2 : 0.21 ( 3%) usr 0.01 ( 4%) sys 0.16 ( 2%) wall 141 kB ( 0%) ggc machine dep reorg : 0.06 ( 1%) usr 0.00 ( 0%) sys 0.04 ( 1%) wall 38 kB ( 0%) ggc reorder blocks : 0.09 ( 1%) usr 0.00 ( 0%) sys 0.04 ( 1%) wall 340 kB ( 0%) ggc final : 0.04 ( 1%) usr 0.00 ( 0%) sys 0.05 ( 1%) wall 167 kB ( 0%) ggc rest of compilation : 0.10 ( 1%) usr 0.02 ( 8%) sys 0.11 ( 2%) wall 1437 kB ( 1%) ggc remove unused locals : 0.07 ( 1%) usr 0.01 ( 4%) sys 0.10 ( 1%) wall 0 kB ( 0%) ggc unaccounted todo : 0.05 ( 1%) usr 0.01 ( 4%) sys 0.06 ( 1%) wall 0 kB ( 0%) ggc TOTAL : 6.94 0.26 7.21 166488 kB Here we now have ivopts, combiner and IRA showing top. Steven, would be possible to have comparsion with 4.6 included?
PR47344 now tracks the "regression" property of this bug.
Some new timings: hashes.c -O0: 3.4.6 4.2.4 4.3.3 4.4.7 4.6.3 user 0.02076 0.02037 0.01972 0.02291 0.02286 sys 0.00383 0.00483 0.00468 0.00583 0.00568 hashes.c -O1: 3.4.6 4.2.4 4.3.3 4.4.7 4.6.3 user 0.0278 0.0472 0.0459 0.0539 0.0572 sys 0.0046 0.0055 0.0066 0.0075 0.0082 hashes.c -O2: 3.4.6 4.2.4 4.3.3 4.4.7 4.6.3 user 0.0452 0.0700 0.0744 0.0773 0.0842 sys 0.0047 0.0065 0.0060 0.0083 0.0086 infcodes.c -O0: 3.4.6 4.2.4 4.3.3 4.4.7 4.6.3 user 0.0374 0.0396 0.0421 0.0499 0.0490 sys 0.0046 0.0059 0.0065 0.0064 0.0071 infcodes.c -O1: 3.4.6 4.2.4 4.3.3 4.4.7 4.6.3 user 0.0481 0.0886 0.0821 0.0950 0.0849 sys 0.0046 0.0074 0.0066 0.0092 0.0070 infcodes.c -O2: 3.4.6 4.2.4 4.3.3 4.4.7 4.6.3 user 0.0745 0.1303 0.1306 0.1389 0.1354 sys 0.0054 0.0096 0.0096 0.0088 0.0091
GCC 12 now takes sub-second time for the larger testcases and with optimization it's now early VRP taking most of the time.