Summary: | cannot compile big function | ||
---|---|---|---|
Product: | gcc | Reporter: | Klaus Grue <grue> |
Component: | middle-end | Assignee: | Richard Biener <rguenth> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | gcc-bugs, hubicka, lucier, pawel_sikora, rguenth, steven, vmakarov, zadeck |
Priority: | P2 | Keywords: | compile-time-hog, memory-hog |
Version: | 4.3.1 | ||
Target Milestone: | --- | ||
Host: | x86_64-suse-linux | Target: | x86_64-suse-linux |
Build: | x86_64-suse-linux | Known to work: | 4.2.4, 6.0 |
Known to fail: | 4.3.2, 4.6.0, 4.7.2, 4.8.0, 5.3.0 | Last reconfirmed: | 2008-09-18 21:11:40 |
Bug Depends on: | |||
Bug Blocks: | 47344 | ||
Attachments: |
The source mentioned in the bug description
The statistics mentioned in the bug description detailed memory stats with checking enabled and long statistic counters memory and time statistics for compiling lgwam.c |
Description
Klaus Grue
2008-09-09 18:34:11 UTC
Created attachment 16281 [details]
The source mentioned in the bug description
Created attachment 16282 [details]
The statistics mentioned in the bug description
It indeed uses about 1.4GB ram and takes 26s (I think that's reasonable) to build the testcase. It's the DF initialize pass that requires this much memory. The function in question has 31671 basic blocks. dfinish doesn't seem to free it btw. (gdb) print *df $56 = {problems_in_order = {0x1b25920, 0x1b259a0, 0x0, 0x0, 0x0, 0x0}, problems_by_index = {0x1b25920, 0x1b259a0, 0x0, 0x0, 0x0, 0x0}, num_problems_defined = 2, blocks_to_analyze = 0x0, analyze_subset = 0 '\0', redo_entry_and_exit = 0 '\0', def_info = {refs = 0x0, begin = 0x19ee470, count = 0x38e3e2e0, refs_size = 0, table_size = 0, total_size = 7482051, ref_order = DF_REF_ORDER_NO_TABLE}, use_info = {refs = 0x0, begin = 0x38ef7fe0, count = 0x38fb1ce0, refs_size = 0, table_size = 0, total_size = 644313, ref_order = DF_REF_ORDER_NO_TABLE}, def_regs = 0x2f766c50, use_regs = 0x38b56f20, eq_use_regs = 0x38cca900, regs_size = 190267, regs_inited = 152214, insns = 0x1d48fb0, insns_size = 792538, hardware_regs_used = 0x2ef1e640, regular_block_artificial_uses = 0x2ef1e660, eh_block_artificial_uses = 0x2ef1e680, entry_block_defs = 0x2ef1e6a0, exit_block_uses = 0x2ef1e6c0, insns_to_delete = 0x2f928b40, insns_to_rescan = 0x2f928b60, insns_to_notes_rescan = 0x2f928b80, postorder = 0x30d2c60, postorder_inverted = 0x30f1b50, n_blocks = 31671, n_blocks_inverted = 31671, hard_regs_live_count = 0x1940610, ref_order = 39346414, changeable_flags = 0} Kenny, any idea where the memory leaks? On the trunk compile-time sky-rocketed there in building the cgraph. Honza, what's going on there - it doesn't seem to even finish ;) With ENABLE_CHECKING we spend ages checking that there are no duplicated edges in callgraph. Will fix it shortly. Now we spend eons in walking lexical blocks, since add_lexical_block called by inliner walks till end of the chain. Is the order of blocks important at all? (i.e. we don't insert them to proper place anyway, just as last block, we can perhaps as well insert them as first?) Honza #0 0x000000000099c80d in ira_allocno_live_ranges_intersect_p (a1=0x33da8ae0, a2=0x34395120) at ../../gcc/ira-conflicts.c:535 #1 0x000000000099ff0e in coalesced_allocno_conflict_p (a1=0x34e221b0, a2=0x33da8ae0, reload_p=1 '\001') at ../../gcc/ira-color.c:1436 #2 0x00000000009a2904 in ira_sort_regnos_for_alter_reg (pseudo_regnos=0x5eb441f0, n=<value optimized out>, reg_max_ref_width=0x580ffb00) at ../../gcc/ira-color.c:2278 #3 0x00000000005e4f86 in reload (first=0x2aaaacf99680, global=1) at ../../gcc/reload1.c:907 #4 0x000000000099472c in rest_of_handle_ira () at ../../gcc/ira.c:1867 #5 0x000000000059c211 in execute_one_pass (pass=0xdf3020) at ../../gcc/passes.c:1279 #6 0x000000000059c445 in execute_pass_list (pass=0xdf3020) at ../../gcc/passes.c:1327 #7 0x000000000059c45d in execute_pass_list (pass=0xdee140) at ../../gcc/passes.c:1328 #8 0x000000000066bc07 in tree_rest_of_compilation (fndecl=0x2aaaab9ae200) at ../../gcc/tree-optimize.c:418 OK, one hour later (at -O2) we spent all the time in IRA/reload. Vladimir, could you please take a look? Honza And at -O0 we still need about 2GB (because of dataflow leaks?), otherwise we seem fine compilation time wise garbage collection : 0.92 ( 3%) usr 0.00 ( 0%) sys 0.93 ( 3%) wall 0 kB ( 0%) ggc callgraph construction: 0.48 ( 1%) usr 0.03 ( 2%) sys 0.53 ( 2%) wall 22055 kB ( 4%) ggc callgraph optimization: 0.44 ( 1%) usr 0.02 ( 1%) sys 0.48 ( 1%) wall 16397 kB ( 3%) ggc cfg cleanup : 0.10 ( 0%) usr 0.00 ( 0%) sys 0.10 ( 0%) wall 3 kB ( 0%) ggc trivially dead code : 0.42 ( 1%) usr 0.00 ( 0%) sys 0.44 ( 1%) wall 0 kB ( 0%) ggc df live regs : 0.80 ( 2%) usr 0.01 ( 1%) sys 0.79 ( 2%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 1.57 ( 5%) usr 0.01 ( 1%) sys 1.59 ( 5%) wall 16931 kB ( 3%) ggc register information : 0.68 ( 2%) usr 0.00 ( 0%) sys 0.68 ( 2%) wall 0 kB ( 0%) ggc alias analysis : 0.27 ( 1%) usr 0.00 ( 0%) sys 0.26 ( 1%) wall 4290 kB ( 1%) ggc rebuild jump labels : 0.36 ( 1%) usr 0.00 ( 0%) sys 0.37 ( 1%) wall 0 kB ( 0%) ggc preprocessing : 0.32 ( 1%) usr 0.16 (10%) sys 0.41 ( 1%) wall 25859 kB ( 5%) ggc lexical analysis : 0.15 ( 0%) usr 0.33 (21%) sys 0.56 ( 2%) wall 0 kB ( 0%) ggc parser : 0.48 ( 1%) usr 0.17 (11%) sys 0.64 ( 2%) wall 52995 kB (10%) ggc inline heuristics : 0.28 ( 1%) usr 0.00 ( 0%) sys 0.27 ( 1%) wall 0 kB ( 0%) ggc tree gimplify : 0.89 ( 3%) usr 0.06 ( 4%) sys 0.96 ( 3%) wall 125511 kB (25%) ggc tree eh : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall 0 kB ( 0%) ggc tree CFG construction : 0.18 ( 1%) usr 0.01 ( 1%) sys 0.18 ( 1%) wall 12213 kB ( 2%) ggc tree CFG cleanup : 0.16 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 0%) wall 0 kB ( 0%) ggc tree find ref. vars : 0.15 ( 0%) usr 0.01 ( 1%) sys 0.15 ( 0%) wall 14419 kB ( 3%) ggc tree PHI insertion : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 150 kB ( 0%) ggc tree SSA rewrite : 0.08 ( 0%) usr 0.01 ( 1%) sys 0.10 ( 0%) wall 18660 kB ( 4%) ggc tree SSA other : 0.22 ( 1%) usr 0.06 ( 4%) sys 0.21 ( 1%) wall 512 kB ( 0%) ggc tree operand scan : 0.10 ( 0%) usr 0.02 ( 1%) sys 0.17 ( 1%) wall 9402 kB ( 2%) ggc tree SSA to normal : 0.22 ( 1%) usr 0.00 ( 0%) sys 0.23 ( 1%) wall 14 kB ( 0%) ggc dominance frontiers : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc dominance computation : 0.19 ( 1%) usr 0.01 ( 1%) sys 0.17 ( 1%) wall 0 kB ( 0%) ggc expand : 5.47 (17%) usr 0.40 (26%) sys 5.82 (17%) wall 171648 kB (34%) ggc jump : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc integrated RA : 5.63 (17%) usr 0.10 ( 6%) sys 5.82 (17%) wall 5190 kB ( 1%) ggc reload : 9.42 (29%) usr 0.13 ( 8%) sys 9.55 (28%) wall 8842 kB ( 2%) ggc thread pro- & epilogue: 1.01 ( 3%) usr 0.00 ( 0%) sys 1.02 ( 3%) wall 594 kB ( 0%) ggc reg stack : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc final : 1.22 ( 4%) usr 0.01 ( 1%) sys 1.25 ( 4%) wall 1639 kB ( 0%) ggc TOTAL : 32.33 1.55 33.95 508532 kB There are 66K allocnos and 3M live ranges in function testsuite. Fixing the problem in IRA will take some time. I hope the patch will be ready in a few days. Subject: Bug 37448 Author: hubicka Date: Thu Sep 11 12:34:53 2008 New Revision: 140281 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=140281 Log: PR middle-end/37448 * tree-inline.c (add_lexical_block): Replace with ... (prepend_lexical_block): ... prepend at begginig. (remap_blocks): Use it and reverse later. (expand_call_inline): Use prepend_lexical_block. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-inline.c Subject: Bug 37448 Author: hubicka Date: Thu Sep 11 12:38:57 2008 New Revision: 140284 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=140284 Log: PR middle-end/37448 * cgraph.c (cgraph_create_edge): Use !cgraph_edge for sanity check. Modified: trunk/gcc/ChangeLog trunk/gcc/cgraph.c Subject: Bug 37448 Author: hjl Date: Sat Sep 13 15:57:26 2008 New Revision: 140345 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=140345 Log: 2008-09-13 Vladimir Makarov <vmakarov@redhat.com> PR middle-end/37448 * ira-int.h (IRA_ALLOCNO_TEMP): Rename to ALLOCNO_TEMP. (ira_compress_allocno_live_ranges): New prototype. * ira-color.c: Rename IRA_ALLOCNO_TEMP to ALLOCNO_TEMP. (coalesced_allocnos_living_at_program_points): New. (coalesced_allocnos_live_at_points_p, set_coalesced_allocnos_live_points): New functions. (coalesce_spill_slots): Rewrite. * ira-lives.c (remove_some_program_points_and_update_live_ranges, ira_compress_allocno_live_ranges): New functions. * ira-build.c (ira_flattening): Call ira_compress_allocno_live_ranges. (ira_build): Ditto. Modified: branches/ira-merge/gcc/ChangeLog.ira branches/ira-merge/gcc/ira-build.c branches/ira-merge/gcc/ira-color.c branches/ira-merge/gcc/ira-int.h branches/ira-merge/gcc/ira-lives.c In the attached statistics file you find where this allocation overflowed Alloc-pool Kind Pools Allocated Peak Leak ------------------------------------------------------------- df_scan_ref pool 868 -1403330584 607042176 0 which is similar to what happens in 26854, so perhaps this PR could also be improved with factored use-def chains. Created attachment 16351 [details] detailed memory stats with checking enabled and long statistic counters I reran the test problem with checking enabled and this patch, which changes the memory statistic counters from int to long, installed: http://gcc.gnu.org/ml/gcc-patches/2008-09/msg01270.html So we see that 5,272,694,920 bytes are allocated in alloc_pools, and 331,198,680 btytes are allocated in bitmaps. We seem to make nice progress on the testcase ;) Conversion to FUD would probably help here, but looking at the DF dumps, the testcase don't look like your average testcase showing DU/UD chain explosion. The chains seem relatively short at least in manual inspection through first part of dump. Yet we produe about 40 DU/UD records live at once for every instruction ever emit, so it seems we are doing some wrong still. If someone more familiar with dataflow branch could look into why we have so many DU/UDs, it would be great. Honza Subject: Bug 37448 Author: hubicka Date: Thu Sep 18 17:28:40 2008 New Revision: 140463 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=140463 Log: PR middle-end/37448 * ipa-reference.c (ipa_reference_local_vars_info_d, ipa_reference_global_vars_info_d, ipa_reference_local_vars_info_t, ipa_reference_global_vars_info_t, ipa_reference_vars_info_t): Move here from ipa-reference.h (node_duplication_hook_holder, node_removal_hook_holder): New. (get_reference_vars_info_from_cgraph): Rename to ... (get_reference_vars_info): ... this one, use cgraph uids. (get_local_reference_vars_info, get_global_reference_vars_info): Use cgraph instead of decl. (ipa_reference_get_read_local, ipa_reference_get_written_local): Remove. (ipa_reference_get_read_global, ipa_reference_get_not_read_global ipa_reference_get_written_global, ipa_reference_get_not_written_global): Use cgraph argument. (check_call): Simplify avail check. (scan_stmt_for_static_refs): Update. (propagate_bits): Update. (merge_callee_local_info): Remove. (init_function_info): Use cgraph nodes. (clean_function_local_data): Break out from ... (clean_function): ... here. (copy_local_bitmap, copy_global_bitmap): New functions. (duplicate_node_data, remove_node_data): New functions. (generate_summary): Register hooks; use visibility instead of master clones. (propafate): Use cgraph nodes; copy bitmap to each node in cycle. * ipa-reference.h (ipa_reference_local_vars_info_d, ipa_reference_global_vars_info_d, ipa_reference_local_vars_info_t, ipa_reference_global_vars_info_t, ipa_reference_vars_info_t): Move to ipa-reference.c (ipa_reference_get_read_local, ipa_reference_get_written_local): Remove. (ipa_reference_get_read_global, ipa_reference_get_written_global, ipa_reference_get_not_read_global, ipa_reference_get_not_written_global): Update prototype. * ipa-pure-const.c (funct_state_vec): Turn into VECtor. (init_state): Remove. (node_duplication_hook_holder, node_removal_hook_holder): New. (get_function_state, set_function_state): Use VECtor. (analyze_function): Check body availability. (add_new_function): Likewise. (duplicate_node_data, remove_node_data): New. (generate_summary): Register hooks; do not care about clones. (propafate): Do not care about clones; recursive functions are not looping. * ipa-utils.c (searchc, ipa_utils_reduced_inorder): Do not skip clones. * ipa-prop.c (edge_removal_hook_holder, node_removal_hook_holder, * edge_duplication_hook_holder, node_duplication_hook_holder): Make static. * tree-flow.h (function_ann_d): Remove reference_vars_info. * tree-ssa-opreands.c (add_call_clobber_ops, add_call_read_ops): Update call of ipa-reference accesors. Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-prop.c trunk/gcc/ipa-pure-const.c trunk/gcc/ipa-reference.c trunk/gcc/ipa-reference.h trunk/gcc/ipa-utils.c trunk/gcc/tree-ssa-operands.c Subject: Bug 37448 Author: hubicka Date: Thu Sep 18 18:16:45 2008 New Revision: 140465 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=140465 Log: PR middle-end/37448 * ipa-reference.c (ipa_reference_local_vars_info_d, ipa_reference_global_vars_info_d, ipa_reference_local_vars_info_t, ipa_reference_global_vars_info_t, ipa_reference_vars_info_t): Move here from ipa-reference.h (node_duplication_hook_holder, node_removal_hook_holder): New. (get_reference_vars_info_from_cgraph): Rename to ... (get_reference_vars_info): ... this one, use cgraph uids. (get_local_reference_vars_info, get_global_reference_vars_info): Use cgraph instead of decl. (ipa_reference_get_read_local, ipa_reference_get_written_local): Remove. (ipa_reference_get_read_global, ipa_reference_get_not_read_global ipa_reference_get_written_global, ipa_reference_get_not_written_global): Use cgraph argument. (check_call): Simplify avail check. (scan_stmt_for_static_refs): Update. (propagate_bits): Update. (merge_callee_local_info): Remove. (init_function_info): Use cgraph nodes. (clean_function_local_data): Break out from ... (clean_function): ... here. (copy_local_bitmap, copy_global_bitmap): New functions. (duplicate_node_data, remove_node_data): New functions. (generate_summary): Register hooks; use visibility instead of master clones. (propafate): Use cgraph nodes; copy bitmap to each node in cycle. * ipa-reference.h (ipa_reference_local_vars_info_d, ipa_reference_global_vars_info_d, ipa_reference_local_vars_info_t, ipa_reference_global_vars_info_t, ipa_reference_vars_info_t): Move to ipa-reference.c (ipa_reference_get_read_local, ipa_reference_get_written_local): Remove. (ipa_reference_get_read_global, ipa_reference_get_written_global, ipa_reference_get_not_read_global, ipa_reference_get_not_written_global): Update prototype. * ipa-pure-const.c (funct_state_vec): Turn into VECtor. (init_state): Remove. (node_duplication_hook_holder, node_removal_hook_holder): New. (get_function_state, set_function_state): Use VECtor. (analyze_function): Check body availability. (add_new_function): Likewise. (duplicate_node_data, remove_node_data): New. (generate_summary): Register hooks; do not care about clones. (propafate): Do not care about clones; recursive functions are not looping. * ipa-utils.c (searchc, ipa_utils_reduced_inorder): Do not skip clones. * ipa-prop.c (edge_removal_hook_holder, node_removal_hook_holder, * edge_duplication_hook_holder, node_duplication_hook_holder): Make static. * tree-flow.h (function_ann_d): Remove reference_vars_info. * tree-ssa-opreands.c (add_call_clobber_ops, add_call_read_ops): Update call of ipa-reference accesors. Modified: trunk/gcc/tree-flow.h Hi, some stats on instruction counts and references Kenny asked about: At -O2: jh@gcc17:~/gcc-baseline/build-ada/prev-gcc$ grep "(insn" e.i.157r.outof_cfglayout | wc -l 307811 jh@gcc17:~/gcc-baseline/build-ada/prev-gcc$ grep "(call_insn" e.i.157r.outof_cfglayout | wc -l 157016 jh@gcc17:~/gcc-baseline/build-ada/prev-gcc$ grep "(jump_insn" e.i.157r.outof_cfglayout | wc -l 18650 at -O0: jh@gcc17:~/gcc-baseline/build-ada/prev-gcc/O0$ grep "(insn" e.i.157r.outof_cfglayout | wc -l 411876 jh@gcc17:~/gcc-baseline/build-ada/prev-gcc/O0$ grep "(call_insn" e.i.157r.outof_cfglayout | wc -l 166471 jh@gcc17:~/gcc-baseline/build-ada/prev-gcc/O0$ grep "(jump_insn" e.i.157r.outof_cfglayout | wc -l 18218 I am not sure if I can grep for references in some of the RTL dumps... Honza Subject: Bug 37448 Author: vmakarov Date: Fri Sep 26 00:14:30 2008 New Revision: 140674 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=140674 Log: 2008-09-25 Vladimir Makarov <vmakarov@redhat.com> PR middle-end/37448 * ira-int.h (IRA_ALLOCNO_TEMP): Rename to ALLOCNO_TEMP. (ira_compress_allocno_live_ranges): New prototype. * ira-color.c: Rename IRA_ALLOCNO_TEMP to ALLOCNO_TEMP. (coalesced_allocnos_living_at_program_points): New. (coalesced_allocnos_live_at_points_p, set_coalesced_allocnos_live_points): New functions. (coalesce_spill_slots): Rewrite. * ira-lives.c (remove_some_program_points_and_update_live_ranges, ira_compress_allocno_live_ranges): New functions. * ira-build.c (ira_flattening): Call ira_compress_allocno_live_ranges. (ira_build): Ditto. Modified: trunk/gcc/ChangeLog trunk/gcc/ira-build.c trunk/gcc/ira-color.c trunk/gcc/ira-int.h trunk/gcc/ira-lives.c The testcase now compiles at -O2 in resonable time. Checking enabled compiler: CFG verifier : 15.02 ( 5%) usr 0.01 ( 0%) sys 15.13 ( 5%) wall 0 kB ( 0%) ggc df reaching defs : 7.57 ( 2%) usr 0.27 ( 7%) sys 7.81 ( 2%) wall 0 kB ( 0%) ggc df live regs : 13.13 ( 4%) usr 0.00 ( 0%) sys 13.16 ( 4%) wall 0 kB ( 0%) ggc df live&initialized regs: 3.62 ( 1%) usr 0.00 ( 0%) sys 3.60 ( 1%) wall 0 kB ( 0%) ggc df use-def / def-use chains: 3.83 ( 1%) usr 0.03 ( 1%) sys 3.85 ( 1%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 8.56 ( 3%) usr 0.01 ( 0%) sys 8.53 ( 3%) wall 23492 kB ( 3%) ggc tree SSA verifier : 22.27 ( 7%) usr 0.04 ( 1%) sys 22.39 ( 7%) wall 0 kB ( 0%) ggc tree STMT verifier : 27.32 ( 9%) usr 0.42 (11%) sys 27.65 ( 9%) wall 0 kB ( 0%) ggc expand : 22.57 ( 7%) usr 0.10 ( 3%) sys 22.74 ( 7%) wall 142836 kB (16%) ggc integrated RA : 72.71 (24%) usr 0.00 ( 0%) sys 74.09 (23%) wall 5404 kB ( 1%) ggc reload : 17.30 ( 6%) usr 0.00 ( 0%) sys 20.38 ( 6%) wall 26179 kB ( 3%) ggc TOTAL : 306.65 3.67 316.44 908200 kB So we do resonable job here. I wonder where those 7% of expansion time goes to. We still need over 2GB RAM because of dataflow info. I am going to test how much difference Kenny's patch make. I've just tested Kenny's df scan ref union patch. We still have over 2.2GB footprint that is not that much different from unpatched compiler. allocpool statistics now give some more useful data: df_scan_ref base pool 868 -1871497632 553770408 0 copies 434 214464 134512 0 df_scan_ref artificial pool 868 77289168 11231352 0 df_scan_ref regular pool 868 737848848 72493272 0 So largest one is df_scan_ref base, not artifical defs as was suspected. About 0.5GB comming to that storage. Other large pool (now at -O2) is: allocno live ranges 434 167266736 167062320 0 And bitmaps: df-problems.c:311 (df_rd_alloc) 110920 1071636320 1071611720 1071611720 0 df-problems.c:312 (df_rd_alloc) 110920 543723720 534523840 534523840 0 df-problems.c:539 (df_rd_transfer_functio 52865 541674400 537236800 537236760 101861 tree-ssa-pre.c:584 (bitmap_set_new) 547137 1368222720 1366138880 1366138880 784222 tree-ssa-pre.c:585 (bitmap_set_new) 547137 951660760 949608960 949608960 1467881 and GGC garbage: emit-rtl.c:3343 (make_insn_raw) 33796400: 4.4% 0: 0.0% 88: 0.0% 3072408: 3.1% 384051 tree-inline.c:3600 (copy_tree_r) 36731592: 4.8% 0: 0.0% 0: 0.0% 2702392: 2.7% 478660 lists.c:143 (alloc_EXPR_LIST) 37188960: 4.8% 0: 0.0% 0: 0.0% 7437792: 7.5% 929724 tree-ssanames.c:141 (make_ssa_name_fn) 52532760: 6.8% 0: 0.0% 3480: 0.0% 3502416: 3.5% 437802 With Kenny's updated df patch and updated statistics I now get: Alloc-pool Kind Pools Elt size Allocated (elts) Peak (elts) Leak (elts) -------------------------------------------------------------------------------------------------------------- elt_loc_list 40 3863 110676080( 2766902) 13178600( 329465) 0( 0) df_scan ref base 80 868 -1871497632( -23393720) 553609440( 6920118) 0( 0) df_scan ref artificial 88 868 77289168( 878286) 11224840( 127555) 0( 0) df_scan ref regular 88 868 737848848( 8384646) 72474072( 823569) 0( 0) About 17 refs live for every instruction. Will check memory footprint after return from California. I do not believe honza. My measurements at -O0 on x86-42 are about 15 refs per insn. This is based on the following stats. (These can be reproduced using a patch that i am about to submit). ;; total ref usage 8428419{7601408d,827011u,0e} in 570685{406804 regular + 163881 call} insns. This yields about 15 refs per insn. While this number is large, it is reasonable considering that slightly less than 30% of the insns are call instructions. Call instructions have a lot of clobbers. It is possible that some mechanism could be devised to share these refs, but this will mess up things like building chains so it is certainly not something that is going to be easy to do. The df patch that i have submitted makes modest progress on reducing the size of df-refs. Hopefully bonzini will finish reviewing this soon. I should also point out that honza's alloc pool stats were completely bogus. I have submitted a patch that fixes the way stats are accumulated for alloc-pools. We can account for all of the df-refs and the peak usage according to the new alloc-pool stats is very close to the number used by the largest function. Once those patches are installed, I will consider this bugzilla resolved with respect to the df issues. Subject: Bug 37448 Author: zadeck Date: Wed Oct 8 02:52:28 2008 New Revision: 140960 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=140960 Log: 2008-10-07 Kenneth Zadeck <zadeck@naturalbridge.com> PR rtl-optimization/37448 alloc_pool_desc (elt_size): New field. alloc_pool_desc (created, allocated, current, peak): Make unsigned long. output_info (count): Renamed total_created and made unsigned long. output_info (size): Renamed total_allocated and made unsigned long. alloc-pool.c (create_alloc_pool, empty_alloc_pool, pool_alloc, pool_free): Properly keep track of desc->size. (print_statistics, dump_alloc_pool_statistics): Enhance the printing of statistics to print the number of elements and to use unsigned longs. Modified: trunk/gcc/ChangeLog trunk/gcc/alloc-pool.c I'm sorry, I haven't been reading gcc-patches recently, but this is quite similar to my patch suggested here: http://gcc.gnu.org/ml/gcc-patches/2008-09/msg01270.html which also "fixed" the counters for bitmaps. The consensus seemed to be to use HOST_WIDEST_INT for the counters (for 64-bit Windows, pointers are 64-bit and longs are 32-bit). Subject: Bug 37448 Author: zadeck Date: Sat Oct 11 23:39:21 2008 New Revision: 141067 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=141067 Log: 2008-10-11 Kenneth Zadeck <zadeck@naturalbridge.com> PR rtl-optimization/37448 * df.h: (df_ref_class): New enum. (DF_REF_TYPE_NAMES, df_ref_extract): Removed. (struct df_ref): Replaced with union df_ref_d. (df_base_ref, df_artificial_ref, df_regular_ref, df_extract_ref): New members of df_ref_d union. (DF_REF_REAL_REG, DF_REF_REGNO, DF_REF_REAL_LOC, DF_REF_REG, DF_REF_LOC, DF_REF_BB, DF_REF_INSN_INFO, DF_REF_INSN, DF_REF_CLASS, DF_REF_TYPE, DF_REF_CHAIN, DF_REF_ID, DF_REF_FLAGS, DF_REF_ORDER, DF_REF_IS_ARTIFICIAL, DF_REF_NEXT_REG, DF_REF_PREV_REG, DF_REF_EXTRACT_WIDTH, DF_REF_EXTRACT_OFFSET, DF_REF_EXTRACT_MODE): Replaced definition to access union df_ref_d. (DF_MWS_REG_DEF_P, DF_MWS_REG_USE_P, DF_MWS_TYPE): New macros. (df_scan_bb_info, df_bb_regno_first_def_find, df_bb_regno_last_def_find, df_find_def, df_find_use, df_refs_chain_dump, df_regs_chain_dump, df_ref_debug, debug_df_ref, df_chain_create, df_chain_unlink, df_chain_copy, df_ref_create, df_ref_remove, df_compute_accessed_bytes, df_get_artificial_defs, df_get_artificial_uses, union_defs) Replaced struct df_ref * with df_ref. * df-scan.c (df_collection_rec, df_null_ref_rec, df_ref_chain_delete_du_chain, df_ref_chain_delete, df_install_ref, df_grow_ref_info, df_ref_create, df_reg_chain_unlink, df_ref_compress_rec, df_ref_remove, df_ref_chain_delete_du_chain, df_ref_chain_delete, df_free_collection_rec, df_insn_rescan, df_reorganize_refs_by_reg_by_reg, df_reorganize_refs_by_reg_by_insn, df_reorganize_refs_by_reg, df_ref_change_reg_with_loc_1, df_notes_rescan, df_swap_refs, df_sort_and_compress_refs, df_install_ref, df_install_refs, df_ref_record, df_get_conditional_uses, df_get_call_refs, df_bb_refs_record, df_exit_block_uses_collect, df_record_exit_block_uses, df_reg_chain_mark, df_reg_chain_verify_unmarked, df_refs_verify): Replaced struct df_ref * with df_ref. (df_ref_record, df_uses_record, df_ref_create_structure): Added df_ref_class parameter. (df_scan_problem_data): Added new pools for different types of refs. (df_scan_free_internal, df_scan_alloc, df_free_ref, df_ref_create_structure): Processed new ref pools. (df_scan_start_dump): Added counts of refs and insns. (df_ref_create, df_notes_rescan, df_def_record_1, df_uses_record, df_get_call_refs, df_insn_refs_collect, df_bb_refs_collect, df_entry_block_defs_collect, df_exit_block_uses_collect): Added code to pass df_ref_class down to ref creation functions. (df_reg_chain_unlink, df_ref_remove, df_ref_change_reg_with_loc_1, df_reg_chain_mark): Use macros to hide references to df_refs. (df_ref_chain_change_bb): Removed. (df_insn_change_bb): Remove calls to df_ref_insn_change_bb. (df_ref_equal_p, df_ref_compare, df_ref_create_structure): Enhanced to understand df_ref union structure. * fwprop.c (local_ref_killed_between_p, use_killed_between, all_uses_available_at, update_df, try_fwprop_subst, forward_propagate_subreg, forward_propagate_and_simplify, forward_propagate_into, fwprop, fwprop_addr): Replaced struct df_ref * with df_ref. (use_killed_between, all_uses_available_at): Use macros to hide references to df_refs. * regstat.c (regstat_bb_compute_ri, regstat_bb_compute_calls_crossed): Replaced struct df_ref * with df_ref. * see.c (see_handle_relevant_defs, see_handle_relevant_uses, see_handle_relevant_refs, see_analyze_one_def, see_update_relevancy, see_propagate_extensions_to_uses): Replaced struct df_ref * with df_ref. * ra-conflict.c (record_one_conflict, clear_reg_in_live, global_conflicts): Replaced struct df_ref * with df_ref. * ddg.c (create_ddg_dep_from_intra_loop_link, add_cross_iteration_register_deps, build_inter_loop_deps): Replaced struct df_ref * with df_ref. (create_ddg_dep_from_intra_loop_link, add_cross_iteration_register_deps): Use macros to hide references to df_refs. * auto-inc-dec.c (find_inc, merge_in_block): Replaced struct df_ref * with df_ref. * df-core.c (df_bb_regno_first_def_find, df_bb_regno_last_def_find, df_find_def, df_find_use, df_refs_chain_dump, df_regs_chain_dump, df_ref_debug, debug_df_ref): Replaced struct df_ref * with df_ref. (df_mws_dump, df_ref_debug): Use macros to hide references to df_refs. * cse.c (cse_extended_basic_block): Replaced struct df_ref * with df_ref. * web.c (union_defs, entry_register, replace_ref, web_main): Replaced struct df_ref * with df_ref. (union_defs, replace_ref): Use macros to hide references to df_refs. * global.c (compute_regs_asm_clobbered, build_insn_chain): Replaced struct df_ref * with df_ref. * ifcvt.c (dead_or_predicable): Replaced struct df_ref * with df_ref. * sel-sched-ir.c (maybe_downgrade_id_to_use, setup_id_reg_sets, ): Replaced struct df_ref * with df_ref. * ira-lives.c (mark_ref_live, def_conflicts_with_inputs_p, mark_ref_dead, process_bb_node_lives): Replaced struct df_ref * with df_ref. * local-alloc.c (block_alloc): Replaced struct df_ref * with df_ref. * df-byte-scan.c (df_compute_accessed_bytes_extract, df_compute_accessed_bytes_strict_low_part, df_compute_accessed_bytes_subreg, df_compute_accessed_bytes): Replaced struct df_ref * with df_ref. (df_compute_accessed_bytes): Use macros to hide references to df_refs. * init-regs.c (initialize_uninitialized_regs): Replaced struct df_ref * with df_ref. * loop-invariant.c (invariant_for_use, hash_invariant_expr_1, check_dependency, check_dependencies, record_uses): Replaced struct df_ref * with df_ref. (invariant_for_use, check_dependency): Use macros to hide references to df_refs. * loop-iv.c (iv_analysis_loop_init, iv_get_reaching_def, get_biv_step_1, get_biv_step, record_iv, iv_analyze_def, iv_analyze, biv_p): Replaced struct df_ref * with df_ref. (iv_analysis_loop_init, iv_get_reaching_def): Use macros to hide references to df_refs. * ira.c (compute_regs_asm_clobbered): Replaced struct df_ref * with df_ref. * combine.c (create_log_links): Replaced struct df_ref * with df_ref. * df-problems.c (df_rd_bb_local_compute_process_def, df_lr_bb_local_compute, df_live_bb_local_compute, df_chain_create, df_chain_unlink_1, df_chain_unlink, df_chain_copy, df_chain_remove_problem, df_chain_create_bb_process_use, df_chain_create_bb, df_chain_top_dump, df_chain_bottom_dump, df_byte_lr_check_regs, df_byte_lr_bb_local_compute, df_byte_lr_simulate_defs, df_byte_lr_simulate_uses, df_byte_lr_simulate_artificial_refs_at_top, df_byte_lr_simulate_artificial_refs_at_end, df_create_unused_note, df_note_bb_compute, df_note_add_problem, df_simulate_defs, df_simulate_uses, df_simulate_artificial_refs_at_end, df_simulate_artificial_refs_at_top): Replaced struct df_ref * with df_ref. (df_chain_dump): Use macros to hide references to df_refs. * config/mips/mips.c (r10k_simplify_address): Replaced struct df_ref * with df_ref. * dce.c (mark_nonreg_stores, delete_corresponding_reg_eq_notes, mark_artificial_uses, mark_reg_dependencies, byte_dce_process_block): Replaced struct df_ref * with df_ref. Modified: trunk/gcc/ChangeLog trunk/gcc/auto-inc-dec.c trunk/gcc/combine.c trunk/gcc/config/mips/mips.c trunk/gcc/cse.c trunk/gcc/dce.c trunk/gcc/ddg.c trunk/gcc/df-byte-scan.c trunk/gcc/df-core.c trunk/gcc/df-problems.c trunk/gcc/df-scan.c trunk/gcc/df.h trunk/gcc/fwprop.c trunk/gcc/global.c trunk/gcc/ifcvt.c trunk/gcc/init-regs.c trunk/gcc/ira-lives.c trunk/gcc/ira.c trunk/gcc/local-alloc.c trunk/gcc/loop-invariant.c trunk/gcc/loop-iv.c trunk/gcc/ra-conflict.c trunk/gcc/regstat.c trunk/gcc/see.c trunk/gcc/sel-sched-ir.c trunk/gcc/web.c GCC 4.3.3 is being released, adjusting target milestone. The testcase is now failing on mainline. Reason seems to be new pure-const pass causing PRE to take unacceptable amount of time producing extraordinarily large bitmaps. I stoppped it after while and out of 1GB of memory use, 700MB was PRE bitmap. Honza tree-ssa-pre.c:4201 (init_pre) 68 544319360 2960 2920 35303438 tree-ssa-pre.c:590 (bitmap_set_new) 187851 167503080 38440200 12734360 183944666 tree-ssa-pre.c:591 (bitmap_set_new) 187851 96183960 26228960 14024680 223113273 GCC 4.3.4 is being released, adjusting target milestone. Updated timings and memory: > ~/bin/maxmem2.sh /usr/bin/time gcc-4.5 -S -o /dev/null lgwam.c 32.62user 1.48system 0:34.41elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+333822minor)pagefaults 0swaps total: 1333745 kB > ~/bin/maxmem2.sh /usr/bin/time /space/rguenther/install/gcc-4.4.3/bin/gcc -S -o /dev/null lgwam.c 35.01user 1.54system 0:36.89elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+329139minor)pagefaults 0swaps total: 1306898 kB > ~/bin/maxmem2.sh /usr/bin/time /space/rguenther/install/gcc-4.3.4/bin/gcc -S -o /dev/null lgwam.c 27.42user 1.83system 0:29.61elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 16inputs+0outputs (0major+374338minor)pagefaults 0swaps total: 1341721 kB > ~/bin/maxmem2.sh /usr/bin/time /space/rguenther/install/gcc-4.2.3/bin/gcc -S -o /dev/null lgwam.c 15.33user 0.80system 0:16.31elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 256inputs+0outputs (2major+197427minor)pagefaults 0swaps total: 598733 kB time-report for 4.5 trunk (only interesting parts): expand : 6.72 (20%) usr 0.72 (26%) sys 7.45 (20%) wall 140609 kB (35%) ggc integrated RA : 8.56 (25%) usr 0.12 ( 4%) sys 8.76 (24%) wall 5151 kB ( 1%) ggc reload : 7.72 (23%) usr 0.22 ( 8%) sys 7.94 (21%) wall TOTAL : 34.17 2.79 37.15 402913 kB memory-usage is still high compared to 4.2. At -O2 the picture is similar (memory peaks at 2.5GB): Execution times (seconds) garbage collection : 0.64 ( 0%) usr 0.00 ( 0%) sys 0.64 ( 0%) wall 0 kB ( 0%) ggc callgraph construction: 0.31 ( 0%) usr 0.08 ( 1%) sys 0.39 ( 0%) wall 20525 kB ( 3%) ggc callgraph optimization: 1.40 ( 1%) usr 0.03 ( 0%) sys 1.46 ( 1%) wall 639 kB ( 0%) ggc ipa cp : 0.20 ( 0%) usr 0.00 ( 0%) sys 0.20 ( 0%) wall 9607 kB ( 1%) ggc ipa reference : 0.14 ( 0%) usr 0.00 ( 0%) sys 0.14 ( 0%) wall 0 kB ( 0%) ggc ipa pure const : 0.33 ( 0%) usr 0.00 ( 0%) sys 0.32 ( 0%) wall 199 kB ( 0%) ggc cfg cleanup : 0.92 ( 0%) usr 0.00 ( 0%) sys 0.89 ( 0%) wall 1168 kB ( 0%) ggc trivially dead code : 0.92 ( 0%) usr 0.00 ( 0%) sys 0.89 ( 0%) wall 0 kB ( 0%) ggc df multiple defs : 0.86 ( 0%) usr 0.01 ( 0%) sys 0.88 ( 0%) wall 0 kB ( 0%) ggc df reaching defs : 2.69 ( 1%) usr 0.96 (12%) sys 3.94 ( 2%) wall 0 kB ( 0%) ggc df live regs : 7.15 ( 3%) usr 0.07 ( 1%) sys 7.13 ( 3%) wall 0 kB ( 0%) ggc df live&initialized regs: 3.33 ( 2%) usr 0.06 ( 1%) sys 3.36 ( 1%) wall 0 kB ( 0%) ggc df use-def / def-use chains: 0.85 ( 0%) usr 0.03 ( 0%) sys 0.88 ( 0%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 6.31 ( 3%) usr 0.05 ( 1%) sys 12.84 ( 6%) wall 18222 kB ( 2%) ggc register information : 3.42 ( 2%) usr 0.00 ( 0%) sys 3.46 ( 2%) wall 0 kB ( 0%) ggc alias analysis : 1.14 ( 1%) usr 0.01 ( 0%) sys 1.17 ( 1%) wall 10447 kB ( 1%) ggc alias stmt walking : 4.39 ( 2%) usr 0.26 ( 3%) sys 4.53 ( 2%) wall 0 kB ( 0%) ggc register scan : 0.20 ( 0%) usr 0.00 ( 0%) sys 0.20 ( 0%) wall 7 kB ( 0%) ggc rebuild jump labels : 0.53 ( 0%) usr 0.00 ( 0%) sys 0.51 ( 0%) wall 0 kB ( 0%) ggc preprocessing : 0.64 ( 0%) usr 0.35 ( 4%) sys 0.93 ( 0%) wall 23140 kB ( 3%) ggc lexical analysis : 0.30 ( 0%) usr 0.54 ( 7%) sys 0.88 ( 0%) wall 0 kB ( 0%) ggc parser : 0.69 ( 0%) usr 0.38 ( 5%) sys 1.09 ( 0%) wall 38129 kB ( 5%) ggc inline heuristics : 1.18 ( 1%) usr 0.01 ( 0%) sys 1.15 ( 1%) wall 29832 kB ( 4%) ggc integration : 3.42 ( 2%) usr 0.45 ( 6%) sys 3.66 ( 2%) wall 175322 kB (22%) ggc tree gimplify : 1.08 ( 1%) usr 0.09 ( 1%) sys 1.17 ( 1%) wall 104718 kB (13%) ggc tree eh : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc tree CFG construction : 0.14 ( 0%) usr 0.01 ( 0%) sys 0.14 ( 0%) wall 11817 kB ( 1%) ggc tree CFG cleanup : 1.23 ( 1%) usr 0.00 ( 0%) sys 1.21 ( 1%) wall 60 kB ( 0%) ggc tree VRP : 3.08 ( 1%) usr 0.01 ( 0%) sys 3.09 ( 1%) wall 6719 kB ( 1%) ggc tree copy propagation : 1.22 ( 1%) usr 0.02 ( 0%) sys 1.14 ( 0%) wall 585 kB ( 0%) ggc tree find ref. vars : 0.11 ( 0%) usr 0.01 ( 0%) sys 0.12 ( 0%) wall 6045 kB ( 1%) ggc tree PTA : 0.62 ( 0%) usr 0.00 ( 0%) sys 0.68 ( 0%) wall 2178 kB ( 0%) ggc tree PHI insertion : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 185 kB ( 0%) ggc tree SSA rewrite : 2.92 ( 1%) usr 0.06 ( 1%) sys 3.04 ( 1%) wall 54148 kB ( 7%) ggc tree SSA other : 0.24 ( 0%) usr 0.08 ( 1%) sys 0.34 ( 0%) wall 589 kB ( 0%) ggc tree SSA incremental : 3.46 ( 2%) usr 0.03 ( 0%) sys 3.50 ( 2%) wall 187 kB ( 0%) ggc tree operand scan : 1.81 ( 1%) usr 0.31 ( 4%) sys 2.40 ( 1%) wall 53424 kB ( 7%) ggc dominator optimization: 0.73 ( 0%) usr 0.00 ( 0%) sys 0.73 ( 0%) wall 5666 kB ( 1%) ggc tree SRA : 0.01 ( 0%) usr 0.01 ( 0%) sys 0.02 ( 0%) wall 2 kB ( 0%) ggc tree CCP : 0.95 ( 0%) usr 0.00 ( 0%) sys 0.93 ( 0%) wall 617 kB ( 0%) ggc tree PHI const/copy prop: 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 8 kB ( 0%) ggc tree reassociation : 0.20 ( 0%) usr 0.00 ( 0%) sys 0.20 ( 0%) wall 280 kB ( 0%) ggc tree PRE : 2.02 ( 1%) usr 1.01 (13%) sys 3.02 ( 1%) wall 2092 kB ( 0%) ggc tree FRE : 2.62 ( 1%) usr 1.12 (14%) sys 3.75 ( 2%) wall 1674 kB ( 0%) ggc tree code sinking : 0.15 ( 0%) usr 0.00 ( 0%) sys 0.17 ( 0%) wall 1126 kB ( 0%) ggc tree linearize phis : 0.07 ( 0%) usr 0.01 ( 0%) sys 0.09 ( 0%) wall 4 kB ( 0%) ggc tree forward propagate: 0.19 ( 0%) usr 0.00 ( 0%) sys 0.17 ( 0%) wall 75 kB ( 0%) ggc tree phiprop : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc tree conservative DCE : 0.61 ( 0%) usr 0.20 ( 2%) sys 0.91 ( 0%) wall 5 kB ( 0%) ggc tree aggressive DCE : 0.65 ( 0%) usr 0.09 ( 1%) sys 0.80 ( 0%) wall 1486 kB ( 0%) ggc tree buildin call DCE : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc tree DSE : 0.13 ( 0%) usr 0.00 ( 0%) sys 0.15 ( 0%) wall 34 kB ( 0%) ggc PHI merge : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 26 kB ( 0%) ggc tree loop bounds : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 17 kB ( 0%) ggc tree loop invariant motion: 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 4 kB ( 0%) ggc scev constant prop : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 112 kB ( 0%) ggc complete unrolling : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall 379 kB ( 0%) ggc tree iv optimization : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 473 kB ( 0%) ggc tree loop init : 0.13 ( 0%) usr 0.00 ( 0%) sys 0.13 ( 0%) wall 390 kB ( 0%) ggc tree copy headers : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 330 kB ( 0%) ggc tree SSA uncprop : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc tree rename SSA copies: 0.31 ( 0%) usr 0.00 ( 0%) sys 0.32 ( 0%) wall 0 kB ( 0%) ggc dominance frontiers : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc dominance computation : 1.26 ( 1%) usr 0.04 ( 0%) sys 1.31 ( 1%) wall 0 kB ( 0%) ggc control dependences : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc expand : 10.92 ( 5%) usr 0.70 ( 9%) sys 11.74 ( 5%) wall 124907 kB (16%) ggc forward prop : 2.08 ( 1%) usr 0.04 ( 0%) sys 2.06 ( 1%) wall 3631 kB ( 0%) ggc CSE : 3.19 ( 1%) usr 0.00 ( 0%) sys 3.14 ( 1%) wall 512 kB ( 0%) ggc dead code elimination : 1.24 ( 1%) usr 0.01 ( 0%) sys 1.25 ( 1%) wall 0 kB ( 0%) ggc dead store elim1 : 1.26 ( 1%) usr 0.04 ( 0%) sys 1.33 ( 1%) wall 965 kB ( 0%) ggc dead store elim2 : 1.51 ( 1%) usr 0.02 ( 0%) sys 1.54 ( 1%) wall 7466 kB ( 1%) ggc loop analysis : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 229 kB ( 0%) ggc loop invariant motion : 0.10 ( 0%) usr 0.03 ( 0%) sys 0.13 ( 0%) wall 2 kB ( 0%) ggc CPROP : 0.18 ( 0%) usr 0.00 ( 0%) sys 0.20 ( 0%) wall 452 kB ( 0%) ggc PRE : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall 15 kB ( 0%) ggc CSE 2 : 3.20 ( 1%) usr 0.00 ( 0%) sys 3.22 ( 1%) wall 494 kB ( 0%) ggc branch prediction : 0.28 ( 0%) usr 0.00 ( 0%) sys 0.30 ( 0%) wall 2539 kB ( 0%) ggc combiner : 1.42 ( 1%) usr 0.02 ( 0%) sys 1.42 ( 1%) wall 11673 kB ( 1%) ggc if-conversion : 0.19 ( 0%) usr 0.00 ( 0%) sys 0.20 ( 0%) wall 338 kB ( 0%) ggc regmove : 0.39 ( 0%) usr 0.00 ( 0%) sys 0.35 ( 0%) wall 0 kB ( 0%) ggc integrated RA : 78.17 (37%) usr 0.46 ( 6%) sys 78.94 (34%) wall 4788 kB ( 1%) ggc reload : 25.79 (12%) usr 0.07 ( 1%) sys 26.09 (11%) wall 26499 kB ( 3%) ggc reload CSE regs : 3.00 ( 1%) usr 0.05 ( 1%) sys 3.08 ( 1%) wall 15550 kB ( 2%) ggc thread pro- & epilogue: 1.39 ( 1%) usr 0.01 ( 0%) sys 1.48 ( 1%) wall 1006 kB ( 0%) ggc if-conversion 2 : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.08 ( 0%) wall 138 kB ( 0%) ggc combine stack adjustments: 0.08 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc peephole 2 : 0.67 ( 0%) usr 0.01 ( 0%) sys 0.69 ( 0%) wall 3275 kB ( 0%) ggc hard reg cprop : 1.33 ( 1%) usr 0.02 ( 0%) sys 1.32 ( 1%) wall 979 kB ( 0%) ggc scheduling 2 : 6.59 ( 3%) usr 0.05 ( 1%) sys 6.63 ( 3%) wall 294 kB ( 0%) ggc machine dep reorg : 0.69 ( 0%) usr 0.01 ( 0%) sys 0.68 ( 0%) wall 10 kB ( 0%) ggc reorder blocks : 0.34 ( 0%) usr 0.00 ( 0%) sys 0.39 ( 0%) wall 2878 kB ( 0%) ggc final : 1.63 ( 1%) usr 0.04 ( 0%) sys 1.64 ( 1%) wall 201 kB ( 0%) ggc tree if-combine : 0.01 ( 0%) usr 0.01 ( 0%) sys 0.03 ( 0%) wall 1 kB ( 0%) ggc plugin execution : 0.00 ( 0%) usr 0.03 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 214.12 8.02 229.70 793292 kB At -O1 we seem to never finish early inlining into testsuite though... (the loop over all callees calling check-inline-limits which again loops over all callees looks quadratic - at least because we do not not consider the duplicates in the caller? We seem to be looping over edges but check limits for !one_only. Why do we not consider edges individually and avoid calling cgraph_check_inline_limits with !one_only at all?). Finally it finished. -O1: inline heuristics : 810.76 (63%) usr 4.58 (24%) sys 957.16 (53%) wall 0 kB ( 0%) ggc integrated RA : 115.69 ( 9%) usr 3.86 (20%) sys 290.09 (16%) wall 5979 kB ( 1%) ggc reload : 312.19 (24%) usr 4.99 (26%) sys 453.32 (25%) wall 5139 kB ( 1%) ggc TOTAL :1290.20 19.28 1807.46 486943 kB ugh. The whole early-inlining stuff is made ugly because we jump through hoops to handle callgraph cycles where some callees may not yet be in SSA form. If we do not want to go the route to go into SSA for all functions before starting early IPA passes then we at least should do so for callgraph SCCs. Subject: Re: [4.3/4.4/4.5 Regression] gcc 4.3.1 cannot compile big function
> The whole early-inlining stuff is made ugly because we jump through hoops
> to handle callgraph cycles where some callees may not yet be in SSA form.
Hmm, it don't seem to me that keeping the reverse postorder is bad idea
here (not that going to SSA earlier would not be good idea for different
reasons). Still top-down inliner needs to deal with cycles one way or
another. I will take a look what is taking quadratic time.
I have a fix for the inline-heuristic slowness. With the patch: inline heuristics : 0.77 ( 0%) usr 0.01 ( 0%) sys 1.37 ( 0%) wall 0 kB ( 0%) ggc expand : 9.42 ( 2%) usr 0.70 (11%) sys 10.36 ( 2%) wall 138780 kB (29%) ggc integrated RA : 115.01 (24%) usr 1.33 (21%) sys 117.04 (24%) wall 5975 kB ( 1%) ggc reload : 311.26 (65%) usr 0.44 ( 7%) sys 312.77 (64%) wall 5139 kB ( 1%) ggc TOTAL : 478.22 6.28 491.79 486545 kB But it miscompares. Honza?! Index: gcc/ipa-inline.c =================================================================== *** gcc/ipa-inline.c (revision 156321) --- gcc/ipa-inline.c (working copy) *************** cgraph_decide_inlining_incrementally (st *** 1510,1515 **** --- 1510,1517 ---- /* Never inline regular functions into always-inline functions during incremental inlining. */ && !node->local.disregard_inline_limits) + { + bitmap visited = BITMAP_ALLOC (NULL); for (e = node->callees; e; e = e->next_callee) { int allowed_growth = 0; *************** cgraph_decide_inlining_incrementally (st *** 1517,1522 **** --- 1519,1528 ---- || !e->inline_failed || e->callee->local.disregard_inline_limits) continue; + /* We are inlining a function to all call-sites in node + or to none. So visit each candidate only once. */ + if (!bitmap_set_bit (visited, e->callee->uid)) + continue; if (dump_file) fprintf (dump_file, "Considering inline candidate %s.\n", cgraph_node_name (e->callee)); *************** cgraph_decide_inlining_incrementally (st *** 1601,1606 **** --- 1607,1614 ---- if (cgraph_default_inline_p (e->callee, &failed_reason)) inlined |= try_inline (e, mode, depth); } + BITMAP_FREE (visited); + } node->aux = (void *)(size_t) old_mode; return inlined; } Nope, bootstrap is generally broken. Subject: Bug 37448 Author: rguenth Date: Fri Jan 29 11:26:27 2010 New Revision: 156343 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=156343 Log: 2010-01-29 Richard Guenther <rguenther@suse.de> PR middle-end/37448 * ipa-inline.c (cgraph_decide_inlining_incrementally): Avoid quadratic behavior in most cases. Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-inline.c Subject: Re: [4.3/4.4/4.5 Regression] gcc 4.3.1 cannot compile big function
> But it miscompares. Honza?!
Patch seems sane to me, so I guess it was other bootstrap issue.
I was actually working on similar fix yesterday just in more general
setting. Similar issue can be triggered with main inliner too, that
might provoke need to later make this caching global, but lets wait for
testcase since there might be other issues if you get that many inline
candidates and inlined calls.
Thanks for looking into this,
Honza
GCC 4.3.5 is being released, adjusting target milestone. PR47344 now tracks the regression property of this bug. Now at -O2 the main inliner seems to degenerate ;( 168767 78.4974 cc1 cgraph_edge_badness 9426 4.3842 cc1 update_edge_key 3529 1.6414 cc1 update_caller_keys 2498 1.1619 cc1 bitmap_set_bit 1776 0.8261 opreport /usr/bin/opreport 1212 0.5637 cc1 bitmap_elt_insert_after 1166 0.5423 oprofiled /usr/bin/oprofiled 1106 0.5144 libc-2.11.1.so _IO_vfscanf 965 0.4488 cc1 bitmap_ior_into and later IRA samples % app name symbol name 25826 56.8667 cc1 allocno_spill_priority_compare 6812 14.9994 cc1 ira_build_conflicts 4308 9.4859 cc1 splay_tree_splay 2847 6.2689 cc1 color_pass 1661 3.6574 cc1 push_allocno_to_stack 1336 2.9418 cc1 build_object_conflicts 468 1.0305 cc1 bitmap_bit_p 443 0.9754 cc1 splay_tree_insert df reaching defs : 1.65 ( 1%) usr 0.42 (10%) sys 2.10 ( 1%) wall 0 kB ( 0%) ggc df live regs : 4.27 ( 2%) usr 0.00 ( 0%) sys 4.24 ( 2%) wall 0 kB ( 0%) ggc df live&initialized regs: 1.31 ( 1%) usr 0.01 ( 0%) sys 1.32 ( 1%) wall 0 kB ( 0%) ggc df use-def / def-use chains: 0.72 ( 0%) usr 0.00 ( 0%) sys 0.72 ( 0%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 4.32 ( 2%) usr 0.00 ( 0%) sys 4.31 ( 2%) wall 8508 kB ( 1%) ggc register information : 2.05 ( 1%) usr 0.00 ( 0%) sys 2.02 ( 1%) wall 0 kB ( 0%) ggc 362 0.7971 cc1 splay_tree_remove 193 0.4250 cc1 splay_tree_free 167 0.3677 cc1 add_copies inline heuristics : 81.88 (35%) usr 0.04 ( 1%) sys 82.07 (35%) wall 44238 kB ( 5%) ggc CSE : 4.09 ( 2%) usr 0.00 ( 0%) sys 4.12 ( 2%) wall 138 kB ( 0%) ggc CSE 2 : 4.13 ( 2%) usr 0.00 ( 0%) sys 4.12 ( 2%) wall 125 kB ( 0%) ggc integrated RA : 50.18 (22%) usr 0.18 ( 4%) sys 50.48 (21%) wall 7472 kB ( 1%) ggc reload : 9.21 ( 4%) usr 0.00 ( 0%) sys 9.25 ( 4%) wall 12180 kB ( 1%) ggc scheduling 2 : 3.81 ( 2%) usr 0.00 ( 0%) sys 3.79 ( 2%) wall 356 kB ( 0%) ggc So mostly IRA and main inliner. I am not sure how much I can do for main inliner except for adding some hard limit on how much work we put into updating the priority queue and ignoring inconsistencies when it is met... Honza (In reply to comment #15) > We seem to make nice progress on the testcase ;) I wonder where things stand with trunk today.. Created attachment 28534 [details]
memory and time statistics for compiling lgwam.c
With today's mainline
leibniz-252% /tmp/lucier/mainline/bin/gcc -v
Using built-in specs.
COLLECT_GCC=/tmp/lucier/mainline/bin/gcc
COLLECT_LTO_WRAPPER=/tmp/lucier/mainline/libexec/gcc/x86_64-unknown-linux-gnu/4.8.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../../gcc/configure --prefix=/tmp/lucier/mainline --enable-checking=release --enable-languages=c --disable-multilib
Thread model: posix
gcc version 4.8.0 20121026 (experimental) (GCC)
and the command:
leibniz-270% /tmp/lucier/mainline/bin/gcc -O3 -c lgwam.c -fmem-report -ftime-report > & ! lgwam.c.stat-O3
These are the statistics.
Brad
Hmm, good timming. I just started looking into re-tunning inliner heuristics for 4.8 so I should look again into this inline bomb... Looks like inliner did not get faster ;) The problem here is walking of callgraph edges to sum the code size of the caller function in both early inliner and late inliner. I am still not very keen on making the cost calucuations incremental, but I will see if keeping the sum of all edges makes sense here... Honza (In reply to comment #47) > Hmm, good timming. I just started looking into re-tunning inliner > heuristics for 4.8 so I should look again into this inline bomb... > Looks like inliner did not get faster ;) While everything else did get faster. Much of the work from PR54146 also helps for this PR. But the inliner heuristics are even more insane than this test case. What are the reasons for your aversion against incrementally updating the inliner heuristics? Note this is a regression. See comment #42 ... Even at -O1 the inliner degenerates (and var-tracking gives up if you enable -g) ipa inlining heuristics : 260.14 (84%) usr 0.50 (23%) sys 260.91 (84%) wall 102217 kB (10%) ggc I have a simple patch :P Author: rguenth Date: Mon Feb 22 09:32:35 2016 New Revision: 233598 URL: https://gcc.gnu.org/viewcvs?rev=233598&root=gcc&view=rev Log: 2016-02-22 Richard Biener <rguenther@suse.de> PR ipa/37448 * ipa-inline-transform.c (inline_call): When not updating overall summaries adjust self size by the growth estimate. * ipa-inline.c (inline_to_all_callers_1): Add to the callers hash-set, do not update overall summaries here. Renamed from ... (inline_to_all_callers): ... this which is now wrapping the above and performing delayed overall summary update. (early_inline_small_functions): Delay updating of the overall summary. Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-inline-transform.c trunk/gcc/ipa-inline.c Fixed for GCC 6. |