Created attachment 32182 [details] ghc14232_3.hc.i.bz2 - compressed source file (30MBs uncompressed) GHC (glasgow haskell compiler) in it's portable mode generates intermediate .C source files. Sometimes they are really large: 20-100 MB. Even on core-i7 building such files takes: 3-5 minutes. On slower boxes/arches things are worse and go up to 20-60 minutes. clang does not seem to have such problems [1]. gcc version should not be very relevant. Time test are for [2]. If the problem is hard to fix in gcc what options would you suggest to enable to get sane build times? -fno-unit-at-a-time does not seem to help. Thanks! [1]: [sf] /tmp/__z:time gcc -O0 -Wno-ignored-attributes -c ghc14232_3.hc.i -o gcc.o real 5m19.975s user 5m18.403s sys 0m0.629s [sf] /tmp/__z:time gcc -O1 -Wno-ignored-attributes -c ghc14232_3.hc.i -o gcc.o real 3m0.557s user 2m58.496s sys 0m0.623s [sf] /tmp/__z:time gcc -O2 -Wno-ignored-attributes -c ghc14232_3.hc.i -o gcc.o real 3m21.315s user 3m19.691s sys 0m0.550s [sf] /tmp/__z:time clang -O0 -Wno-ignored-attributes -c ghc14232_3.hc.i -o clang.o real 0m19.661s user 0m19.356s sys 0m0.234s [sf] /tmp/__z:time clang -O1 -Wno-ignored-attributes -c ghc14232_3.hc.i -o clang.o real 0m49.612s user 0m49.145s sys 0m0.295s [sf] /tmp/__z:time clang -O2 -Wno-ignored-attributes -c ghc14232_3.hc.i -o clang.o real 0m48.991s user 0m48.539s sys 0m0.278s [2]: Using built-in specs. COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/4.8.2/gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/4.8.2/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /subvolumes/var_tmp/portage/sys-devel/gcc-4.8.2-r1/work/gcc-4.8.2/configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.8.2 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.8.2/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.8.2 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.8.2/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.8.2/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.8.2/include/g++-v4 --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.8.2/python --enable-languages=c,c++,fortran --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 4.8.2-r1 p1.4-ssptest, pie-0.5.9-ssptest' --enable-libstdcxx-time --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64 --disable-altivec --disable-fixed-point --enable-targets=all --disable-libgcj --enable-libgomp --disable-libmudflap --disable-libssp --enable-lto --without-cloog Thread model: posix gcc version 4.8.2 (Gentoo 4.8.2-r1 p1.4-ssptest, pie-0.5.9-ssptest)
For -ftime-report -O1 most highlighting lines are: phase opt and generate : 175.81 (98%) usr 4.61 (51%) sys 180.95 (95%) wall 1833542 kB (86%) ggc expand : 20.05 (11%) usr 0.71 ( 8%) sys 20.29 (11%) wall 171351 kB ( 8%) ggc remove unused locals : 116.72 (65%) usr 0.90 (10%) sys 117.77 (62%) wall 0 kB ( 0%) ggc
It looks like clang ignores all _attribute__((aligned (8))): % clang -c -O2 ghc14232_3.hc.i 2>&1 | grep Wignored-attributes | wc -l 16710 % Using -flto=4 with gcc-4.9 almost halves the compile time on my machine (ancient AMD Phenom 4 core): From: 5:27.73 total To : 2:23.85 total (clang takes 1:22.60 total)
Perf shows: 24.26% cc1 cc1 [.] bitmap_set_bit(bitmap_head*, int) 20.88% cc1 cc1 [.] mark_all_vars_used_1(tree_node**, int*, void*) 14.18% cc1 cc1 [.] operand_equal_p(tree_node const*, tree_node const*,... 9.15% cc1 cc1 [.] mem_attrs_eq_p(mem_attrs const*, mem_attrs const*) 4.17% cc1 cc1 [.] walk_tree_1(tree_node**, tree_node* (*)(tree_node**,... 1.69% cc1 cc1 [.] tree_block(tree_node*)
Confirmed.(In reply to Markus Trippelsdorf from comment #3) > Perf shows: > > 24.26% cc1 cc1 [.] bitmap_set_bit(bitmap_head*, int) > 20.88% cc1 cc1 [.] mark_all_vars_used_1(tree_node**, int*, void*) /* Only need to mark VAR_DECLS; parameters and return results are not eliminated as unused. */ if (TREE_CODE (t) == VAR_DECL) { /* When a global var becomes used for the first time also walk its initializer (non global ones don't have any). */ if (set_is_used (t) && is_global_var (t)) mark_all_vars_used (&DECL_INITIAL (t)); not sure why we do that .... (we've had such compile-time-hog in former referenced-vars tracking as well). That's quadratic if you refer to a large global from all functions. > 14.18% cc1 cc1 [.] operand_equal_p(tree_node const*, tree_node const*,... > 9.15% cc1 cc1 [.] mem_attrs_eq_p(mem_attrs const*, mem_attrs const*) > 4.17% cc1 cc1 [.] walk_tree_1(tree_node**, tree_node* > (*)(tree_node**,... > 1.69% cc1 cc1 [.] tree_block(tree_node*)
remove_unused_locals is called from at least cfgcleanup-post-optimizing at -O0. At -O0 I have (trunk) expand : 481.98 (94%) usr 1.15 (17%) sys 481.94 (93%) wall 293891 kB (15%) ggc TOTAL : 512.44 6.86 519.82 2023628 kB so it's not that particular spot at -O0. At -O1 it becomes dominant though: expand : 36.83 (18%) usr 0.78 ( 8%) sys 37.18 (17%) wall 177246 kB ( 8%) ggc remove unused locals : 122.09 (58%) usr 0.79 ( 8%) sys 122.80 (56%) wall 0 kB ( 0%) ggc TOTAL : 210.18 9.30 219.40 2258487 kB fixed by removing that walking of DECL_INITIAL. I'm sure somebody put thought into it but I can't think of any case that would break with not doing that walking. -O1 without that: phase parsing : 5.50 ( 6%) usr 4.49 (48%) sys 9.98 (11%) wall 294905 kB (13%) ggc expand : 34.99 (41%) usr 0.92 (10%) sys 36.44 (38%) wall 177246 kB ( 8%) ggc integrated RA : 4.65 ( 5%) usr 0.29 ( 3%) sys 4.60 ( 5%) wall 478130 kB (21%) ggc TOTAL : 85.57 9.39 94.92 2258487 kB so probably the very same issue as with -O0.
For reference (in testing) Index: gcc/tree-ssa-live.c =================================================================== --- gcc/tree-ssa-live.c (revision 207960) +++ gcc/tree-ssa-live.c (working copy) @@ -432,12 +432,7 @@ mark_all_vars_used_1 (tree *tp, int *wal /* Only need to mark VAR_DECLS; parameters and return results are not eliminated as unused. */ if (TREE_CODE (t) == VAR_DECL) - { - /* When a global var becomes used for the first time also walk its - initializer (non global ones don't have any). */ - if (set_is_used (t) && is_global_var (t)) - mark_all_vars_used (&DECL_INITIAL (t)); - } + set_is_used (t); /* remove_unused_scope_block_p requires information about labels which are not DECL_IGNORED_P to tell if they might be used in the IL. */ else if (TREE_CODE (t) == LABEL_DECL) fastest is -Og: expand : 30.56 (42%) usr 0.74 (10%) sys 30.15 (38%) wall 165633 kB ( 8%) ggc TOTAL : 72.66 7.70 80.33 2047648 kB throwing that to callgrind now (perf sucks - no backtraces :/)
(In reply to Richard Biener from comment #6) > For reference (in testing) Looks promising: Without LTO: 2:27.39 total With LTO: 35.485 total (60% faster than clang) > throwing that to callgrind now (perf sucks - no backtraces :/) There is "-g" for "perf record" and "-g -G" for "perf report".
Won't that break with function-local statics? Those can certainly refer to other function-local static, with this patch gcc might think the other ones are unused. I mean something like: int ** foo (void) { static int a = 0; static int *b = &a; static int **c = &b; return c; }
Ok, so the rest is hash collisions in the mem-attrs hash. We also do useless work here: case MEM_REF: { addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (TREE_TYPE (TREE_OPERAND (exp, 0)))); ... set_mem_attributes (temp, exp, 0); set_mem_addr_space (temp, as); set_mem_attributes already handles MEM_REFs properly: /* Address-space information is on the base object. */ if (TREE_CODE (base) == MEM_REF || TREE_CODE (base) == TARGET_MEM_REF) as = TYPE_ADDR_SPACE (TREE_TYPE (TREE_TYPE (TREE_OPERAND (base, 0)))); so removing that would cut expand time in half. Oh, and expand_debug_expr even gets addr-spaces wrong (in cfgexpand.c). We have 1107000 callso to set_mem_attrs but 151459000 calls to mem_attrs_htab_eq. The hash function is somewhat ad-hoc apart from hashing MEM_EXPR via iterative_hash_expr. We hash MEM_REFs via hashing their operands recursively but we compare them more strictly factoring in properties of types involved. I don't see an easy way to make the hashing stronger (given PR58521 and its fix). There is always the old idea of simply removing that mem_attr sharing completely ... At -O0 we could simply always force a NULL_TREE MEM_EXPR for example.
(In reply to Jakub Jelinek from comment #8) > Won't that break with function-local statics? Those can certainly refer to > other function-local static, with this patch gcc might think the other ones > are unused. I mean something like: > > int ** > foo (void) > { > static int a = 0; > static int *b = &a; > static int **c = &b; > return c; > } int ** foo (void) { static int a = 0; static int *b = &a; static int **c = &b; return c; } int main() { return **foo(); } I can still step into foo() and print a, b and c when compiling with -O -fno-inline -g (I cannot seem to step into foo with inlining enabled, but that doesn't work without the patch either - didn't expect that because we cannot preserve DECL_INITIAL in the inlined blocks).
With the redundant set_mem_addr_space removed -Og now takes expand : 19.21 (31%) usr 0.60 ( 8%) sys 20.06 (29%) wall 165633 kB ( 8%) ggc TOTAL : 61.60 7.58 69.16 2047648 kB collisions are like MEM[(StgWord *)_5 + -64B] != MEM[(StgWord *)_5 + -64B] MEM[(StgWord *)_5 + -64B] != MEM[(StgWord *)_5 + -64B] MEM[(StgWord *)_5 + -64B] != MEM[(StgWord *)_5 + -64B] MEM[(StgWord *)_20 + 24B] != MEM[(StgWord *)_5 + -64B] investigating ... (clearly not having a recorded hash to compare that quickly with the hashtab collision handling makes things worse here). Ouch. This mem-attr hashtable is _global_! The above _5 are different SSA name objects (from different functions). The hash is global because we also have DECL_RTL for global variables, so we can't really clear it (well, we could - we'd just lose mem-attr sharing at that points). Clearing the mem-attrs htab in rest_of_clean_state () gets us to phase parsing : 5.33 (13%) usr 4.41 (58%) sys 9.74 (19%) wall 294905 kB (14%) ggc tree gimplify : 2.05 ( 5%) usr 0.20 ( 3%) sys 2.05 ( 4%) wall 252760 kB (12%) ggc tree CCP : 1.25 ( 3%) usr 0.13 ( 2%) sys 1.13 ( 2%) wall 57081 kB ( 3%) ggc expand : 1.50 ( 4%) usr 0.13 ( 2%) sys 1.63 ( 3%) wall 169767 kB ( 8%) ggc CSE : 1.29 ( 3%) usr 0.12 ( 2%) sys 1.05 ( 2%) wall 13532 kB ( 1%) ggc combiner : 2.06 ( 5%) usr 0.15 ( 2%) sys 2.10 ( 4%) wall 11785 kB ( 1%) ggc integrated RA : 4.51 (11%) usr 0.24 ( 3%) sys 4.30 ( 9%) wall 427273 kB (21%) ggc LRA non-specific : 1.26 ( 3%) usr 0.09 ( 1%) sys 1.40 ( 3%) wall 6517 kB ( 0%) ggc reload CSE regs : 1.19 ( 3%) usr 0.06 ( 1%) sys 1.40 ( 3%) wall 13638 kB ( 1%) ggc rest of compilation : 2.31 ( 5%) usr 0.18 ( 2%) sys 2.20 ( 4%) wall 40076 kB ( 2%) ggc TOTAL : 42.32 7.66 50.19 2052932 kB
-O0 time with both patches phase parsing : 6.34 (17%) usr 5.18 (71%) sys 11.53 (25%) wall 294905 kB (15%) ggc tree gimplify : 2.17 ( 6%) usr 0.19 ( 3%) sys 2.40 ( 5%) wall 323021 kB (16%) ggc expand : 2.66 ( 7%) usr 0.13 ( 2%) sys 2.64 ( 6%) wall 262544 kB (13%) ggc integrated RA : 8.86 (24%) usr 0.32 ( 4%) sys 8.48 (18%) wall 482040 kB (24%) ggc TOTAL : 37.29 7.34 46.18 1992937 kB
GCC 4.3 needs 22.5s at -O0. GCC 4.7 doesn't exhibit the unused-vars-remove slowness: expand : 46.53 (42%) usr 0.96 ( 9%) sys 47.17 (39%) wall 173344 kB ( 7%) ggc remove unused locals : 0.56 ( 1%) usr 0.07 ( 1%) sys 0.48 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 110.45 10.58 121.70 2343278 kB
Author: rguenth Date: Fri Feb 21 13:14:23 2014 New Revision: 207991 URL: http://gcc.gnu.org/viewcvs?rev=207991&root=gcc&view=rev Log: 2014-02-21 Richard Biener <rguenther@suse.de> PR middle-end/60291 * tree-ssa-live.c (mark_all_vars_used_1): Do not walk DECL_INITIAL for globals not in the current function context. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-live.c
And with http://gcc.gnu.org/ml/gcc-patches/2014-02/msg01314.html we get at -O0 expand : 1.59 ( 5%) usr 0.05 ( 1%) sys 1.72 ( 5%) wall 261220 kB (13%) ggc TOTAL : 29.96 6.17 36.12 1991207 kB and -Og and -O1. expand : 1.03 ( 2%) usr 0.16 ( 2%) sys 1.44 ( 2%) wall 196031 kB ( 9%) ggc TOTAL : 49.82 9.01 58.80 2280015 kB
Fixed on trunk sofar.
Author: rguenth Date: Tue Feb 25 08:59:10 2014 New Revision: 208113 URL: http://gcc.gnu.org/viewcvs?rev=208113&root=gcc&view=rev Log: 2014-02-25 Richard Biener <rguenther@suse.de> PR middle-end/60291 * emit-rtl.c (mem_attrs_htab): Remove. (mem_attrs_htab_hash): Likewise. (mem_attrs_htab_eq): Likewise. (set_mem_attrs): Always allocate new mem-attrs when something changed. (init_emit_once): Do not allocate mem_attrs_htab. Modified: trunk/gcc/ChangeLog trunk/gcc/emit-rtl.c
Status now the same as 4.7 on the 4.8 branch (thus only the long-term regression against 4.4 remains).
Author: rguenth Date: Tue Feb 25 10:47:21 2014 New Revision: 208118 URL: http://gcc.gnu.org/viewcvs?rev=208118&root=gcc&view=rev Log: 2014-02-25 Richard Biener <rguenther@suse.de> Backport from mainline 2014-02-21 Richard Biener <rguenther@suse.de> PR middle-end/60291 * tree-ssa-live.c (mark_all_vars_used_1): Do not walk DECL_INITIAL for globals not in the current function context. 2014-02-20 Richard Biener <rguenther@suse.de> PR middle-end/60221 * tree-eh.c (execute_cleanup_eh_1): Also cleanup empty EH regions at -O0. 2014-02-14 Richard Biener <rguenther@suse.de> PR tree-optimization/60183 * tree-ssa-phiprop.c (propagate_with_phi): Avoid speculating loads. (tree_ssa_phiprop): Calculate and free post-dominators. * gcc.dg/torture/pr60183.c: New testcase. Added: branches/gcc-4_8-branch/gcc/testsuite/gcc.dg/torture/pr60183.c Modified: branches/gcc-4_8-branch/gcc/ChangeLog branches/gcc-4_8-branch/gcc/testsuite/ChangeLog branches/gcc-4_8-branch/gcc/tree-eh.c branches/gcc-4_8-branch/gcc/tree-ssa-live.c branches/gcc-4_8-branch/gcc/tree-ssa-phiprop.c
The 4.7 branch is being closed, moving target milestone to 4.8.4.
I think we can declare this fixed as far as possible on the 4.8 branch.
*** Bug 66682 has been marked as a duplicate of this bug. ***