I was trying to compile autogenerated by 'spiral' 'gap_TlnLv4.c' file (to be attached). Trying to compile it with /mnt/sdb8/sergei/AFSWD_debug/install/gcc-4.3.3/binsh/gcc -O1 -fomit-frame-pointer -malign-double -fstrict-aliasing -march=native -c /tmp/spiral-sergei/gap_TlnLv4.c -o gap_TlnLv4.o1 command line I got this: " /tmp/spiral-sergei/gap_TlnLv4.c: In function ‘RDFT_49152_1’: /tmp/spiral-sergei/gap_TlnLv4.c:172282: internal compiler error: Segmentation fault Please submit a full bug report, ". Before that, trying to compile it with /mnt/sdb8/sergei/AFSWD_debug/install/gcc-4.3.3/binsh/gcc -O2 -fomit-frame-pointer -malign-double -fstrict-aliasing -march=native -c /tmp/spiral-sergei/gap_TlnLv4.c command I got this: cc1: out of memory allocating 4184025948 bytes after a total of 205533184 bytes .
Created attachment 17377 [details] source file causing the failure The source is not preprocessed, it has only standard #include <math.h> in it.
My OS: Linux amdam2 2.6.22.19-0.2-default #1 SMP 2008-12-18 10:17:03 +0100 i686 athlon i386 GNU/Linux - SUSE 10.3 in simple English; 'gcc' is self-built 'gcc-4.3.3'.
There is no failure with -O0.
FWIW, 'gcc-3.4.6', when run as /mnt/sdb8/sergei/AFSWD_debug/install/gcc-3.4.6/binsh/gcc -O1 -fomit-frame-pointer -malign-double -fstrict-aliasing -c /tmp/spiral-sergei/gap_TlnLv4.c -o gap_TlnLv4.o , fails with cc1: out of memory allocating 182853324 bytes after a total of 14716928 bytes message, not with segmentation fault like 'gcc-4.3.3'.
Try -fno-move-loop-invariants. This is probably a killer on alias-improvements branch as well.
As this seems to be autogenerated from ! The SPL Program: (compose (sparse (coords (...12288 x 2 ...))(values (...1 x 12288 ...)))(compose (conjugate (..)(..))(compose (..)(..)))) ! node size: 12288 X 12288 can you produce a testcase with an order of magnitude less loads/stores? (thus likely just "node size: 1228 x 1228")? Thanks!
One problem seems to be that we blow up the stack during GC recursing following the VUSE -> def_stmt links.
Created attachment 17378 [details] a smaller file with hopefully the same pattern - the file name is gap_bzAJWH.c.gz .
(In reply to comment #6) > As this seems to be autogenerated from > > ! The SPL Program: (compose (sparse (coords (...12288 x 2 ...))(values (...1 x > 12288 ...)))(compose (conjugate (..)(..))(compose (..)(..)))) > ! node size: 12288 X 12288 > > can you produce a testcase with an order of magnitude less loads/stores? > (thus likely just "node size: 1228 x 1228")? > > Thanks! > I've uploaded a smaller file: gap_bzAJWH.c.gz . These are intermediated files which are deleted upon successful compilation. So, I've decreased the number the points and managed to "catch" this one. It is: "node size: 1536 X 1536".
I am not sure whether it's clear - the smaller 'gap_bzAJWH.c.gz' file can be found as http://gcc.gnu.org/bugzilla/attachment.cgi?id=17378&action=view attachment.
(In reply to comment #5) > Try -fno-move-loop-invariants. This is probably a killer on alias-improvements > branch as well. > Still segfault: " gap_TlnLv4.c: In function ‘RDFT_49152_1’: gap_TlnLv4.c:172282: internal compiler error: Segmentation fault Please submit a full bug report, with preprocessed source if appropriate. See <http://gcc.gnu.org/bugs.html> for instructions. real 33m52.152s user 16m47.843s sys 0m9.333s [1] Exit 1 time /mnt/sdb8/sergei/AFSWD_debug/install/gcc-4.3.3/binsh/gcc -O1 -fomit-frame-pointer -malign-double -fstrict-aliasing -fno-move-loop-invariants -c gap_TlnLv4.c ".
The small testcase at -O2 is bound by PRE time PRE : 27.08 (55%) usr 0.02 ( 4%) sys 27.09 (55%) wall 1551 kB ( 2%) ggc nothing interesting at -O1 and it still nicely fits into my memory. Which makes me as for a little bigger testcase ;) (as I suspect there's quadratic behavior in memory usage somewhere a size around 2000 * 2000 would be nice) Thanks again.
And for trunk we have combiner : 24.76 (15%) usr 0.80 (23%) sys 25.76 (16%) wall 448053 kB (82%) ggc integrated RA : 68.41 (43%) usr 1.06 (31%) sys 70.23 (42%) wall 3197 kB ( 1%) ggc as the worst offenders at -O1. I notice that with --enable-checking GC params the top memory usage is way lower than without, so something odd happens here as well (maybe a ggc_free wrapped inside ENABLE_CHECKING?). Most memory is used by combine and IRA.
'spiral' has produced another testcase which segfaults with -O2 - the original testcase segfaults with -O1. The testcase, though has half the points if terms of FFT, is big as a file: -rw-r--r-- 1 sergei users 1656419 2009-03-03 06:36 gap_CPEodL.c.bz2 , so I think I can't upload it directly, or can I ? I can send it as Email attachment to, say, rguenth.
Subject: Re: Segmentation fault with -O1, out of memory with -O2 On Tue, 3 Mar 2009, sergstesh at yahoo dot com wrote: > ------- Comment #14 from sergstesh at yahoo dot com 2009-03-03 13:36 ------- > 'spiral' has produced another testcase which segfaults with -O2 - the original > testcase segfaults with -O1. > > The testcase, though has half the points if terms of FFT, is big as a file: > > -rw-r--r-- 1 sergei users 1656419 2009-03-03 06:36 gap_CPEodL.c.bz2 > > , so I think I can't upload it directly, or can I ? > > I can send it as Email attachment to, say, rguenth. That would be fine. Richard.
(In reply to comment #15) > Subject: Re: Segmentation fault with -O1, out of memory > with -O2 > > On Tue, 3 Mar 2009, sergstesh at yahoo dot com wrote: > > > ------- Comment #14 from sergstesh at yahoo dot com 2009-03-03 13:36 ------- > > 'spiral' has produced another testcase which segfaults with -O2 - the original > > testcase segfaults with -O1. > > > > The testcase, though has half the points if terms of FFT, is big as a file: > > > > -rw-r--r-- 1 sergei users 1656419 2009-03-03 06:36 gap_CPEodL.c.bz2 > > > > , so I think I can't upload it directly, or can I ? > > > > I can send it as Email attachment to, say, rguenth. > > That would be fine. > > Richard. > I have just sent Richard an Email with 'gap_CPEodL.c.bz2' file attached.
Created attachment 17476 [details] the other testcase
Vlad, for the second testcase I see -O0: expand : 0.78 (19%) usr 0.04 ( 5%) sys 0.83 (16%) wall 44335 kB (49%) ggc integrated RA : 1.05 (25%) usr 0.03 ( 4%) sys 1.13 (22%) wall 1036 kB ( 1%) ggc reload : 0.55 (13%) usr 0.02 ( 3%) sys 0.59 (12%) wall 3585 kB ( 4%) ggc -O1: integrated RA : 73.95 (77%) usr 1.18 (50%) sys 76.20 (76%) wall 2898 kB ( 3%) ggc (on alias improvements branch I see combiner : 24.89 (17%) usr 0.51 (28%) sys 25.41 (17%) wall 448053 kB (82%) ggc integrated RA : 69.81 (47%) usr 0.36 (20%) sys 70.22 (46%) wall 3212 kB ( 1%) ggc as tree loop invariant motion does lots of stuff there) -O2: PRE : 25.49 (26%) usr 0.03 ( 3%) sys 25.52 (26%) wall 3513 kB ( 3%) ggc integrated RA : 36.98 (38%) usr 0.24 (21%) sys 37.31 (38%) wall 2158 kB ( 2%) ggc
For trunk -O1 I see CPU: AMD64 processors, speed 1000 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 368029 19.8218 allocno_spill_priority_compare 244104 13.1473 color_pass 205725 11.0802 build_allocno_conflicts 141509 7.6216 push_allocno_to_stack 106750 5.7495 assign_hard_reg 95558 5.1467 bitmap_bit_p 88785 4.7819 add_to_allocno_conflicts 51962 2.7986 splay_tree_splay 42764 2.3032 ira_build_conflicts 40718 2.1930 ira_flattening 27977 1.5068 canon_true_dependence 25608 1.3792 ira_allocno_live_ranges_intersect_p 20403 1.0989 nonoverlapping_memrefs_p 20005 1.0775 ira_add_allocno_conflict
Btw, it looks like internal IRA data-structures can be shrinked on 64bit platforms by avoiding padding between pointer and integer members a lot.
IRA now is very different from IRA in early 2009. Some new timings for GCC 4.8 trunk would be nice. Perhaps this bug is already fixed.
4.7.2 -O0 25s 2189981kB integrated RA : 8.96 (35%) usr 0.89 (28%) sys 9.89 (34%) wall 206439 kB (16%) ggc reload : 2.98 (12%) usr 0.07 ( 2%) sys 3.05 (11%) wall 43197 kB ( 3%) ggc 4.8.0 -O0 29s 2111318kB integrated RA : 8.88 (31%) usr 0.41 (13%) sys 9.32 (29%) wall 206439 kB (18%) ggc LRA non-specific : 5.50 (19%) usr 0.08 ( 3%) sys 5.59 (17%) wall 4600 kB ( 0%) ggc 4.8.0 -O2 (terminated after 9 minutes waiting, LIM being the offender, I suspect domwalk ...) >2.5GB applying domwalk fix ... 4.8.0 -O1 still awfully slow the smaller testcase is also tested on http://gcc.opensuse.org/c++bench/random/ but it runs OOM at -O1 and -O2 there (there is a virtual ulimit of 1GB due to lack of resources on the machine), but -O3 surprisingly works so you can see a recent time-report there (also for -O0). At -O3 you can see there (everything > 10%): tree loop invariant motion: 53.83 (13%) usr PRE : 116.26 (29%) usr LRA hard reg assignment : 73.68 (18%) usr load CSE after reload : 38.67 (10%) usr ISTR the testcases are not exactly exposing the same issues.
FYI, the original file ( http://gcc.gnu.org/bugzilla/attachment.cgi?id=17377 ) can be compiled with 'clang', albeit slowly: " sergei@amdam2:~/gcc_bug> time ~/AFSWD/install/LLVM-3.2/binsh/clang -v gap_TlnLv4.c -O2 -c clang version 3.2 (tags/RELEASE_32/final) Target: i386-pc-linux-gnu Thread model: posix "/mnt/sdb5/qemu/20121021/LLVM-3.2/bin/clang" -cc1 -triple i386-pc-linux-gnu -emit-obj -disable-free -disable-llvm-verifier -main-file-name gap_TlnLv4.c -mrelocation-model static -fmath-errno -masm-verbose -mconstructor-aliases -target-cpu pentium4 -target-linker-version 2.22 -momit-leaf-frame-pointer -v -coverage-file /home/sergei/gcc_bug/gap_TlnLv4.o -resource-dir /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2 -fmodule-cache-path /tmp/clang-module-cache -internal-isystem /usr/local/include -internal-isystem /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -O2 -fdebug-compilation-dir /home/sergei/gcc_bug -ferror-limit 19 -fmessage-length 182 -mstackrealign -fobjc-runtime=gcc -fdiagnostics-show-option -fcolor-diagnostics -o gap_TlnLv4.o -x c gap_TlnLv4.c clang -cc1 version 3.2 based upon LLVM 3.2svn default target i386-pc-linux-gnu ignoring nonexistent directory "/include" #include "..." search starts here: #include <...> search starts here: /usr/local/include /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2/include /usr/include End of search list. real 58m50.364s user 53m25.128s sys 0m11.641s sergei@amdam2:~/gcc_bug> ". Memory consumption is about 186M.
(In reply to comment #22) > 4.8.0 -O2 (terminated after 9 minutes waiting, LIM being the offender, I > suspect domwalk ...) >2.5GB > > applying domwalk fix ... It is LIM, for sure. I've been watching in GDB for a while at some back traces, and it's spent minutes already in this DOM walk: #5 0x0000000000b841e1 in walk_dominator_tree (walk_data=0x7fffffffdd60, bb=0x7fffef033958) at ../../trunk/gcc/domwalk.c:187 #6 0x0000000000c02d73 in determine_invariantness () at ../../trunk/gcc/tree-ssa-loop-im.c:1189 #7 tree_ssa_lim () at ../../trunk/gcc/tree-ssa-loop-im.c:2632 #8 0x000000000075bcd7 in execute_one_pass (pass=0x12323e0 <pass_lim>) at ../../trunk/gcc/passes.c:2330 #9 0x000000000075c0f5 in execute_pass_list (pass=0x12323e0 <pass_lim>) at ../../trunk/gcc/passes.c:2378 This is supposed to be cheap. Is this a known bottle-neck? Pathetic...
(In reply to comment #24) > (In reply to comment #22) > > 4.8.0 -O2 (terminated after 9 minutes waiting, LIM being the offender, I > > suspect domwalk ...) >2.5GB > > > > applying domwalk fix ... > > It is LIM, for sure. I've been watching in GDB for a while at some > back traces, and it's spent minutes already in this DOM walk: After this, it's spending even more time in refs_independent_p, doing bitmap tests (ah! a test case for my splay tree bitmaps!). Is the refs_independent_p relation symmetric? It certainly looks that way. If so, one way to halve the work done here is to build only half the "independence" graph. Currently it builds a full square: if (mem_refs_may_alias_p (ref1->mem, ref2->mem, &memory_accesses.ttae_cache)) { bitmap_set_bit (ref1->dep_ref, ref2->id); bitmap_set_bit (ref2->dep_ref, ref1->id); if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, "dependent.\n"); return false; } else { bitmap_set_bit (ref1->indep_ref, ref2->id); bitmap_set_bit (ref2->indep_ref, ref1->id); if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, "independent.\n"); return true; } It's hard to say how much memory is wasted here (obstack_memory_used is still broken and overflows), but it's probably x*100MB if not more.
(In reply to comment #23) > FYI, the original file ( http://gcc.gnu.org/bugzilla/attachment.cgi?id=17377 ) > can be compiled with 'clang', albeit slowly: ... > Memory consumption is about 186M. How have you measured this?
Compilation finished after ~3 hours and consuming at least 3GB (from top - I forgot to use memmax2...). Winners in the "geez, I'm slow for this test case" list: PRE : 6241.75 (57%) usr tree loop invariant motion : 4194.83 (38%) usr TOTAL : 11021.46 Time is mostly spent in the alias oracle. For LIM I'm going to test using triangle instead of square dep/indep matrix, as a first step. Richi, PRE is yours so... ;-)
On Wed, 6 Mar 2013, steven at gcc dot gnu.org wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39326 > > Steven Bosscher <steven at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |steven at gcc dot gnu.org > > --- Comment #24 from Steven Bosscher <steven at gcc dot gnu.org> 2013-03-06 23:39:27 UTC --- > (In reply to comment #22) > > 4.8.0 -O2 (terminated after 9 minutes waiting, LIM being the offender, I > > suspect domwalk ...) >2.5GB > > > > applying domwalk fix ... > > It is LIM, for sure. I've been watching in GDB for a while at some > back traces, and it's spent minutes already in this DOM walk: > > #5 0x0000000000b841e1 in walk_dominator_tree (walk_data=0x7fffffffdd60, > bb=0x7fffef033958) at ../../trunk/gcc/domwalk.c:187 > #6 0x0000000000c02d73 in determine_invariantness () at > ../../trunk/gcc/tree-ssa-loop-im.c:1189 > #7 tree_ssa_lim () at ../../trunk/gcc/tree-ssa-loop-im.c:2632 > #8 0x000000000075bcd7 in execute_one_pass (pass=0x12323e0 <pass_lim>) at > ../../trunk/gcc/passes.c:2330 > #9 0x000000000075c0f5 in execute_pass_list (pass=0x12323e0 <pass_lim>) at > ../../trunk/gcc/passes.c:2378 > > This is supposed to be cheap. Is this a known bottle-neck? > > Pathetic... Yes, it's known - and there are several known (to me ...) ways to make constant factor compile-time and memory-usage improvements... (I _think_ we have a bug for LIMs slowness, if you can't find it quickly you can create one and assign me - I have some TLC patches locally queued for 4.9, but they don't help the slowness very much)
On Thu, 7 Mar 2013, steven at gcc dot gnu.org wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39326 > > --- Comment #25 from Steven Bosscher <steven at gcc dot gnu.org> 2013-03-07 00:08:26 UTC --- > (In reply to comment #24) > > (In reply to comment #22) > > > 4.8.0 -O2 (terminated after 9 minutes waiting, LIM being the offender, I > > > suspect domwalk ...) >2.5GB > > > > > > applying domwalk fix ... > > > > It is LIM, for sure. I've been watching in GDB for a while at some > > back traces, and it's spent minutes already in this DOM walk: > > After this, it's spending even more time in refs_independent_p, doing > bitmap tests (ah! a test case for my splay tree bitmaps!). > > Is the refs_independent_p relation symmetric? It certainly looks that > way. If so, one way to halve the work done here is to build only half > the "independence" graph. Currently it builds a full square: > > if (mem_refs_may_alias_p (ref1->mem, ref2->mem, > &memory_accesses.ttae_cache)) > { > bitmap_set_bit (ref1->dep_ref, ref2->id); > bitmap_set_bit (ref2->dep_ref, ref1->id); > if (dump_file && (dump_flags & TDF_DETAILS)) > fprintf (dump_file, "dependent.\n"); > return false; > } > else > { > bitmap_set_bit (ref1->indep_ref, ref2->id); > bitmap_set_bit (ref2->indep_ref, ref1->id); > if (dump_file && (dump_flags & TDF_DETAILS)) > fprintf (dump_file, "independent.\n"); > return true; > } > > It's hard to say how much memory is wasted here (obstack_memory_used > is still broken and overflows), but it's probably x*100MB if not more. Yeah, one of my minor TLC patches. Most of the excessive memory usage for regular testcases can be fixed by doing LIM on all siblings of the loop tree independently, instead of keeping data on all loops in the function live at the same time (I've not finished _that_ TLC patch - and it won't help for insanely deep nests, of course)
On Thu, 7 Mar 2013, steven at gcc dot gnu.org wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39326 > > --- Comment #27 from Steven Bosscher <steven at gcc dot gnu.org> 2013-03-07 08:09:59 UTC --- > Compilation finished after ~3 hours and consuming at least 3GB (from top - I > forgot to use memmax2...). > > Winners in the "geez, I'm slow for this test case" list: > > PRE : 6241.75 (57%) usr > tree loop invariant motion : 4194.83 (38%) usr > TOTAL : 11021.46 > > Time is mostly spent in the alias oracle. > > For LIM I'm going to test using triangle instead of square dep/indep > matrix, as a first step. Richi, PRE is yours so... ;-) Actually PRE is yours - it's the RTL PRE ;) Richard.
(In reply to comment #30) Hmm, RTL PRE isn't really mine either, but I probably know it as well as anyone else, so I will have a look. It's probably a similar issue as for gimple LIM: alias checks. Can you attach your TLC patches for gimple LIM to this PR, please?
(In reply to comment #26) > (In reply to comment #23) > > FYI, the original file ( http://gcc.gnu.org/bugzilla/attachment.cgi?id=17377 ) > > can be compiled with 'clang', albeit slowly: > ... > > Memory consumption is about 186M. > > How have you measured this? From time to time I was looking at output of 'top'. The compiler pretty quickly reaches the 186M +/- mark and stays at it.
On Thu, 7 Mar 2013, steven at gcc dot gnu.org wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39326 > > Steven Bosscher <steven at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|NEW |ASSIGNED > AssignedTo|unassigned at gcc dot |steven at gcc dot gnu.org > |gnu.org | > > --- Comment #31 from Steven Bosscher <steven at gcc dot gnu.org> 2013-03-07 09:58:03 UTC --- > (In reply to comment #30) > Hmm, RTL PRE isn't really mine either, but I probably know it as well as > anyone else, so I will have a look. It's probably a similar issue as for > gimple LIM: alias checks. > > Can you attach your TLC patches for gimple LIM to this PR, please? Bah, I seem to have them in some local tree only but cannot find it. Maybe I accidentially reverted :/ ISTR moving stuff to obstacks as well (LIM has the heaviest load on malloc/free). Oh well.
Somehow, with -O3 LLVM clang works a little bit faster than with -O2 - 54 minutes instead of 58 minutes, though this might be a random variation: " sergei@amdam2:~/gcc_bug> time ~/AFSWD/install/LLVM-3.2/binsh/clang -v gap_TlnLv4.c -O3 -c clang version 3.2 (tags/RELEASE_32/final) Target: i386-pc-linux-gnu Thread model: posix "/mnt/sdb5/qemu/20121021/LLVM-3.2/bin/clang" -cc1 -triple i386-pc-linux-gnu -emit-obj -disable-free -disable-llvm-verifier -main-file-name gap_TlnLv4.c -mrelocation-model static -fmath-errno -masm-verbose -mconstructor-aliases -target-cpu pentium4 -target-linker-version 2.22 -momit-leaf-frame-pointer -v -coverage-file /home/sergei/gcc_bug/gap_TlnLv4.o -resource-dir /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2 -fmodule-cache-path /tmp/clang-module-cache -internal-isystem /usr/local/include -internal-isystem /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -fdebug-compilation-dir /home/sergei/gcc_bug -ferror-limit 19 -fmessage-length 182 -mstackrealign -fobjc-runtime=gcc -fdiagnostics-show-option -fcolor-diagnostics -o gap_TlnLv4.o -x c gap_TlnLv4.c clang -cc1 version 3.2 based upon LLVM 3.2svn default target i386-pc-linux-gnu ignoring nonexistent directory "/include" #include "..." search starts here: #include <...> search starts here: /usr/local/include /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2/include /usr/include End of search list. real 54m18.212s user 52m56.062s sys 0m8.085s sergei@amdam2:~/gcc_bug> ". Memory consumption appears to be the same as with -O2.
(In reply to comment #34) > Memory consumption appears to be the same as with -O2. Can you measure the peak memory with time? /usr/bin/time -f 'real=%e user=%U system=%S share=%P%% maxrss=%M ins=%I outs=%O mfaults=%R waits=%w'
(In reply to comment #29) > Yeah, one of my minor TLC patches. Most of the excessive memory > usage for regular testcases can be fixed by doing LIM on > all siblings of the loop tree independently, instead of keeping > data on all loops in the function live at the same time > (I've not finished _that_ TLC patch - and it won't help for > insanely deep nests, of course) It would be very helpful if you can try to recover this one at least, or explain a bit more what you had done. I don't see right away in the code what you mean.
(In reply to comment #35) > (In reply to comment #34) > > Memory consumption appears to be the same as with -O2. > > Can you measure the peak memory with time? > > /usr/bin/time -f 'real=%e user=%U system=%S share=%P%% maxrss=%M ins=%I > outs=%O mfaults=%R waits=%w' " + /usr/bin/time -f 'real=%e user=%U system=%S share=%P%% maxrss=%M ins=%Iouts=%O mfaults=%R waits=%w' /mnt/sdb8/sergei/AFSWD_debug/20121021/LLVM-3.2/bin/clang -v gap_TlnLv4.c -O3 -c clang version 3.2 (tags/RELEASE_32/final) Target: i386-pc-linux-gnu Thread model: posix "/mnt/sdb5/qemu/20121021/LLVM-3.2/bin/clang" -cc1 -triple i386-pc-linux-gnu -emit-obj -disable-free -disable-llvm-verifier -main-file-name gap_TlnLv4.c -mrelocation-model static -fmath-errno -masm-verbose -mconstructor-aliases -target-cpu pentium4 -target-linker-version 2.22 -momit-leaf-frame-pointer -v -coverage-file /home/sergei/gcc_bug/gap_TlnLv4.o -resource-dir /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2 -fmodule-cache-path /tmp/clang-module-cache -internal-isystem /usr/local/include -internal-isystem /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -fdebug-compilation-dir /home/sergei/gcc_bug -ferror-limit 19 -fmessage-length 182 -mstackrealign -fobjc-runtime=gcc -fdiagnostics-show-option -fcolor-diagnostics -o gap_TlnLv4.o -x c gap_TlnLv4.c clang -cc1 version 3.2 based upon LLVM 3.2svn default target i386-pc-linux-gnu ignoring nonexistent directory "/include" #include "..." search starts here: #include <...> search starts here: /usr/local/include /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2/include /usr/include End of search list. real=3323.86 user=3224.22 system=8.70 share=97%% maxrss=0 ins=82598outs=8720 mfaults=193511 waits=669 " - I am not sure 'maxrss=0' makes sense. Anyway, several minutes before completion 'top' showed 224m (or 228m - I do not remember exactly) in VIRT column. When 'gcc' wokrs, the machine becomes very slowly responsive due to memory usage growth; with 'clang' I do not notice slow down. My machine is dual core with 2G of physical memory.
Created attachment 29612 [details] Punt on loops with more memory references than LIM can handle For the LIM problem, I'm thinking of a patch like this one (not tested yet). Basically, this means giving up on really large loops with many loads and/or stores. That's not an unreasonable thing to do anyway. Code motion from really big loops usually isn't an improvement. Richi, could you have a look at this, and see if I've got it all right, more-or-less? LIM is quite complicated and I'm not sure if I should look at, or update, the set of optimizable loops somewhere. With the patch, and with "-O2 -fgcse", I now have: gap_TlnLv4.c: In function 'RDFT_49152_1': gap_TlnLv4.c:37502:18: warning: -ftree-loop-im: number of memory references in loop exceeds the --param loops-max-datarefs-for-datadeps threshold [-Wdisabled-optimization] t12308[500] = t12304[6144*i0+1000]; ^ dead store elim1 : 57.70 ( 8%) dead store elim2 : 76.82 (10%) combiner : 360.07 (48%) reload CSE regs : 55.44 ( 7%) TOTAL : 743.77
Memory usage is still pathetic. Some stats: mem stats from /proc/self/statm on *entry* of pass: pass (#) size resident *warn_unused_result ( -1) 177106944 140742656 ... *init_datastructures ( -1) 209670144 169570304 ssa ( 20) 209670144 169582592 *rebuild_cgraph_edges ( -1) 302575616 259780608 ... expand (169) 390647808 349757440 vregs (170) 477892608 425406464 ... dfinit (174) 477892608 425627648 cse1 (175) 615616512 565694464 ... dse1 (195) 727654400 634572800 fwprop2 (196) 791678976 723767296 ... combine (200) 823115776 752177152 ce2 (201) 6589759488 6512082944 ... csa (222) 6938136576 1057169408 This means that "combine" explodes the memory foot print by ~5.4 GB. On entry to combine, the memory foot print is ~750MB. I don't think that is unreasonable. But after combine, the memory footprint jumps to ~7GB max, and 350GB resident (after a ggc_collect before csa). I'm guessing most of that 350GB is combine leaving memory scattered so that pages can't be released. dfinit is also a major contributor to the huge memory foot print. I have seen that in other test cases also, but so far I can't figure out how it can consume this much memory. Interestingly, the initial memory footprint, out of the front end, is ~180 MB. From there, the memory foot print gradually goes up, but that is probably in part because ggc_collect is just never called (this is on a box with 16GB RAM), with GGC params: --param ggc-min-expand=100 --param ggc-min-heapsize=131072 With both parameters halved, the combine problem obviously remains, and the dfinit problem becomes more pronounced: expand (169) 328032256 287195136 vregs (170) 474415104 424263680 ... dfinit (174) 474415104 424484864 cse1 (175) 613801984 564596736 ... combine (200) 823496704 752828416 ce2 (201) 6590169088 6512717824 The good news: There clearly is a lot of room for improvement for this test case. :-)
(In reply to comment #31) > (In reply to comment #30) > Hmm, RTL PRE isn't really mine either, but I probably know it as well as > anyone else, so I will have a look. It's probably a similar issue as for > gimple LIM: alias checks. > > Can you attach your TLC patches for gimple LIM to this PR, please? Found it on some testing machine. I merged it and saw to it that it compiles and passes tree-ssa.exp (some ???s remain, I've added a TODO). The idea should be clear, and it's memory savings only and it will eventually be a bit slower (unless it is optimized further).
Created attachment 29616 [details] make LIM work per outermost loop
On Thu, 7 Mar 2013, steven at gcc dot gnu.org wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39326 > > --- Comment #38 from Steven Bosscher <steven at gcc dot gnu.org> 2013-03-07 22:15:39 UTC --- > Created attachment 29612 [details] > --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29612 > Punt on loops with more memory references than LIM can handle > > For the LIM problem, I'm thinking of a patch like this one (not tested > yet). Basically, this means giving up on really large loops with many > loads and/or stores. That's not an unreasonable thing to do anyway. > Code motion from really big loops usually isn't an improvement. > > Richi, could you have a look at this, and see if I've got it all right, > more-or-less? LIM is quite complicated and I'm not sure if I should > look at, or update, the set of optimizable loops somewhere. Yeah, well - it should be easy to avoid most overhead (even collecting the memory references) with my proposed scheme (see patch). First, for each BB count the number of memory stmts (easy, look for VUSEs), then, when walking the set of outermost loops we want to apply LIM to sum over its BB counts and instead walk over its siblings if the number is too large. But yes, limiting it should be done as it performs O(n^2) dependence checks (well, O(#of stores * #of memory references), so for a low number of stores it's quite cheap). I've yet to recover the obstack-ification of struct depend, struct mem_ref, struct mem_ref_locs and lim_aux_data allocations ... (LIM is the biggest load on malloc/free). > With the patch, and with "-O2 -fgcse", I now have: > > gap_TlnLv4.c: In function 'RDFT_49152_1': > gap_TlnLv4.c:37502:18: warning: -ftree-loop-im: number of memory references in > loop exceeds the --param loops-max-datarefs-for-datadeps threshold > [-Wdisabled-optimization] > t12308[500] = t12304[6144*i0+1000]; > ^ > > dead store elim1 : 57.70 ( 8%) > dead store elim2 : 76.82 (10%) > combiner : 360.07 (48%) > reload CSE regs : 55.44 ( 7%) > TOTAL : 743.77 Nice. Well ... ;)
The problem with combine is only "collateral damage" from what dse1 is doing to this function. It's loading stored values into registers and replacing re-loads with those registers, as in: @@ -747831,6 +780399,7 @@ 474783: r569578:SF=r569577:SF-r177936:SF REG_DEAD r569577:SF REG_DEAD r177936:SF + 605734: r662460:SF=r569578:SF 474784: [const(`t18571.26403'+0x5fe8)]=r569578:SF REG_DEAD r569578:SF 474786: r177942:SF=[const(`t12303.26335'+0x9008)] ... 487110: [const(`t24716.26406'+0x2ff0)]=r578824:SF REG_DEAD r578824:SF - 487113: r578827:SF=[const(`t18571.26403'+0x5fe8)] + 487113: r578827:SF=r662460:SF 487114: [const(`t24716.26406'+0x2ff4)]=r578827:SF REG_DEAD r578827:SF The distance between the store and load is large, and that is what is causing combine to blow up -- walking all the insns between 605734 and 487113 (about 500 of them, which appears to be typical). This also increases register pressure, causing the poor IRA/LRA/postreload numbers. Also not great: dse is loading the same stored value into multiple registers, e.g.: @@ -147272,6 +175190,8 @@ 110394: r288307:SF=r288306:SF-r37078:SF REG_DEAD r288306:SF REG_DEAD r37078:SF + 601084: r657810:SF=r288307:SF + 601085: r657811:SF=r288307:SF 110395: [r288302:DI]=r288307:SF REG_DEAD r288307:SF but I'm guessing (and verifying now) that this is cleaned up by fwprop2.
Created attachment 29628 [details] Re-use store register if possible This patch resolves the issue mentioned in comment #43, and makes combine run in reasonable time.
Patches posted: * Restrict GIMPLE loop invariant code motion of loop-invariant loads and stores to loops with fewer memory references than a certain maximum that is controlled with --param loops-max-datarefs-for-datadeps" from the command line. http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00380.html * Do not create new pseudo-registers for load-after-store transformations in RTL dead store elimination. This reduces the memory foot print after DSE by ~2 percent, and avoids the compile time and memory usage explosion in combine because it gets presented fewer single-def/single-use register moves that are really just register copies. http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00379.html * Make gcse.c respect -fno-gcse-lm. For the RTL PRE problem, this means compile time is reasonable with -fno-gcse-lm. A follow-up patch will implement some mechanism to disable load motion automatically on extreme test cases like the one from this PR. http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00386.html The remaining compile time bottlenecks are: - RTL dead store limination in its analysis phase. This is mostly time spent in dependence tests in alias analysis for instructions in a single basic block, so it's only a problem for test cases where there is a huge number of loads and stores in each basic block. I don't think it is worth speeding up DSE for such extreme cases. - Post-reload CSE because it is in the worst-case quadratic in the number of instructions in a basic block. In most practical cases, post-reload CSE scales linearly with the number of instructions in a basic block, but with a large constant bound. It looks up and down through the instruction chain to see if a reg is not clobbered between a use and a def. Because it only has to do so with hard registers the typical bound is closer to "number of insns in basic block" * "number of hard registers". This is fine, I am not going to try and improve this.
Created attachment 29649 [details] symmetry in reference testing Exploit symmetry in reference testing.
With caching affine-combination compute and ao_ref compute I have it down to tree loop invariant motion: 596.91 (79%) usr 0.73 (29%) sys 599.77 (78%) wall 31135 kB ( 3%) ggc without limiting anything. I tried a more intelligent ref_indep_in_loop (avoiding another quadraticness in loop depth ...) but it gets slower ... *************** ref_indep_loop_p_1 (struct loop *loop, m *** 2230,2242 **** bitmap refs_to_check; unsigned i; bitmap_iterator bi; ! bool ret = true, stored = bitmap_bit_p (ref->stored, loop->num); mem_ref_p aref; ! if (stored) ! refs_to_check = memory_accesses.all_refs_in_loop[loop->num]; else ! refs_to_check = memory_accesses.all_refs_stored_in_loop[loop->num]; EXECUTE_IF_SET_IN_BITMAP (refs_to_check, 0, i, bi) { --- 2233,2245 ---- bitmap refs_to_check; unsigned i; bitmap_iterator bi; ! bool ret = true; mem_ref_p aref; ! if (bitmap_bit_p (ref->stored, loop->num)) ! refs_to_check = memory_accesses.refs_in_loop[loop->num]; else ! refs_to_check = memory_accesses.refs_stored_in_loop[loop->num]; EXECUTE_IF_SET_IN_BITMAP (refs_to_check, 0, i, bi) { *************** ref_indep_loop_p_1 (struct loop *loop, m *** 2259,2272 **** static bool ref_indep_loop_p (struct loop *loop, mem_ref_p ref) { - bool ret; - if (bitmap_bit_p (ref->indep_loop, loop->num)) return true; if (bitmap_bit_p (ref->dep_loop, loop->num)) return false; ! ret = ref_indep_loop_p_1 (loop, ref); if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, "Querying dependencies of ref %u in loop %d: %s\n", --- 2262,2286 ---- static bool ref_indep_loop_p (struct loop *loop, mem_ref_p ref) { if (bitmap_bit_p (ref->indep_loop, loop->num)) return true; if (bitmap_bit_p (ref->dep_loop, loop->num)) return false; ! bool ret = true; ! struct loop *inner = loop->inner; ! while (inner) ! { ! if (!ref_indep_loop_p (inner, ref)) ! { ! ret = false; ! break; ! } ! inner = inner->next; ! } ! ! if (ret) ! ret = ref_indep_loop_p_1 (loop, ref); if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, "Querying dependencies of ref %u in loop %d: %s\n", with it we can also get rid of the all_refs_in_loop bitmap (only all_refs_stored_in_loop is then used, by store-motion). I don't see why the above should result in slower operation :/ It should save us the time to re-query references in sub-loops (yes, they should be cached already ... still walking the bitmap is not free). To mimic parts of this idea the dep_loop/indep_loop checking/setting can be improved to cover outer loops like with @@ -2296,7 +2218,12 @@ record_indep_loop (struct loop *loop, me if (indep) bitmap_set_bit (ref->indep_loop, loop->num); else - bitmap_set_bit (ref->dep_loop, loop->num); + do + { + bitmap_set_bit (ref->dep_loop, loop->num); + loop = loop_outer (loop); + } + while (loop != current_loops->tree_root); } and @@ -2339,8 +2266,13 @@ ref_indep_loop_p (struct loop *loop, mem { bool ret; - if (bitmap_bit_p (ref->indep_loop, loop->num)) - return true; + struct loop *oloop = loop; + do + { + if (bitmap_bit_p (ref->indep_loop, oloop->num)) + return true; + oloop = loop_outer (oloop); + } + while (oloop != current_loops->tree_root); if (bitmap_bit_p (ref->dep_loop, loop->num)) return false; (no difference in compile-time) anyway, another obvious improvement is to merge the two two-bitmap sets (they are tri-state, dependent, independent and unknown). That's more cache-friendly (now, where's that tri-state "bitmap" implementation that also saves the half bit we waste ... ;)) It helps a bit.
Removing all the caching (dep/indep_loop and dep/indep_ref) makes things faster ... (they have a hit rate of 5% resp. 6.2% only) Fastest timing sofar: tree loop invariant motion: 57.42 (26%) usr that's without any limiting and with just the dep/indep_ref cache removed and a (working) patch according to comment #47.
Author: rguenth Date: Mon Mar 18 08:42:57 2013 New Revision: 196768 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=196768 Log: 2013-03-18 Richard Biener <rguenther@suse.de> PR tree-optimization/39326 * tree-ssa-loop-im.c (refs_independent_p): Exploit symmetry. (struct mem_ref): Replace mem member with ao_ref typed member. (MEM_ANALYZABLE): Adjust. (memref_eq): Likewise. (mem_ref_alloc): Likewise. (gather_mem_refs_stmt): Likewise. (mem_refs_may_alias_p): Use the ao_ref to query the alias oracle. (execute_sm_if_changed_flag_set): Adjust. (execute_sm): Likewise. (ref_always_accessed_p): Likewise. (refs_independent_p): Likewise. (can_sm_ref_p): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-loop-im.c
Author: rguenth Date: Thu Mar 21 14:45:36 2013 New Revision: 196878 URL: http://gcc.gnu.org/viewcvs?rev=196878&root=gcc&view=rev Log: 2013-03-21 Richard Biener <rguenther@suse.de> PR tree-optimization/39326 * tree-ssa-loop-im.c (UNANALYZABLE_MEM_ID): New define. (MEM_ANALYZABLE): Adjust. (record_mem_ref_loc): Move bitmap ops ... (gather_mem_refs_stmt): ... here. Use the shared mem-ref for unanalyzable refs, do not record locations for it. (analyze_memory_references): Allocate ref zero as shared unanalyzable ref. (refs_independent_p): Do not test for unanalyzed mems here. (ref_indep_loop_p_1): Special-case disambiguation against the unanalyzed ref. (ref_indep_loop_p): Assert we are not queried for the unanalyzed mem. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-loop-im.c
> (struct mem_ref): Replace mem member with ao_ref typed member. RTL gcse (-O2) suffers from the same slowness in its dependence tests. Caching ao_ref instead of just mem and alias-set in RTL mem-attrs would improve this tremendously. A quick try reveals that gengtype is not at all happy with including something in rtl.h though :/
(In reply to comment #51) > > (struct mem_ref): Replace mem member with ao_ref typed member. > > RTL gcse (-O2) suffers from the same slowness in its dependence tests. Caching > ao_ref instead of just mem and alias-set in RTL mem-attrs would improve > this tremendously. > > A quick try reveals that gengtype is not at all happy with including > something in rtl.h though :/ Doing that (caching the ao_ref) doesn't help as much as it did on the GIMPLE level. The most time-consuming part of canon_true_dependence (which consumes 56% of compile-time) is find_base_term (37% of that 56%) followed by memrefs_conflict_p and only then (after caching ao_ref) rtx_refs_may_alias_p (18%). The find_base_term result is another thing that could be easily cached when doing dependence checks that repeatedly use one or another operand. For this testcase the most expensive caller is compute_transp. There is some obvious way to do less find_base_term calls in alias.c itself. I'm going to cleanup things there a bit.
With the PR59802 and PR38518 fixes on trunk I see for the 2nd testcase at -O2 PRE : 19.23 (35%) usr 0.01 ( 1%) sys 19.22 (34%) wall 1421 kB ( 0%) ggc combiner : 13.24 (24%) usr 0.14 (19%) sys 13.37 (24%) wall 402488 kB (73%) ggc integrated RA : 9.28 (17%) usr 0.35 (47%) sys 9.65 (17%) wall 26707 kB ( 5%) ggc TOTAL : 55.56 0.74 56.34 550215 kB compared to the 4.8 branch: tree loop invariant motion: 17.29 (22%) usr PRE : 24.25 (31%) usr combiner : 12.52 (16%) usr 0.14 (17%) sys 12.66 (16%) wall 402491 kB (74%) ggc integrated RA : 7.96 (10%) usr 0.32 (38%) sys 8.25 (11%) wall 26420 kB ( 5%) ggc TOTAL : 77.53 0.84 78.53 542544 kB so it's an improvement (tree LIM is off the radar completely, RTL PRE is improved a bit - looking at it now).
For the original testcase on trunk we get at -O1 tree loop invariant motion: 37.20 (16%) usr 0.02 ( 1%) sys 37.20 (16%) wall 12127 kB ( 1%) ggc dead store elim1 : 17.42 ( 7%) usr 0.04 ( 2%) sys 17.44 ( 7%) wall 28345 kB ( 3%) ggc dead store elim2 : 26.22 (11%) usr 0.00 ( 0%) sys 26.20 (11%) wall 18664 kB ( 2%) ggc combiner : 15.36 ( 6%) usr 0.04 ( 2%) sys 15.39 ( 6%) wall 70073 kB ( 7%) ggc LRA hard reg assignment : 74.07 (31%) usr 0.02 ( 1%) sys 74.05 (31%) wall 0 kB ( 0%) ggc reload CSE regs : 12.67 ( 5%) usr 0.00 ( 0%) sys 12.66 ( 5%) wall 13315 kB ( 1%) ggc TOTAL : 237.86 1.83 239.60 999181 kB so that's still slow (memory usage is peaking at about 1.5GB now, during RA). tree LIM: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 4.22 120.72 9.83 299080 0.00 0.00 ref_indep_loop_p_2(loop*, mem_ref*, bool) 1.89 151.82 4.41 1712943886 0.00 0.00 mem_refs_may_alias_p(mem _ref*, mem_ref*, pointer_map_t**) [clone .constprop.60] 0.21 197.02 0.48 12192 0.00 0.00 force_move_till(tree_node*, tree_node**, void*) at -O2 this will blow up completely as it then runs into the affine-based disambiguation.
Current GCC 6 numbers: -O1 -g var-tracking dataflow : 89.37 (60%) usr 0.09 (23%) sys 89.53 (59%) wall 11542 kB ( 3%) ggc var-tracking emit : 47.70 (32%) usr 0.05 (13%) sys 47.81 (32%) wall 4546 kB ( 1%) ggc TOTAL : 150.06 0.39 150.60 441618 kB and 580MB peak memory use -O2 PRE : 11.41 (40%) usr 0.01 ( 3%) sys 11.43 (39%) wall 921 kB ( 0%) ggc integrated RA : 4.78 (17%) usr 0.09 (26%) sys 4.86 (17%) wall 14129 kB ( 4%) ggc TOTAL : 28.84 0.34 29.21 375189 kB and 820MB peak memory use -O3 load CSE after reload : 30.42 (49%) usr 0.02 ( 6%) sys 30.46 (49%) wall 0 kB ( 0%) ggc TOTAL : 62.08 0.32 62.47 415741 kB and 700MB peak memory use
"-O2 ... and 820MB peak memory use" vs "-O3 ... and 700MB peak memory use" - according to my common sense -O3 is stronger than -02 optimization, and one should expect greater memory use. So, can the above be an indication of yet another bug ? I.e. -O3 in reality does not do something it is supposed to do while -O2 does this ?
On Wed, 17 Feb 2016, sergstesh at yahoo dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39326 > > --- Comment #56 from Sergei Steshenko <sergstesh at yahoo dot com> --- > "-O2 ... and 820MB peak memory use" vs "-O3 ... and 700MB peak memory use" - > according to my common sense -O3 is stronger than -02 optimization, and one > should expect greater memory use. > > So, can the above be an indication of yet another bug ? I.e. -O3 in reality > does not do something it is supposed to do while -O2 does this ? More like -O3 cleans up the code due to stronger optimization before the memory-hogging part comes in and thus that one needs less memory do to its work.
Not working on this => unassigned...
GCC 8.3, x86_64: First testcase: -O0: 1GB, 8.8s -O1: 1.7GB, 242s alias stmt walking : 85.84 ( 35%) 0.82 ( 22%) 86.84 ( 35%) 2189 kB ( 0%) tree CFG cleanup : 13.73 ( 6%) 0.00 ( 0%) 13.75 ( 6%) 2055 kB ( 0%) tree loop invariant motion : 66.65 ( 27%) 0.09 ( 2%) 66.85 ( 27%) 30968 kB ( 2%) combiner : 16.87 ( 7%) 0.02 ( 1%) 16.90 ( 7%) 18054 kB ( 1%) reload CSE regs : 11.46 ( 5%) 0.01 ( 0%) 11.47 ( 5%) 10633 kB ( 1%) -O2: 4.3GB, 1994s alias stmt walking : 126.85 ( 6%) 1.14 ( 23%) 128.38 ( 6%) 2875 kB ( 0%) tree loop invariant motion : 67.50 ( 3%) 0.08 ( 2%) 67.64 ( 3%) 62356 kB ( 2%) PRE :1644.36 ( 82%) 0.07 ( 1%)1644.49 ( 82%) 2877 kB ( 0%) Second, smaller testcase: -O1: 420MB, 16s alias stmt walking : 7.06 ( 43%) 0.08 ( 13%) 7.15 ( 42%) 621 kB ( 0%) tree loop invariant motion : 0.90 ( 5%) 0.01 ( 2%) 0.92 ( 5%) 3497 kB ( 2%) integrated RA : 2.89 ( 18%) 0.05 ( 8%) 2.91 ( 17%) 14261 kB ( 9%) -O2: 780MB, 48s alias stmt walking : 10.23 ( 21%) 0.09 ( 14%) 10.31 ( 21%) 785 kB ( 0%) tree loop invariant motion : 0.94 ( 2%) 0.01 ( 2%) 0.96 ( 2%) 7264 kB ( 2%) PRE : 19.39 ( 40%) 0.00 ( 0%) 19.40 ( 39%) 281 kB ( 0%) integrated RA : 5.38 ( 11%) 0.09 ( 14%) 5.46 ( 11%) 14033 kB ( 4%) reload CSE regs : 2.28 ( 5%) 0.00 ( 0%) 2.28 ( 5%) 2486 kB ( 1%) The other testcase: -O1: 2.8GB, 761s alias stmt walking : 188.53 ( 25%) 1.55 ( 20%) 190.19 ( 25%) 3400 kB ( 0%) tree CFG cleanup : 82.33 ( 11%) 0.02 ( 0%) 82.35 ( 11%) 4103 kB ( 0%) tree DSE : 18.31 ( 2%) 0.00 ( 0%) 18.32 ( 2%) 0 kB ( 0%) tree loop invariant motion : 296.98 ( 39%) 0.18 ( 2%) 297.22 ( 39%) 62472 kB ( 2%) forward prop : 10.22 ( 1%) 0.03 ( 0%) 10.24 ( 1%) 57889 kB ( 2%) CSE : 14.47 ( 2%) 0.01 ( 0%) 14.47 ( 2%) 54258 kB ( 2%) combiner : 75.19 ( 10%) 0.03 ( 0%) 75.22 ( 10%) 35852 kB ( 1%) reload CSE regs : 24.39 ( 3%) 0.01 ( 0%) 24.40 ( 3%) 21083 kB ( 1%) -O2: 9.2GB, 7634s alias stmt walking : 273.85 ( 4%) 2.30 ( 22%) 276.44 ( 4%) 4533 kB ( 0%) tree loop invariant motion : 284.31 ( 4%) 0.46 ( 4%) 284.89 ( 4%) 125719 kB ( 1%) PRE :6714.77 ( 88%) 0.44 ( 4%)6716.39 ( 88%) 5747 kB ( 0%)
PRE is all find_base_term exploding ... The LIM case is all store_motion () which is quadratic and the only user of the quadratic in memory all_refs_stored_in_loop. The latter would be reasonably easy to fix by iterating. The former is more difficult to fix, we'd need to rewrite SM to work inner-to-outer loops somehow. But that's certainly doable. Gating store-motion on optimize > 1 is possible as well (it also has the bad side-effect of increasing register pressure by introducing IVs).
r12-7592, first testcase, x86_64: -O0: 6s, 1GB -O1: 264s, 1.4GB callgraph ipa passes : 30.47 ( 12%) alias stmt walking : 67.44 ( 26%) tree loop invariant motion : 120.86 ( 46%) -O2: 1383s, 11.8GB alias stmt walking : 174.01 ( 13%) tree PRE : 87.85 ( 6%) tree loop invariant motion : 130.54 ( 9%) tree slp vectorization : 100.01 ( 7%) PRE : 716.64 ( 52%) quite some regression on the peak memory usage side here.
(In reply to Richard Biener from comment #61) > r12-7592, first testcase, x86_64: > > -O0: 6s, 1GB > > -O1: 264s, 1.4GB > > callgraph ipa passes : 30.47 ( 12%) > alias stmt walking : 67.44 ( 26%) > tree loop invariant motion : 120.86 ( 46%) > > -O2: 1383s, 11.8GB > > alias stmt walking : 174.01 ( 13%) > tree PRE : 87.85 ( 6%) > tree loop invariant motion : 130.54 ( 9%) > tree slp vectorization : 100.01 ( 7%) > PRE : 716.64 ( 52%) > > quite some regression on the peak memory usage side here. r14-9064 -O2: 683s, 5GB alias stmt walking : 95.35 ( 14%) tree PRE : 58.27 ( 9%) tree loop invariant motion : 75.47 ( 11%) PRE : 335.15 ( 49%) so improvements on the memory usage side. -O1 didn't change.