39326 – Segmentation fault with -O1, out of memory with -O2

Bug 39326 - Segmentation fault with -O1, out of memory with -O2

Summary: Segmentation fault with -O1, out of memory with -O2

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	4.3.3

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	compile-time-hog, memory-hog

Depends on:
Blocks:	47344
	Show dependency tree / graph

Reported:	2009-02-28 15:23 UTC by Sergei Steshenko
Modified:	2024-02-20 13:14 UTC (History)
CC List:	5 users (show)

See Also:
Host:	i686-pc-linux-gnu
Target:	i686-pc-linux-gnu
Build:	i686-pc-linux-gnu
Known to work:
Known to fail:
Last reconfirmed:	2024-02-20 00:00:00

Attachments
source file causing the failure (543.84 KB, application/octet-stream) 2009-02-28 15:29 UTC, Sergei Steshenko	Details
a smaller file with hopefully the same pattern - the file name is gap_bzAJWH.c.gz . (106.56 KB, application/octet-stream) 2009-03-01 03:03 UTC, Sergei Steshenko	Details
the other testcase (705.88 KB, application/octet-stream) 2009-03-17 11:05 UTC, Richard Biener	Details
Punt on loops with more memory references than LIM can handle (4.58 KB, patch) 2013-03-07 22:15 UTC, Steven Bosscher	Details \| Diff
make LIM work per outermost loop (2.33 KB, patch) 2013-03-08 09:13 UTC, Richard Biener	Details \| Diff
Re-use store register if possible (4.67 KB, patch) 2013-03-09 17:25 UTC, Steven Bosscher	Details \| Diff
symmetry in reference testing (692 bytes, patch) 2013-03-12 10:46 UTC, Richard Biener	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Sergei Steshenko 2009-02-28 15:23:23 UTC

I was trying to compile autogenerated by 'spiral' 'gap_TlnLv4.c' file (to be attached).

Trying to compile it with

/mnt/sdb8/sergei/AFSWD_debug/install/gcc-4.3.3/binsh/gcc -O1 -fomit-frame-pointer -malign-double -fstrict-aliasing -march=native -c /tmp/spiral-sergei/gap_TlnLv4.c -o gap_TlnLv4.o1

command line I got this:

"
/tmp/spiral-sergei/gap_TlnLv4.c: In function ‘RDFT_49152_1’:
/tmp/spiral-sergei/gap_TlnLv4.c:172282: internal compiler error: Segmentation fault
Please submit a full bug report,
".

Before that, trying to compile it with

/mnt/sdb8/sergei/AFSWD_debug/install/gcc-4.3.3/binsh/gcc -O2 -fomit-frame-pointer -malign-double -fstrict-aliasing -march=native -c /tmp/spiral-sergei/gap_TlnLv4.c

command I got this:

cc1: out of memory allocating 4184025948 bytes after a total of 205533184 bytes
.

Comment 1 Sergei Steshenko 2009-02-28 15:29:28 UTC

Created attachment 17377 [details]
source file causing the failure

The source is not preprocessed, it has only standard

#include <math.h>

in it.

Comment 2 Sergei Steshenko 2009-02-28 15:32:37 UTC

My OS:

Linux amdam2 2.6.22.19-0.2-default #1 SMP 2008-12-18 10:17:03 +0100 i686 athlon i386 GNU/Linux

- SUSE 10.3 in simple English; 'gcc' is self-built 'gcc-4.3.3'.

Comment 3 Sergei Steshenko 2009-02-28 15:34:46 UTC

There is no failure with -O0.

Comment 4 Sergei Steshenko 2009-02-28 17:23:06 UTC

FWIW, 'gcc-3.4.6', when run as

/mnt/sdb8/sergei/AFSWD_debug/install/gcc-3.4.6/binsh/gcc -O1 -fomit-frame-pointer -malign-double -fstrict-aliasing -c /tmp/spiral-sergei/gap_TlnLv4.c -o gap_TlnLv4.o

, fails with

cc1: out of memory allocating 182853324 bytes after a total of 14716928 bytes

message, not with segmentation fault like 'gcc-4.3.3'.

Comment 5 Richard Biener 2009-02-28 17:57:10 UTC

Try -fno-move-loop-invariants.  This is probably a killer on alias-improvements
branch as well.

Comment 6 Richard Biener 2009-02-28 18:00:30 UTC

As this seems to be autogenerated from

! The SPL Program: (compose (sparse (coords (...12288 x 2 ...))(values (...1 x 12288 ...)))(compose (conjugate (..)(..))(compose (..)(..))))
! node size: 12288 X 12288

can you produce a testcase with an order of magnitude less loads/stores?
(thus likely just "node size: 1228 x 1228")?

Thanks!

Comment 7 Richard Biener 2009-02-28 18:10:16 UTC

One problem seems to be that we blow up the stack during GC recursing following
the VUSE -> def_stmt links.

Comment 8 Sergei Steshenko 2009-03-01 03:03:42 UTC

Created attachment 17378 [details]
a smaller file with hopefully the same pattern - the file name is gap_bzAJWH.c.gz .

Comment 9 Sergei Steshenko 2009-03-01 03:06:04 UTC

(In reply to comment #6)
> As this seems to be autogenerated from
> 
> ! The SPL Program: (compose (sparse (coords (...12288 x 2 ...))(values (...1 x
> 12288 ...)))(compose (conjugate (..)(..))(compose (..)(..))))
> ! node size: 12288 X 12288
> 
> can you produce a testcase with an order of magnitude less loads/stores?
> (thus likely just "node size: 1228 x 1228")?
> 
> Thanks!
> 

I've uploaded a smaller file: gap_bzAJWH.c.gz . These are intermediated files which are deleted upon successful compilation.

So, I've decreased the number the points and managed to "catch" this one. It is: "node size: 1536 X 1536".

Comment 10 Sergei Steshenko 2009-03-01 03:09:25 UTC

I am not sure whether it's clear - the smaller 'gap_bzAJWH.c.gz' file can be found as

http://gcc.gnu.org/bugzilla/attachment.cgi?id=17378&action=view

attachment.

Comment 11 Sergei Steshenko 2009-03-01 03:54:19 UTC

(In reply to comment #5)
> Try -fno-move-loop-invariants.  This is probably a killer on alias-improvements
> branch as well.
> 

Still segfault:

"
gap_TlnLv4.c: In function ‘RDFT_49152_1’:
gap_TlnLv4.c:172282: internal compiler error: Segmentation fault
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.

real    33m52.152s
user    16m47.843s
sys     0m9.333s

[1]   Exit 1                  time /mnt/sdb8/sergei/AFSWD_debug/install/gcc-4.3.3/binsh/gcc -O1 -fomit-frame-pointer -malign-double -fstrict-aliasing -fno-move-loop-invariants -c gap_TlnLv4.c
".

Comment 12 Richard Biener 2009-03-01 11:39:33 UTC

The small testcase at -O2 is bound by PRE time

 PRE                   :  27.08 (55%) usr   0.02 ( 4%) sys  27.09 (55%) wall    1551 kB ( 2%) ggc

nothing interesting at -O1 and it still nicely fits into my memory.  Which
makes me as for a little bigger testcase ;)  (as I suspect there's quadratic
behavior in memory usage somewhere a size around 2000 * 2000 would be nice)

Thanks again.

Comment 13 Richard Biener 2009-03-02 17:00:54 UTC

And for trunk we have

 combiner              :  24.76 (15%) usr   0.80 (23%) sys  25.76 (16%) wall  448053 kB (82%) ggc
 integrated RA         :  68.41 (43%) usr   1.06 (31%) sys  70.23 (42%) wall    3197 kB ( 1%) ggc

as the worst offenders at -O1.  I notice that with --enable-checking GC
params the top memory usage is way lower than without, so something odd
happens here as well (maybe a ggc_free wrapped inside ENABLE_CHECKING?).

Most memory is used by combine and IRA.

Comment 14 Sergei Steshenko 2009-03-03 13:36:12 UTC

'spiral' has produced another testcase which segfaults with -O2 - the original testcase segfaults with -O1.

The testcase, though has half the points if terms of FFT, is big as a file:

-rw-r--r-- 1 sergei users  1656419 2009-03-03 06:36 gap_CPEodL.c.bz2

, so I think I can't upload it directly, or can I ?

I can send it as Email attachment to, say, rguenth.

Comment 15 rguenther@suse.de 2009-03-03 13:48:54 UTC

Subject: Re:  Segmentation fault with -O1, out of memory
 with -O2

On Tue, 3 Mar 2009, sergstesh at yahoo dot com wrote:

> ------- Comment #14 from sergstesh at yahoo dot com  2009-03-03 13:36 -------
> 'spiral' has produced another testcase which segfaults with -O2 - the original
> testcase segfaults with -O1.
> 
> The testcase, though has half the points if terms of FFT, is big as a file:
> 
> -rw-r--r-- 1 sergei users  1656419 2009-03-03 06:36 gap_CPEodL.c.bz2
> 
> , so I think I can't upload it directly, or can I ?
> 
> I can send it as Email attachment to, say, rguenth.

That would be fine.

Richard.

Comment 16 Sergei Steshenko 2009-03-03 14:15:15 UTC

(In reply to comment #15)
> Subject: Re:  Segmentation fault with -O1, out of memory
>  with -O2
> 
> On Tue, 3 Mar 2009, sergstesh at yahoo dot com wrote:
> 
> > ------- Comment #14 from sergstesh at yahoo dot com  2009-03-03 13:36 -------
> > 'spiral' has produced another testcase which segfaults with -O2 - the original
> > testcase segfaults with -O1.
> > 
> > The testcase, though has half the points if terms of FFT, is big as a file:
> > 
> > -rw-r--r-- 1 sergei users  1656419 2009-03-03 06:36 gap_CPEodL.c.bz2
> > 
> > , so I think I can't upload it directly, or can I ?
> > 
> > I can send it as Email attachment to, say, rguenth.
> 
> That would be fine.
> 
> Richard.
> 

I have just sent Richard an Email with 'gap_CPEodL.c.bz2' file attached.

Comment 17 Richard Biener 2009-03-17 11:05:05 UTC

Created attachment 17476 [details]
the other testcase

Comment 18 Richard Biener 2009-03-17 12:42:19 UTC

Vlad, for the second testcase I see

-O0:

 expand                :   0.78 (19%) usr   0.04 ( 5%) sys   0.83 (16%) wall   44335 kB (49%) ggc
 integrated RA         :   1.05 (25%) usr   0.03 ( 4%) sys   1.13 (22%) wall    1036 kB ( 1%) ggc
 reload                :   0.55 (13%) usr   0.02 ( 3%) sys   0.59 (12%) wall    3585 kB ( 4%) ggc

-O1:

 integrated RA         :  73.95 (77%) usr   1.18 (50%) sys  76.20 (76%) wall    2898 kB ( 3%) ggc

(on alias improvements branch I see
 combiner              :  24.89 (17%) usr   0.51 (28%) sys  25.41 (17%) wall  448053 kB (82%) ggc
 integrated RA         :  69.81 (47%) usr   0.36 (20%) sys  70.22 (46%) wall    3212 kB ( 1%) ggc
as tree loop invariant motion does lots of stuff there)

-O2:

 PRE                   :  25.49 (26%) usr   0.03 ( 3%) sys  25.52 (26%) wall    3513 kB ( 3%) ggc
 integrated RA         :  36.98 (38%) usr   0.24 (21%) sys  37.31 (38%) wall    2158 kB ( 2%) ggc

Comment 19 Richard Biener 2009-03-17 12:58:52 UTC

For trunk -O1 I see

CPU: AMD64 processors, speed 1000 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
368029   19.8218  allocno_spill_priority_compare
244104   13.1473  color_pass
205725   11.0802  build_allocno_conflicts
141509    7.6216  push_allocno_to_stack
106750    5.7495  assign_hard_reg
95558     5.1467  bitmap_bit_p
88785     4.7819  add_to_allocno_conflicts
51962     2.7986  splay_tree_splay
42764     2.3032  ira_build_conflicts
40718     2.1930  ira_flattening
27977     1.5068  canon_true_dependence
25608     1.3792  ira_allocno_live_ranges_intersect_p
20403     1.0989  nonoverlapping_memrefs_p
20005     1.0775  ira_add_allocno_conflict

Comment 20 Richard Biener 2009-03-17 13:09:57 UTC

Btw, it looks like internal IRA data-structures can be shrinked on 64bit
platforms by avoiding padding between pointer and integer members a lot.

Comment 21 Steven Bosscher 2013-03-06 11:07:54 UTC

IRA now is very different from IRA in early 2009.  Some new timings
for GCC 4.8 trunk would be nice. Perhaps this bug is already fixed.

Comment 22 Richard Biener 2013-03-06 11:38:07 UTC

4.7.2  -O0  25s  2189981kB

 integrated RA           :   8.96 (35%) usr   0.89 (28%) sys   9.89 (34%) wall  206439 kB (16%) ggc
 reload                  :   2.98 (12%) usr   0.07 ( 2%) sys   3.05 (11%) wall   43197 kB ( 3%) ggc

4.8.0  -O0  29s  2111318kB

 integrated RA           :   8.88 (31%) usr   0.41 (13%) sys   9.32 (29%) wall  206439 kB (18%) ggc
 LRA non-specific        :   5.50 (19%) usr   0.08 ( 3%) sys   5.59 (17%) wall    4600 kB ( 0%) ggc

4.8.0  -O2  (terminated after 9 minutes waiting, LIM being the offender, I suspect domwalk ...)  >2.5GB

applying domwalk fix ...

4.8.0  -O1  still awfully slow

the smaller testcase is also tested on http://gcc.opensuse.org/c++bench/random/
but it runs OOM at -O1 and -O2 there (there is a virtual ulimit of 1GB due
to lack of resources on the machine), but -O3 surprisingly works so you
can see a recent time-report there (also for -O0).

At -O3 you can see there (everything > 10%):

 tree loop invariant motion:  53.83 (13%) usr
 PRE                     : 116.26 (29%) usr 
 LRA hard reg assignment :  73.68 (18%) usr
 load CSE after reload   :  38.67 (10%) usr

ISTR the testcases are not exactly exposing the same issues.

Comment 23 Sergei Steshenko 2013-03-06 16:49:51 UTC

FYI, the original file ( http://gcc.gnu.org/bugzilla/attachment.cgi?id=17377 ) can be compiled with 'clang', albeit slowly:

"
sergei@amdam2:~/gcc_bug> time ~/AFSWD/install/LLVM-3.2/binsh/clang -v gap_TlnLv4.c -O2 -c
clang version 3.2 (tags/RELEASE_32/final)
Target: i386-pc-linux-gnu
Thread model: posix
 "/mnt/sdb5/qemu/20121021/LLVM-3.2/bin/clang" -cc1 -triple i386-pc-linux-gnu -emit-obj -disable-free -disable-llvm-verifier -main-file-name gap_TlnLv4.c -mrelocation-model static -fmath-errno -masm-verbose -mconstructor-aliases -target-cpu pentium4 -target-linker-version 2.22 -momit-leaf-frame-pointer -v -coverage-file /home/sergei/gcc_bug/gap_TlnLv4.o -resource-dir /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2 -fmodule-cache-path /tmp/clang-module-cache -internal-isystem /usr/local/include -internal-isystem /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -O2 -fdebug-compilation-dir /home/sergei/gcc_bug -ferror-limit 19 -fmessage-length 182 -mstackrealign -fobjc-runtime=gcc -fdiagnostics-show-option -fcolor-diagnostics -o gap_TlnLv4.o -x c gap_TlnLv4.c
clang -cc1 version 3.2 based upon LLVM 3.2svn default target i386-pc-linux-gnu
ignoring nonexistent directory "/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/include
 /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2/include
 /usr/include
End of search list.

real    58m50.364s
user    53m25.128s
sys     0m11.641s
sergei@amdam2:~/gcc_bug>    
".

Memory consumption is about 186M.

Comment 24 Steven Bosscher 2013-03-06 23:39:27 UTC

(In reply to comment #22)
> 4.8.0  -O2  (terminated after 9 minutes waiting, LIM being the offender, I
> suspect domwalk ...)  >2.5GB
> 
> applying domwalk fix ...

It is LIM, for sure. I've been watching in GDB for a while at some
back traces, and it's spent minutes already in this DOM walk:

#5  0x0000000000b841e1 in walk_dominator_tree (walk_data=0x7fffffffdd60, bb=0x7fffef033958) at ../../trunk/gcc/domwalk.c:187
#6  0x0000000000c02d73 in determine_invariantness () at ../../trunk/gcc/tree-ssa-loop-im.c:1189
#7  tree_ssa_lim () at ../../trunk/gcc/tree-ssa-loop-im.c:2632
#8  0x000000000075bcd7 in execute_one_pass (pass=0x12323e0 <pass_lim>) at ../../trunk/gcc/passes.c:2330
#9  0x000000000075c0f5 in execute_pass_list (pass=0x12323e0 <pass_lim>) at ../../trunk/gcc/passes.c:2378

This is supposed to be cheap. Is this a known bottle-neck?

Pathetic...

Comment 25 Steven Bosscher 2013-03-07 00:08:26 UTC

(In reply to comment #24)
> (In reply to comment #22)
> > 4.8.0  -O2  (terminated after 9 minutes waiting, LIM being the offender, I
> > suspect domwalk ...)  >2.5GB
> > 
> > applying domwalk fix ...
> 
> It is LIM, for sure. I've been watching in GDB for a while at some
> back traces, and it's spent minutes already in this DOM walk:

After this, it's spending even more time in refs_independent_p, doing
bitmap tests (ah! a test case for my splay tree bitmaps!).

Is the refs_independent_p relation symmetric? It certainly looks that
way.  If so, one way to halve the work done here is to build only half
the "independence" graph.  Currently it builds a full square:

  if (mem_refs_may_alias_p (ref1->mem, ref2->mem,
                            &memory_accesses.ttae_cache))
    { 
      bitmap_set_bit (ref1->dep_ref, ref2->id);
      bitmap_set_bit (ref2->dep_ref, ref1->id);
      if (dump_file && (dump_flags & TDF_DETAILS))
        fprintf (dump_file, "dependent.\n");
      return false;
    }
  else
    {
      bitmap_set_bit (ref1->indep_ref, ref2->id);
      bitmap_set_bit (ref2->indep_ref, ref1->id);
      if (dump_file && (dump_flags & TDF_DETAILS))
        fprintf (dump_file, "independent.\n");
      return true;
    }

It's hard to say how much memory is wasted here (obstack_memory_used
is still broken and overflows), but it's probably x*100MB if not more.

Comment 26 Steven Bosscher 2013-03-07 00:26:56 UTC

(In reply to comment #23)
> FYI, the original file ( http://gcc.gnu.org/bugzilla/attachment.cgi?id=17377 )
> can be compiled with 'clang', albeit slowly:
...
> Memory consumption is about 186M.

How have you measured this?

Comment 27 Steven Bosscher 2013-03-07 08:09:59 UTC

Compilation finished after ~3 hours and consuming at least 3GB (from top - I
forgot to use memmax2...).

Winners in the "geez, I'm slow for this test case" list:

 PRE                        :  6241.75 (57%) usr
 tree loop invariant motion :  4194.83 (38%) usr
 TOTAL                      : 11021.46 

Time is mostly spent in the alias oracle.

For LIM I'm going to test using triangle instead of square dep/indep
matrix, as a first step.  Richi, PRE is yours so... ;-)

Comment 28 rguenther@suse.de 2013-03-07 08:44:28 UTC

On Wed, 6 Mar 2013, steven at gcc dot gnu.org wrote:

> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39326
> 
> Steven Bosscher <steven at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |steven at gcc dot gnu.org
> 
> --- Comment #24 from Steven Bosscher <steven at gcc dot gnu.org> 2013-03-06 23:39:27 UTC ---
> (In reply to comment #22)
> > 4.8.0  -O2  (terminated after 9 minutes waiting, LIM being the offender, I
> > suspect domwalk ...)  >2.5GB
> > 
> > applying domwalk fix ...
> 
> It is LIM, for sure. I've been watching in GDB for a while at some
> back traces, and it's spent minutes already in this DOM walk:
> 
> #5  0x0000000000b841e1 in walk_dominator_tree (walk_data=0x7fffffffdd60,
> bb=0x7fffef033958) at ../../trunk/gcc/domwalk.c:187
> #6  0x0000000000c02d73 in determine_invariantness () at
> ../../trunk/gcc/tree-ssa-loop-im.c:1189
> #7  tree_ssa_lim () at ../../trunk/gcc/tree-ssa-loop-im.c:2632
> #8  0x000000000075bcd7 in execute_one_pass (pass=0x12323e0 <pass_lim>) at
> ../../trunk/gcc/passes.c:2330
> #9  0x000000000075c0f5 in execute_pass_list (pass=0x12323e0 <pass_lim>) at
> ../../trunk/gcc/passes.c:2378
> 
> This is supposed to be cheap. Is this a known bottle-neck?
> 
> Pathetic...

Yes, it's known - and there are several known (to me ...) ways to
make constant factor compile-time and memory-usage improvements...

(I _think_ we have a bug for LIMs slowness, if you can't find it
quickly you can create one and assign me - I have some TLC patches
locally queued for 4.9, but they don't help the slowness very much)

Comment 29 rguenther@suse.de 2013-03-07 08:47:35 UTC

On Thu, 7 Mar 2013, steven at gcc dot gnu.org wrote:

> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39326
> 
> --- Comment #25 from Steven Bosscher <steven at gcc dot gnu.org> 2013-03-07 00:08:26 UTC ---
> (In reply to comment #24)
> > (In reply to comment #22)
> > > 4.8.0  -O2  (terminated after 9 minutes waiting, LIM being the offender, I
> > > suspect domwalk ...)  >2.5GB
> > > 
> > > applying domwalk fix ...
> > 
> > It is LIM, for sure. I've been watching in GDB for a while at some
> > back traces, and it's spent minutes already in this DOM walk:
> 
> After this, it's spending even more time in refs_independent_p, doing
> bitmap tests (ah! a test case for my splay tree bitmaps!).
> 
> Is the refs_independent_p relation symmetric? It certainly looks that
> way.  If so, one way to halve the work done here is to build only half
> the "independence" graph.  Currently it builds a full square:
> 
>   if (mem_refs_may_alias_p (ref1->mem, ref2->mem,
>                             &memory_accesses.ttae_cache))
>     { 
>       bitmap_set_bit (ref1->dep_ref, ref2->id);
>       bitmap_set_bit (ref2->dep_ref, ref1->id);
>       if (dump_file && (dump_flags & TDF_DETAILS))
>         fprintf (dump_file, "dependent.\n");
>       return false;
>     }
>   else
>     {
>       bitmap_set_bit (ref1->indep_ref, ref2->id);
>       bitmap_set_bit (ref2->indep_ref, ref1->id);
>       if (dump_file && (dump_flags & TDF_DETAILS))
>         fprintf (dump_file, "independent.\n");
>       return true;
>     }
> 
> It's hard to say how much memory is wasted here (obstack_memory_used
> is still broken and overflows), but it's probably x*100MB if not more.

Yeah, one of my minor TLC patches.  Most of the excessive memory
usage for regular testcases can be fixed by doing LIM on
all siblings of the loop tree independently, instead of keeping
data on all loops in the function live at the same time
(I've not finished _that_ TLC patch - and it won't help for
insanely deep nests, of course)

Comment 30 rguenther@suse.de 2013-03-07 08:52:52 UTC

On Thu, 7 Mar 2013, steven at gcc dot gnu.org wrote:

> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39326
> 
> --- Comment #27 from Steven Bosscher <steven at gcc dot gnu.org> 2013-03-07 08:09:59 UTC ---
> Compilation finished after ~3 hours and consuming at least 3GB (from top - I
> forgot to use memmax2...).
> 
> Winners in the "geez, I'm slow for this test case" list:
> 
>  PRE                        :  6241.75 (57%) usr
>  tree loop invariant motion :  4194.83 (38%) usr
>  TOTAL                      : 11021.46 
> 
> Time is mostly spent in the alias oracle.
> 
> For LIM I'm going to test using triangle instead of square dep/indep
> matrix, as a first step.  Richi, PRE is yours so... ;-)

Actually PRE is yours - it's the RTL PRE ;)

Richard.

Comment 31 Steven Bosscher 2013-03-07 09:58:03 UTC

(In reply to comment #30)
Hmm, RTL PRE isn't really mine either, but I probably know it as well as
anyone else, so I will have a look. It's probably a similar issue as for
gimple LIM: alias checks.

Can you attach your TLC patches for gimple LIM to this PR, please?

Comment 32 Sergei Steshenko 2013-03-07 10:13:40 UTC

(In reply to comment #26)
> (In reply to comment #23)
> > FYI, the original file ( http://gcc.gnu.org/bugzilla/attachment.cgi?id=17377 )
> > can be compiled with 'clang', albeit slowly:
> ...
> > Memory consumption is about 186M.
> 
> How have you measured this?

From time to time I was looking at output of 'top'. The compiler pretty quickly reaches the 186M +/- mark and stays at it.

Comment 33 rguenther@suse.de 2013-03-07 10:14:53 UTC

On Thu, 7 Mar 2013, steven at gcc dot gnu.org wrote:

> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39326
> 
> Steven Bosscher <steven at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|NEW                         |ASSIGNED
>          AssignedTo|unassigned at gcc dot       |steven at gcc dot gnu.org
>                    |gnu.org                     |
> 
> --- Comment #31 from Steven Bosscher <steven at gcc dot gnu.org> 2013-03-07 09:58:03 UTC ---
> (In reply to comment #30)
> Hmm, RTL PRE isn't really mine either, but I probably know it as well as
> anyone else, so I will have a look. It's probably a similar issue as for
> gimple LIM: alias checks.
> 
> Can you attach your TLC patches for gimple LIM to this PR, please?

Bah, I seem to have them in some local tree only but cannot find it.
Maybe I accidentially reverted :/

ISTR moving stuff to obstacks as well (LIM has the heaviest load on
malloc/free).

Oh well.

Comment 34 Sergei Steshenko 2013-03-07 17:13:42 UTC

Somehow, with -O3 LLVM clang works a little bit faster than with -O2 - 54 minutes instead of 58 minutes, though this might be a random variation:

"
sergei@amdam2:~/gcc_bug> time ~/AFSWD/install/LLVM-3.2/binsh/clang -v gap_TlnLv4.c -O3 -c
clang version 3.2 (tags/RELEASE_32/final)
Target: i386-pc-linux-gnu
Thread model: posix
 "/mnt/sdb5/qemu/20121021/LLVM-3.2/bin/clang" -cc1 -triple i386-pc-linux-gnu -emit-obj -disable-free -disable-llvm-verifier -main-file-name gap_TlnLv4.c -mrelocation-model static -fmath-errno -masm-verbose -mconstructor-aliases -target-cpu pentium4 -target-linker-version 2.22 -momit-leaf-frame-pointer -v -coverage-file /home/sergei/gcc_bug/gap_TlnLv4.o -resource-dir /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2 -fmodule-cache-path /tmp/clang-module-cache -internal-isystem /usr/local/include -internal-isystem /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -fdebug-compilation-dir /home/sergei/gcc_bug -ferror-limit 19 -fmessage-length 182 -mstackrealign -fobjc-runtime=gcc -fdiagnostics-show-option -fcolor-diagnostics -o gap_TlnLv4.o -x c gap_TlnLv4.c
clang -cc1 version 3.2 based upon LLVM 3.2svn default target i386-pc-linux-gnu
ignoring nonexistent directory "/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/include
 /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2/include
 /usr/include
End of search list.

real    54m18.212s
user    52m56.062s
sys     0m8.085s
sergei@amdam2:~/gcc_bug>
".


Memory consumption appears to be the same as with -O2.

Comment 35 Steven Bosscher 2013-03-07 17:30:58 UTC

(In reply to comment #34)
> Memory consumption appears to be the same as with -O2.

Can you measure the peak memory with time?

 /usr/bin/time -f 'real=%e user=%U system=%S share=%P%% maxrss=%M ins=%I outs=%O mfaults=%R waits=%w'

Comment 36 Steven Bosscher 2013-03-07 17:33:28 UTC

(In reply to comment #29)
> Yeah, one of my minor TLC patches.  Most of the excessive memory
> usage for regular testcases can be fixed by doing LIM on
> all siblings of the loop tree independently, instead of keeping
> data on all loops in the function live at the same time
> (I've not finished _that_ TLC patch - and it won't help for
> insanely deep nests, of course)

It would be very helpful if you can try to recover this one at least, or
explain a bit more what you had done.  I don't see right away in the code
what you mean.

Comment 37 Sergei Steshenko 2013-03-07 21:47:52 UTC

(In reply to comment #35)
> (In reply to comment #34)
> > Memory consumption appears to be the same as with -O2.
> 
> Can you measure the peak memory with time?
> 
>  /usr/bin/time -f 'real=%e user=%U system=%S share=%P%% maxrss=%M ins=%I
> outs=%O mfaults=%R waits=%w'

"
+ /usr/bin/time -f 'real=%e user=%U system=%S share=%P%% maxrss=%M ins=%Iouts=%O mfaults=%R waits=%w' /mnt/sdb8/sergei/AFSWD_debug/20121021/LLVM-3.2/bin/clang -v gap_TlnLv4.c -O3 -c
clang version 3.2 (tags/RELEASE_32/final)
Target: i386-pc-linux-gnu
Thread model: posix
 "/mnt/sdb5/qemu/20121021/LLVM-3.2/bin/clang" -cc1 -triple i386-pc-linux-gnu -emit-obj -disable-free -disable-llvm-verifier -main-file-name gap_TlnLv4.c -mrelocation-model static -fmath-errno -masm-verbose -mconstructor-aliases -target-cpu pentium4 -target-linker-version 2.22 -momit-leaf-frame-pointer -v -coverage-file /home/sergei/gcc_bug/gap_TlnLv4.o -resource-dir /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2 -fmodule-cache-path /tmp/clang-module-cache -internal-isystem /usr/local/include -internal-isystem /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -fdebug-compilation-dir /home/sergei/gcc_bug -ferror-limit 19 -fmessage-length 182 -mstackrealign -fobjc-runtime=gcc -fdiagnostics-show-option -fcolor-diagnostics -o gap_TlnLv4.o -x c gap_TlnLv4.c
clang -cc1 version 3.2 based upon LLVM 3.2svn default target i386-pc-linux-gnu
ignoring nonexistent directory "/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/include
 /mnt/sdb5/qemu/20121021/LLVM-3.2/bin/../lib/clang/3.2/include
 /usr/include
End of search list.
real=3323.86 user=3224.22 system=8.70 share=97%% maxrss=0 ins=82598outs=8720 mfaults=193511 waits=669
"

- I am not sure 'maxrss=0' makes sense. Anyway, several minutes before completion 'top' showed 224m (or 228m - I do not remember exactly) in VIRT column.

When 'gcc' wokrs, the machine becomes very slowly responsive due to memory usage growth; with 'clang' I do not notice slow down.

My machine is dual core with 2G of physical memory.

Comment 38 Steven Bosscher 2013-03-07 22:15:39 UTC

Created attachment 29612 [details]
Punt on loops with more memory references than LIM can handle

For the LIM problem, I'm thinking of a patch like this one (not tested
yet).  Basically, this means giving up on really large loops with many
loads and/or stores.  That's not an unreasonable thing to do anyway.
Code motion from really big loops usually isn't an improvement.

Richi, could you have a look at this, and see if I've got it all right,
more-or-less?  LIM is quite complicated and I'm not sure if I should
look at, or update, the set of optimizable loops somewhere.

With the patch, and with "-O2 -fgcse", I now have:

gap_TlnLv4.c: In function 'RDFT_49152_1':
gap_TlnLv4.c:37502:18: warning: -ftree-loop-im: number of memory references in loop exceeds the --param loops-max-datarefs-for-datadeps threshold [-Wdisabled-optimization]
      t12308[500] = t12304[6144*i0+1000];
                  ^

 dead store elim1        :  57.70 ( 8%)
 dead store elim2        :  76.82 (10%)
 combiner                : 360.07 (48%)
 reload CSE regs         :  55.44 ( 7%)
 TOTAL                   : 743.77

Comment 39 Steven Bosscher 2013-03-07 23:18:48 UTC

Memory usage is still pathetic. Some stats:

mem stats from /proc/self/statm on *entry* of pass:

pass (#)                          size        resident   
*warn_unused_result ( -1)    177106944       140742656
...
*init_datastructures ( -1)   209670144       169570304
ssa ( 20)                    209670144       169582592
*rebuild_cgraph_edges ( -1)  302575616       259780608
...
expand (169)                 390647808       349757440
vregs (170)                  477892608       425406464
...
dfinit (174)                 477892608       425627648
cse1 (175)                   615616512       565694464 
...
dse1 (195)                   727654400       634572800
fwprop2 (196)                791678976       723767296
...
combine (200)                823115776       752177152 
ce2 (201)                   6589759488      6512082944 
...
csa (222)                   6938136576      1057169408

This means that "combine" explodes the memory foot print by ~5.4 GB.

On entry to combine, the memory foot print is ~750MB. I don't think
that is unreasonable. But after combine, the memory footprint jumps
to ~7GB max, and 350GB resident (after a ggc_collect before csa).
I'm guessing most of that 350GB is combine leaving memory scattered
so that pages can't be released.

dfinit is also a major contributor to the huge memory foot print.  I
have seen that in other test cases also, but so far I can't figure out
how it can consume this much memory.

Interestingly, the initial memory footprint, out of the front end, is
~180 MB. From there, the memory foot print gradually goes up, but that
is probably in part because ggc_collect is just never called (this is
on a box with 16GB RAM), with GGC params:

  --param ggc-min-expand=100 --param ggc-min-heapsize=131072

With both parameters halved, the combine problem obviously remains,
and the dfinit problem becomes more pronounced:

expand (169)                 328032256       287195136
vregs (170)                  474415104       424263680
...
dfinit (174)                 474415104       424484864
cse1 (175)                   613801984       564596736
...
combine (200)                823496704       752828416
ce2 (201)                   6590169088      6512717824

The good news: There clearly is a lot of room for improvement for this 
test case. :-)

Comment 40 Richard Biener 2013-03-08 09:12:38 UTC

(In reply to comment #31)
> (In reply to comment #30)
> Hmm, RTL PRE isn't really mine either, but I probably know it as well as
> anyone else, so I will have a look. It's probably a similar issue as for
> gimple LIM: alias checks.
> 
> Can you attach your TLC patches for gimple LIM to this PR, please?

Found it on some testing machine.  I merged it and saw to it that it
compiles and passes tree-ssa.exp (some ???s remain, I've added a TODO).
The idea should be clear, and it's memory savings only and it will eventually
be a bit slower (unless it is optimized further).

Comment 41 Richard Biener 2013-03-08 09:13:53 UTC

Created attachment 29616 [details]
make LIM work per outermost loop

Comment 42 rguenther@suse.de 2013-03-08 09:22:39 UTC

On Thu, 7 Mar 2013, steven at gcc dot gnu.org wrote:

> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39326
> 
> --- Comment #38 from Steven Bosscher <steven at gcc dot gnu.org> 2013-03-07 22:15:39 UTC ---
> Created attachment 29612 [details]
>   --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29612
> Punt on loops with more memory references than LIM can handle
> 
> For the LIM problem, I'm thinking of a patch like this one (not tested
> yet).  Basically, this means giving up on really large loops with many
> loads and/or stores.  That's not an unreasonable thing to do anyway.
> Code motion from really big loops usually isn't an improvement.
> 
> Richi, could you have a look at this, and see if I've got it all right,
> more-or-less?  LIM is quite complicated and I'm not sure if I should
> look at, or update, the set of optimizable loops somewhere.

Yeah, well - it should be easy to avoid most overhead (even collecting
the memory references) with my proposed scheme (see patch).  First,
for each BB count the number of memory stmts (easy, look for VUSEs),
then, when walking the set of outermost loops we want to apply LIM to
sum over its BB counts and instead walk over its siblings if the number
is too large.

But yes, limiting it should be done as it performs O(n^2) dependence
checks (well, O(#of stores * #of memory references), so for a low
number of stores it's quite cheap).

I've yet to recover the obstack-ification of struct depend,
struct mem_ref, struct mem_ref_locs and lim_aux_data allocations ...
(LIM is the biggest load on malloc/free).

> With the patch, and with "-O2 -fgcse", I now have:
> 
> gap_TlnLv4.c: In function 'RDFT_49152_1':
> gap_TlnLv4.c:37502:18: warning: -ftree-loop-im: number of memory references in
> loop exceeds the --param loops-max-datarefs-for-datadeps threshold
> [-Wdisabled-optimization]
>       t12308[500] = t12304[6144*i0+1000];
>                   ^
> 
>  dead store elim1        :  57.70 ( 8%)
>  dead store elim2        :  76.82 (10%)
>  combiner                : 360.07 (48%)
>  reload CSE regs         :  55.44 ( 7%)
>  TOTAL                   : 743.77

Nice.  Well ... ;)

Comment 43 Steven Bosscher 2013-03-09 14:57:52 UTC

The problem with combine is only "collateral damage" from what dse1 is
doing to this function.  It's loading stored values into registers and
replacing re-loads with those registers, as in:

@@ -747831,6 +780399,7 @@
  474783: r569578:SF=r569577:SF-r177936:SF
       REG_DEAD r569577:SF
       REG_DEAD r177936:SF
+ 605734: r662460:SF=r569578:SF
  474784: [const(`t18571.26403'+0x5fe8)]=r569578:SF
       REG_DEAD r569578:SF
  474786: r177942:SF=[const(`t12303.26335'+0x9008)]
...
  487110: [const(`t24716.26406'+0x2ff0)]=r578824:SF
       REG_DEAD r578824:SF
- 487113: r578827:SF=[const(`t18571.26403'+0x5fe8)]
+ 487113: r578827:SF=r662460:SF
  487114: [const(`t24716.26406'+0x2ff4)]=r578827:SF
       REG_DEAD r578827:SF

The distance between the store and load is large, and that is what is
causing combine to blow up -- walking all the insns between 605734 and
487113 (about 500 of them, which appears to be typical).  This also
increases register pressure, causing the poor IRA/LRA/postreload numbers.

Also not great: dse is loading the same stored value into multiple
registers, e.g.:

@@ -147272,6 +175190,8 @@
  110394: r288307:SF=r288306:SF-r37078:SF
       REG_DEAD r288306:SF
       REG_DEAD r37078:SF
+ 601084: r657810:SF=r288307:SF
+ 601085: r657811:SF=r288307:SF
  110395: [r288302:DI]=r288307:SF
       REG_DEAD r288307:SF

but I'm guessing (and verifying now) that this is cleaned up by fwprop2.

Comment 44 Steven Bosscher 2013-03-09 17:25:46 UTC

Created attachment 29628 [details]
Re-use store register if possible

This patch resolves the issue mentioned in comment #43, and makes
combine run in reasonable time.

Comment 45 Steven Bosscher 2013-03-11 09:40:18 UTC

Patches posted:

* Restrict GIMPLE loop invariant code motion of loop-invariant loads and
stores to loops with fewer memory references than a certain maximum that
is controlled with --param loops-max-datarefs-for-datadeps" from the
command line.
http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00380.html

* Do not create new pseudo-registers for load-after-store transformations
in RTL dead store elimination.  This reduces the memory foot print after
DSE by ~2 percent, and avoids the compile time and memory usage explosion
in combine because it gets presented fewer single-def/single-use register
moves that are really just register copies.
http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00379.html

* Make gcse.c respect -fno-gcse-lm. For the RTL PRE problem, this means
compile time is reasonable with -fno-gcse-lm.
A follow-up patch will implement some mechanism to disable load motion
automatically on extreme test cases like the one from this PR.
http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00386.html


The remaining compile time bottlenecks are:

- RTL dead store limination in its analysis phase.  This is mostly time
spent in dependence tests in alias analysis for instructions in a single
basic block, so it's only a problem for test cases where there is a huge
number of loads and stores in each basic block. I don't think it is worth
speeding up DSE for such extreme cases. 

- Post-reload CSE because it is in the worst-case quadratic in the number 
of instructions in a basic block.  In most practical cases, post-reload 
CSE scales linearly with the number of instructions in a basic block, but
with a large constant bound. It looks up and down through the instruction
chain to see if a reg is not  clobbered between a use and a def.  Because
it only has to do so with  hard registers  the typical bound is closer to
"number of insns in basic block" * "number of hard registers".  This is
fine, I am not going to try and improve this.

Comment 46 Richard Biener 2013-03-12 10:46:45 UTC

Created attachment 29649 [details]
symmetry in reference testing

Exploit symmetry in reference testing.

Comment 47 Richard Biener 2013-03-12 14:05:10 UTC

With caching affine-combination compute and ao_ref compute I have it down to

 tree loop invariant motion: 596.91 (79%) usr   0.73 (29%) sys 599.77 (78%) wall   31135 kB ( 3%) ggc

without limiting anything.  I tried a more intelligent ref_indep_in_loop
(avoiding another quadraticness in loop depth ...) but it gets slower ...

*************** ref_indep_loop_p_1 (struct loop *loop, m
*** 2230,2242 ****
    bitmap refs_to_check;
    unsigned i;
    bitmap_iterator bi;
!   bool ret = true, stored = bitmap_bit_p (ref->stored, loop->num);
    mem_ref_p aref;

!   if (stored)
!     refs_to_check = memory_accesses.all_refs_in_loop[loop->num];
    else
!     refs_to_check = memory_accesses.all_refs_stored_in_loop[loop->num];

    EXECUTE_IF_SET_IN_BITMAP (refs_to_check, 0, i, bi)
      {
--- 2233,2245 ----
    bitmap refs_to_check;
    unsigned i;
    bitmap_iterator bi;
!   bool ret = true;
    mem_ref_p aref;

!   if (bitmap_bit_p (ref->stored, loop->num))
!     refs_to_check = memory_accesses.refs_in_loop[loop->num];
    else
!     refs_to_check = memory_accesses.refs_stored_in_loop[loop->num];

    EXECUTE_IF_SET_IN_BITMAP (refs_to_check, 0, i, bi)
      {
*************** ref_indep_loop_p_1 (struct loop *loop, m
*** 2259,2272 ****
  static bool
  ref_indep_loop_p (struct loop *loop, mem_ref_p ref)
  {
-   bool ret;
-
    if (bitmap_bit_p (ref->indep_loop, loop->num))
      return true;
    if (bitmap_bit_p (ref->dep_loop, loop->num))
      return false;

!   ret = ref_indep_loop_p_1 (loop, ref);

    if (dump_file && (dump_flags & TDF_DETAILS))
      fprintf (dump_file, "Querying dependencies of ref %u in loop %d: %s\n",
--- 2262,2286 ----
  static bool
  ref_indep_loop_p (struct loop *loop, mem_ref_p ref)
  {
    if (bitmap_bit_p (ref->indep_loop, loop->num))
      return true;
    if (bitmap_bit_p (ref->dep_loop, loop->num))
      return false;

!   bool ret = true;
!   struct loop *inner = loop->inner;
!   while (inner)
!     {
!       if (!ref_indep_loop_p (inner, ref))
!       {
!         ret = false;
!         break;
!       }
!       inner = inner->next;
!     }
!
!   if (ret)
!     ret = ref_indep_loop_p_1 (loop, ref);

    if (dump_file && (dump_flags & TDF_DETAILS))
      fprintf (dump_file, "Querying dependencies of ref %u in loop %d: %s\n",


with it we can also get rid of the all_refs_in_loop bitmap (only
all_refs_stored_in_loop is then used, by store-motion).

I don't see why the above should result in slower operation :/  It should
save us the time to re-query references in sub-loops (yes, they should
be cached already ... still walking the bitmap is not free).

To mimic parts of this idea the dep_loop/indep_loop checking/setting can
be improved to cover outer loops like with

@@ -2296,7 +2218,12 @@ record_indep_loop (struct loop *loop, me
   if (indep)
     bitmap_set_bit (ref->indep_loop, loop->num);
   else
-    bitmap_set_bit (ref->dep_loop, loop->num);
+    do
+      {
+       bitmap_set_bit (ref->dep_loop, loop->num);
+       loop = loop_outer (loop);
+      }
+    while (loop != current_loops->tree_root);
 }
 
and

@@ -2339,8 +2266,13 @@ ref_indep_loop_p (struct loop *loop, mem
 {
   bool ret;
 
-  if (bitmap_bit_p (ref->indep_loop, loop->num))
-    return true;
+  struct loop *oloop = loop;
+  do
+    {
+      if (bitmap_bit_p (ref->indep_loop, oloop->num))
+       return true;
+      oloop = loop_outer (oloop);
+    }
+  while (oloop != current_loops->tree_root);
   if (bitmap_bit_p (ref->dep_loop, loop->num))
     return false;

(no difference in compile-time)

anyway, another obvious improvement is to merge the two two-bitmap
sets (they are tri-state, dependent, independent and unknown).  That's
more cache-friendly (now, where's that tri-state "bitmap" implementation
that also saves the half bit we waste ... ;))  It helps a bit.

Comment 48 Richard Biener 2013-03-15 16:06:33 UTC

Removing all the caching (dep/indep_loop and dep/indep_ref) makes things
faster ... (they have a hit rate of 5% resp. 6.2% only)

Fastest timing sofar:

 tree loop invariant motion:  57.42 (26%) usr

that's without any limiting and with just the dep/indep_ref cache removed
and a (working) patch according to comment #47.

Comment 49 Richard Biener 2013-03-18 08:43:08 UTC

Author: rguenth
Date: Mon Mar 18 08:42:57 2013
New Revision: 196768

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=196768
Log:
2013-03-18  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/39326
	* tree-ssa-loop-im.c (refs_independent_p): Exploit symmetry.
	(struct mem_ref): Replace mem member with ao_ref typed member.
	(MEM_ANALYZABLE): Adjust.
	(memref_eq): Likewise.
	(mem_ref_alloc): Likewise.
	(gather_mem_refs_stmt): Likewise.
	(mem_refs_may_alias_p): Use the ao_ref to query the alias oracle.
	(execute_sm_if_changed_flag_set): Adjust.
	(execute_sm): Likewise.
	(ref_always_accessed_p): Likewise.
	(refs_independent_p): Likewise.
	(can_sm_ref_p): Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-loop-im.c

Comment 50 Jakub Jelinek 2013-03-21 20:04:45 UTC

Author: rguenth
Date: Thu Mar 21 14:45:36 2013
New Revision: 196878

URL: http://gcc.gnu.org/viewcvs?rev=196878&root=gcc&view=rev
Log:
2013-03-21  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/39326
	* tree-ssa-loop-im.c (UNANALYZABLE_MEM_ID): New define.
	(MEM_ANALYZABLE): Adjust.
	(record_mem_ref_loc): Move bitmap ops ...
	(gather_mem_refs_stmt): ... here.  Use the shared mem-ref for
	unanalyzable refs, do not record locations for it.
	(analyze_memory_references): Allocate ref zero as shared
	unanalyzable ref.
	(refs_independent_p): Do not test for unanalyzed mems here.
	(ref_indep_loop_p_1): Special-case disambiguation against
	the unanalyzed ref.
	(ref_indep_loop_p): Assert we are not queried for the
	unanalyzed mem.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-loop-im.c

Comment 51 Richard Biener 2013-03-22 14:00:25 UTC

>     (struct mem_ref): Replace mem member with ao_ref typed member.

RTL gcse (-O2) suffers from the same slowness in its dependence tests.  Caching
ao_ref instead of just mem and alias-set in RTL mem-attrs would improve
this tremendously.

A quick try reveals that gengtype is not at all happy with including
something in rtl.h though :/

Comment 52 Richard Biener 2013-03-26 10:09:52 UTC

(In reply to comment #51)
> >     (struct mem_ref): Replace mem member with ao_ref typed member.
> 
> RTL gcse (-O2) suffers from the same slowness in its dependence tests.  Caching
> ao_ref instead of just mem and alias-set in RTL mem-attrs would improve
> this tremendously.
>
> A quick try reveals that gengtype is not at all happy with including
> something in rtl.h though :/

Doing that (caching the ao_ref) doesn't help as much as it did on the
GIMPLE level.  The most time-consuming part of canon_true_dependence
(which consumes 56% of compile-time) is find_base_term (37% of that 56%)
followed by memrefs_conflict_p and only then (after caching ao_ref)
rtx_refs_may_alias_p (18%).  The find_base_term result is another thing
that could be easily cached when doing dependence checks that repeatedly
use one or another operand.  For this testcase the most expensive caller
is compute_transp.

There is some obvious way to do less find_base_term calls in alias.c
itself.  I'm going to cleanup things there a bit.

Comment 53 Richard Biener 2014-01-15 15:13:58 UTC

With the PR59802 and PR38518 fixes on trunk I see for the 2nd testcase at -O2

 PRE                     :  19.23 (35%) usr   0.01 ( 1%) sys  19.22 (34%) wall    1421 kB ( 0%) ggc
 combiner                :  13.24 (24%) usr   0.14 (19%) sys  13.37 (24%) wall  402488 kB (73%) ggc
 integrated RA           :   9.28 (17%) usr   0.35 (47%) sys   9.65 (17%) wall   26707 kB ( 5%) ggc
 TOTAL                 :  55.56             0.74            56.34             550215 kB

compared to the 4.8 branch:

 tree loop invariant motion:  17.29 (22%) usr
 PRE                     :  24.25 (31%) usr
 combiner                :  12.52 (16%) usr   0.14 (17%) sys  12.66 (16%) wall  402491 kB (74%) ggc
 integrated RA           :   7.96 (10%) usr   0.32 (38%) sys   8.25 (11%) wall   26420 kB ( 5%) ggc
 TOTAL                 :  77.53             0.84            78.53             542544 kB

so it's an improvement (tree LIM is off the radar completely, RTL PRE
is improved a bit - looking at it now).

Comment 54 Richard Biener 2014-01-17 13:53:43 UTC

For the original testcase on trunk we get at -O1

 tree loop invariant motion:  37.20 (16%) usr   0.02 ( 1%) sys  37.20 (16%) wall   12127 kB ( 1%) ggc
 dead store elim1        :  17.42 ( 7%) usr   0.04 ( 2%) sys  17.44 ( 7%) wall   28345 kB ( 3%) ggc
 dead store elim2        :  26.22 (11%) usr   0.00 ( 0%) sys  26.20 (11%) wall   18664 kB ( 2%) ggc
 combiner                :  15.36 ( 6%) usr   0.04 ( 2%) sys  15.39 ( 6%) wall   70073 kB ( 7%) ggc
 LRA hard reg assignment :  74.07 (31%) usr   0.02 ( 1%) sys  74.05 (31%) wall       0 kB ( 0%) ggc
 reload CSE regs         :  12.67 ( 5%) usr   0.00 ( 0%) sys  12.66 ( 5%) wall   13315 kB ( 1%) ggc
 TOTAL                 : 237.86             1.83           239.60             999181 kB

so that's still slow (memory usage is peaking at about 1.5GB now, during RA).

tree LIM:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
  4.22    120.72     9.83   299080     0.00     0.00  ref_indep_loop_p_2(loop*, 
mem_ref*, bool)
  1.89    151.82     4.41 1712943886     0.00     0.00  mem_refs_may_alias_p(mem
_ref*, mem_ref*, pointer_map_t**) [clone .constprop.60]
  0.21    197.02     0.48    12192     0.00     0.00  force_move_till(tree_node*, tree_node**, void*)

at -O2 this will blow up completely as it then runs into the affine-based
disambiguation.

Comment 55 Richard Biener 2016-02-17 11:23:16 UTC

Current GCC 6 numbers:

-O1 -g

 var-tracking dataflow   :  89.37 (60%) usr   0.09 (23%) sys  89.53 (59%) wall   11542 kB ( 3%) ggc
 var-tracking emit       :  47.70 (32%) usr   0.05 (13%) sys  47.81 (32%) wall    4546 kB ( 1%) ggc
 TOTAL                 : 150.06             0.39           150.60             441618 kB

and 580MB peak memory use

-O2

 PRE                     :  11.41 (40%) usr   0.01 ( 3%) sys  11.43 (39%) wall     921 kB ( 0%) ggc
 integrated RA           :   4.78 (17%) usr   0.09 (26%) sys   4.86 (17%) wall   14129 kB ( 4%) ggc
 TOTAL                 :  28.84             0.34            29.21             375189 kB

and 820MB peak memory use

-O3

 load CSE after reload   :  30.42 (49%) usr   0.02 ( 6%) sys  30.46 (49%) wall       0 kB ( 0%) ggc
 TOTAL                 :  62.08             0.32            62.47             415741 kB

and 700MB peak memory use

Comment 56 Sergei Steshenko 2016-02-17 11:51:59 UTC

"-O2 ... and 820MB peak memory use" vs "-O3 ... and 700MB peak memory use" - according to my common sense -O3 is stronger than -02 optimization, and one should expect greater memory use.

So, can the above be an indication of yet another bug ? I.e. -O3 in reality does not do something it is supposed to do while -O2 does this ?

Comment 57 rguenther@suse.de 2016-02-17 12:11:26 UTC

On Wed, 17 Feb 2016, sergstesh at yahoo dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39326
> 
> --- Comment #56 from Sergei Steshenko <sergstesh at yahoo dot com> ---
> "-O2 ... and 820MB peak memory use" vs "-O3 ... and 700MB peak memory use" -
> according to my common sense -O3 is stronger than -02 optimization, and one
> should expect greater memory use.
> 
> So, can the above be an indication of yet another bug ? I.e. -O3 in reality
> does not do something it is supposed to do while -O2 does this ?

More like -O3 cleans up the code due to stronger optimization before
the memory-hogging part comes in and thus that one needs less memory
do to its work.

Comment 58 Steven Bosscher 2019-03-05 08:26:51 UTC

Not working on this => unassigned...

Comment 59 Richard Biener 2019-03-05 13:02:49 UTC

GCC 8.3, x86_64:

First testcase:

-O0: 1GB, 8.8s

-O1: 1.7GB, 242s
 alias stmt walking                 :  85.84 ( 35%)   0.82 ( 22%)  86.84 ( 35%)    2189 kB (  0%)
 tree CFG cleanup                   :  13.73 (  6%)   0.00 (  0%)  13.75 (  6%)    2055 kB (  0%)
 tree loop invariant motion         :  66.65 ( 27%)   0.09 (  2%)  66.85 ( 27%)   30968 kB (  2%)
 combiner                           :  16.87 (  7%)   0.02 (  1%)  16.90 (  7%)   18054 kB (  1%)
 reload CSE regs                    :  11.46 (  5%)   0.01 (  0%)  11.47 (  5%)   10633 kB (  1%)

-O2: 4.3GB, 1994s
 alias stmt walking                 : 126.85 (  6%)   1.14 ( 23%) 128.38 (  6%)    2875 kB (  0%)
 tree loop invariant motion         :  67.50 (  3%)   0.08 (  2%)  67.64 (  3%)   62356 kB (  2%)
 PRE                                :1644.36 ( 82%)   0.07 (  1%)1644.49 ( 82%)    2877 kB (  0%)

Second, smaller testcase:

-O1: 420MB, 16s
 alias stmt walking                 :   7.06 ( 43%)   0.08 ( 13%)   7.15 ( 42%)     621 kB (  0%)
 tree loop invariant motion         :   0.90 (  5%)   0.01 (  2%)   0.92 (  5%)    3497 kB (  2%)
 integrated RA                      :   2.89 ( 18%)   0.05 (  8%)   2.91 ( 17%)   14261 kB (  9%)

-O2: 780MB, 48s
 alias stmt walking                 :  10.23 ( 21%)   0.09 ( 14%)  10.31 ( 21%)     785 kB (  0%)
 tree loop invariant motion         :   0.94 (  2%)   0.01 (  2%)   0.96 (  2%)    7264 kB (  2%)
 PRE                                :  19.39 ( 40%)   0.00 (  0%)  19.40 ( 39%)     281 kB (  0%)
 integrated RA                      :   5.38 ( 11%)   0.09 ( 14%)   5.46 ( 11%)   14033 kB (  4%)
 reload CSE regs                    :   2.28 (  5%)   0.00 (  0%)   2.28 (  5%)    2486 kB (  1%)


The other testcase:

-O1: 2.8GB, 761s
 alias stmt walking                 : 188.53 ( 25%)   1.55 ( 20%) 190.19 ( 25%)    3400 kB (  0%)
 tree CFG cleanup                   :  82.33 ( 11%)   0.02 (  0%)  82.35 ( 11%)    4103 kB (  0%)
 tree DSE                           :  18.31 (  2%)   0.00 (  0%)  18.32 (  2%)       0 kB (  0%)
 tree loop invariant motion         : 296.98 ( 39%)   0.18 (  2%) 297.22 ( 39%)   62472 kB (  2%)
 forward prop                       :  10.22 (  1%)   0.03 (  0%)  10.24 (  1%)   57889 kB (  2%)
 CSE                                :  14.47 (  2%)   0.01 (  0%)  14.47 (  2%)   54258 kB (  2%)
 combiner                           :  75.19 ( 10%)   0.03 (  0%)  75.22 ( 10%)   35852 kB (  1%)
 reload CSE regs                    :  24.39 (  3%)   0.01 (  0%)  24.40 (  3%)   21083 kB (  1%)

-O2: 9.2GB, 7634s
 alias stmt walking                 : 273.85 (  4%)   2.30 ( 22%) 276.44 (  4%)    4533 kB (  0%)
 tree loop invariant motion         : 284.31 (  4%)   0.46 (  4%) 284.89 (  4%)  125719 kB (  1%)
 PRE                                :6714.77 ( 88%)   0.44 (  4%)6716.39 ( 88%)    5747 kB (  0%)

Comment 60 Richard Biener 2019-03-05 14:42:14 UTC

PRE is all find_base_term exploding ...

The LIM case is all store_motion () which is quadratic and the only
user of the quadratic in memory all_refs_stored_in_loop.  The latter
would be reasonably easy to fix by iterating.  The former is more
difficult to fix, we'd need to rewrite SM to work inner-to-outer
loops somehow.  But that's certainly doable.  Gating store-motion
on optimize > 1 is possible as well (it also has the bad side-effect
of increasing register pressure by introducing IVs).

Comment 61 Richard Biener 2022-03-10 15:05:43 UTC

r12-7592, first testcase, x86_64:

-O0:   6s, 1GB

-O1: 264s, 1.4GB

 callgraph ipa passes               :  30.47 ( 12%)
 alias stmt walking                 :  67.44 ( 26%)
 tree loop invariant motion         : 120.86 ( 46%)

-O2: 1383s, 11.8GB

 alias stmt walking                 : 174.01 ( 13%)
 tree PRE                           :  87.85 (  6%)
 tree loop invariant motion         : 130.54 (  9%)
 tree slp vectorization             : 100.01 (  7%)
 PRE                                : 716.64 ( 52%)

quite some regression on the peak memory usage side here.

Comment 62 Richard Biener 2024-02-20 13:14:26 UTC

(In reply to Richard Biener from comment #61)
> r12-7592, first testcase, x86_64:
> 
> -O0:   6s, 1GB
> 
> -O1: 264s, 1.4GB
> 
>  callgraph ipa passes               :  30.47 ( 12%)
>  alias stmt walking                 :  67.44 ( 26%)
>  tree loop invariant motion         : 120.86 ( 46%)
> 
> -O2: 1383s, 11.8GB
> 
>  alias stmt walking                 : 174.01 ( 13%)
>  tree PRE                           :  87.85 (  6%)
>  tree loop invariant motion         : 130.54 (  9%)
>  tree slp vectorization             : 100.01 (  7%)
>  PRE                                : 716.64 ( 52%)
> 
> quite some regression on the peak memory usage side here.

r14-9064

-O2: 683s, 5GB

 alias stmt walking                 :  95.35 ( 14%)
 tree PRE                           :  58.27 (  9%)
 tree loop invariant motion         :  75.47 ( 11%)
 PRE                                : 335.15 ( 49%)

so improvements on the memory usage side.  -O1 didn't change.