39157 – Code that compiles fine in 1GB of memory with 4.1.2 requires > 20GB in 4.2.* and higher

Bug 39157 - Code that compiles fine in 1GB of memory with 4.1.2 requires > 20GB in 4.2.* and higher

Summary: Code that compiles fine in 1GB of memory with 4.1.2 requires > 20GB in 4.2.* ...

Status:	RESOLVED DUPLICATE of bug 26854

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	4.3.3

Importance:	P3 major
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	memory-hog

Depends on:
Blocks:

Reported:	2009-02-11 21:41 UTC by lucier
Modified:	2012-11-05 22:02 UTC (History)
CC List:	12 users (show)

See Also:
Host:	x86_64-unknown-linux-gnu
Target:	x86_64-unknown-linux-gnu
Build:	x86_64-unknown-linux-gnu
Known to work:	4.1.2
Known to fail:
Last reconfirmed:	2009-02-12 12:01:30

Attachments
gcc44-pr39157.patch (335 bytes, patch) 2009-02-12 14:21 UTC, Jakub Jelinek	Details \| Diff
gcc44-pr39157.patch (2.25 KB, patch) 2009-02-12 15:57 UTC, Jakub Jelinek	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description lucier 2009-02-11 21:41:47 UTC

With this compiler

[lucier@descartes gambit]$ gcc -v
Using built-in specs.
Target: powerpc64-unknown-linux-gnu
Configured with: ../../gcc-4.3.3/configure --prefix=/pkgs/gcc-4.3.3 --enable-languages=c --with-cpu=default64
Thread model: posix
gcc version 4.3.3 (GCC) 

with the file compiler.i found here:

http://www.math.purdue.edu/~lucier/bugzilla/8/

attempting to compile with these options:

gcc -m64 -mcpu=970 -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -rdynamic -shared

can't compile in 8GB of RAM.  With this compiler:

euler-77% /pkgs/gcc-4.2.3/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../gcc-4.2.3/configure --prefix=/pkgs/gcc-4.2.3 --enable-checking=release --with-gmp=/pkgs/gmp-4.2.2 --with-mpfr=/pkgs/gmp-4.2.2
Thread model: posix
gcc version 4.2.3

and these options:

gcc -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -rdynamic -shared

it can't compiler in 20GB of RAM.  (That machine has only 16GB of RAM, so I killed the compile when it hit 20GB of physical+virtual memory.)

It compiles just fine in about 1GB of RAM with 

euler-76% gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --prefix=/pkgs/gcc-4.1.2
Thread model: posix
gcc version 4.1.2

compiler.i is the output from the Gambit Scheme->C compiler; the source scheme program is from a standard benchmark suite for Scheme compilers.  So I found this by trying to change the code generator for Gambit and running the benchmark suite on x86_64.

I don't know how this can be "fixed".  Basically, the entire middle-end infrastructure since 4.1.* is telling people like me with computer-generated code like this to just go away (to put it very politely).  On Mac OS X 10.5.*, Apple bundles their version of 4.0.1, which compiles this just fine; on Red Hat 5.2, they bundle their version of 4.1.2 (I think, my RH5.2 box is down at the moment), which compiles this just fine; but on Ubuntu 8.10 or Fedora 10 you can't compile this because they bundle newer compilers.  (I guess I'll see if I can install 4.1.* on both of these.)

As a stopgap measure, perhaps someone can tell me what optimization level to use.  As you can see, I use -O1 and a few others (mainly -fschedule-insns2).  gcc 4.1.* and earlier compiled something like this just fine, but -O1 must mean something different now.

Comment 1 Richard Biener 2009-02-12 09:58:44 UTC

Is this the same as PR26854?  Note that it would be extremely helpful to have
a testcase that fits into less memory to be able to analyze this.  As you say
it is autogenerated source, can you shrink the testcase?

Thanks.

Comment 2 Jakub Jelinek 2009-02-12 10:41:15 UTC

The peak on x86_64 (with normal checking cc1) seems to be around 29.3GB, but most of the time top numbers were in the .5GB - 1GB or at most 1.3GB range, with just a signle very high peak.  Will now look closer where all the memory is needed.

Comment 3 Richard Biener 2009-02-12 10:46:07 UTC

If it is like PR26854 then it is all df memory.  The gambit stuff consists of
a state machine with loads of computed gotos (thus a nearly fully connected
CFG) IIRC.

Comment 4 Steven Bosscher 2009-02-12 11:26:39 UTC

Even for code with lots of computed gotos, the CFG should not be fully connected. We factorize computed gotos to avoid exactly that.  At least we used to.  Maybe the factorizing is broken, or it is undone somewhere too early?

Comment 5 Eric Botcazou 2009-02-12 11:40:23 UTC

> Even for code with lots of computed gotos, the CFG should not be fully
> connected. We factorize computed gotos to avoid exactly that.  At least we used
> to.  Maybe the factorizing is broken, or it is undone somewhere too early?

If I'm not mistaken the factoring is undone during RTL expansion.

Comment 6 Steven Bosscher 2009-02-12 12:01:30 UTC

In the past, we did not unfactor them (see e.g. Bug 15242).

Comment 7 Jakub Jelinek 2009-02-12 12:16:09 UTC

Adding -fno-move-loop-invariants to the x86_64 mentioned options results in
VIRT memory topping around 1224m (only because of IRA, before RA it never went above 1GB).  Seems it is really the loop2_invariant pass that ate 28GB of RAM.

Comment 8 Steven Bosscher 2009-02-12 12:49:33 UTC

If there is a test case that compiles in less than 4GB, I'll take this bug (I have no access to machines with more memory than that ;-)

Comment 9 Richard Biener 2009-02-12 12:58:55 UTC

PR26854 is "fixed" as well with -fno-move-loop-invariants.  It has a little
less peak memory requirement than the testcase here.

Comment 10 Jakub Jelinek 2009-02-12 13:16:36 UTC

Most of the memory is allocated in df_chain->block_pool:
p *df->problems_by_index[4]->block_pool
$37 = {name = 0xbc3114 "df_chain_block pool", id = 475, elts_per_block = 50, returned_free_list = 0x0, 
  virgin_free_list = 0x72ce7cae8 "", virgin_elts_remaining = 18, elts_allocated = 775734800, elts_free = 18, 
  blocks_allocated = 15514696, block_list = 0x72ce7c6e0, block_size = 1608, elt_size = 32}
p *cfun->cfg
$39 = {x_entry_block_ptr = 0x7ffff0cb11e0, x_exit_block_ptr = 0x7ffff0cb1240, x_basic_block_info = 0x7ffff283f000, 
  x_n_basic_blocks = 76672, x_n_edges = 135946, x_last_basic_block = 76676, x_label_to_block_map = 0x0, 
  x_profile_status = PROFILE_GUESSED, x_dom_computed = {DOM_NO_FAST_QUERY, DOM_NONE}, x_n_bbs_in_dom_tree = {76672, 0}, 
  max_jumptable_ents = 0, last_label_uid = 97987}

Comment 11 Jakub Jelinek 2009-02-12 14:21:21 UTC

Created attachment 17288 [details]
gcc44-pr39157.patch

Quick hack to avoid loop invariant motion from excessively large loops at -O1.
With this cc1 still tops at 1.2GB, but loop invariants are still moved out of reasonably sized loops.  For the huge ones I wonder how much it buys in anyway, there it increases register pressure a lot.  10000 is pretty arbitrary, I wonder if we want a param or something.

Comment 12 rguenther@suse.de 2009-02-12 14:23:21 UTC

Subject: Re:  Code that compiles fine in 1GB of memory
 with 4.1.2 requires > 20GB in 4.2.* and higher

On Thu, 12 Feb 2009, jakub at gcc dot gnu dot org wrote:

> ------- Comment #11 from jakub at gcc dot gnu dot org  2009-02-12 14:21 -------
> Created an attachment (id=17288)
>  --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17288&action=view)
> gcc44-pr39157.patch
> 
> Quick hack to avoid loop invariant motion from excessively large loops at -O1.
> With this cc1 still tops at 1.2GB, but loop invariants are still moved out of
> reasonably sized loops.  For the huge ones I wonder how much it buys in anyway,
> there it increases register pressure a lot.  10000 is pretty arbitrary, I
> wonder if we want a param or something.

I'd say for -O1 disable it completely, for -O2 and up use the
heuristic and add a param for it.

Richard.

Comment 13 Richard Biener 2009-02-12 15:47:05 UTC

Zdenek, in this case (and PR26854) can we make sure not to recognize loops
that involve the single non-local goto BB?  Maybe this would solve the
problem as well.

Comment 14 Jakub Jelinek 2009-02-12 15:57:14 UTC

Created attachment 17291 [details]
gcc44-pr39157.patch

Patch to add loop-invariant-max-bbs-in-loop parameter.

Comment 15 lucier 2009-02-12 16:35:01 UTC

Some comments (a lot went on while I was sleeping):

1.  Yes, this is similar to the test case of PR26854, but the C code generator has changed significantly since that test case was filed.  I don't know if the changes in the code generator really affect what's happening here, however.

2.  I'm trying to get a "moderately" sized test case that will compile in about 3GB of RAM, as Steven requested.  (The test case from PR26854 takes at least 7GB of RAM to compile on ppc64.)

3.  If the amount of memory and cpu time required by the test case at -O1 doesn't increase significantly when loop-invariant motion is performed on loops of size up to 10,000, then it would be good if the parameter at -O1 could be 10,000 instead of 100 (or at least larger than 100), as is suggested in the most recent patch.

Comment 16 Richard Biener 2009-02-12 16:52:14 UTC

Actually for PR26854 it is just one loop that is detected, covering all of
the function (with approx. 56000 basic blocks and one basic-block that has
edges to all other basic blocks in the loop).  So the default for the param
could indeed be 10000 here.  And possibly it's not really the large number
of blocks but that the in/out-degree of that single basic-block with the
computed goto.

Thus, I'd suggest again to not recognize such loops or maybe mark them specially?

Comment 17 rakdver@kam.mff.cuni.cz 2009-02-12 18:29:12 UTC

Subject: Re:  Code that compiles fine in 1GB of memory with 4.1.2 requires > 20GB in 4.2.* and higher

> Zdenek, in this case (and PR26854) can we make sure not to recognize loops
> that involve the single non-local goto BB?  Maybe this would solve the
> problem as well.

maybe; however, I think it would even make sense to add an option to the
loop analysis not to detect loops over certain size at all.  Most of the
loop optimization passes become senseless on loops with thousands of
basic blocks, just wasting the resources.

Comment 18 lucier 2009-02-12 19:54:47 UTC

There is now a file slatex.i at

http://www.math.purdue.edu/~lucier/bugzilla/8/

that compiles in about 650MB of memory with gcc-4.2.3 on x86-64 with the same options; I don't know if that will help Steven.

Comment 19 lucier 2009-02-12 20:51:09 UTC

Subject: Re:  Code that compiles fine in 1GB of
 memory with 4.1.2 requires > 20GB in 4.2.* and higher

On Thu, 2009-02-12 at 16:52 +0000, rguenth at gcc dot gnu dot org wrote:
> ------- Comment #16 from rguenth at gcc dot gnu dot org  2009-02-12 16:52 -------
> Actually for PR26854 it is just one loop that is detected, covering all of
> the function (with approx. 56000 basic blocks and one basic-block that has
> edges to all other basic blocks in the loop).

Richard:

I'm wondering if you could look at a smaller file that's generated in a
somewhat different way.

At

http://www.math.purdue.edu/~lucier/bugzilla/8/

there's a file _num.i.gz that I think should have smaller (nested) loops
than the entire file, for example, from the label

___L189__23__23_bignum_2e__2a_:

at line 50031 to just before label

___L190__23__23_bignum_2e__2a_:

at line 50105.

Moving loop invariants out of this loop might help if it detected as a
loop, but I don't know how to check whether it is.

Perhaps you might check and report whether this small loop is treated as
a loop or whether, again, the entire function is the only "loop"
detected.

Brad

Comment 20 Steven Bosscher 2009-02-13 07:44:13 UTC

Re: "Moving loop invariants out of this loop might help if it detected as a
loop, but I don't know how to check whether it is." (Comment #19):

It's not like there will not be any loop invariant code motion (LICM) at all anymore if the RTL LICM pass is disabled. There is an LICM pass on GIMPLE, and there is also PRE for GIMPLE (and lazy code motion for RTL but I think it disables itself for your test case).

The RTL LICM pass mostly cleans up after expand, i.e. moves things that are not exposed in GIMPLE. This is mostly just address calculations.

I think Zdenek is right in Comment #17. LICM on big loops probably does more harm than good. Still, randomly disabling RTL LICM on big loops, when we do nothing about the other LICM-like passes, looks like a hack to me and not like a permanent solution.

There are a couple more ways to fix the problem (ISTR I've listed them before somewhere...):

1. Stop using the DF UD chains and build them locally only. This is not something I would like much, but clearly there is something with the DF chains that consumes too many resources.

2. Figure out why *only* RTL LICM has a big problem with the DF chains for this test case. Why does fwprop work? Does it shut down itself somehow? Is fwprop asking only for UD chains for pseudos and LICM also for hardregs, perhaps?

3. Make RTL FUD chains work. Shame on me for being unable so far to make that work. Not 4.4 material.

4. Poor-man's-FUD: For CFGs with many basic blocks and a large join-split point (like the block that holds the factored computed goto), add a no-op move for all live registers for which a valid reg-reg move instruction is available. This effectively splits the UD chains, and the no-op moves would be cleaned up later (during RA, or maybe earlier).

Just to see if it works, I want to give the poor-man's-FUD option a try with Brad's smaller test case.

Comment 21 Jakub Jelinek 2009-02-13 09:17:14 UTC

To answer 2), I bet fwprop would suffer the same problem, but fwprop is disabled at -O1, LICM is not.  I can try it today.

Comment 22 rguenther@suse.de 2009-02-13 11:06:44 UTC

Subject: Re:  Code that compiles fine in 1GB of memory
 with 4.1.2 requires > 20GB in 4.2.* and higher

On Thu, 12 Feb 2009, lucier at math dot purdue dot edu wrote:

> ------- Comment #19 from lucier at math dot purdue dot edu  2009-02-12 20:51 -------
> Subject: Re:  Code that compiles fine in 1GB of
>  memory with 4.1.2 requires > 20GB in 4.2.* and higher
> 
> On Thu, 2009-02-12 at 16:52 +0000, rguenth at gcc dot gnu dot org wrote:
> > ------- Comment #16 from rguenth at gcc dot gnu dot org  2009-02-12 16:52 -------
> > Actually for PR26854 it is just one loop that is detected, covering all of
> > the function (with approx. 56000 basic blocks and one basic-block that has
> > edges to all other basic blocks in the loop).
> 
> Richard:
> 
> I'm wondering if you could look at a smaller file that's generated in a
> somewhat different way.
> 
> At
> 
> http://www.math.purdue.edu/~lucier/bugzilla/8/
> 
> there's a file _num.i.gz that I think should have smaller (nested) loops
> than the entire file, for example, from the label
> 
> ___L189__23__23_bignum_2e__2a_:
> 
> at line 50031 to just before label
> 
> ___L190__23__23_bignum_2e__2a_:
> 
> at line 50105.
> 
> Moving loop invariants out of this loop might help if it detected as a
> loop, but I don't know how to check whether it is.
> 
> Perhaps you might check and report whether this small loop is treated as
> a loop or whether, again, the entire function is the only "loop"
> detected.

Yes, there are several small loops detected for this testcase (139 in
total, including one large with the computed goto).

Btw, thanks for the smaller testcase.

Richard.

Comment 23 Richard Biener 2009-02-13 11:08:42 UTC

Lemme close this bug as a dup of the one marked as regression.

*** This bug has been marked as a duplicate of 26854 ***

Comment 24 Jakub Jelinek 2009-02-20 12:56:14 UTC

Subject: Bug 39157

Author: jakub
Date: Fri Feb 20 12:56:01 2009
New Revision: 144320

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=144320
Log:
	PR middle-end/39157
	* Makefile.in (loop-invariant.o): Depend on $(PARAMS_H).
	* params.h (LOOP_INVARIANT_MAX_BBS_IN_LOOP): Define.
	* params.def (loop-invariant-max-bbs-in-loop): New parameter.
	* opts.c (decode_options): Set loop-invariant-max-bbs-in-loop
	parameter to 1000 for -O1 by default.
	* doc/invoke.texi (loop-invariant-max-bbs-in-loop): Document new
	parameter.
	* loop-invariant.c: Include params.h.
	(move_loop_invariants): Don't call move_single_loop_invariants on
	very large loops.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/Makefile.in
    trunk/gcc/doc/invoke.texi
    trunk/gcc/loop-invariant.c
    trunk/gcc/opts.c
    trunk/gcc/params.def
    trunk/gcc/params.h

Comment 25 Steven Bosscher 2012-11-05 22:02:18 UTC

This problem has been fixed in DF with the DF_RD_PRUNE_DEAD_DEFS flag.
I see no good reason to deprecate the param, though.  For such a huge
loop, any loop invariant code motion is unlikely to be a win.