With this compiler [lucier@descartes gambit]$ gcc -v Using built-in specs. Target: powerpc64-unknown-linux-gnu Configured with: ../../gcc-4.3.3/configure --prefix=/pkgs/gcc-4.3.3 --enable-languages=c --with-cpu=default64 Thread model: posix gcc version 4.3.3 (GCC) with the file compiler.i found here: http://www.math.purdue.edu/~lucier/bugzilla/8/ attempting to compile with these options: gcc -m64 -mcpu=970 -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -rdynamic -shared can't compile in 8GB of RAM. With this compiler: euler-77% /pkgs/gcc-4.2.3/bin/gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../../gcc-4.2.3/configure --prefix=/pkgs/gcc-4.2.3 --enable-checking=release --with-gmp=/pkgs/gmp-4.2.2 --with-mpfr=/pkgs/gmp-4.2.2 Thread model: posix gcc version 4.2.3 and these options: gcc -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -rdynamic -shared it can't compiler in 20GB of RAM. (That machine has only 16GB of RAM, so I killed the compile when it hit 20GB of physical+virtual memory.) It compiles just fine in about 1GB of RAM with euler-76% gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../configure --prefix=/pkgs/gcc-4.1.2 Thread model: posix gcc version 4.1.2 compiler.i is the output from the Gambit Scheme->C compiler; the source scheme program is from a standard benchmark suite for Scheme compilers. So I found this by trying to change the code generator for Gambit and running the benchmark suite on x86_64. I don't know how this can be "fixed". Basically, the entire middle-end infrastructure since 4.1.* is telling people like me with computer-generated code like this to just go away (to put it very politely). On Mac OS X 10.5.*, Apple bundles their version of 4.0.1, which compiles this just fine; on Red Hat 5.2, they bundle their version of 4.1.2 (I think, my RH5.2 box is down at the moment), which compiles this just fine; but on Ubuntu 8.10 or Fedora 10 you can't compile this because they bundle newer compilers. (I guess I'll see if I can install 4.1.* on both of these.) As a stopgap measure, perhaps someone can tell me what optimization level to use. As you can see, I use -O1 and a few others (mainly -fschedule-insns2). gcc 4.1.* and earlier compiled something like this just fine, but -O1 must mean something different now.
Is this the same as PR26854? Note that it would be extremely helpful to have a testcase that fits into less memory to be able to analyze this. As you say it is autogenerated source, can you shrink the testcase? Thanks.
The peak on x86_64 (with normal checking cc1) seems to be around 29.3GB, but most of the time top numbers were in the .5GB - 1GB or at most 1.3GB range, with just a signle very high peak. Will now look closer where all the memory is needed.
If it is like PR26854 then it is all df memory. The gambit stuff consists of a state machine with loads of computed gotos (thus a nearly fully connected CFG) IIRC.
Even for code with lots of computed gotos, the CFG should not be fully connected. We factorize computed gotos to avoid exactly that. At least we used to. Maybe the factorizing is broken, or it is undone somewhere too early?
> Even for code with lots of computed gotos, the CFG should not be fully > connected. We factorize computed gotos to avoid exactly that. At least we used > to. Maybe the factorizing is broken, or it is undone somewhere too early? If I'm not mistaken the factoring is undone during RTL expansion.
In the past, we did not unfactor them (see e.g. Bug 15242).
Adding -fno-move-loop-invariants to the x86_64 mentioned options results in VIRT memory topping around 1224m (only because of IRA, before RA it never went above 1GB). Seems it is really the loop2_invariant pass that ate 28GB of RAM.
If there is a test case that compiles in less than 4GB, I'll take this bug (I have no access to machines with more memory than that ;-)
PR26854 is "fixed" as well with -fno-move-loop-invariants. It has a little less peak memory requirement than the testcase here.
Most of the memory is allocated in df_chain->block_pool: p *df->problems_by_index[4]->block_pool $37 = {name = 0xbc3114 "df_chain_block pool", id = 475, elts_per_block = 50, returned_free_list = 0x0, virgin_free_list = 0x72ce7cae8 "", virgin_elts_remaining = 18, elts_allocated = 775734800, elts_free = 18, blocks_allocated = 15514696, block_list = 0x72ce7c6e0, block_size = 1608, elt_size = 32} p *cfun->cfg $39 = {x_entry_block_ptr = 0x7ffff0cb11e0, x_exit_block_ptr = 0x7ffff0cb1240, x_basic_block_info = 0x7ffff283f000, x_n_basic_blocks = 76672, x_n_edges = 135946, x_last_basic_block = 76676, x_label_to_block_map = 0x0, x_profile_status = PROFILE_GUESSED, x_dom_computed = {DOM_NO_FAST_QUERY, DOM_NONE}, x_n_bbs_in_dom_tree = {76672, 0}, max_jumptable_ents = 0, last_label_uid = 97987}
Created attachment 17288 [details] gcc44-pr39157.patch Quick hack to avoid loop invariant motion from excessively large loops at -O1. With this cc1 still tops at 1.2GB, but loop invariants are still moved out of reasonably sized loops. For the huge ones I wonder how much it buys in anyway, there it increases register pressure a lot. 10000 is pretty arbitrary, I wonder if we want a param or something.
Subject: Re: Code that compiles fine in 1GB of memory with 4.1.2 requires > 20GB in 4.2.* and higher On Thu, 12 Feb 2009, jakub at gcc dot gnu dot org wrote: > ------- Comment #11 from jakub at gcc dot gnu dot org 2009-02-12 14:21 ------- > Created an attachment (id=17288) > --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17288&action=view) > gcc44-pr39157.patch > > Quick hack to avoid loop invariant motion from excessively large loops at -O1. > With this cc1 still tops at 1.2GB, but loop invariants are still moved out of > reasonably sized loops. For the huge ones I wonder how much it buys in anyway, > there it increases register pressure a lot. 10000 is pretty arbitrary, I > wonder if we want a param or something. I'd say for -O1 disable it completely, for -O2 and up use the heuristic and add a param for it. Richard.
Zdenek, in this case (and PR26854) can we make sure not to recognize loops that involve the single non-local goto BB? Maybe this would solve the problem as well.
Created attachment 17291 [details] gcc44-pr39157.patch Patch to add loop-invariant-max-bbs-in-loop parameter.
Some comments (a lot went on while I was sleeping): 1. Yes, this is similar to the test case of PR26854, but the C code generator has changed significantly since that test case was filed. I don't know if the changes in the code generator really affect what's happening here, however. 2. I'm trying to get a "moderately" sized test case that will compile in about 3GB of RAM, as Steven requested. (The test case from PR26854 takes at least 7GB of RAM to compile on ppc64.) 3. If the amount of memory and cpu time required by the test case at -O1 doesn't increase significantly when loop-invariant motion is performed on loops of size up to 10,000, then it would be good if the parameter at -O1 could be 10,000 instead of 100 (or at least larger than 100), as is suggested in the most recent patch.
Actually for PR26854 it is just one loop that is detected, covering all of the function (with approx. 56000 basic blocks and one basic-block that has edges to all other basic blocks in the loop). So the default for the param could indeed be 10000 here. And possibly it's not really the large number of blocks but that the in/out-degree of that single basic-block with the computed goto. Thus, I'd suggest again to not recognize such loops or maybe mark them specially?
Subject: Re: Code that compiles fine in 1GB of memory with 4.1.2 requires > 20GB in 4.2.* and higher > Zdenek, in this case (and PR26854) can we make sure not to recognize loops > that involve the single non-local goto BB? Maybe this would solve the > problem as well. maybe; however, I think it would even make sense to add an option to the loop analysis not to detect loops over certain size at all. Most of the loop optimization passes become senseless on loops with thousands of basic blocks, just wasting the resources.
There is now a file slatex.i at http://www.math.purdue.edu/~lucier/bugzilla/8/ that compiles in about 650MB of memory with gcc-4.2.3 on x86-64 with the same options; I don't know if that will help Steven.
Subject: Re: Code that compiles fine in 1GB of memory with 4.1.2 requires > 20GB in 4.2.* and higher On Thu, 2009-02-12 at 16:52 +0000, rguenth at gcc dot gnu dot org wrote: > ------- Comment #16 from rguenth at gcc dot gnu dot org 2009-02-12 16:52 ------- > Actually for PR26854 it is just one loop that is detected, covering all of > the function (with approx. 56000 basic blocks and one basic-block that has > edges to all other basic blocks in the loop). Richard: I'm wondering if you could look at a smaller file that's generated in a somewhat different way. At http://www.math.purdue.edu/~lucier/bugzilla/8/ there's a file _num.i.gz that I think should have smaller (nested) loops than the entire file, for example, from the label ___L189__23__23_bignum_2e__2a_: at line 50031 to just before label ___L190__23__23_bignum_2e__2a_: at line 50105. Moving loop invariants out of this loop might help if it detected as a loop, but I don't know how to check whether it is. Perhaps you might check and report whether this small loop is treated as a loop or whether, again, the entire function is the only "loop" detected. Brad
Re: "Moving loop invariants out of this loop might help if it detected as a loop, but I don't know how to check whether it is." (Comment #19): It's not like there will not be any loop invariant code motion (LICM) at all anymore if the RTL LICM pass is disabled. There is an LICM pass on GIMPLE, and there is also PRE for GIMPLE (and lazy code motion for RTL but I think it disables itself for your test case). The RTL LICM pass mostly cleans up after expand, i.e. moves things that are not exposed in GIMPLE. This is mostly just address calculations. I think Zdenek is right in Comment #17. LICM on big loops probably does more harm than good. Still, randomly disabling RTL LICM on big loops, when we do nothing about the other LICM-like passes, looks like a hack to me and not like a permanent solution. There are a couple more ways to fix the problem (ISTR I've listed them before somewhere...): 1. Stop using the DF UD chains and build them locally only. This is not something I would like much, but clearly there is something with the DF chains that consumes too many resources. 2. Figure out why *only* RTL LICM has a big problem with the DF chains for this test case. Why does fwprop work? Does it shut down itself somehow? Is fwprop asking only for UD chains for pseudos and LICM also for hardregs, perhaps? 3. Make RTL FUD chains work. Shame on me for being unable so far to make that work. Not 4.4 material. 4. Poor-man's-FUD: For CFGs with many basic blocks and a large join-split point (like the block that holds the factored computed goto), add a no-op move for all live registers for which a valid reg-reg move instruction is available. This effectively splits the UD chains, and the no-op moves would be cleaned up later (during RA, or maybe earlier). Just to see if it works, I want to give the poor-man's-FUD option a try with Brad's smaller test case.
To answer 2), I bet fwprop would suffer the same problem, but fwprop is disabled at -O1, LICM is not. I can try it today.
Subject: Re: Code that compiles fine in 1GB of memory with 4.1.2 requires > 20GB in 4.2.* and higher On Thu, 12 Feb 2009, lucier at math dot purdue dot edu wrote: > ------- Comment #19 from lucier at math dot purdue dot edu 2009-02-12 20:51 ------- > Subject: Re: Code that compiles fine in 1GB of > memory with 4.1.2 requires > 20GB in 4.2.* and higher > > On Thu, 2009-02-12 at 16:52 +0000, rguenth at gcc dot gnu dot org wrote: > > ------- Comment #16 from rguenth at gcc dot gnu dot org 2009-02-12 16:52 ------- > > Actually for PR26854 it is just one loop that is detected, covering all of > > the function (with approx. 56000 basic blocks and one basic-block that has > > edges to all other basic blocks in the loop). > > Richard: > > I'm wondering if you could look at a smaller file that's generated in a > somewhat different way. > > At > > http://www.math.purdue.edu/~lucier/bugzilla/8/ > > there's a file _num.i.gz that I think should have smaller (nested) loops > than the entire file, for example, from the label > > ___L189__23__23_bignum_2e__2a_: > > at line 50031 to just before label > > ___L190__23__23_bignum_2e__2a_: > > at line 50105. > > Moving loop invariants out of this loop might help if it detected as a > loop, but I don't know how to check whether it is. > > Perhaps you might check and report whether this small loop is treated as > a loop or whether, again, the entire function is the only "loop" > detected. Yes, there are several small loops detected for this testcase (139 in total, including one large with the computed goto). Btw, thanks for the smaller testcase. Richard.
Lemme close this bug as a dup of the one marked as regression. *** This bug has been marked as a duplicate of 26854 ***
Subject: Bug 39157 Author: jakub Date: Fri Feb 20 12:56:01 2009 New Revision: 144320 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=144320 Log: PR middle-end/39157 * Makefile.in (loop-invariant.o): Depend on $(PARAMS_H). * params.h (LOOP_INVARIANT_MAX_BBS_IN_LOOP): Define. * params.def (loop-invariant-max-bbs-in-loop): New parameter. * opts.c (decode_options): Set loop-invariant-max-bbs-in-loop parameter to 1000 for -O1 by default. * doc/invoke.texi (loop-invariant-max-bbs-in-loop): Document new parameter. * loop-invariant.c: Include params.h. (move_loop_invariants): Don't call move_single_loop_invariants on very large loops. Modified: trunk/gcc/ChangeLog trunk/gcc/Makefile.in trunk/gcc/doc/invoke.texi trunk/gcc/loop-invariant.c trunk/gcc/opts.c trunk/gcc/params.def trunk/gcc/params.h
This problem has been fixed in DF with the DF_RD_PRUNE_DEAD_DEFS flag. I see no good reason to deprecate the param, though. For such a huge loop, any loop invariant code motion is unlikely to be a win.