Created attachment 33749 [details]
The attached Fortran programs compiles slowly and needs lots of memory
if compiled with basic optimization (-O) but without -fPIC.
With -fPIC, memory usage and compile time are much shorter.
$ gfortran abbrevd408h0.f90 -c -O
needs 45s and 630MB RAM
$ gfortran abbrevd408h0.f90 -c -O -fPIC
needs only 6s and 105MB RAM.
$ gfortran abbrevd408h0.f90 -c
works also quickly.
The attached file is shortened. On the full file, the effect is more distinct (with gfortran 4.8.1: >8GB RAM usage, several minutes).
With gfortran 4.7, the issue does not appear.
GNU Fortran (Debian 4.9.1-17) 4.9.1 [~ SVN r216240]
GNU Fortran (SUSE Linux) 4.8.1 20130909 [gcc-4_8-branch revision 202388]
GNU Fortran (SUSE Linux) 4.8.3 20140627 [gcc-4_8-branch revision 212064]
Confirmed with 4.9:
combiner : 31.14 (79%) usr 0.46 (74%) sys 31.65 (78%) wall 1029289 kB (96%) ggc
TOTAL : 39.48 0.62 40.77 1071504 kB
--param max-combine-insns=2 helps a bit compile-time wise but not fully memory-usage-wise (I suppose log-links are expensive and of course still set up).
Only available on trunk, of course.
The LOG_LINKS take up only a few hundred kB, tops; the gigantic memory
use is from of all the garbage RTL produced for all the failed combine
GCC 4.8.4 has been released.
Does the combiner have any GC pointers stored in non-GTY memory? I mean e.g. LOG_LINKS, ... If we could ggc_collect within the pass, the memory consumption problem would be fixed.
[ I missed your last comment, sorry. ]
Both the log_links and the reg_stat point to insns in the insn stream,
(all of those are either live or never again referred to), so that
might be fine, but you really should make sure you only GC between
try_combine calls, never inside one -- that would be rather disastrous.
Do you want to try this for GCC 5?
It is a regression, so perhaps, depends on how risky the patch would be.
Most likely it would need to be tested with always-collect params on a few larger testcases.
It's not a very new regression, and it is quite risky in my opinion;
I prefer to have this dealt with in stage1.
Also note that doing GC during the pass will not reduce the compile
time or the amount of garbage created at all, so won't fix the actual
problem; it does of course make it more bearable on smaller machines.
I'll have another look at what causes this; from what I remember last
time I looked there simply *are* very many opportunities to combine
some insns (most of which fail, maybe we could short-circuit some).
Segher, could you please look at this again before we get into stage4? Thanks.
I cannot reproduce the problem with GCC 6, combine takes about 1%
time and little memory. I do not know what changed.
Ideally we'd move RTL back out of garbage-collected memory and onto obstacks. The combiner was designed to throw away garbage RTL after each failed combination.
So two changes are responsible for huge improvements here.
First is Richi's fix for 63677:
2014-11-20 Richard Biener <firstname.lastname@example.org>
* tree-ssa-dom.c: Include gimplify.h for unshare_expr.
(avail_exprs_stack): Make a vector of pairs.
(struct hash_expr_elt): Replace stmt member with vop member.
(remove_local_expressions_from_table): Restore previous state
(record_equivalences_from_stmt): Record &x + CST as constant
&MEM[&x, CST] for further propagation.
(vuse_eq): New function.
(lookup_avail_expr): For loads use the alias oracle to see
whether a candidate from the expr hash is usable.
(avail_expr_hash): Do not hash VUSEs.
* gcc.dg/tree-ssa/ssa-dom-cse-2.c: New testcase.
* gcc.dg/tree-ssa/ssa-dom-cse-3.c: Likewise.
Which reduces the memory consumption by ~.5G, presumably by simplifying things in the tree optimizers, long before we get into the RTL bits.
Second is the introduction of the early DSE pass by Jan which removes another 1/2G of memory and the remaining time of significance in combine.
Author: hubicka <hubicka@138bc75d-0d04-0410-961f-82ee72b054a4>
Date: Wed Apr 22 01:32:14 2015 +0000
* passes.def (early_optimizations): Add pass_dse.
* g++.dg/tree-ssa/pr61034.C: Update template.
* g++.dg/warn/Warray-bounds.C: Harden for DSE.
* gcc.dg/Warray-bounds-11.c: Likewise.
* gcc.dg/Warray-bounds.c: Likewise.
I'm going to declare this regression fixed for gcc-6. Given the timing of the two patches which helped here, it's a good bet that gcc-4.9 is, of course, bad and that gcc-5 improved, but was still bad.
Thanks for tracking this down Jeff.
This seems too invasive to backport to the release branches, or is this
compile-time regression considered important enough for that?
I'd agree it's too invasive to backport -- both changes are new optimizations and there may well have been follow-up patches for both. The potential for destabilizing the release branches would seem to outweigh the benefits for reducing the compile time on this code.
Agreed though the underlying issue in combine didn't get fixed (so we're just waiting for a new testcase to pop up).
I wouldn't say there really _is_ an issue in combine; not an
implementation issue, at least.
Combine is designed to blindly try every combination it can try,
without first seeing if that is likely to result in success. This
is what gives it its power. (*)
In the testcase a lot of combinations are possible for the one
small group of patterns that is repeated many times. This is why
it takes so much time (and then also produces a lot of garbage RTL).
I put the blame for that on the huge input to combine though, which
shouldn't be there in the first place.
Combine is sort of linear, just with a huge constant. Nothing I've
seen in here violates that.
Things will improve if can abort a combination attempt earlier, if
we can detect it cannot possibly result in a successful combination.
Things are much complicated by the fact that e.g. sometimes a 2-insn
combination fails, but then a 3-insn combination works while in
effect it just does that 2-insn combination plus it leaves a single
insn untouched: this happens because when it tries the 3-insn combo
combine knows more about the register values. The whole reg_stat
thing needs a big overhaul.
(*) This is weakened somewhat for four-insn combinations, those
just would take way too long.