Bug 63577 - [4.9/5 Regression]: Huge compile time and memory usage with -O and not -fPIC
Summary: [4.9/5 Regression]: Huge compile time and memory usage with -O and not -fPIC
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.9.1
: P2 normal
Target Milestone: 6.0
Assignee: Not yet assigned to anyone
Keywords: compile-time-hog, memory-hog
Depends on:
Blocks: 47344
  Show dependency treegraph
Reported: 2014-10-17 14:59 UTC by jfsoden
Modified: 2016-02-11 16:28 UTC (History)
4 users (show)

See Also:
Target: x86_64-*-*
Known to work: 6.0
Known to fail:
Last reconfirmed: 2014-10-20 00:00:00

Test program. (15.76 KB, application/x-tar)
2014-10-17 14:59 UTC, jfsoden

Note You need to log in before you can comment on or make changes to this bug.
Description jfsoden 2014-10-17 14:59:23 UTC
Created attachment 33749 [details]
Test program.

The attached Fortran programs compiles slowly and needs lots of memory
if compiled with basic optimization (-O) but without -fPIC.
With -fPIC, memory usage and compile time are much shorter.

$ gfortran abbrevd408h0.f90 -c -O  
needs 45s and 630MB RAM


$ gfortran abbrevd408h0.f90 -c -O -fPIC
needs only 6s and 105MB RAM.

$ gfortran abbrevd408h0.f90 -c
works also quickly.

The attached file is shortened. On the full file, the effect is more distinct (with gfortran 4.8.1: >8GB RAM usage, several minutes).

With gfortran 4.7, the issue does not appear.

Used version:
GNU Fortran (Debian 4.9.1-17) 4.9.1     [~ SVN r216240]

GNU Fortran (SUSE Linux) 4.8.1 20130909 [gcc-4_8-branch revision 202388]
GNU Fortran (SUSE Linux) 4.8.3 20140627 [gcc-4_8-branch revision 212064] 
are affected.
Comment 1 Richard Biener 2014-10-20 13:03:00 UTC
Confirmed with 4.9:

 combiner                :  31.14 (79%) usr   0.46 (74%) sys  31.65 (78%) wall 1029289 kB (96%) ggc
 TOTAL                 :  39.48             0.62            40.77            1071504 kB
Comment 2 Richard Biener 2014-10-20 13:07:22 UTC
--param max-combine-insns=2 helps a bit compile-time wise but not fully memory-usage-wise (I suppose log-links are expensive and of course still set up).
Only available on trunk, of course.
Comment 3 Segher Boessenkool 2014-10-20 17:35:54 UTC
The LOG_LINKS take up only a few hundred kB, tops; the gigantic memory
use is from of all the garbage RTL produced for all the failed combine
Comment 4 Jakub Jelinek 2014-12-19 13:32:04 UTC
GCC 4.8.4 has been released.
Comment 5 Jakub Jelinek 2015-01-12 14:42:23 UTC
Does the combiner have any GC pointers stored in non-GTY memory?  I mean e.g. LOG_LINKS, ...  If we could ggc_collect within the pass, the memory consumption problem would be fixed.
Comment 6 Segher Boessenkool 2015-01-28 02:46:16 UTC
[ I missed your last comment, sorry. ]

Both the log_links and the reg_stat point to insns in the insn stream,
(all of those are either live or never again referred to), so that
might be fine, but you really should make sure you only GC between
try_combine calls, never inside one -- that would be rather disastrous.

Do you want to try this for GCC 5?
Comment 7 Jakub Jelinek 2015-01-28 07:41:51 UTC
It is a regression, so perhaps, depends on how risky the patch would be.
Most likely it would need to be tested with always-collect params on a few larger testcases.
Comment 8 Segher Boessenkool 2015-01-28 20:30:04 UTC
It's not a very new regression, and it is quite risky in my opinion;
I prefer to have this dealt with in stage1.
Comment 9 Jakub Jelinek 2015-01-28 20:41:26 UTC
Ok then.
Comment 10 Segher Boessenkool 2015-01-31 20:19:30 UTC
Also note that doing GC during the pass will not reduce the compile
time or the amount of garbage created at all, so won't fix the actual
problem; it does of course make it more bearable on smaller machines.

I'll have another look at what causes this; from what I remember last
time I looked there simply *are* very many opportunities to combine
some insns (most of which fail, maybe we could short-circuit some).
Comment 11 Jakub Jelinek 2015-12-14 17:14:10 UTC
Segher, could you please look at this again before we get into stage4?  Thanks.
Comment 12 Segher Boessenkool 2015-12-15 04:16:59 UTC
I cannot reproduce the problem with GCC 6, combine takes about 1%
time and little memory.  I do not know what changed.
Comment 13 Bernd Schmidt 2016-01-15 17:32:12 UTC
Ideally we'd move RTL back out of garbage-collected memory and onto obstacks. The combiner was designed to throw away garbage RTL after each failed combination.
Comment 14 Jeffrey A. Law 2016-02-09 23:15:12 UTC
So two changes are responsible for huge improvements here.

First is Richi's fix for 63677:

   2014-11-20   Richard Biener  <rguenther@suse.de>
        PR tree-optimization/63677
        * tree-ssa-dom.c: Include gimplify.h for unshare_expr.
        (avail_exprs_stack): Make a vector of pairs.
        (struct hash_expr_elt): Replace stmt member with vop member.
        (expr_elt_hasher::equal): Simplify.
        (initialize_hash_element): Adjust.
        (initialize_hash_element_from_expr): Likewise.
        (dom_opt_dom_walker::thread_across_edge): Likewise.
        (record_cond): Likewise.
        (dom_opt_dom_walker::before_dom_children): Likewise.
        (print_expr_hash_elt): Likewise.
        (remove_local_expressions_from_table): Restore previous state
        if requested.
        (record_equivalences_from_stmt): Record &x + CST as constant
        &MEM[&x, CST] for further propagation.
        (vuse_eq): New function.
        (lookup_avail_expr): For loads use the alias oracle to see
        whether a candidate from the expr hash is usable.
        (avail_expr_hash): Do not hash VUSEs.
        * gcc.dg/tree-ssa/ssa-dom-cse-2.c: New testcase.
        * gcc.dg/tree-ssa/ssa-dom-cse-3.c: Likewise.

Which reduces the memory consumption by ~.5G, presumably by simplifying things in the tree optimizers, long before we get into the RTL bits.

Second is the introduction of the early DSE pass by Jan which removes another 1/2G of memory and the remaining time of significance in combine.

Author: hubicka <hubicka@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Wed Apr 22 01:32:14 2015 +0000

        PR ipa/65076
        * passes.def (early_optimizations): Add pass_dse.
        * g++.dg/tree-ssa/pr61034.C: Update template.
        * g++.dg/warn/Warray-bounds.C: Harden for DSE.
        * gcc.dg/Warray-bounds-11.c: Likewise.
        * gcc.dg/Warray-bounds.c: Likewise.

I'm going to declare this regression fixed for gcc-6.  Given the timing of the two patches which helped here, it's a good bet that gcc-4.9 is, of course, bad and that gcc-5 improved, but was still bad.
Comment 15 Segher Boessenkool 2016-02-10 23:56:28 UTC
Thanks for tracking this down Jeff.

This seems too invasive to backport to the release branches, or is this
compile-time regression considered important enough for that?
Comment 16 Jeffrey A. Law 2016-02-11 04:53:39 UTC
I'd agree it's too invasive to backport -- both changes are new optimizations and there may well have been follow-up patches for both.  The potential for destabilizing the release branches would seem to outweigh the benefits for reducing the compile time on this code.
Comment 17 Richard Biener 2016-02-11 08:50:38 UTC
Agreed though the underlying issue in combine didn't get fixed (so we're just waiting for a new testcase to pop up).
Comment 18 Segher Boessenkool 2016-02-11 16:28:50 UTC
I wouldn't say there really _is_ an issue in combine; not an
implementation issue, at least.

Combine is designed to blindly try every combination it can try,
without first seeing if that is likely to result in success.  This
is what gives it its power.  (*)

In the testcase a lot of combinations are possible for the one
small group of patterns that is repeated many times.  This is why
it takes so much time (and then also produces a lot of garbage RTL).
I put the blame for that on the huge input to combine though, which
shouldn't be there in the first place.

Combine is sort of linear, just with a huge constant.  Nothing I've
seen in here violates that.

Things will improve if can abort a combination attempt earlier, if
we can detect it cannot possibly result in a successful combination.
Things are much complicated by the fact that e.g. sometimes a 2-insn
combination fails, but then a 3-insn combination works while in
effect it just does that 2-insn combination plus it leaves a single
insn untouched: this happens because when it tries the 3-insn combo
combine knows more about the register values.  The whole reg_stat
thing needs a big overhaul.

(*) This is weakened somewhat for four-insn combinations, those
just would take way too long.