Bug 89115 - compile time and memory hog
Summary: compile time and memory hog
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 8.2.1
: P3 normal
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords: compile-time-hog, memory-hog, ra
Depends on:
Blocks:
 
Reported: 2019-01-30 10:48 UTC by Matthias Klose
Modified: 2019-01-31 12:07 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work: 7.4.1, 8.2.1, 9.0
Known to fail: 7.4.0, 8.2.0
Last reconfirmed: 2019-01-30 00:00:00


Attachments
preprocessed source (462.42 KB, application/x-xz)
2019-01-30 10:48 UTC, Matthias Klose
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Klose 2019-01-30 10:48:38 UTC
Created attachment 45565 [details]
preprocessed source

[forwarded from https://bugs.debian.org/918329]

this is a compile time and memory hog, seen with the gcc-8-branch, also seen with the gcc-7 branch.

source files are somehow big
-rw-rw-r-- 1 doko doko 2884231 Jan 30 10:44 tagCircle49h12.c
-rw-rw-r-- 1 doko doko 1850759 Jan 30 10:44 tagCustom48h12.c
-rw-rw-r-- 1 doko doko 2137073 Jan 30 10:44 tagStandard52h13.c

using a powerpc64le cross compiler, the attached preprocessed source needs 4G memory with -O1. According to the bug reporter compile times and memory usage with -O2 seems to be worse on AArch64 and POWER than on x86.
Comment 1 Matthias Klose 2019-01-30 10:52:13 UTC
$ time powerpc64le-linux-gnu-gcc-8 -c -O1 tagCircle49h12.i 

real    1m43.636s
user    1m40.884s
sys     0m1.759s
Comment 2 Richard Biener 2019-01-30 11:26:59 UTC
On x86_64-linux:

> /usr/bin/time gcc-8 -S tagCircle49h12.i 
38.02user 0.19system 0:38.22elapsed 99%CPU (0avgtext+0avgdata 547252maxresident)k
0inputs+14440outputs (0major+157121minor)pagefaults 0swaps
> /usr/bin/time gcc-8 -S tagCircle49h12.i -O
49.24user 0.99system 0:50.23elapsed 99%CPU (0avgtext+0avgdata 3801468maxresident)k
320inputs+9808outputs (3major+962644minor)pagefaults 0swaps
> /usr/bin/time gcc-8 -S tagCircle49h12.i -O2
76.06user 0.17system 1:16.24elapsed 99%CPU (0avgtext+0avgdata 494480maxresident)k
0inputs+7280outputs (0major+140687minor)pagefaults 0swaps


there's a big function initializing an array which is the culprit:

__attribute__((visibility("default")))
apriltag_family_t *tagCircle49h12_create()
{
   apriltag_family_t *tf = calloc(1, sizeof(apriltag_family_t));
   tf->name = strdup("tagCircle49h12");
   tf->h = 12;
   tf->ncodes = 65698;
   tf->codes = calloc(65698, sizeof(uint64_t));
   tf->codes[0] = 0x0000c6c921d8614aUL;
...
   tf->codes[65697] = 0x000092506b5ec3aaUL;
   tf->nbits = 49;
...

during it's compile we build up a lot of garbage as well:

Assembling functions:
 <materialize-all-clones> <simdclone> tagCircle49h12_create {GC 3275766k -> 118073k} tagCircle49h12_destroy

Time variable                                   usr           sys          wall               GGC
 dead store elim1                   :   9.38 ( 19%)   0.74 ( 33%)  10.12 ( 20%) 3048002 kB ( 88%)
 LRA reload inheritance             :  32.58 ( 66%)   0.00 (  0%)  32.60 ( 63%)       0 kB (  0%)
 TOTAL                              :  49.57          2.27         51.85        3446647 kB

Maybe we can disable
reload inheritance with some limit, at least at -O1 (which is what we intend
to "support" for insane testcases).  Vlad?

On trunk we have (with detailed-mem-stats and release checking):

 dead store elim1                   :  44.34 ( 49%)   6.42 ( 76%)  50.76 ( 51%) 3048002 kB ( 89%)
 LRA reload inheritance             :  32.51 ( 36%)   0.00 (  0%)  32.51 ( 33%)       0 kB (  0%)

Looks like there's no -fno-lra-inheritance but "not doing" is supported
as seen by existence of LRA_MAX_INHERITANCE_PASSES (hard defined to 2
rather than a --param or conditional on optimize level).

All of the memory goes here:

explow.c:198 (plus_constant)                          3025M: 95.8%        0 :  0.0%       72 :  0.0%        0 :  0.0%      126M

(not very informative, might also be a sign of a target[hook] issue).
Comment 3 Richard Biener 2019-01-30 12:34:27 UTC
callgrind computes lra_inheritance -> inerhit_in_ebb -> htab_find_slot -> ... -> rtx_equal_p as the most time-consuming part.  That's from insert_invariant.
Likely the hash function for this particular testcase is bad (there's no
hash statistics on this hashtable printed).  But we call 462 000 times
htab_find_slot but 2 150 000 000 times invariant_eq_p.  Likely
we have many (mem (plus (symbol-ref) CONST_INT) with different constants
but lra_rtx_hash does

    case SCRATCH:
    case CONST_DOUBLE:
    case CONST_INT:
    case CONST_VECTOR:
      return val;

which means it ignores the actual constant value (for whatever reason)?

Doing a simple

Index: gcc/lra.c
===================================================================
--- gcc/lra.c   (revision 268383)
+++ gcc/lra.c   (working copy)
@@ -1719,10 +1719,12 @@ lra_rtx_hash (rtx x)
 
     case SCRATCH:
     case CONST_DOUBLE:
-    case CONST_INT:
     case CONST_VECTOR:
       return val;
 
+    case CONST_INT:
+      return val + UINTVAL (x);
+
     default:
       break;
     }

improves compile time to

> /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O 
18.82user 0.90system 0:19.73elapsed 100%CPU (0avgtext+0avgdata 3789340maxresident)k
0inputs+9808outputs (0major+933676minor)pagefaults 0swaps

For sub-fmts of CONST_INT the hash function already performs this
operation.

 dead store elim1                   :  10.20 ( 54%)   0.77 ( 36%)  10.96 ( 52%) 3048002 kB ( 89%)
 LRA reload inheritance             :   0.08 (  0%)   0.00 (  0%)   0.08 (  0%)       0 kB (  0%)

I'm going to test sth like the above.  Does nothing to the memory use
though.
Comment 4 Richard Biener 2019-01-30 13:23:26 UTC
The DSE thing is (of course) alias queries and there, find_base_term.
200 000 calls to check_mem_read_use result in 25 600 000 calls to
canon_true_dependence.  I suppose we could cache the result of find_base_term
and have a canon_true_dependence_with_bases.

Eventually DSE should just give up with too long next_local_store chains.

Btw, the plus_constant calls all originate from true_dependence_1 ending
up calling get_addr and that re-building RTL that must be there already
somehow.  Thus in the end it originates from the excessive number of
alias queries done by DSE.  Even that get_addr() part could be cached
though.  canon_true_dependence_with_bases_and_addrs.

Unfortunately --param max-dse-active-local-stores is a bail-out thing
so we need to cross a magic barrier which is somewhere between 1000 and
1250:

> /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O --param max-dse-active-local-stores=5000 (default)
18.59user 1.03system 0:19.62elapsed 99%CPU (0avgtext+0avgdata 3787968maxresident)k
0inputs+9808outputs (0major+933497minor)pagefaults 0swaps
> /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O --param max-dse-active-local-stores=2500
18.13user 1.09system 0:19.22elapsed 99%CPU (0avgtext+0avgdata 3787792maxresident)k
0inputs+9904outputs (0major+934009minor)pagefaults 0swaps
> /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O --param max-dse-active-local-stores=2000
18.57user 0.98system 0:19.56elapsed 99%CPU (0avgtext+0avgdata 3786852maxresident)k
0inputs+9808outputs (0major+933789minor)pagefaults 0swaps
> /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O --param max-dse-active-local-stores=1500
18.71user 1.01system 0:19.74elapsed 99%CPU (0avgtext+0avgdata 3789372maxresident)k
0inputs+9808outputs (0major+933920minor)pagefaults 0swaps
> /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O --param max-dse-active-local-stores=1250
18.54user 0.94system 0:19.49elapsed 99%CPU (0avgtext+0avgdata 3788452maxresident)k
0inputs+9808outputs (0major+933435minor)pagefaults 0swaps
> /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O --param max-dse-active-local-stores=1000
7.63user 0.22system 0:07.86elapsed 99%CPU (0avgtext+0avgdata 715704maxresident)k
0inputs+9808outputs (0major+170563minor)pagefaults 0swaps
> /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O --param max-dse-active-local-stores=500
7.66user 0.24system 0:07.90elapsed 100%CPU (0avgtext+0avgdata 717116maxresident)k
> /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O --param max-dse-active-local-stores=250
7.73user 0.16system 0:07.90elapsed 100%CPU (0avgtext+0avgdata 715960maxresident)k
0inputs+9904outputs (0major+170918minor)pagefaults 0swaps


I am testing

Index: gcc/opts.c
===================================================================
--- gcc/opts.c  (revision 268383)
+++ gcc/opts.c  (working copy)
@@ -670,7 +670,16 @@ default_options_optimization (struct gcc
   /* For -O1 only do loop invariant motion for very small loops.  */
   maybe_set_param_value
     (PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP,
-     opt2 ? default_param_value (PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP) : 1000,
+     opt2 ? default_param_value (PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP)
+     : default_param_value (PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP) / 10,
+     opts->x_param_values, opts_set->x_param_values);
+
+  /* For -O1 reduce the maximum number of active local stores for RTL DSE
+     since this can consume huge amounts of memory (PR89115).  */
+  maybe_set_param_value
+    (PARAM_MAX_DSE_ACTIVE_LOCAL_STORES,
+     opt2 ? default_param_value (PARAM_MAX_DSE_ACTIVE_LOCAL_STORES)
+     : default_param_value (PARAM_MAX_DSE_ACTIVE_LOCAL_STORES) / 10,
      opts->x_param_values, opts_set->x_param_values);
 
   /* At -Ofast, allow store motion to introduce potential race conditions.  */
Comment 5 Richard Biener 2019-01-30 15:11:35 UTC
Author: rguenth
Date: Wed Jan 30 15:11:04 2019
New Revision: 268394

URL: https://gcc.gnu.org/viewcvs?rev=268394&root=gcc&view=rev
Log:
2019-01-30  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/89115
	* opts.c (default_options_optimization): Reduce
	PARAM_MAX_DSE_ACTIVE_LOCAL_STORES by a factor of 10 at -O1.
	Make PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP reduction relative
	to the default.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/opts.c
Comment 6 Richard Biener 2019-01-31 08:10:30 UTC
Author: rguenth
Date: Thu Jan 31 08:09:59 2019
New Revision: 268414

URL: https://gcc.gnu.org/viewcvs?rev=268414&root=gcc&view=rev
Log:
2019-01-31  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/89115
	* lra.c (lra_rtx_hash): Properly hash CONST_INT values.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/lra.c
Comment 7 Richard Biener 2019-01-31 09:01:55 UTC
On trunk compile-time at -O1 should now be reasonable, currently testing backports.  The DSE issue still exists at -O2+ but compared to the LRA issue
it was "minor".
Comment 8 Richard Biener 2019-01-31 10:00:57 UTC
Author: rguenth
Date: Thu Jan 31 10:00:26 2019
New Revision: 268416

URL: https://gcc.gnu.org/viewcvs?rev=268416&root=gcc&view=rev
Log:
2019-01-31  Richard Biener  <rguenther@suse.de>

	Backport from mainline
	2019-01-31  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/89115
	* lra.c (lra_rtx_hash): Properly hash CONST_INT values.

	2019-01-30  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/89115
	* opts.c (default_options_optimization): Reduce
	PARAM_MAX_DSE_ACTIVE_LOCAL_STORES by a factor of 10 at -O1.
	Make PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP reduction relative
	to the default.

Modified:
    branches/gcc-8-branch/gcc/ChangeLog
    branches/gcc-8-branch/gcc/lra.c
    branches/gcc-8-branch/gcc/opts.c
Comment 9 Richard Biener 2019-01-31 12:05:50 UTC
Author: rguenth
Date: Thu Jan 31 12:05:19 2019
New Revision: 268418

URL: https://gcc.gnu.org/viewcvs?rev=268418&root=gcc&view=rev
Log:
2019-01-31  Richard Biener  <rguenther@suse.de>

	Backport from mainline
	2019-01-31  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/89115
	* lra.c (lra_rtx_hash): Properly hash CONST_INT values.

	2019-01-30  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/89115
	* opts.c (default_options_optimization): Reduce
	PARAM_MAX_DSE_ACTIVE_LOCAL_STORES by a factor of 10 at -O1.
	Make PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP reduction relative
	to the default.

Modified:
    branches/gcc-7-branch/gcc/ChangeLog
    branches/gcc-7-branch/gcc/lra.c
    branches/gcc-7-branch/gcc/opts.c
Comment 10 Richard Biener 2019-01-31 12:07:56 UTC
Fixed for GCC 7.5/8.3.