Created attachment 45565 [details] preprocessed source [forwarded from https://bugs.debian.org/918329] this is a compile time and memory hog, seen with the gcc-8-branch, also seen with the gcc-7 branch. source files are somehow big -rw-rw-r-- 1 doko doko 2884231 Jan 30 10:44 tagCircle49h12.c -rw-rw-r-- 1 doko doko 1850759 Jan 30 10:44 tagCustom48h12.c -rw-rw-r-- 1 doko doko 2137073 Jan 30 10:44 tagStandard52h13.c using a powerpc64le cross compiler, the attached preprocessed source needs 4G memory with -O1. According to the bug reporter compile times and memory usage with -O2 seems to be worse on AArch64 and POWER than on x86.
$ time powerpc64le-linux-gnu-gcc-8 -c -O1 tagCircle49h12.i real 1m43.636s user 1m40.884s sys 0m1.759s
On x86_64-linux: > /usr/bin/time gcc-8 -S tagCircle49h12.i 38.02user 0.19system 0:38.22elapsed 99%CPU (0avgtext+0avgdata 547252maxresident)k 0inputs+14440outputs (0major+157121minor)pagefaults 0swaps > /usr/bin/time gcc-8 -S tagCircle49h12.i -O 49.24user 0.99system 0:50.23elapsed 99%CPU (0avgtext+0avgdata 3801468maxresident)k 320inputs+9808outputs (3major+962644minor)pagefaults 0swaps > /usr/bin/time gcc-8 -S tagCircle49h12.i -O2 76.06user 0.17system 1:16.24elapsed 99%CPU (0avgtext+0avgdata 494480maxresident)k 0inputs+7280outputs (0major+140687minor)pagefaults 0swaps there's a big function initializing an array which is the culprit: __attribute__((visibility("default"))) apriltag_family_t *tagCircle49h12_create() { apriltag_family_t *tf = calloc(1, sizeof(apriltag_family_t)); tf->name = strdup("tagCircle49h12"); tf->h = 12; tf->ncodes = 65698; tf->codes = calloc(65698, sizeof(uint64_t)); tf->codes[0] = 0x0000c6c921d8614aUL; ... tf->codes[65697] = 0x000092506b5ec3aaUL; tf->nbits = 49; ... during it's compile we build up a lot of garbage as well: Assembling functions: <materialize-all-clones> <simdclone> tagCircle49h12_create {GC 3275766k -> 118073k} tagCircle49h12_destroy Time variable usr sys wall GGC dead store elim1 : 9.38 ( 19%) 0.74 ( 33%) 10.12 ( 20%) 3048002 kB ( 88%) LRA reload inheritance : 32.58 ( 66%) 0.00 ( 0%) 32.60 ( 63%) 0 kB ( 0%) TOTAL : 49.57 2.27 51.85 3446647 kB Maybe we can disable reload inheritance with some limit, at least at -O1 (which is what we intend to "support" for insane testcases). Vlad? On trunk we have (with detailed-mem-stats and release checking): dead store elim1 : 44.34 ( 49%) 6.42 ( 76%) 50.76 ( 51%) 3048002 kB ( 89%) LRA reload inheritance : 32.51 ( 36%) 0.00 ( 0%) 32.51 ( 33%) 0 kB ( 0%) Looks like there's no -fno-lra-inheritance but "not doing" is supported as seen by existence of LRA_MAX_INHERITANCE_PASSES (hard defined to 2 rather than a --param or conditional on optimize level). All of the memory goes here: explow.c:198 (plus_constant) 3025M: 95.8% 0 : 0.0% 72 : 0.0% 0 : 0.0% 126M (not very informative, might also be a sign of a target[hook] issue).
callgrind computes lra_inheritance -> inerhit_in_ebb -> htab_find_slot -> ... -> rtx_equal_p as the most time-consuming part. That's from insert_invariant. Likely the hash function for this particular testcase is bad (there's no hash statistics on this hashtable printed). But we call 462 000 times htab_find_slot but 2 150 000 000 times invariant_eq_p. Likely we have many (mem (plus (symbol-ref) CONST_INT) with different constants but lra_rtx_hash does case SCRATCH: case CONST_DOUBLE: case CONST_INT: case CONST_VECTOR: return val; which means it ignores the actual constant value (for whatever reason)? Doing a simple Index: gcc/lra.c =================================================================== --- gcc/lra.c (revision 268383) +++ gcc/lra.c (working copy) @@ -1719,10 +1719,12 @@ lra_rtx_hash (rtx x) case SCRATCH: case CONST_DOUBLE: - case CONST_INT: case CONST_VECTOR: return val; + case CONST_INT: + return val + UINTVAL (x); + default: break; } improves compile time to > /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O 18.82user 0.90system 0:19.73elapsed 100%CPU (0avgtext+0avgdata 3789340maxresident)k 0inputs+9808outputs (0major+933676minor)pagefaults 0swaps For sub-fmts of CONST_INT the hash function already performs this operation. dead store elim1 : 10.20 ( 54%) 0.77 ( 36%) 10.96 ( 52%) 3048002 kB ( 89%) LRA reload inheritance : 0.08 ( 0%) 0.00 ( 0%) 0.08 ( 0%) 0 kB ( 0%) I'm going to test sth like the above. Does nothing to the memory use though.
The DSE thing is (of course) alias queries and there, find_base_term. 200 000 calls to check_mem_read_use result in 25 600 000 calls to canon_true_dependence. I suppose we could cache the result of find_base_term and have a canon_true_dependence_with_bases. Eventually DSE should just give up with too long next_local_store chains. Btw, the plus_constant calls all originate from true_dependence_1 ending up calling get_addr and that re-building RTL that must be there already somehow. Thus in the end it originates from the excessive number of alias queries done by DSE. Even that get_addr() part could be cached though. canon_true_dependence_with_bases_and_addrs. Unfortunately --param max-dse-active-local-stores is a bail-out thing so we need to cross a magic barrier which is somewhere between 1000 and 1250: > /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O --param max-dse-active-local-stores=5000 (default) 18.59user 1.03system 0:19.62elapsed 99%CPU (0avgtext+0avgdata 3787968maxresident)k 0inputs+9808outputs (0major+933497minor)pagefaults 0swaps > /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O --param max-dse-active-local-stores=2500 18.13user 1.09system 0:19.22elapsed 99%CPU (0avgtext+0avgdata 3787792maxresident)k 0inputs+9904outputs (0major+934009minor)pagefaults 0swaps > /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O --param max-dse-active-local-stores=2000 18.57user 0.98system 0:19.56elapsed 99%CPU (0avgtext+0avgdata 3786852maxresident)k 0inputs+9808outputs (0major+933789minor)pagefaults 0swaps > /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O --param max-dse-active-local-stores=1500 18.71user 1.01system 0:19.74elapsed 99%CPU (0avgtext+0avgdata 3789372maxresident)k 0inputs+9808outputs (0major+933920minor)pagefaults 0swaps > /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O --param max-dse-active-local-stores=1250 18.54user 0.94system 0:19.49elapsed 99%CPU (0avgtext+0avgdata 3788452maxresident)k 0inputs+9808outputs (0major+933435minor)pagefaults 0swaps > /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O --param max-dse-active-local-stores=1000 7.63user 0.22system 0:07.86elapsed 99%CPU (0avgtext+0avgdata 715704maxresident)k 0inputs+9808outputs (0major+170563minor)pagefaults 0swaps > /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O --param max-dse-active-local-stores=500 7.66user 0.24system 0:07.90elapsed 100%CPU (0avgtext+0avgdata 717116maxresident)k > /usr/bin/time /abuild/rguenther/obj/gcc/cc1 -quiet tagCircle49h12.i -O --param max-dse-active-local-stores=250 7.73user 0.16system 0:07.90elapsed 100%CPU (0avgtext+0avgdata 715960maxresident)k 0inputs+9904outputs (0major+170918minor)pagefaults 0swaps I am testing Index: gcc/opts.c =================================================================== --- gcc/opts.c (revision 268383) +++ gcc/opts.c (working copy) @@ -670,7 +670,16 @@ default_options_optimization (struct gcc /* For -O1 only do loop invariant motion for very small loops. */ maybe_set_param_value (PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP, - opt2 ? default_param_value (PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP) : 1000, + opt2 ? default_param_value (PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP) + : default_param_value (PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP) / 10, + opts->x_param_values, opts_set->x_param_values); + + /* For -O1 reduce the maximum number of active local stores for RTL DSE + since this can consume huge amounts of memory (PR89115). */ + maybe_set_param_value + (PARAM_MAX_DSE_ACTIVE_LOCAL_STORES, + opt2 ? default_param_value (PARAM_MAX_DSE_ACTIVE_LOCAL_STORES) + : default_param_value (PARAM_MAX_DSE_ACTIVE_LOCAL_STORES) / 10, opts->x_param_values, opts_set->x_param_values); /* At -Ofast, allow store motion to introduce potential race conditions. */
Author: rguenth Date: Wed Jan 30 15:11:04 2019 New Revision: 268394 URL: https://gcc.gnu.org/viewcvs?rev=268394&root=gcc&view=rev Log: 2019-01-30 Richard Biener <rguenther@suse.de> PR rtl-optimization/89115 * opts.c (default_options_optimization): Reduce PARAM_MAX_DSE_ACTIVE_LOCAL_STORES by a factor of 10 at -O1. Make PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP reduction relative to the default. Modified: trunk/gcc/ChangeLog trunk/gcc/opts.c
Author: rguenth Date: Thu Jan 31 08:09:59 2019 New Revision: 268414 URL: https://gcc.gnu.org/viewcvs?rev=268414&root=gcc&view=rev Log: 2019-01-31 Richard Biener <rguenther@suse.de> PR rtl-optimization/89115 * lra.c (lra_rtx_hash): Properly hash CONST_INT values. Modified: trunk/gcc/ChangeLog trunk/gcc/lra.c
On trunk compile-time at -O1 should now be reasonable, currently testing backports. The DSE issue still exists at -O2+ but compared to the LRA issue it was "minor".
Author: rguenth Date: Thu Jan 31 10:00:26 2019 New Revision: 268416 URL: https://gcc.gnu.org/viewcvs?rev=268416&root=gcc&view=rev Log: 2019-01-31 Richard Biener <rguenther@suse.de> Backport from mainline 2019-01-31 Richard Biener <rguenther@suse.de> PR rtl-optimization/89115 * lra.c (lra_rtx_hash): Properly hash CONST_INT values. 2019-01-30 Richard Biener <rguenther@suse.de> PR rtl-optimization/89115 * opts.c (default_options_optimization): Reduce PARAM_MAX_DSE_ACTIVE_LOCAL_STORES by a factor of 10 at -O1. Make PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP reduction relative to the default. Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/lra.c branches/gcc-8-branch/gcc/opts.c
Author: rguenth Date: Thu Jan 31 12:05:19 2019 New Revision: 268418 URL: https://gcc.gnu.org/viewcvs?rev=268418&root=gcc&view=rev Log: 2019-01-31 Richard Biener <rguenther@suse.de> Backport from mainline 2019-01-31 Richard Biener <rguenther@suse.de> PR rtl-optimization/89115 * lra.c (lra_rtx_hash): Properly hash CONST_INT values. 2019-01-30 Richard Biener <rguenther@suse.de> PR rtl-optimization/89115 * opts.c (default_options_optimization): Reduce PARAM_MAX_DSE_ACTIVE_LOCAL_STORES by a factor of 10 at -O1. Make PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP reduction relative to the default. Modified: branches/gcc-7-branch/gcc/ChangeLog branches/gcc-7-branch/gcc/lra.c branches/gcc-7-branch/gcc/opts.c
Fixed for GCC 7.5/8.3.