Bug 55135

Summary: Segfault of gcc on a big file
Product: gcc Reporter: benoit.barbot
Component: middle-endAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: dimhen, hubicka, mpolacek, octoploid, rguenth, steven, webrown.cpp
Priority: P3 Keywords: compile-time-hog, memory-hog
Version: 4.8.0   
Target Milestone: ---   
Host: Target:
Build: Known to work: 4.6.3
Known to fail: 10.2.1, 4.7.2, 4.8.0 Last reconfirmed: 2024-02-19 00:00:00
Bug Depends on:    
Bug Blocks: 47344    
Attachments: Collected hacks to make the test case compile in reasonable time with -O0

Description benoit.barbot 2012-10-30 10:18:20 UTC
When i compile the given file with gcc, i obtained a segfault:
>gcc LHA.ii -v    
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.4.5-8' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.4 --enable-shared --enable-multiarch --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --with-arch-32=i586 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.4.5 (Debian 4.4.5-8) 
COLLECT_GCC_OPTIONS='-v' '-mtune=generic'
 /usr/lib/gcc/x86_64-linux-gnu/4.4.5/cc1plus -fpreprocessed LHA.ii -quiet -dumpbase LHA.ii -mtune=generic -auxbase LHA -version -o /tmp/ccDvgh6v.s
GNU C++ (Debian 4.4.5-8) version 4.4.5 (x86_64-linux-gnu)
	compiled by GNU C version 4.4.5, GMP version 4.3.2, MPFR version 3.0.0-p3.
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: 5a2e15051eaa06a84cf6320b754ba993
gcc: Internal error: Erreur de segmentation (program cc1plus)

The code in LHA.ii is generated which explained why it is so big.
The compilation work with similar but smaller file generated by the same
program. It also fail on similar and bigger file generated by the same program

If the crash is due to the size of the file gcc should return an error instead
of a segfault.
Comment 1 benoit.barbot 2012-10-30 10:28:22 UTC
The file is too big to be attached. Here is a URL where you can find it: 
http://www.lsv.ens-cachan.fr/~barbot/cosmos/files/buggcc.ii
Comment 2 Marek Polacek 2012-10-30 10:39:43 UTC
Please try newer version of GCC.
Comment 3 Paolo Carlini 2012-10-30 11:48:23 UTC
And if you can still reproduce, please attach here a preprocessed file of manageable size. Thanks.
Comment 4 benoit.barbot 2012-10-30 13:29:04 UTC
When i remove line in the file the segfault disappear. The size of the file seams to trigger the segfault.
Comment 5 Markus Trippelsdorf 2012-10-30 14:03:47 UTC
I can't reproduce the crash, but what's interesting are the compile times
(without optimization just "-c buggcc.ii"):

clang++:     26.95 total
gcc-4.6.3: 1:39.92 total
gcc-4.7.2: 6:04.07 total
gcc-4.8  : 7:16.84 total
Comment 6 benoit.barbot 2012-10-31 16:31:37 UTC
I try the same file but on a different computer with a newer version of gcc(gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3) with the same problem:
>g++ buggcc.ii
g++: erreur interne du compilateur: Processus arrêté (program cc1plus)
Comment 7 Markus Trippelsdorf 2012-10-31 16:45:43 UTC
(In reply to comment #6)
> I try the same file but on a different computer with a newer version of gcc(gcc
> (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3) with the same problem:
> >g++ buggcc.ii
> g++: erreur interne du compilateur: Processus arrêté (program cc1plus)

Make sure you have enough RAM (over 4GB) to compile this testcase.
Check your dmesg for the OOM-killer.
Comment 8 Paolo Carlini 2013-02-27 16:20:47 UTC
I can't reproduce the crash either. I'm not sure if we should keep the PR open as regards compile-time performance issues (or we have already a similar testcase in Bugzilla?) Steven?
Comment 9 Steven Bosscher 2013-02-28 22:49:16 UTC
I first tried at -O0, only to run into even worse compile time issues,
hitting quadratic behavior in the number of EH regions, and having a
huge number of them:

 void LHA::Load();; remove_queued_eh_handlers: # eh regions: 179972

The remove_queued_eh_handlers function is new, I'll attach a patch
here after proper testing. With that problem out of the way, the next
hurdle is IRA but I'm still trying to figure that one out.
Comment 10 Steven Bosscher 2013-02-28 23:58:38 UTC
Created attachment 29557 [details]
Collected hacks to make the test case compile in reasonable time with -O0

Patch does 2 things:

- Queue up to-be-removed EH regions, instead of removing them one-by-one.
  Removing them one at a time results in walking the list of EH regions
  repeatedly, thus taking O(# of EH regions ** 2) time.

- Rewrite init_subregs_of_mode and subroutines to first collect the
  invalid mode change subregs in sbitmaps, and then converting the final
  sbitmap to a bitmap. This trades memory for time: the bitmap lookups are
  also potentially O(# of registers ** 2) and this test case has more than
  one million registers, many of them with invalid mode changes (to be fixed
  up by IRA/LRA).

Peak memory at -O0 is <4GB. Compile time on a "Quad-Core AMD Opteron(tm) Processor 8354" at 2200MHz is 240s, half of it still taken up by IRA+LRA.

At -O1 the einline pass is consuming almost all compile time again.
-> Honza: Can we please have a proper permanent fix for this recurring
problem? What's there now just Does Not Work!
Comment 11 Steven Bosscher 2013-03-01 00:01:47 UTC
This is a regression on various things from previous releases.

I will take care of the compile time explosion at -O0.

The -O1+ compile time explosion (and probably the memory explosion) are
due to the ever-changing inliner heuristics that still just don't scale.
Comment 12 Steven Bosscher 2013-03-01 07:50:43 UTC
Last night's compilation at -O1 with my hacks applied finished after
a whopping >6 hours. Top compile time consumers:

 early inlining heuristics: 12409.92 (55%) usr
 integration              :  1417.65 ( 6%) usr
 tree eh                  :  1140.75 ( 5%) usr
 tree PTA                 :   309.46 ( 1%) usr
 tree SSA incremental     :  6065.43 (27%) usr
 tree split crit edges    :   515.67 ( 2%) usr
 TOTAL                    : 22448.04

Peak memory: 4294647808 (~4GB).
Comment 13 Steven Bosscher 2013-03-01 07:52:41 UTC
For reference, my numbers are for GCC 4.8 trunk r196182, configured
with release checking, on x86_64-unknown-linux-gnu, on AMD Opteron
Processor 8354 at 2200MHz.
Comment 14 Paolo Carlini 2013-03-01 09:35:41 UTC
Thanks Steven for analyzing / fixing this.
Comment 15 Richard Biener 2013-03-01 10:44:10 UTC
(In reply to comment #10)
> Created attachment 29557 [details]
> Collected hacks to make the test case compile in reasonable time with -O0
> 
> Patch does 2 things:
> 
> - Queue up to-be-removed EH regions, instead of removing them one-by-one.
>   Removing them one at a time results in walking the list of EH regions
>   repeatedly, thus taking O(# of EH regions ** 2) time.

This (properly cleaned up) looks reasonable to me.

> - Rewrite init_subregs_of_mode and subroutines to first collect the
>   invalid mode change subregs in sbitmaps, and then converting the final
>   sbitmap to a bitmap. This trades memory for time: the bitmap lookups are
>   also potentially O(# of registers ** 2) and this test case has more than
>   one million registers, many of them with invalid mode changes (to be fixed
>   up by IRA/LRA).

Hmm - this is because we hit the O(n) complexity we have on our bitmap
implementation?  Can't we improve init_subregs_of_mode by first collecting
all mode changes we see for a pseudo (eventually using DF info?) and
then do the processing in some more optimal order?

Trading memory O(number of pseudos) with a large constant factor sounds
like something waiting for trouble for other testcases ...

> Peak memory at -O0 is <4GB. Compile time on a "Quad-Core AMD Opteron(tm)
> Processor 8354" at 2200MHz is 240s, half of it still taken up by IRA+LRA.
> 
> At -O1 the einline pass is consuming almost all compile time again.
> -> Honza: Can we please have a proper permanent fix for this recurring
> problem? What's there now just Does Not Work!
Comment 16 Steven Bosscher 2013-03-01 14:35:20 UTC
(In reply to comment #15)
> > - Queue up to-be-removed EH regions, instead of removing them one-by-one.
> >   Removing them one at a time results in walking the list of EH regions
> >   repeatedly, thus taking O(# of EH regions ** 2) time.
> This (properly cleaned up) looks reasonable to me.

It's not yet complete, I think I need to update the outer region pointers
for the inner region if an outer region is removed. But I think this is
the right approach.


> > - Rewrite init_subregs_of_mode and subroutines to first collect the
> >   invalid mode change subregs in sbitmaps, and then converting the final
> >   sbitmap to a bitmap. This trades memory for time: the bitmap lookups are
> >   also potentially O(# of registers ** 2) and this test case has more than
> >   one million registers, many of them with invalid mode changes (to be fixed
> >   up by IRA/LRA).
> Hmm - this is because we hit the O(n) complexity we have on our bitmap
> implementation?

Yes.

>  Can't we improve init_subregs_of_mode by first collecting
> all mode changes we see for a pseudo (eventually using DF info?) and
> then do the processing in some more optimal order?

Yes. That is the plan, this was just a proof-of-concept fix (I didn't call
it a patch, I called it a hack - for the good reasons you mentioned :-).

I also want to add a better way to lookup bits as random-access in bitmaps:
change the "view" of the bitmap, much like what tree-ssa-live does with its
maps).
Comment 17 Jan Hubicka 2013-03-01 16:14:08 UTC
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55135
> 
> --- Comment #12 from Steven Bosscher <steven at gcc dot gnu.org> 2013-03-01 07:50:43 UTC ---
> Last night's compilation at -O1 with my hacks applied finished after
> a whopping >6 hours. Top compile time consumers:
> 
>  early inlining heuristics: 12409.92 (55%) usr
>  integration              :  1417.65 ( 6%) usr
>  tree eh                  :  1140.75 ( 5%) usr
>  tree PTA                 :   309.46 ( 1%) usr
>  tree SSA incremental     :  6065.43 (27%) usr
>  tree split crit edges    :   515.67 ( 2%) usr
>  TOTAL                    : 22448.04
> 
> Peak memory: 4294647808 (~4GB).

I will take care of the early inlining problem.  I wonder, you don't have oprofile of that, by any chance?

Honza
Comment 18 Jan Hubicka 2013-03-01 16:19:47 UTC
> I will take care of the early inlining problem.  I wonder, you don't have oprofile of that, by any chance?

Aha, callee walking in update_inline_summary. Perhaps I will really need to
make this incremental despite the risk of accmulating roundoff errors..  I will
play with this a bit more.

Honza
Comment 19 Steven Bosscher 2013-03-01 19:13:32 UTC
(In reply to comment #18)

I thought you had already done that, to handle attribute flatten for
bug 54146 (http://gcc.gnu.org/PR54146#c43). This test case doesn't use
the flatten attribute, but explodes in the same way as the test case
for bug 54146.
Comment 20 Steven Bosscher 2013-03-01 21:05:00 UTC
(In reply to comment #15)
> Trading memory O(number of pseudos) with a large constant factor sounds
> like something waiting for trouble for other testcases ...

FWIW, for  this particular test case, almost all registers end up in
the set. I'm not sure why.  And since there are NUM_MACHINE_MODES bits
per register (NUM_MACHINE_MODES==87 on x86) the supposed-to-be sparse
bitmaps are huge (>100,000 bitmap_elements).

I have a fix proper for this problem in testing.
Comment 21 Steven Bosscher 2013-03-05 14:45:32 UTC
Author: steven
Date: Tue Mar  5 14:45:23 2013
New Revision: 196464

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=196464
Log:
gcc/
	PR c++/55135
	* except.h (remove_unreachable_eh_regions): New prototype.
	* except.c (remove_eh_handler_splicer): New function, split out
	of remove_eh_handler.
	(remove_eh_handler): Use remove_eh_handler_splicer.  Add comment
	warning about running it on many EH regions one at a time.
	(remove_unreachable_eh_regions_worker): New function, walk the
	EH tree in depth-first order and remove non-marked regions.
	(remove_unreachable_eh_regions): New function.
	* tree-eh.c (mark_reachable_handlers): New function, split out
	from remove_unreachable_handlers.
	(remove_unreachable_handlers): Use mark_reachable_handlers and
	remove_unreachable_eh_regions.
	(remove_unreachable_handlers_no_lp): Use mark_reachable_handlers
	and remove_unreachable_eh_regions.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/except.c
    trunk/gcc/except.h
    trunk/gcc/tree-eh.c
Comment 22 Steven Bosscher 2013-03-05 16:40:40 UTC
(In reply to comment #20)
> I have a fix proper for this problem in testing.

Posted for discussion here:
http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00193.html
Comment 23 Richard Biener 2013-03-06 10:47:18 UTC
PR47344 tracks the regression property of this bug.
Comment 24 Steven Bosscher 2013-03-06 12:09:26 UTC
(In reply to comment #23)
> PR47344 tracks the regression property of this bug.

?! This is also a regression from GCC 4.6 (commen #5), how in the world 
does that qualify as an "old regression"?
Comment 25 Steven Bosscher 2013-03-06 12:10:31 UTC
(NB 4.6.3 known to work w.r.t. comment #5, not w.r.t. original bug report)
Comment 26 rguenther@suse.de 2013-03-06 12:14:03 UTC
On Wed, 6 Mar 2013, steven at gcc dot gnu.org wrote:

> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55135
> 
> Steven Bosscher <steven at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>       Known to work|                            |4.6.3
>       Known to fail|                            |4.7.2, 4.8.0
> 
> --- Comment #24 from Steven Bosscher <steven at gcc dot gnu.org> 2013-03-06 12:09:26 UTC ---
> (In reply to comment #23)
> > PR47344 tracks the regression property of this bug.
> 
> ?! This is also a regression from GCC 4.6 (commen #5), how in the world 
> does that qualify as an "old regression"?

Ah, just because nobody has tried 4.5 doesn't say it isn't a regression
in 4.6!

(what is a regression in compile-time / memory-usage?  technically
I'd say if T2 > T1 or M2 > M1 it's a regression ... welcome to
the world of an ever increasing number of open "regressions")
Comment 27 Richard Biener 2013-03-06 12:16:06 UTC
Btw, I wouldn't call

> gcc-4.6.3: 1:39.92 total

"work" ;)  Also the reporter says the bug is in 4.4.5 (so we are again
turning a bug into a different bug ... :/)
Comment 28 Steven Bosscher 2013-03-06 12:18:01 UTC
(In reply to comment #22)
> Posted for discussion here:
> http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00193.html

OT: Another trivial speed-up for bitmaps used as regsets (and probably
in general) is to look at head->first if head->current is not the element
containing the sought bit, and *not* update head->current if head->first
is the right element.  This speeds up regsets because a common access
pattern is to look at sets containing both pseudos and hardregs, and on
most targets all hardregs are in head->first.  Not updating head->current
preserves a pointer to the latest accessed pseudos.

I'll implement this idea and come back with some timings.
Comment 29 rguenther@suse.de 2013-03-06 12:23:21 UTC
On Wed, 6 Mar 2013, steven at gcc dot gnu.org wrote:

> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55135
> 
> --- Comment #28 from Steven Bosscher <steven at gcc dot gnu.org> 2013-03-06 12:18:01 UTC ---
> (In reply to comment #22)
> > Posted for discussion here:
> > http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00193.html
> 
> OT: Another trivial speed-up for bitmaps used as regsets (and probably
> in general) is to look at head->first if head->current is not the element
> containing the sought bit, and *not* update head->current if head->first
> is the right element.  This speeds up regsets because a common access
> pattern is to look at sets containing both pseudos and hardregs, and on
> most targets all hardregs are in head->first.  Not updating head->current
> preserves a pointer to the latest accessed pseudos.
> 
> I'll implement this idea and come back with some timings.

Indeed a nice idea ;)  I suppose ->current should only be updated
if its new distance to head->first is bigger than <magic number>
(and zero is of course an obvious one)
Comment 30 Jan Hubicka 2019-11-21 15:34:52 UTC
Reconfirmed that we still take ages to build the testcase (early inliner is still running for me)

The early inliner issue here is caused by tree-inline removing individual clones one by one.  Each time a clone is removed a new clone becomes a root of the clone tree and it takes long time to update all pointers.
Comment 31 Richard Biener 2020-09-28 07:24:27 UTC
GCC 10.2 takes 1.8s and ~500MB for parsing (-fsyntax-only), compiling at -O0
is 17.8s peaking at ~3GB of memory.

Time variable                                   usr           sys          wall               GGC
 phase parsing                      :   1.82 ( 10%)   0.71 ( 59%)   2.53 ( 13%)  516872 kB ( 29%)
 phase opt and generate             :  15.96 ( 90%)   0.48 ( 40%)  16.44 ( 87%) 1273663 kB ( 71%)
 callgraph ipa passes               :   0.91 (  5%)   0.15 ( 13%)   1.07 (  6%)  198535 kB ( 11%)
 df scan insns                      :   1.00 (  6%)   0.18 ( 15%)   1.20 (  6%)      19 kB (  0%)
 expand                             :   0.92 (  5%)   0.05 (  4%)   0.95 (  5%)  308140 kB ( 17%)
 integrated RA                      :   3.98 ( 22%)   0.06 (  5%)   4.02 ( 21%)  129194 kB (  7%)
 LRA non-specific                   :   1.04 (  6%)   0.01 (  1%)   1.02 (  5%)     868 kB (  0%)
 TOTAL                              :  17.79          1.20         19.00        1796092 kB

that's everything >= 5%

At -O1 things blow up as expected:

Time variable                                   usr           sys          wall               GGC
 callgraph functions expansion      : 335.20 ( 14%)   2.31 ( 51%) 337.66 ( 14%) 1570346 kB ( 41%)
 callgraph ipa passes               :2040.93 ( 86%)   1.43 ( 31%)2042.99 ( 86%) 1326596 kB ( 34%)
 ipa inlining heuristics            :1505.86 ( 63%)   0.01 (  0%)1506.33 ( 63%)  146038 kB (  4%)
 integration                        : 637.70 ( 27%)   1.61 ( 35%) 639.78 ( 27%)  968497 kB ( 25%)
 tree SSA incremental               : 110.12 (  5%)   0.00 (  0%) 110.16 (  5%)     410 kB (  0%)
 TOTAL                              :2379.80          4.54       2385.13        3854592 kB

and ~4GB of ram used.

Honza - you put the finger to it, can't we refactor things so we apply
this update in the caller after all stmts were processed?
Comment 32 Jan Hubicka 2020-09-28 08:16:18 UTC
> 
> Honza - you put the finger to it, can't we refactor things so we apply
> this update in the caller after all stmts were processed?

The clone tree issue should not be too hard to solve.  We need to keep
the top of tree as a deleted symbol rather then actual inline clone. I
plan to take a look. It was I think only inliner related compile time
issue I gave up on last stage2 because it was bit exotic.
I plan to look into this again soon and also change way we materialize
clones at the beggining of build and do that on demand instead. This
requires to cleanup some of old logic deciding on when function body
is stil needed for compilation.

Honza
> 
> -- 
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 33 Richard Biener 2024-02-19 12:35:04 UTC
Still as comment#31 says.

 ipa inlining heuristics            : 813.42 ( 67%)
 integration                        : 228.18 ( 19%)
 TOTAL                              :1219.52          7.03       1226.86         3438M