Bug 62291 - PRE uses too much memory and compile-time
Summary: PRE uses too much memory and compile-time
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.9.1
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
Keywords: compile-time-hog, memory-hog
Depends on:
Blocks: 47344
  Show dependency treegraph
Reported: 2014-08-28 13:08 UTC by Richard Biener
Modified: 2014-09-02 13:58 UTC (History)
0 users

See Also:
Known to work:
Known to fail:
Last reconfirmed: 2014-08-29 00:00:00

preprocessed testcase (145.55 KB, application/octet-stream)
2014-08-28 13:08 UTC, Richard Biener

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Biener 2014-08-28 13:08:31 UTC
Created attachment 33409 [details]
preprocessed testcase

For the attached testcase (from GHC) PRE blows up memory and compile-time wise
at -O2, -O2 -fno-tree-pre and -O1 are fine.

Usual suspect is AVAIL_OUT which was only eliminated for FRE.
Comment 1 Richard Biener 2014-08-29 12:40:22 UTC
Author: rguenth
Date: Fri Aug 29 12:39:50 2014
New Revision: 214727

URL: https://gcc.gnu.org/viewcvs?rev=214727&root=gcc&view=rev
2014-08-29  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/62291
	* tree-ssa-pre.c (sorted_array_from_bitmap_set): Reserve
	exactly the vector size needed and use quick_push.
	(phi_translate_1): Adjust comment.
	(valid_in_sets): Remove block argument and remove pointless
	checking of NAMEs.
	(dependent_clean): Adjust for removal of block argument.
	(clean): Likewise.
	(compute_antic_aux): Likewise.
	(compute_partial_antic_aux): Likewise.

Comment 2 Richard Biener 2014-08-29 12:51:05 UTC
As said in the mail for the just applied patch:

"The second patch (once done) will refactor insertion phase
to do a dominator walk similar to what elimination does
computing AVAIL_OUT on-the-fly (and only keeping it live
for a single block).  It's a bit more complicated than
for elimination as insertion looks at a block and all its
predecessors, it's also a bit more costly as we iterate
insertion.  It's also more costly because computing
AVAIL_OUT on-the-fly requires looking at all statements
which insertion currently doesn't do at all.
So I've not yet settled on a final idea how to best re-factor it,
eventually insertion and elimination can be merged (and then
still iterate(?)).

Ideas welcome ;)"
Comment 3 Richard Biener 2014-09-01 13:31:12 UTC
So what remains is insertion computing whether a value is available on some
predecessor (partial redundancy) or all (full redundancy) which needs AVAIL_OUT
for the predecessors and the immediate dominator.

Then the actual insertion code needs to lookup leaders for expression operands
it wants to insert.  I think this part can be easily delayed until eliminate
if we store all necessary information in some new data structure.

The first part is harder - especially as it is that what needs to be iterated
(with taking into account newly inserted stuff).  One could imagine
iterating over sucessors of a block (and immediate dominated-bys) and compute
"candidate" sets, both pruning and extending it during the iteration.
Comment 4 Richard Biener 2014-09-02 13:58:41 UTC
Btw, the testcase is absymally slow in compiling even with -O1 if you enable
glibc memory checking with both(!) MALLOC_CHECK_=3 and MALLOC_PERTURB_=69

I have no idea why... (constant overhead seems to be on the order of a
factor of tree for pure malloc/free/realloc workload, but compiling
the testcase goes from 30s to infinity - aborted after an hour).

We certainly do some stupid reallocs (c.f. vec::safe_grow[_cleared]
doing a reserve_exact and thus some vectors grow element-by-element).