This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug rtl-optimization/59811] [4.9/5/6 Regression] Huge increase in memory usage and compile time in combine
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Thu, 11 Feb 2016 10:42:56 +0000
- Subject: [Bug rtl-optimization/59811] [4.9/5/6 Regression] Huge increase in memory usage and compile time in combine
- Auto-submitted: auto-generated
- References: <bug-59811-4 at http dot gcc dot gnu dot org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59811
--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
So SCCVN is limited with --param sccvn-max-alias-queries-per-access which has
a quite high default value (1000). There isn't something more elaborate like
having an overall budget as well.
With lowering the default value to 100 I get to
alias stmt walking : 12.15 (16%) usr 0.06 (12%) sys 12.23 (16%) wall
2 kB ( 0%) ggc
...
tree PRE : 9.48 (13%) usr 0.01 ( 2%) sys 9.48 (13%) wall
1333 kB ( 0%) ggc
tree FRE : 9.91 (13%) usr 0.03 ( 6%) sys 9.92 (13%) wall
310 kB ( 0%) ggc
...
combiner : 31.87 (43%) usr 0.24 (46%) sys 32.10 (43%) wall
528441 kB (80%) ggc
rest of compilation : 51.05 (69%) usr 0.31 (60%) sys 51.25 (69%) wall
620347 kB (94%) ggc
TOTAL : 73.85 0.52 74.34
661274 kB
(see disclaimer about things not adding up). As we run FRE twice and PRE
only once (but it uses a cheaper walking mode that ends up giving up earlier)
a reasonable result would be FRE at around 2 times the PRE time.
Note that we do run into the above limit all the time with this testcase
which has loads of stores and loads (non-aliased, of course). And we do
seem to optimize stuff as upping the limit results in faster overall compile
time.
I think we need to more intelligently limit the walking and/or design some
caching of queries that works better or simply make the queries cheaper.
Most of the query cost comes from decomposing our reference trees and
repeatedly computing the overall access offset/size (get_ref_base_and_extent).
So if we could cache that we might win.
The number of loads/stores are not that bad in the testcase, we have
1700 stores after fre1 and 547 loads, 1142 stores after fre2 and 547 loads.
Before fre1 we had 1714 stores and 15334 loads. So its clearly fre1 that
will run into this compile-time hog (it does max. #loads * #stores alias
queries). OTOH we _do_ want to optimize those 14000 redundant loads!
It simply looks like the defs are far away (happens for example if you
have an initializer at the start of the function we can optimize from).
The 547 loads after fre1 is actually with upping the limit to 10000, with
the default limit we still have 5548 loads! With lowering to 100 we get
11616 loads.
Note that compile-time would be worse if there were nothing to optimize here
(then you hit the worst-case of #loads * #stores alias queries).