Bug 54146 - Very slow compile with attribute((flatten))
Summary: Very slow compile with attribute((flatten))
Status: RESOLVED WONTFIX
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.8.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: compile-time-hog, memory-hog
Depends on:
Blocks:
 
Reported: 2012-07-31 18:33 UTC by Marc Glisse
Modified: 2020-05-22 11:53 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-07-31 00:00:00


Attachments
testcase (456.99 KB, application/x-gzip)
2012-07-31 18:33 UTC, Marc Glisse
Details
Speed up stack var conflict matric computation (1.73 KB, patch)
2012-07-31 22:16 UTC, Steven Bosscher
Details | Diff
Slighty less broken version of the previous attachment (2.60 KB, patch)
2012-07-31 22:59 UTC, Steven Bosscher
Details | Diff
Do not inline_merge_summary if called via flatten_function (916 bytes, text/plain)
2012-08-03 08:59 UTC, Steven Bosscher
Details
Speed up ifcvt.c:check_cond_move_block (2.69 KB, patch)
2012-08-05 21:13 UTC, Steven Bosscher
Details | Diff
Be memory friendlier in build_insn_chain (1.42 KB, patch)
2012-08-06 20:22 UTC, Steven Bosscher
Details | Diff
Do not traverse sibling loops (2.76 KB, patch)
2012-08-07 22:28 UTC, Steven Bosscher
Details | Diff
Path for inliner slowness (2.18 KB, patch)
2012-08-10 05:18 UTC, Jan Hubicka
Details | Diff
Faster rewrite_into_loop_closed_ssa (4.73 KB, patch)
2012-08-15 14:33 UTC, Steven Bosscher
Details | Diff
More dedicated obstacks (8.46 KB, patch)
2012-08-16 23:04 UTC, Steven Bosscher
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Marc Glisse 2012-07-31 18:33:30 UTC
Created attachment 27912 [details]
testcase

Hello,

trying to compile the attached program on x86_64, g++ slow.cc -frounding-math -c -w -O2 fails after about an hour. The memory consumption (as shown by top) never went above 1.5G. clang compiles the whole thing in 1 minute (well, it won't compile the preprocessed source attached here, but the original program).

If you remove most of the check<...> lines at then end (keep only the first one), the program takes about 10s to compile at -O0, and 30s at -O1 (or -O2, -O3), with 2/3 of the time spent in "expand vars" according to -ftime-report. The most relevant report I found in bugzilla is PR 38474, but it seems to say that expand vars should be fast now...
Comment 1 Andrew Pinski 2012-07-31 18:38:20 UTC
Are you compiling GCC with --enable-checking=release to do the timings?
Comment 2 Marc Glisse 2012-07-31 18:47:48 UTC
(In reply to comment #1)
> Are you compiling GCC with --enable-checking=release to do the timings?

I first noticed the issue with the compiler provided by debian (4.7.1), which has --enable-checking=release. I then checked with a snapshot from 3 days ago which doesn't have the option, and the results were about the same.
Comment 3 Steven Bosscher 2012-07-31 20:33:56 UTC
Time is spent in add_scope_conflicts() in this loop:

  FOR_EACH_BB (bb)
    add_scope_conflicts_1 (bb, work, true);
Comment 4 Steven Bosscher 2012-07-31 22:16:58 UTC
Created attachment 27915 [details]
Speed up stack var conflict matric computation

The patch speeds up the conflict matrix computation in three ways:

1. Nested EXECUTE_IF_SET_IN_BITMAP is extremely inefficient. The same conflicts result if one IORs the bitmaps.

2. Self-conflicts should not be recorded (this helps stack_var_conflict_p too).

3. The conflicts matrix can be stored in upper triangular form.


Debugging, bootstrapping, testing, and other activities involved in contributing this formally are left as an exercise to the reader... ;-)
Comment 5 Steven Bosscher 2012-07-31 22:59:53 UTC
Created attachment 27917 [details]
Slighty less broken version of the previous attachment
Comment 6 Steven Bosscher 2012-08-01 12:22:29 UTC
With my patch applied, and with a couple of lines commented out at the bottom:

//  check<CGAL::Quotient<CGAL::Gmpz> >();
//  check<CGAL::Lazy_exact_nt<CGAL::Gmpq> >();
//  check<CORE::BigInt>();
//  check<CORE::BigRat>();
//  check<CORE::BigFloat>();
//  check<CORE::Expr>();

and with the compiler patched with the patch from comment #5 and configured as:
$ cat configargs.h
/* Generated automatically. */
static const char configuration_arguments[] = "../trunk/configure --with-mpfr=/opt/cfarm/mpfr-latest --with-gmp=/opt/cfarm/gmp-latest --with-mpc=/opt/cfarm/mpc-latest --with-isl=/opt/cfarm/isl-latest --with-cloog=/opt/cfarm/cloog-latest --enable-languages=c,c++ --disable-nls --disable-libmudflap --disable-libssp --disable-libitm --disable-multilib --disable-bootstrap --enable-checking=release";
static const char thread_model[] = "posix";

static const struct {
  const char *name, *value;
} configure_default_options[] = { { "cpu", "generic" }, { "arch", "x86-64" } };


I have "TOTAL :8223.99" seconds compile time. Top 10 big spenders:

 inline heuristics       :6393.58 (78%) usr 175699 kB (27%) ggc
 tree loop init          : 270.76 ( 3%) usr 242922 kB (37%) ggc
 if-conversion           : 220.55 ( 3%) usr   3774 kB ( 1%) ggc
 integrated RA           : 199.12 ( 2%) usr 593696 kB (90%) ggc
 reload                  : 128.67 ( 2%) usr  52399 kB ( 8%) ggc
 tree SSA incremental    : 195.26 ( 2%) usr 160581 kB (24%) ggc
 tree SSA rewrite        : 112.87 ( 1%) usr  39761 kB ( 6%) ggc
 df live regs            :  79.90 ( 1%) usr      0 kB ( 0%) ggc
 df live&initialized regs:  77.14 ( 1%) usr      0 kB ( 0%) ggc
 out of ssa              :  63.59 ( 1%) usr    661 kB ( 0%) ggc

The inline heuristics stuff is probably also due to stack-vars handling, I will look into that.

The loop init stuff is something for Richi.

IRA produces the most garbage, by far, and in fact causes OOM if I do not comment out those last few lines.
Comment 7 Steven Bosscher 2012-08-01 14:08:01 UTC
(In reply to comment #6)
> The inline heuristics stuff is probably also due to stack-vars handling,
> I will look into that.

Turns out this is due to the use of attribute((flatten)) on gebp_kernel.
Comment 8 Steven Bosscher 2012-08-01 14:14:11 UTC
With the attribute((flatten)) removed, the full test case compiles in less than a minute.
Comment 9 Steven Bosscher 2012-08-01 14:24:48 UTC
clang compiles the test case with attribute((flatten)) because it doesn't support that attribute (http://llvm.org/bugs/show_bug.cgi?id=7559).

I'm beginning to think this is one of those cases of "Doctor it hurts if I ..." that should be closed as WONTFIX. Marc, do you know where the use of the flatten attribute comes from in your code?
Comment 10 Richard Biener 2012-08-01 14:32:15 UTC
(In reply to comment #9)
> clang compiles the test case with attribute((flatten)) because it doesn't
> support that attribute (http://llvm.org/bugs/show_bug.cgi?id=7559).
> 
> I'm beginning to think this is one of those cases of "Doctor it hurts if I ..."
> that should be closed as WONTFIX. Marc, do you know where the use of the
> flatten attribute comes from in your code?

indeed "flatten" will override any inlining heuristic that avoids creating
gigant function bodies.  Still eventually improving worst-case performance
of some of our algorithms sounds good, even if it will never fix all issues
that pop up when you for example put flatten on main () ;)
Comment 11 Marc Glisse 2012-08-01 15:59:43 UTC
(In reply to comment #9)
> I'm beginning to think this is one of those cases of "Doctor it hurts if I ..."
> that should be closed as WONTFIX.

Indeed, might even call it INVALID. Thanks a lot for the investigation and sorry for the bad example. Hopefully the patches you wrote are still useful.

> Marc, do you know where the use of the
> flatten attribute comes from in your code?

Comes from the Eigen library, I'll talk to them about it and see if they can comment. They mostly deal with simple number types (double, float), so the behavior with heavy types might have been unnoticed.
Comment 12 Steven Bosscher 2012-08-02 10:01:03 UTC
(In reply to comment #10)
> indeed "flatten" will override any inlining heuristic that avoids creating
> gigant function bodies.  Still eventually improving worst-case performance
> of some of our algorithms sounds good, even if it will never fix all issues
> that pop up when you for example put flatten on main () ;)

Feel free to have a stab at it, if you must.

It would be great if you could have a look at the patch I attached to this bug report for stack var conflicts (I'd finish it myself, but I won't have time for it before September or so).

Oh, and document your loop changes, because loop_optimizer_init accounts for ~20% of the compile time (half of it incorrectly accounted to if-conversion, I'll post a bunch oftimevar fixes this week).

Note that the flattened functions in themselves do not present much of a problem to the compiler (not at -O1, anyway), so the problem here is not the by-passing of the inlining heuristics per-se. It is the flattening operation itself that causes the compiler to explode. You can only see this by adding a timevar around flatten_function (I used TV_LOCAL_ALLOC for that, but I'll add a proper timevar for it :-) because the compiler as-is incorrectly accounts the time for flatten_function to TV_INLINE_HEURISTICS. There must be something non-linear in the flattening algorithm. I'm still trying to find out where, but it's kinda hard to debug a problem with a test case that takes literally hours to compile :-)
Comment 13 rguenther@suse.de 2012-08-02 10:35:56 UTC
On Thu, 2 Aug 2012, steven at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146
> 
> Steven Bosscher <steven at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>           Component|middle-end                  |tree-optimization
>             Summary|Very slow compile at -O1    |Very slow compile with
>                    |(expand vars)               |attribute((flatten))
> 
> --- Comment #12 from Steven Bosscher <steven at gcc dot gnu.org> 2012-08-02 10:01:03 UTC ---
> (In reply to comment #10)
> > indeed "flatten" will override any inlining heuristic that avoids creating
> > gigant function bodies.  Still eventually improving worst-case performance
> > of some of our algorithms sounds good, even if it will never fix all issues
> > that pop up when you for example put flatten on main () ;)
> 
> Feel free to have a stab at it, if you must.
> 
> It would be great if you could have a look at the patch I attached to this bug
> report for stack var conflicts (I'd finish it myself, but I won't have time for
> it before September or so).

It looks sensible - I hope Micha will pick it up though, I'll ping him
personally.

> Oh, and document your loop changes, because loop_optimizer_init accounts for
> ~20% of the compile time (half of it incorrectly accounted to if-conversion,
> I'll post a bunch oftimevar fixes this week).

It's on my list of things (for stage3, I really want to push more invasive
cleanups for stage1 ... heh).

> Note that the flattened functions in themselves do not present much of a
> problem to the compiler (not at -O1, anyway), so the problem here is not the
> by-passing of the inlining heuristics per-se. It is the flattening operation
> itself that causes the compiler to explode. You can only see this by adding a
> timevar around flatten_function (I used TV_LOCAL_ALLOC for that, but I'll add a
> proper timevar for it :-) because the compiler as-is incorrectly accounts the
> time for flatten_function to TV_INLINE_HEURISTICS. There must be something
> non-linear in the flattening algorithm. I'm still trying to find out where, but
> it's kinda hard to debug a problem with a test case that takes literally hours
> to compile :-)

I suppose it's the old issue that we update fibheap keys along each
inlining decision - and with flatten there are just very many ... Honza?
It also seems we flatten depth-first (thus inline leafs) instead of
the other way around:

      orig_callee = callee;
      inline_call (e, true, NULL, NULL);
      if (e->callee != orig_callee)
        orig_callee->symbol.aux = (void *) node;
      flatten_function (e->callee, early);
      if (e->callee != orig_callee)
        orig_callee->symbol.aux = NULL;

This means we will materialize all intermediate flattenings in
functions that will not be reclaimed, right?  flattening foo
should inline everything into foo, but not affect remaining
callee bodies.

Richard.
Comment 14 Steven Bosscher 2012-08-03 08:59:29 UTC
Created attachment 27930 [details]
Do not inline_merge_summary if called via flatten_function

As I noted in comment #12, the flattening itself is the problem, not the heuristics.  I have narrowed this down to inline_merge_summary. With the attached hack, compile time (for only "check<CGAL::Gmpfi>();") drops from 6459s to 1560s. I hi-jacked the unused timevar TV_LOCAL_ALLOC (from pre-IRA days) to measure the time spent in inline_call. Without my hack, this is 4801s, and with the hack it is only 0.37s (!). Do the math: 6459s-4801s=1658s~=1560s.
(NB: These timings are all with my stack_vars patch applied as well.)

With the hack, the remaining compile time is mostly spent in:
 tree loop init                  : 236.16
 integrated RA                   : 215.70
 if-conversion                   : 177.26 (but due to loop_optimizer_init)
 tree SSA incremental            : 154.69
 reload                          : 124.33

I don't understand how inline_merge_summary is supposed to work, so I'm going to leave that one for Richi and Honza.
Comment 15 Michael Matz 2012-08-03 14:43:14 UTC
Author: matz
Date: Fri Aug  3 14:43:09 2012
New Revision: 190126

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190126
Log:
	PR tree-optimization/54146
	* cfgexpand.c (add_scope_conflicts_1): Use bitmap_ior_into.
	(add_scope_conflicts): Iterate in RPO order.
	(add_stack_protection_conflicts): Iterate over the other triangle.
	(fini_vars_expansion): Clear stack_vars_sorted.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cfgexpand.c
Comment 16 Steven Bosscher 2012-08-05 11:38:36 UTC
Comment on attachment 27917 [details]
Slighty less broken version of the previous attachment

Patch obsolete, similar fix committed as r190126:
http://gcc.gnu.org/viewcvs?view=revision&revision=190126
Comment 17 Steven Bosscher 2012-08-05 18:48:55 UTC
(In reply to comment #14)
>  if-conversion             : 177.26 (but due to loop_optimizer_init)

Hmm, this is not loop_optimizer_init. All time is spent in the two memset calls in cond_move_process_if_block:

  /* Build a mapping for each block to the value used for each
     register.  */
  max_reg = max_reg_num ();
  size = (max_reg + 1) * sizeof (rtx);
  then_vals = (rtx *) alloca (size);
  else_vals = (rtx *) alloca (size);
  memset (then_vals, 0, size);
  memset (else_vals, 0, size);

There are O(1e6) registers in the test case, and O(1e5) basic blocks. So we end up memset'ing O(10)MB X O(1e5)times = O(1e6)MB (assuming 64 bits pointers so that each rtx is O(10) bytes).

I have a few ideas to fix this problem.
Comment 18 Steven Bosscher 2012-08-05 21:13:26 UTC
Created attachment 27946 [details]
Speed up ifcvt.c:check_cond_move_block

This speeds up the pre-reload if-conversion passes by using a smarter data structure to keep track of register values.
Comment 19 Richard Biener 2012-08-06 08:45:12 UTC
(In reply to comment #17)
> (In reply to comment #14)
> >  if-conversion             : 177.26 (but due to loop_optimizer_init)
> 
> Hmm, this is not loop_optimizer_init. All time is spent in the two memset calls
> in cond_move_process_if_block:
> 
>   /* Build a mapping for each block to the value used for each
>      register.  */
>   max_reg = max_reg_num ();
>   size = (max_reg + 1) * sizeof (rtx);
>   then_vals = (rtx *) alloca (size);
>   else_vals = (rtx *) alloca (size);
>   memset (then_vals, 0, size);
>   memset (else_vals, 0, size);
> 
> There are O(1e6) registers in the test case, and O(1e5) basic blocks. So we end
> up memset'ing O(10)MB X O(1e5)times = O(1e6)MB (assuming 64 bits pointers so
> that each rtx is O(10) bytes).
> 
> I have a few ideas to fix this problem.

Ick, I suppose similar issues exist on the tree level for passes that
think that memory / compile-time usage O(number-of-ssa-names or basic-blocks)
is "ok" (and I suppose it really _is_ ok ...?)
Comment 20 stevenb.gcc@gmail.com 2012-08-06 09:09:02 UTC
On Mon, Aug 6, 2012 at 10:45 AM, rguenth at gcc dot gnu.org
<gcc-bugzilla@gcc.gnu.org> wrote:
> Ick, I suppose similar issues exist on the tree level for passes that
> think that memory / compile-time usage O(number-of-ssa-names or basic-blocks)
> is "ok" (and I suppose it really _is_ ok ...?)

I think this is OK, yes. In general, anything linear in some measure
of the function should be OK.
NB, this thing in ifcvt is *not* O(max_reg_num()) but
O(max_reg_num()*n_basic_blocks).

I suspect that in the tree optimizers, a non-linear issue exists for
rewriting into loop-closed SSA form. It looks like it is
O(num_ssa_names*n_basic_blocks). I'm trying to confirm that (in my
evening free time, so don't hold your breath ;-).
Comment 21 Steven Bosscher 2012-08-06 15:36:04 UTC
(In reply to comment #20)

Yay, it's always nice to be right the first time when diagnosing a problem.

The tree loop optimizers spend 285s out of 1360s total compile time (with my flatten hack and ifcvt patch applied) in compute_global_livein. That's 21% of the total compile time.

I have a few ideas for how to fix this.
Comment 22 Steven Bosscher 2012-08-06 19:36:35 UTC
IRA/reload spends a rather significant amount of time here:

  FOR_EACH_BB_REVERSE (bb)
    { 
      bitmap_iterator bi;
      rtx insn;

      CLEAR_REG_SET (live_relevant_regs);
-->   memset (live_subregs_used, 0, max_regno * sizeof (int));

This is running with max_regno=O(1e6) and n_basic_blocks=O(1e5) ...
Comment 23 Steven Bosscher 2012-08-06 20:22:02 UTC
Created attachment 27953 [details]
Be memory friendlier in build_insn_chain

My first ever reload patch! :-)
Comment 24 Steven Bosscher 2012-08-06 20:55:37 UTC
(In reply to comment #23)
> Created attachment 27953 [details]

Needs this extra bit:

diff -u ira.c ira.c
--- ira.c       (working copy)
+++ ira.c       (working copy)
@@ -3539,7 +3539,8 @@
   *p = NULL;
 
   for (i = 0; i < (unsigned int) max_regno; i++)
-    sbitmap_free (live_subregs[i]);
+    if (live_subregs[i] != NULL)
+      sbitmap_free (live_subregs[i]);
   free (live_subregs);
   BITMAP_FREE (live_subregs_used);
   BITMAP_FREE (live_relevant_regs);


With that, reload time goes down to ~14s, down from ~124s.
Comment 25 Steven Bosscher 2012-08-06 22:42:12 UTC
(In reply to comment #21)
> The tree loop optimizers spend 285s out of 1360s total compile time (with my
> flatten hack and ifcvt patch applied) in compute_global_livein. That's 21% of
> the total compile time.

Hmm, I've underestimated this problem. It's not a simple matter of speeding up compute_global_livein with a few caches. I've collected some statistics:
* The largest function has 408254 basic blocks
* The largest number of blocks visited in compute_global_livein is 185939
* compute_global_livein looked at 148056 edges to visit those blocks
* The maximum size of the work list was only 6419

185939 is the number of basic blocks that end up in livein. That is a bitmap, so most time is spent in traversing bitmap linked lists.

Perhaps this can be fixed with a fibheap, which may improve the access pattern on the livein bitmap.
Comment 26 Steven Bosscher 2012-08-06 22:58:22 UTC
(In reply to comment #25)
> 185939 is the number of basic blocks that end up in livein. That is a bitmap,
> so most time is spent in traversing bitmap linked lists.

Oh, and this doesn't happen just once, but several 100 times.
Only to prune the set to (livein & loop_exits) ...
Comment 27 Richard Biener 2012-08-07 11:59:07 UTC
Martin, can you look at comment #14 and the patch?  I think what we want to
do in flatten_function is before

      inline_call (e, true, NULL, NULL);

reset the edge predicates so that inline_merge_summary becomes very cheap.
Unfortunately that beast seems to have no early out (but instead uses
true_predicate () ...).  Can we speed it up for the case where we just
want "fast" operation rather than precise accounting of sizes/time in the
inlined-to caller?
Comment 28 Steven Bosscher 2012-08-07 19:58:00 UTC
To illustrate the rewrite_into_closed_loop_ssa problem, consider this test case:

extern void use1 (int);
extern void use2 (int);
extern int confuse_loop (void);

void
foo (void)
{
  int i, j, k;

  for (i = 0; i < 100; i++)
    {
      for (j = 0; j < 100; j++)
        {
          k = i + j;
          if (j > 2)
            use1 (k);
          if (confuse_loop ())
            break;
        }
      use2 (k);
    }
}

The only PHI needed for loop-closed SSA is one for k before "use2(k)". 

One problem is that GCC ends up inserting two PHI nodes (one on the exit after confuse_loop(), which is wasteful if the exit block is empty. So problem 1 is that too many PHIs are inserted.

The second problem is that with multiple loop exits, a large part of the CFG is walked in compute_global_livein. The CFG for the test case (compiled with -fno-tree-ch) looks like this:

             +---<------------ 14 ------<-----+
E           /                                  \          E
N          /                                    \         X
T -> 2 -  9 --> 3 --> 4 --> 5 -> 11 ----------> 7 -> 8 -> I
R              / \         / \                 /          T
Y             /   +-> 10 -+   +-> 6 -> 13 ->--+
              \                   /
               +----<-- 12 --<---+

And the following blocks are visited (with a patch I will attach in a minute -- without the patch even more blocks are visited). Block 3 is the block containing the def, so the CFG walk ends there.

XXX visiting block 7
XXX visiting block 13
XXX visiting block 6
XXX visiting block 5
XXX visiting block 4
XXX visiting block 10
XXX visiting block 11
;; Created LCSSA PHI: k_4 = PHI <k_14(5)>
;; Created LCSSA PHI: k_8 = PHI <k_14(6)>

;; Resulting GIMPLE function IR:

foo ()
{
  int k;
  int j;
  int i;
  int D.1727;

  <bb 2>:
  goto <bb 9>;

  <bb 12>:

  <bb 3>:
  # j_29 = PHI <j_18(12), 0(9)>
  k_14 = i_28 + j_29;
  if (j_29 > 2)
    goto <bb 4>;
  else
    goto <bb 10>;

  <bb 10>:
  goto <bb 5>;

  <bb 4>:
  use1 (k_14);

  <bb 5>:
  D.1727_17 = confuse_loop ();
  if (D.1727_17 != 0)
    goto <bb 11>;
  else
    goto <bb 6>;

  <bb 11>:
  # k_4 = PHI <k_14(5)>
  goto <bb 7>;

  <bb 6>:
  j_18 = j_29 + 1;
  if (j_18 <= 99)
    goto <bb 12>;
  else
    goto <bb 13>;

  <bb 13>:
  # k_8 = PHI <k_14(6)>

  <bb 7>:
  # k_21 = PHI <k_4(11), k_8(13)>
  use2 (k_21);
  i_20 = i_28 + 1;
  if (i_20 <= 99)
    goto <bb 14>;
  else
    goto <bb 8>;

  <bb 8>:
  return;

  <bb 14>:

  <bb 9>:
  # i_28 = PHI <i_20(14), 0(2)>
  goto <bb 3>;

}
Comment 29 Steven Bosscher 2012-08-07 22:28:11 UTC
Created attachment 27957 [details]
Do not traverse sibling loops

The idea here is to note that for a nested loop we know for sure that the loop exits into a common loop father must all need PHI nodes, because we have have loop pre-headers. I'm not sure what the effect of this is on irreducible loops, and I haven't tested this much yet. Comments sought...
Comment 30 Steven Bosscher 2012-08-07 22:36:30 UTC
> Created attachment 27957 [details]

With the attachment, time spent in rewrite_into_loop_closed_ssa is almost 0 (and that includes the time in the verifier). Compile time for the test case (still with only "check<CGAL::Gmpfi>();") is now 924s wall time. Top spenders:

integrated RA                   : 224.26 (24%)
tree SSA incremental            :  75.07 ( 8%)
df live&initialized regs        :  69.58 ( 8%)
df live regs                    :  66.19 ( 7%)
remove unused locals            :  53.43 ( 6%)
out of ssa                      :  51.93 ( 6%)
Comment 31 Steven Bosscher 2012-08-08 06:28:16 UTC
Author: steven
Date: Wed Aug  8 06:28:10 2012
New Revision: 190222

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190222
Log:
	PR middle-end/54146
	* ifcvt.c: Include pointer-set.h.
	(cond_move_process_if_block): Change type of then_regs and
	else_regs from alloca'd array to pointer_sets.
	(check_cond_move_block): Update for this change.
	(cond_move_convert_if_block): Likewise.
	* Makefile.in: Fix dependencies for ifcvt.o.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/Makefile.in
    trunk/gcc/ifcvt.c
Comment 32 Steven Bosscher 2012-08-08 06:29:16 UTC
Author: steven
Date: Wed Aug  8 06:29:12 2012
New Revision: 190223

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190223
Log:
	PR middle-end/54146
	* ira.c (init_live_subregs): Take live_subregs_used as a bitmap.
	(build_insn_chain): Make live_subregs_used a bitmap.
	Use SBITMAP_SIZE to ignore the paradoxical bytes of subregs.
	Use sbitmap_free to free the live_subreg sbitmaps.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ira.c
Comment 33 Richard Biener 2012-08-08 08:09:25 UTC
(In reply to comment #29)
> Created attachment 27957 [details]
> Do not traverse sibling loops
> 
> The idea here is to note that for a nested loop we know for sure that the loop
> exits into a common loop father must all need PHI nodes, because we have have
> loop pre-headers. I'm not sure what the effect of this is on irreducible loops,
> and I haven't tested this much yet. Comments sought...

Nice results.  A few comments - if you rely on preheaders then please assert
in rewrite_into_loop_closed_ssa that loops_state_satisfies_p (LOOPS_HAVE_PREHEADERS).

Irreducible loops are "one" loop to cfgloop and to loop-closed SSA form, so
they should be fine.

What do you mean by "for a nested loop we know for sure that the loop
exits into a common loop father must all need PHI nodes"?  What happens for
nested loops that do not exit into a common loop father?  That is what's
"common" here?  I think what you want to say is that a loop cannot exit
to a child of a sibling of a father, thus:

  for (;;)
    for (;;)
      goto x;
  for (;;)
    x:

cannot happen in the sense that loop detection would not detect the loop
nest as written (but had a single irreducible loop)?

I think you should simply move compute_global_livein to its single use
and make it static.
Comment 34 Steven Bosscher 2012-08-08 10:10:46 UTC
(In reply to comment #33)
> I think you should simply move compute_global_livein to its single use
> and make it static.

Yes, and I need to add the same smarts there as in find_uses_to_rename_use to look through loops at the same nesting level, because this test case fails:

extern unsigned use (unsigned);
void foo (void)
{
  unsigned i, j;
  do { i = use (0); } while (i);
  do { j = use (0); } while (j);
  if (i) use (j);
}

It fails in check_loop_closed_ssa_use at -O2 but passes at -O2 -fno-tree-vrp, so VRP is doing something destroying the initially correct loop-closed SSA form.
Comment 35 Richard Biener 2012-08-08 11:49:21 UTC
(In reply to comment #34)
> (In reply to comment #33)
> > I think you should simply move compute_global_livein to its single use
> > and make it static.
> 
> Yes, and I need to add the same smarts there as in find_uses_to_rename_use to
> look through loops at the same nesting level, because this test case fails:
> 
> extern unsigned use (unsigned);
> void foo (void)
> {
>   unsigned i, j;
>   do { i = use (0); } while (i);
>   do { j = use (0); } while (j);
>   if (i) use (j);
> }
> 
> It fails in check_loop_closed_ssa_use at -O2 but passes at -O2 -fno-tree-vrp,
> so VRP is doing something destroying the initially correct loop-closed SSA
> form.

Jump threading I suppose.  I think you can drop loop-closed SSA form
right before calling vrp_finalize ().
Comment 36 Steven Bosscher 2012-08-08 17:39:49 UTC
Author: steven
Date: Wed Aug  8 17:39:46 2012
New Revision: 190235

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190235
Log:
	PR middle-end/54146
	* gimpify.c (gimplify_body): Only verify_gimple_in_seq with
	checking enabled.
	* tree-ssa-loop-manip.c (add_exit_phis_var): Assert that var is
	a gimple_reg if checking is enabled.
	(find_uses_to_rename_stmt): Only look at non-virtual USE operands.
	* tree-into-ssa (compute_global_livein): Change the worklist
	type from an array to a VEC.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/gimplify.c
    trunk/gcc/tree-into-ssa.c
    trunk/gcc/tree-ssa-loop-manip.c
Comment 37 Jan Hubicka 2012-08-10 03:45:31 UTC
> 
> I suppose it's the old issue that we update fibheap keys along each
> inlining decision - and with flatten there are just very many ... Honza?

Well, I killed most of updating for mainline and at flattening time the keys are not
even computed.  So it is probably some other bookeping.  I will profile it.

> It also seems we flatten depth-first (thus inline leafs) instead of
> the other way around:
> 
>       orig_callee = callee;
>       inline_call (e, true, NULL, NULL);
>       if (e->callee != orig_callee)
>         orig_callee->symbol.aux = (void *) node;
>       flatten_function (e->callee, early);
>       if (e->callee != orig_callee)
>         orig_callee->symbol.aux = NULL;
> 
> This means we will materialize all intermediate flattenings in
> functions that will not be reclaimed, right?  flattening foo
> should inline everything into foo, but not affect remaining
> callee bodies.

I do not follow here.  Callee is already the inline clone, so we will
not affect other copies of foo...

Honza
> 
> Richard.
> 
> -- 
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
Comment 38 Jan Hubicka 2012-08-10 03:50:26 UTC
> I don't understand how inline_merge_summary is supposed to work, so I'm going
> to leave that one for Richi and Honza.
Well, it produces inline summary for function that is result of inlininig based on summaries
of the former functions.  You can't just bypass it for flattening or the summary
of the function after flatting will be off. Obviously the function does fair amount of
propagation. I will try to speed it up or limit the bounds. Since the summaries are of
constant size, we should not explode here.

Honza
Comment 39 Jan Hubicka 2012-08-10 03:53:46 UTC
> Martin, can you look at comment #14 and the patch?  I think what we want to
> do in flatten_function is before
> 
>       inline_call (e, true, NULL, NULL);
> 
> reset the edge predicates so that inline_merge_summary becomes very cheap.

Well, this will still result in (conservatively) wrong inline summary for
flattened function. I am not sure if "flatten" attribute should laso mean
"do not care about inlining of the flattened function much".
I think we want to handle this generically even if it is harder for
small function inliner to explode here because it does a lot less cascaded
inlining.

> Unfortunately that beast seems to have no early out (but instead uses
> true_predicate () ...).  Can we speed it up for the case where we just
> want "fast" operation rather than precise accounting of sizes/time in the
> inlined-to caller?

I will look.

Honza
Comment 40 Jan Hubicka 2012-08-10 04:34:54 UTC
OK,
my simple ^C profiling shows:
#14 0x000000000091f17f in estimate_edge_size_and_time (e=0x7fffb7192d68, size=<optimized out>, time=0x7fffc04110b0, prob=10000) at ../../gcc/ipa-inline-analysis.c:2183
#15 0x000000000091f23a in estimate_calls_size_and_time (node=0x7fffb7193af8, size=0x7fffc04110b4, time=0x7fffc04110b0, possible_truths=4294967294, known_vals=0x0, 
    known_binfos=0x0) at ../../gcc/ipa-inline-analysis.c:2274
#16 0x000000000091f2d1 in estimate_calls_size_and_time (node=0x7fffb71939c0, size=0x7fffc04110b4, time=0x7fffc04110b0, possible_truths=4294967294, known_vals=0x0, 
    known_binfos=0x0) at ../../gcc/ipa-inline-analysis.c:2277
#17 0x000000000091f2d1 in estimate_calls_size_and_time (node=0x7fffb7193888, size=0x7fffc04110b4, time=0x7fffc04110b0, possible_truths=4294967294, known_vals=0x0, 
    known_binfos=0x0) at ../../gcc/ipa-inline-analysis.c:2277
#18 0x000000000091f2d1 in estimate_calls_size_and_time (node=0x7fffb7191c30, size=0x7fffc04110b4, time=0x7fffc04110b0, possible_truths=4294967294, known_vals=0x0, 
    known_binfos=0x0) at ../../gcc/ipa-inline-analysis.c:2277
#19 0x000000000091f2d1 in estimate_calls_size_and_time (node=0x7fffb7180c30, size=0x7fffc04110b4, time=0x7fffc04110b0, possible_truths=4294967294, known_vals=0x0, 
    known_binfos=0x0) at ../../gcc/ipa-inline-analysis.c:2277
#20 0x000000000091f2d1 in estimate_calls_size_and_time (node=0x7fffb7147d68, size=0x7fffc04110b4, time=0x7fffc04110b0, possible_truths=4294967294, known_vals=0x0, 
    known_binfos=0x0) at ../../gcc/ipa-inline-analysis.c:2277
#21 0x000000000091f2d1 in estimate_calls_size_and_time (node=0x7fffcbb039c0, size=0x7fffc04110b4, time=0x7fffc04110b0, possible_truths=4294967294, known_vals=0x0, 
    known_binfos=0x0) at ../../gcc/ipa-inline-analysis.c:2277
#22 0x00000000009232a1 in inline_merge_summary (edge=0x7fffb6fba068) at ../../gcc/ipa-inline-analysis.c:2687
#23 0x0000000000ed0cdf in inline_call (e=0x7fffb6fba068, update_original=<optimized out>, new_edges=0x0, overall_size=0x0) at ../../gcc/ipa-inline-transform.c:246
#24 0x0000000000ece3a9 in flatten_function (node=0x7fffb6fb9ea0, early=1 '\001') at ../../gcc/ipa-inline.c:1605
#25 0x0000000000ece3bf in flatten_function (node=0x7fffb6fb4c30, early=1 '\001') at ../../gcc/ipa-inline.c:1608
#26 0x0000000000ece3bf in flatten_function (node=0x7fffb6f87270, early=1 '\001') at ../../gcc/ipa-inline.c:1608
#27 0x0000000000ece3bf in flatten_function (node=0x7fffcbb039c0, early=1 '\001') at ../../gcc/ipa-inline.c:1608
#28 0x0000000000ece616 in early_inliner () at ../../gcc/ipa-inline.c:1881

so the time is actually not spent in the predicate merging logic, but in computing overall size/time of the function after inline operation is performed. This is because of following call in inline_merge_summary:

  estimate_calls_size_and_time (to, &info->size, &info->time,
                                ~(clause_t)(1 << predicate_false_condition),
                                NULL, NULL);

Here we simply recompute the size/time by walking all callees of the function and since this involve recursive walk of the inlined callees that are many we hit quadratic time complexity.
Ugly.  One (particularly ugly) workaround would be to make inline_merge_summary to skip updating after each step of flattening because flattening do not care, but that is quite a kludge (similar to Steven's patch but with one extra update on the very end).

Other solution is to work hard enough to update everything incrementally, but it is not 100% trivial (that is why I simply recompute). 
I actually tought of this case and concluded that if we do so deep inline trees we will blow up somewhere else.  Quite an achivement that Steven managed to chase out all the other cases.

Honza
Comment 41 Jan Hubicka 2012-08-10 05:18:52 UTC
Created attachment 27979 [details]
Path for inliner slowness

Hi,
this is patch I am testing. After some consideration I do not like the incremental update idea much. It asks for bookkeeping issues and has roundoff problems so unless it is really necessary I will try to stick with the recomputation scheme. This is more careful variant of Steven's patch. We skip the recomputation after each incremental inline and update the info once we are done with the flattening.

Honza
Comment 42 Steven Bosscher 2012-08-10 06:32:37 UTC
(In reply to comment #40)
> Quite an achivement that Steven managed to
> chase out all the other cases.

Thanks for the compliment :-)

I'm still working on the rewrite_into_loop_closed_ssa slowness. I will have a patch ready this weekend.

Next: On to -O2 :-)
Comment 43 Jan Hubicka 2012-08-10 07:52:31 UTC
Author: hubicka
Date: Fri Aug 10 07:52:23 2012
New Revision: 190283

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190283
Log:

	PR middle-end/54146
	* ipa-inline-transform.c (inline_call): Add UPDATE_OVERALL_SUMMARY
	parameter; honnor it.
	* ipa-inline.c (recursive_inlining): Update call
	of inline_call.
	(inline_small_functions): Likewise.
	(ipa_inline): Likewise.
	(inline_always_inline_functions): Likewise.
	(early_inline_small_functions): Likewise.
	(flatten_function): Do separate update of summary info.
	* ipa-inline.h (inline_update_overall_summary): Declare.
	(inline_call): Update.
	* ipa-inline-analysis.c (inline_merge_summary): Break out
	updating code to ...
	(inline_update_overall_summary): Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-inline-analysis.c
    trunk/gcc/ipa-inline-transform.c
    trunk/gcc/ipa-inline.c
    trunk/gcc/ipa-inline.h
Comment 44 Richard Biener 2012-08-14 12:38:41 UTC
Author: rguenth
Date: Tue Aug 14 12:38:32 2012
New Revision: 190382

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190382
Log:
2012-08-14  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/54146
	* tree-ssa-pre.c (do_regular_insertion): Use a VEC
	indexed by pred edge index for avail.
	(do_partial_partial_insertion): Likewise.
	(insert_into_preds_of_block): Adjust.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-pre.c
Comment 45 Marc Glisse 2012-08-14 13:20:55 UTC
(In reply to comment #11)
> > Marc, do you know where the use of the
> > flatten attribute comes from in your code?
> Comes from the Eigen library, I'll talk to them about it and see if they can
> comment. They mostly deal with simple number types (double, float), so the
> behavior with heavy types might have been unnoticed.

For the record:
after doing some benchmarks, they have removed the attribute, it didn't improve performance anymore (though it probably used to in the past).

I have to say I am extremely impressed by all the improvements you guys have done based on this bad code (which I realize is now just a handy example of big code with the flatten attribute high enough in the call chain). If you keep at it, next time I might not even notice there is something wrong with the code ;-)
Comment 46 Steven Bosscher 2012-08-15 14:33:27 UTC
Created attachment 28020 [details]
Faster rewrite_into_loop_closed_ssa

This reduces time spent in rewrite_into_loop_closed_ssa to something too small to show up in the timevar measurements.
Comment 47 Steven Bosscher 2012-08-15 15:07:05 UTC
(In reply to comment #46)
> Created attachment 28020 [details]
> Faster rewrite_into_loop_closed_ssa

After this patch, IRA is the only major bottle-neck left, although there are still a few passes that take more time than I think is reasonable (out-of-SSA in particular). New top 10 spenders:

 integrated RA           : 227.36 (30%) usr     678602 kB (19%) ggc
 tree SSA incremental    :  66.17 ( 9%) usr     151828 kB ( 4%) ggc
 out of ssa              :  43.76 ( 6%) usr        608 kB ( 0%) ggc
 tree PTA                :  43.06 ( 6%) usr      27466 kB ( 1%) ggc
 df live regs            :  35.03 ( 5%) usr          0 kB ( 0%) ggc
 df live&initialized regs:  34.25 ( 5%) usr          0 kB ( 0%) ggc
 tree SSA rewrite        :  18.37 ( 2%) usr      24888 kB ( 1%) ggc
 remove unused locals    :  17.97 ( 2%) usr          0 kB ( 0%) ggc
 dominance computation   :  15.06 ( 2%) usr          0 kB ( 0%) ggc
 tree CFG cleanup        :  11.79 ( 2%) usr       1123 kB ( 0%) ggc
Comment 48 Steven Bosscher 2012-08-16 10:52:20 UTC
Author: steven
Date: Thu Aug 16 10:52:14 2012
New Revision: 190442

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190442
Log:
	PR middle-end/54146
	* tree-flow.h (compute_global_livein): Remove prototype.
	* tree-into-ssa.c (compute_global_livein): Remove function.
	* tree-ssa-loop-manip.c: Include gimple-pretty-print.h.
	(find_sibling_superloop): New function.
	(compute_live_loop_exits): New function.
	(add_exit_phis_edge): Rename to add_exit_phi.  Do not allow
	inserting a PHI in a block that is not a loop exit for VAR.
	Add dumping if TDF_DETAILS.
	(add_exit_phis_var): Rewrite.
	(add_exit_phis): Update.
	(get_loops_exits): Rewrite to return an array of per-loop exits
	rather than one bitmap with all loop exits.
	(find_uses_to_rename_bb): Ignore virtual PHI nodes.
	(rewrite_into_loop_closed_ssa): Update.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-flow.h
    trunk/gcc/tree-into-ssa.c
    trunk/gcc/tree-ssa-loop-manip.c
Comment 49 Richard Biener 2012-08-16 12:10:56 UTC
We still use very much memory (full testcase doesn't fit in 4GB ram).  With

...
  check<CGAL::Gmpfr>();
  //check<CGAL::Gmpfi>();
  //check<CGAL::Quotient<CGAL::Gmpz> >();
  //check<CGAL::Lazy_exact_nt<CGAL::Gmpq> >();
  //check<CORE::BigInt>();
  //check<CORE::BigRat>();
  //check<CORE::BigFloat>();
  //check<CORE::Expr>();

I see (--enable-gather-detailed-mem-stats):

Kind                   Nodes      Bytes
---------------------------------------
decls                1116380  181227400
types                 535841   90021288
blocks                222522   17801760
stmts                  45749    2068872
refs                  854577   43178392
exprs                2147881   95117920
constants             140132    4176820
identifiers            73710    6486480
vecs                 2654190  131466648
binfos                 48162    4873432
ssa names             339369   27149520
constructors           26426     634224
random kinds         1885177   73894984
lang_decl kinds       352117   13918872
lang_type kinds        48218    7329136
omp clauses                0          0
---------------------------------------
Total                10490451  699345748

a lot of memory in TREE_VECs for some reason.

GIMPLE statements
Kind                   Stmts      Bytes
---------------------------------------
assignments           316019   30602376
phi nodes              54994   15777472
conditionals           26090    2504640
everything else       237509   23110772
---------------------------------------
Total                 634612   71995260

gimple is lean, so is RTL ;)

Alloc-pool Kind         Elt size  Pools  Allocated (elts)            Peak (elts)            Leak (elts)
--------------------------------------------------------------------------------------------------------------
live ranges                40        513   19250760(    481269)   10800800(    270020)          0(         0)
df_scan ref base           56       1026  331010008(   5910893)   14059808(    251068)          0(         0)
df_scan ref artificial     64       1026   20113600(    314275)    4239872(     66248)          0(         0)
df_scan ref regular        64       1026   66557568(   1039962)    4543872(     70998)          0(         0)

are by far the biggest alloc-pool users.

bitmap stats are confusing because they show leaks for bitmaps we free
by releasing their obstack.  DF and PTA bitmaps are largest.

We leak some VECs ...

c-family/c-pragma.c:619 (push_visibility)                24: 0.0%         24               1: 0.0%
cp/pt.c:471 (maybe_begin_member_template_process         24: 0.0%         24               1: 0.0%
function.c:4513 (push_struct_function)                   40: 0.0%         40               1: 0.0%
vec.c:307 (vec_stack_p_reserve_exact_1)                  40: 0.0%         40               1: 0.0%
tree-ssa-loop-ivopts.c:3070 (multiplier_allowed_        328: 0.0%        608               3: 0.0%
tree-ssa-loop-ivopts.c:3153 (get_address_cost)          328: 0.0%        608               3: 0.0%
tree-ssa-sccvn.c:745 (copy_reference_ops_from_re        392: 0.0%     806232          102098: 4.6%
cfgloop.h:583 (fel_init)                                480: 0.0%        860             106: 0.0%
c-family/c-pragma.c:1246 (c_register_pragma_1)          584: 0.0%        696               4: 0.0%
function.c:156 (push_function_context)                  976: 0.0%       1200               8: 0.0%
ira.c:3699 (find_moveable_pseudos)                     1240: 0.0%     221128             513: 0.0%
passes.c:2188 (execute_one_pass)                       4360: 0.1%     655320           16466: 0.7%
tree-ssa-structalias.c:3870 (handle_lhs_call)          9576: 0.2%      18360             133: 0.0%
ipa-ref.c:55 (ipa_record_reference)                   60184: 1.1%     327640            5813: 0.3%
cfgloop.c:1143 (get_loop_exit_edges)                  73184: 1.3%     157888           62221: 2.8%
tree-into-ssa.c:940 (mark_phi_for_rewrite)           153360: 2.7%     164096              17: 0.0%
cfgloop.c:1134 (get_loop_exit_edges)                 166592: 3.0%     238712           11639: 0.5%
ipa-reference.c:184 (set_reference_optimization_     180248: 3.2%     248064              47: 0.0%
tree-into-ssa.c:321 (get_ssa_name_ann)               627448:11.2%     716496              14: 0.0%
tree-ssa-sccvn.c:3657 (extract_and_process_scc_f    1246864:22.3%    1291960          105903: 4.7%
tree-ssa-loop-im.c:1562 (record_mem_ref_loc)        1292560:23.1%    1392576           55465: 2.5%
tree-ssa-loop-im.c:1551 (record_mem_ref_loc)        1771800:31.7%    3141360           52717: 2.4%

I'll look at the loop and sccvn parts.
Comment 50 stevenb.gcc@gmail.com 2012-08-16 13:55:40 UTC
On Thu, Aug 16, 2012 at 2:10 PM, rguenth at gcc dot gnu.org
<gcc-bugzilla@gcc.gnu.org> wrote:
> bitmap stats are confusing because they show leaks for bitmaps we free
> by releasing their obstack.  DF and PTA bitmaps are largest.

The bitmap obstack stats are not very reliable anyway, I think. I've
been using my own stats for this (with a size_t version of
obstack_memory_used:

+static size_t
+obstack_memory_used2 (struct obstack *h)
+{
+  struct _obstack_chunk* lp;
+  size_t nbytes = 0;
+
+  for (lp = h->chunk; lp != 0; lp = lp->prev)
+    {
+      nbytes += (size_t) (lp->limit - (char *) lp);
+    }
+  return nbytes;
+}
+

so that also the "freed" bitmap elements are counted). With that, you
can show that the reg_obstack and bitmap_default_obstack grow up to
several GB that isn't released between passes. An added problem is
that IRA puts its bitmap on its own obstack (as it should, really) but
that means the >3GB of free elements on the reg_obstack and
bitmap_default_obstack remain unused. So on the machine I use for
testing (cfarm gcc17), the memory footprint is reduced by >2GB (~25%)
with this hack:

Index: ira.c
===================================================================
--- ira.c       (revision 190442)
+++ ira.c       (working copy)
@@ -4132,6 +4132,12 @@
   int max_regno_before_ira, ira_max_point_before_emit;
   int rebuild_p;

+  /* There shouldn't be anything on these obstacks.  */
+  bitmap_obstack_release (NULL);
+  bitmap_obstack_initialize (NULL);
+  bitmap_obstack_release (&reg_obstack);
+  bitmap_obstack_initialize (&reg_obstack);
+
   if (flag_caller_saves)
     init_caller_save ();


There is in general a lot of BITMAP_ALLOC(NULL) abuse in the compiler.
I have patches to address the cases in tree-ssa-live.c and dse.c, and
I intend to look at the tree-ssa-ter and cfgexpand cases this weekend.
Comment 51 rguenther@suse.de 2012-08-16 14:06:06 UTC
On Thu, 16 Aug 2012, stevenb.gcc at gmail dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146
> 
> --- Comment #50 from stevenb.gcc at gmail dot com <stevenb.gcc at gmail dot com> 2012-08-16 13:55:40 UTC ---
> On Thu, Aug 16, 2012 at 2:10 PM, rguenth at gcc dot gnu.org
> <gcc-bugzilla@gcc.gnu.org> wrote:
> > bitmap stats are confusing because they show leaks for bitmaps we free
> > by releasing their obstack.  DF and PTA bitmaps are largest.
> 
> The bitmap obstack stats are not very reliable anyway, I think. I've
> been using my own stats for this (with a size_t version of
> obstack_memory_used:
> 
> +static size_t
> +obstack_memory_used2 (struct obstack *h)
> +{
> +  struct _obstack_chunk* lp;
> +  size_t nbytes = 0;
> +
> +  for (lp = h->chunk; lp != 0; lp = lp->prev)
> +    {
> +      nbytes += (size_t) (lp->limit - (char *) lp);
> +    }
> +  return nbytes;
> +}
> +
> 
> so that also the "freed" bitmap elements are counted). With that, you
> can show that the reg_obstack and bitmap_default_obstack grow up to
> several GB that isn't released between passes. An added problem is

Hum, I thought we release those obstacks after each pass ...
Comment 52 Richard Biener 2012-08-16 14:27:59 UTC
Author: rguenth
Date: Thu Aug 16 14:27:51 2012
New Revision: 190445

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190445
Log:
2012-08-16  Richard Guenther  <rguenther@suse.de>

	PR middle-end/54146
	* tree-ssa-loop-niter.c (find_loop_niter_by_eval): Free the
	exit vector.
	* ipa-pure-const.c (analyze_function): Use FOR_EACH_LOOP_BREAK.
	* cfgloop.h (FOR_EACH_LOOP_BREAK): Fix.
	* tree-ssa-structalias.c (handle_lhs_call): Properly free rhsc.
	* tree-into-ssa.c (get_ssa_name_ann): Allocate info only when
	needed.
	* tree-ssa-loop-im.c (analyze_memory_references): Adjust.
	(tree_ssa_lim_finalize): Free all mem_refs.
	* tree-ssa-sccvn.c (extract_and_process_scc_for_name): Free
	scc when bailing out.
	* modulo-sched.c (sms_schedule): Use FOR_EACH_LOOP_BREAK.
	* ira-build.c (loop_with_complex_edge_p): Free loop exit vector.
	* graphite-sese-to-poly.c (scop_ivs_can_be_represented): Use
	FOR_EACH_LOOP_BREAK.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cfgloop.h
    trunk/gcc/graphite-sese-to-poly.c
    trunk/gcc/ipa-pure-const.c
    trunk/gcc/ira-build.c
    trunk/gcc/modulo-sched.c
    trunk/gcc/tree-into-ssa.c
    trunk/gcc/tree-ssa-loop-im.c
    trunk/gcc/tree-ssa-loop-niter.c
    trunk/gcc/tree-ssa-sccvn.c
    trunk/gcc/tree-ssa-structalias.c
Comment 53 Steven Bosscher 2012-08-16 23:04:04 UTC
Created attachment 28039 [details]
More dedicated obstacks

This, together with Richi's leak plugs from earlier today, brings peak memory down to 5GB. That's 5GB now, down from 9GB this morning. Not bad for a day's work.
Comment 54 Steven Bosscher 2012-08-17 09:42:15 UTC
Author: steven
Date: Fri Aug 17 09:42:06 2012
New Revision: 190475

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190475
Log:
	PR middle-end/54146
	* tree-ssa-loop-im.c (lim_bitmap_obstack): New bitmap_obstack.
	(memref_free): Don't free the bitmaps individually here.
	(mem_ref_alloc): Allocate the bitmaps on the new bitmap obstack.
	(analyze_memory_references): Likewise.
	(tree_ssa_lim_initialize): Initialize the new bitmap obstack.
	(tree_ssa_lim_finalize): Release it.
	* dse.c (dse_bitmap_obstack): New bitmap obstack.
	(dse_obstack): New obstack.
	(get_group_info): Allocate the bitmaps on the new bitmap obstack.
	(dse_step0): Allocate the scratch bitmap on reg_obstack.  Initialize
	the new bitmap obstack and normal obstack.  Use XNEWVEC for bb_table.
	(record_store): Allocate regs_set on reg_obstack.
	(dse_step1): Allocate regs_live on reg_obstack.
	(dse_step2_init): Allocate offset_map_n and offset_map_p on the new
	obstack.
	(dse_step3_scan): Allocate bitmaps on the new bitmap obstack.
	(dse_step3): Likewise.
	(dse_confluence_0): Likewise.
	(dse_confluence_n): Likewise.
	(dse_transfer_function): Likewise.
	(dse_step7): Destroy the new obstacks, and everything allocated on
	them, in one big sweep.
	(rest_of_handle_dse): Update.
	* cfgexpand.c (stack_var_bitmap_obstack): New bitmap obstack.
	(add_stack_var_conflict): Allocate bitmaps on it.
	(add_scope_conflicts_1): Likewise.
	(add_scope_conflicts): Likewise.
	(update_alias_info_with_stack_vars): Likewise.
	(init_vars_expansion): Move TREE_USED fiddling expand_used_vars.
	Initialize the new bitmap obstack.
	(fini_vars_expansion): Release it.
	(estimated_stack_frame_size): Use init_vars_expansion to set things up
	and always clean up at the end.
	(expand_used_vars): Do the TREE_USED trickery here.  Always call
	fini_vars_expansion.
	* tree-ssa-live.h (struct tree_live_info_d): Make livein and liveout
	arrays of bitmap_head to avoid one indirection per bitmap access.
	(live_on_entry, live_on_exit, live_var_map, live_merge_and_clear,
	make_live_on_entry): Update.
	* tree-ssa-live.c (partition_view_bitmap): Don't double-free 'used'.
	(liveness_bitmap_obstack): New bitmap obstack.
	(remove_unused_locals): Use it to allocate all bitmaps on.  Update
	for livein/liveout changes in tree-ssa-live.h.
	(delete_tree_live_info): Release the bitmap obstack.
	(loe_visit_block, live_worklist, set_var_live_on_entry,
	calculate_live_on_exit, dump_live_info): Update.
	(calculate_live_ranges): Initialize the bitmap.
	* tree-ssa-ter.c (ter_bitmap_obstack): New bitmap obstack.
	(new_temp_expr_table): Allocate bitmap on it.
	(make_dependent_on_partition, add_to_partition_kill_list,
	add_dependence, process_replaceable): Likewise.
	(find_replaceable_exprs): Initialize and release the new obstack here.
	* df-problems.c (df_lr_add_problem): Allocate persistent bitmap
	for out_of_date_transfer_functions on df_bitmap_obstack.
	(df_live_add_problem): Likewise.
	(df_chain_add_problem): Likewise.
	(df_word_lr_add_problem): Likewise.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cfgexpand.c
    trunk/gcc/df-problems.c
    trunk/gcc/dse.c
    trunk/gcc/tree-ssa-live.c
    trunk/gcc/tree-ssa-live.h
    trunk/gcc/tree-ssa-loop-im.c
    trunk/gcc/tree-ssa-loop-manip.c
    trunk/gcc/tree-ssa-ter.c
Comment 55 Steven Bosscher 2012-08-20 19:30:34 UTC
The remaining problem areas are all liveness calculation routines that are essentially inherent quadratic problems: DF liveness, IRA liveness, and out-of-SSA liveness. 

I think it would be good to deprecate the flatten attribute...
Comment 56 rguenther@suse.de 2012-08-21 07:55:14 UTC
On Mon, 20 Aug 2012, steven at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146
> 
> Steven Bosscher <steven at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|NEW                         |RESOLVED
>          Resolution|                            |WONTFIX
> 
> --- Comment #55 from Steven Bosscher <steven at gcc dot gnu.org> 2012-08-20 19:30:34 UTC ---
> The remaining problem areas are all liveness calculation routines that are
> essentially inherent quadratic problems: DF liveness, IRA liveness, and
> out-of-SSA liveness. 
> 
> I think it would be good to deprecate the flatten attribute...

It still can be useful I think, if only for creating testcases
with arbitrary large functions ;)
Comment 57 Richard Biener 2012-08-21 13:34:28 UTC
Author: rguenth
Date: Tue Aug 21 13:34:19 2012
New Revision: 190562

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190562
Log:
2012-08-21  Richard Guenther  <rguenther@suse.de>

	Backport from mainline
	2012-08-16  Richard Guenther  <rguenther@suse.de>

	PR middle-end/54146
	* tree-ssa-loop-niter.c (find_loop_niter_by_eval): Free the
	exit vector.
	* ipa-pure-const.c (analyze_function): Use FOR_EACH_LOOP_BREAK.
	* cfgloop.h (FOR_EACH_LOOP_BREAK): Fix.
	* tree-ssa-structalias.c (handle_lhs_call): Properly free rhsc.
	* tree-ssa-loop-im.c (analyze_memory_references): Adjust.
	(tree_ssa_lim_finalize): Free all mem_refs.
	* tree-ssa-sccvn.c (extract_and_process_scc_for_name): Free
	scc when bailing out.
	* modulo-sched.c (sms_schedule): Use FOR_EACH_LOOP_BREAK.
	* ira-build.c (loop_with_complex_edge_p): Free loop exit vector.
	* graphite-sese-to-poly.c (scop_ivs_can_be_represented): Use
	FOR_EACH_LOOP_BREAK.

	2012-08-17  Richard Guenther  <rguenther@suse.de>

	* tree-sra.c (modify_function): Free redirect_callers vector.
	* ipa-split.c (split_function): Free args_to_pass vector.
	* tree-vect-stmts.c (vectorizable_operation): Do not pre-allocate
	vec_oprnds.
	(new_stmt_vec_info): Do not pre-allocate STMT_VINFO_SAME_ALIGN_REFS.
	* tree-vect-slp.c (vect_free_slp_instance): Free the instance.
	(vect_analyze_slp_instance): Free everything.
	(destroy_bb_vec_info): Free the SLP instances.

	2012-08-17  Richard Guenther  <rguenther@suse.de>
 
	* params.def (integer-share-limit): Decrease from 256 to 251,
	add rationale.

	2012-08-21  Richard Guenther  <rguenther@suse.de>
 
	* tree-ssa-loop-im.c (tree_ssa_lim_finalize): Properly free
	the affine expansion cache.

Modified:
    branches/gcc-4_7-branch/gcc/ChangeLog
    branches/gcc-4_7-branch/gcc/cfgloop.h
    branches/gcc-4_7-branch/gcc/graphite-sese-to-poly.c
    branches/gcc-4_7-branch/gcc/ipa-pure-const.c
    branches/gcc-4_7-branch/gcc/ipa-split.c
    branches/gcc-4_7-branch/gcc/ira-build.c
    branches/gcc-4_7-branch/gcc/modulo-sched.c
    branches/gcc-4_7-branch/gcc/params.def
    branches/gcc-4_7-branch/gcc/tree-sra.c
    branches/gcc-4_7-branch/gcc/tree-ssa-loop-im.c
    branches/gcc-4_7-branch/gcc/tree-ssa-loop-niter.c
    branches/gcc-4_7-branch/gcc/tree-ssa-sccvn.c
    branches/gcc-4_7-branch/gcc/tree-ssa-structalias.c
    branches/gcc-4_7-branch/gcc/tree-vect-slp.c
    branches/gcc-4_7-branch/gcc/tree-vect-stmts.c
Comment 58 stevenb.gcc@gmail.com 2012-08-21 13:56:27 UTC
FWIW, I think all patches addressing parts of this bug are candidates
for back-porting to release branches. They are all almost trivial.
Comment 59 Steven Bosscher 2012-09-02 22:54:34 UTC
FWIW Martin: SRA blows up this test case's register pressure. Compiling with SRA enabled takes ~900s, but with -fno-tree-sra compile time almost halves. There are extremely long live ranges for SA.* variables created by SRA.
Comment 60 Steven Bosscher 2012-09-04 10:50:47 UTC
Another improvement that can be done for the test case, is to assign pseudo-register numbers such that the density of the DF_LR and DF_LIVE bitmaps increases.

For the density I used the following definition here:
  size = (# list elts) * (# bits per elt)
  cardinality = (# bits set)
  density = cardinality / size

The average density we achieve for the DF_LR_IN sets of the largest function is only 12%, and that number is misleading because there are a few elements with almost all bits set and many elements with only a few bits set. For instance there is one bitmap with the following statistics:

size = 50
cardinality = 940
density = 14.7%

but the per-element statistics show:

bits/elt	count
1		20
2		19
3		2
4		2
124		7

so excluding the 7 dense elements (124 bits set out of 128), the density is only 1.3%!

The dense bits are for a bunch of SRA variables, the sparse ones for ivtmp variables. The ivtmp variables are live in mostly the same blocks as the SRA variables (>100000 blocks where the variables appear in DF_LR_IN of the block) so the packing is extremely inefficient. This is the cause for most of the slowness of DF on the test case: the bitmap chains are large and extremely sparse.
Comment 61 Jan Hubicka 2012-09-10 16:04:38 UTC
> > I think it would be good to deprecate the flatten attribute...
> 
> It still can be useful I think, if only for creating testcases
> with arbitrary large functions ;)

I think it is generally useful and some of the HPC extensions do contain flatten
equivalent.  I do not see anything bad in compiler blowing up your machine when you
explicitely ask him to do so via an attribute ;)

Honza
Comment 62 Steven Bosscher 2012-10-15 21:32:04 UTC
Author: steven
Date: Mon Oct 15 21:31:57 2012
New Revision: 192476

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=192476
Log:
	Backport from trunk (r190222):

	PR tree-optimization/54146
	* ifcvt.c: Include pointer-set.h.
	(cond_move_process_if_block): Change type of then_regs and
	else_regs from alloca'd array to pointer_sets.
	(check_cond_move_block): Update for this change.
	(cond_move_convert_if_block): Likewise.
	* Makefile.in: Fix dependencies for ifcvt.o.


Modified:
    branches/gcc-4_7-branch/gcc/ChangeLog
    branches/gcc-4_7-branch/gcc/Makefile.in
    branches/gcc-4_7-branch/gcc/ifcvt.c