38474 – compile time explosion in dataflow_set_preserve_mem_locs at -O3

Bug 38474 - compile time explosion in dataflow_set_preserve_mem_locs at -O3

Summary: compile time explosion in dataflow_set_preserve_mem_locs at -O3

Status:	ASSIGNED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	4.4.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Richard Biener

URL:
Keywords:	compile-time-hog, memory-hog, patch

Depends on:	38582 38583 38584 38585 38586
Blocks:	47344
	Show dependency tree / graph

Reported:	2008-12-10 15:25 UTC by Joost VandeVondele
Modified:	2023-10-01 03:30 UTC (History)
CC List:	7 users (show)

See Also:	109143
Host:
Target:
Build:
Known to work:
Known to fail:	4.8.0
Last reconfirmed:	2009-12-03 21:05:09

Attachments
testcase (195.26 KB, application/postscript) 2008-12-10 15:25 UTC, Joost VandeVondele	Details
memory measurement tool (1.11 KB, application/octet-stream) 2008-12-10 22:48 UTC, Richard Biener	Details
reduced testcase (70.75 KB, application/octet-stream) 2008-12-16 14:17 UTC, Joost VandeVondele	Details
patch experiment (1.95 KB, patch) 2021-02-11 09:32 UTC, Richard Biener	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Joost VandeVondele 2008-12-10 15:25:10 UTC

With current trunk, the attached testcase (~15Klines) takes about 15min to compile (& 2.3Gb). To reproduce

gfortran -ffree-line-length-512 -g -c testcase.f90

The issue seems the clear from the timing report:

Execution times (seconds)
 garbage collection    :   1.37 ( 0%) usr   0.00 ( 0%) sys   1.37 ( 0%) wall       0 kB ( 0%) ggc
 callgraph construction:   0.14 ( 0%) usr   0.01 ( 0%) sys   0.15 ( 0%) wall   12498 kB ( 2%) ggc
 callgraph optimization: 202.27 (24%) usr   1.57 (21%) sys 204.09 (24%) wall    2304 kB ( 0%) ggc
 cfg cleanup           :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 CFG verifier          :   1.32 ( 0%) usr   0.01 ( 0%) sys   1.32 ( 0%) wall       0 kB ( 0%) ggc
 trivially dead code   :   0.52 ( 0%) usr   0.00 ( 0%) sys   0.52 ( 0%) wall       0 kB ( 0%) ggc
 df live regs          :   0.30 ( 0%) usr   0.00 ( 0%) sys   0.30 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.63 ( 0%) usr   0.01 ( 0%) sys   0.64 ( 0%) wall   25889 kB ( 4%) ggc
 register information  :   0.33 ( 0%) usr   0.01 ( 0%) sys   0.35 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis        :   0.29 ( 0%) usr   0.00 ( 0%) sys   0.30 ( 0%) wall    8335 kB ( 1%) ggc
 rebuild jump labels   :   0.34 ( 0%) usr   0.00 ( 0%) sys   0.33 ( 0%) wall       0 kB ( 0%) ggc
 parser                :   5.26 ( 1%) usr   0.10 ( 1%) sys   5.37 ( 1%) wall   59009 kB ( 9%) ggc
 inline heuristics     : 402.30 (48%) usr   3.15 (42%) sys 406.73 (48%) wall       0 kB ( 0%) ggc
 tree gimplify         :   0.21 ( 0%) usr   0.01 ( 0%) sys   0.21 ( 0%) wall   16835 kB ( 2%) ggc
 tree eh               :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 tree CFG construction :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     180 kB ( 0%) ggc
 tree CFG cleanup      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 tree find ref. vars   :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall    7875 kB ( 1%) ggc
 tree SSA other        :   0.03 ( 0%) usr   0.01 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 tree operand scan     :   0.03 ( 0%) usr   0.02 ( 0%) sys   0.05 ( 0%) wall     236 kB ( 0%) ggc
 tree SSA to normal    :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA verifier     :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree STMT verifier    :   0.57 ( 0%) usr   0.06 ( 1%) sys   0.64 ( 0%) wall       0 kB ( 0%) ggc
 callgraph verifier    :   0.24 ( 0%) usr   0.00 ( 0%) sys   0.24 ( 0%) wall       0 kB ( 0%) ggc
 expand                : 207.53 (25%) usr   2.16 (29%) sys 210.04 (25%) wall  366504 kB (54%) ggc
 integrated RA         :  11.67 ( 1%) usr   0.15 ( 2%) sys  11.84 ( 1%) wall    8700 kB ( 1%) ggc
 reload                :   5.34 ( 1%) usr   0.20 ( 3%) sys   5.53 ( 1%) wall  163863 kB (24%) ggc
 thread pro- & epilogue:   0.50 ( 0%) usr   0.00 ( 0%) sys   0.49 ( 0%) wall     174 kB ( 0%) ggc
 final                 :   1.86 ( 0%) usr   0.10 ( 1%) sys   1.99 ( 0%) wall    7380 kB ( 1%) ggc
 symout                :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall    4777 kB ( 1%) ggc
 TOTAL                 : 843.24             7.57           852.70             684846 kB

Comment 1 Joost VandeVondele 2008-12-10 15:25:40 UTC

Created attachment 16873 [details]
testcase

Comment 2 Richard Biener 2008-12-10 15:39:38 UTC

Confirmed.  4.3 is worse (I ran out of memory).

Probably the FE presents us with sth funny.

Comment 3 Joost VandeVondele 2008-12-10 16:13:07 UTC

(In reply to comment #2)
> Confirmed.  4.3 is worse (I ran out of memory).
> 
> Probably the FE presents us with sth funny.
> 
actually, I just got a timing report from 4.3 [4.3.1 20080507 (prerelease) [gcc-4_3-branch revision 135036]] (on a different machine, but with roughly the same clock speed, and plenty of RAM):

Execution times (seconds)
 garbage collection    :   1.05 ( 0%) usr   0.00 ( 0%) sys   1.05 ( 0%) wall       0 kB ( 0%) ggc
 cfg cleanup           :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 trivially dead code   :   0.70 ( 0%) usr   0.00 ( 0%) sys   0.70 ( 0%) wall       0 kB ( 0%) ggc
 df live regs          :   0.96 ( 0%) usr   0.00 ( 0%) sys   0.96 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   1.11 ( 0%) usr   0.00 ( 0%) sys   1.12 ( 0%) wall   25889 kB ( 4%) ggc
 register information  :   0.90 ( 0%) usr   0.00 ( 0%) sys   0.89 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis        :   0.82 ( 0%) usr   0.00 ( 0%) sys   0.83 ( 0%) wall    8335 kB ( 1%) ggc
 rebuild jump labels   :   1.10 ( 0%) usr   0.00 ( 0%) sys   1.10 ( 0%) wall       0 kB ( 0%) ggc
 parser                :   3.02 ( 0%) usr   0.12 ( 1%) sys   3.96 ( 0%) wall   75960 kB (12%) ggc
 inline heuristics     :1862.97 (65%) usr   4.94 (59%) sys2078.10 (65%) wall       1 kB ( 0%) ggc
 tree gimplify         :   0.48 ( 0%) usr   0.00 ( 0%) sys   0.47 ( 0%) wall    3446 kB ( 1%) ggc
 tree eh               :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree CFG construction :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall    7151 kB ( 1%) ggc
 expand                : 967.65 (34%) usr   2.96 (35%) sys1102.03 (34%) wall  357297 kB (54%) ggc
 local alloc           :   5.22 ( 0%) usr   0.08 ( 1%) sys   5.29 ( 0%) wall    8652 kB ( 1%) ggc
 global alloc          :  12.28 ( 0%) usr   0.27 ( 3%) sys  12.59 ( 0%) wall  163884 kB (25%) ggc
 thread pro- & epilogue:   0.74 ( 0%) usr   0.00 ( 0%) sys   0.75 ( 0%) wall     172 kB ( 0%) ggc
 final                 :   2.91 ( 0%) usr   0.05 ( 1%) sys   2.92 ( 0%) wall     541 kB ( 0%) ggc
 symout                :   0.06 ( 0%) usr   0.01 ( 0%) sys   0.07 ( 0%) wall    5690 kB ( 1%) ggc
 TOTAL                 :2862.03             8.43          3212.90             657217 kB

Comment 4 Richard Biener 2008-12-10 16:57:05 UTC

Could you capture the memory requirements on the 4.3 branch?

Comment 5 Joost VandeVondele 2008-12-10 22:34:27 UTC

(In reply to comment #4)
> Could you capture the memory requirements on the 4.3 branch?

I watched top (for 4.3.1), but can't recall anything more than 3Gb. It's a bit boring to watch top for 45min.... any better approach?

Comment 6 Richard Biener 2008-12-10 22:48:18 UTC

Created attachment 16881 [details]
memory measurement tool

Of course!  Try the attached with just

~/bin/maxmem2.sh gfortan ...

Comment 7 Joost VandeVondele 2008-12-10 22:57:13 UTC

(In reply to comment #6)
> Created an attachment (id=16881) [edit]
> memory measurement tool
> 
> Of course!  Try the attached with just
> 
> ~/bin/maxmem2.sh gfortan ...
> 
ugh how intuitive... but very useful. Will try to run it tomorrow.

Comment 8 Joost VandeVondele 2008-12-11 08:27:21 UTC

(In reply to comment #6)
> Created an attachment (id=16881) [edit]
> memory measurement tool
> 
> Of course!  Try the attached with just
> 
> ~/bin/maxmem2.sh gfortan ...
> 
Hmmm. So this is what is returned:

4.3.3 is GNU Fortran (GCC) 4.3.3 20080912 (prerelease)
trunk is NU Fortran (GCC) 4.4.0 20081206 (experimental) [trunk revision 142525]

4.3.3: 899675 kB (and about 33min)
trunk: 1145308 kB (and about 45min).

this is on the same machine, so times can be compared (module enable checking?).

However, for the memory usage, top (oh no) showed 2.3-2.5Gb, which is quite different from what the script returns?

Comment 9 Richard Biener 2008-12-11 11:33:47 UTC

The script only has received testing on linux systems, so if you are running
somewhere else it is likely that either the regexps do not match or that
you require different/additional syscalls to be traced.  It's not perfect, but
it works reliably for me.

Comment 10 Joost VandeVondele 2008-12-11 12:02:41 UTC

(In reply to comment #9)
> The script only has received testing on linux systems, so if you are running
> somewhere else it is likely that either the regexps do not match or that
> you require different/additional syscalls to be traced.  It's not perfect, but
> it works reliably for me.
> 
no, it is a linux box (actually SUSE as far as I can tell,
> uname -a
 Linux pcihopt3 2.6.16.21-0.8-smp #1 SMP Mon Jul 3 18:25:39 UTC 2006 x86_64 x86_64 x86_64 GNU/Linux
> cat /etc/SuSE-release
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
).

Comment 11 Joost VandeVondele 2008-12-15 19:38:34 UTC

as this file is included in a project compiled normally with '-O3 -march=native -funroll-loops' the timing in that case is also important. As I'm finding out, this becomes unworkable (>6h, and still compiling).

Looking at -fdump-tree-original, the overloaded operators (+,-,..) expand to function call, leading to a subroutine which contains 73000 function calls. So, likely that some stuff is scaling at least quadratically wrt to this parameter.

Comment 12 Steven Bosscher 2008-12-15 21:17:37 UTC

One of the bottlenecks seems to be find_temp_slot_from_address.

Comment 13 Steven Bosscher 2008-12-15 21:27:40 UTC

OK, to elaborate: I'm playing with this test case on ia64-linux, and I reduced the test case by some 8000 lines to make it compilable at all.  With this 8000 lines less, it actually spends more time for me in "expand", in the function "find_temp_slot_from_address (rtx x)".  It spends all of its time...

  for (i = max_slot_level (); i >= 0; i--)
    for (p = *temp_slots_at_level (i); p; p = p->next)
      {
	if (XEXP (p->slot, 0) == x
	    || p->address == x
	    || (GET_CODE (x) == PLUS
		&& XEXP (x, 0) == virtual_stack_vars_rtx
		&& GET_CODE (XEXP (x, 1)) == CONST_INT
		&& INTVAL (XEXP (x, 1)) >= p->base_offset
		&& INTVAL (XEXP (x, 1)) < p->base_offset + p->full_size))
	  return p;

	else if (p->address != 0 && GET_CODE (p->address) == EXPR_LIST)
	  for (next = p->address; next; next = XEXP (next, 1))
	    if (XEXP (next, 0) == x)  /* ...here in  this loop... */
	      return p;

in the "for (next = p->address; ...)" loop. This list in p->address is actually several thousand items long and it is traversed many times:

traversals ~ max_slot_level()*temp_slots_at_level(i)*list length of p->address

which is, at best, cubic behavior.

Comment 14 Steven Bosscher 2008-12-15 21:53:33 UTC

For the inline heuristics, almost all time is also spent in stack slot related stuff. The culprit is estimate_stack_frame_size (or actually, add_alias_set__conflicts) in cfgexpand.c.
(What are we doing in cfgexpand anyway, for inlining?!?)

Comment 15 Steven Bosscher 2008-12-15 21:55:12 UTC

From cfgexpand.c:

static void
add_alias_set_conflicts (void)
{
  size_t i, j, n = stack_vars_num;

  for (i = 0; i < n; ++i)
    {
      tree type_i = TREE_TYPE (stack_vars[i].decl);
      bool aggr_i = AGGREGATE_TYPE_P (type_i);
      bool contains_union;

      contains_union = aggregate_contains_union_type (type_i);
      for (j = 0; j < i; ++j)
	{

Classic example of quadratic algorithm...

Comment 16 Steven Bosscher 2008-12-15 21:56:10 UTC

Oh, and FWIW, for yukawa_gn_full, stack_vars_num == 67551 for me.

Comment 17 Joost VandeVondele 2008-12-16 07:51:02 UTC

(In reply to comment #16)
> Oh, and FWIW, for yukawa_gn_full, stack_vars_num == 67551 for me.

Thanks for the analysis. Detailed enough to have me peak in the gcc code for once.  

This would mean that the array stack_vars_conflict takes about 2.2Gb, since it is  O(stack_vars_num**2/2) (assuming a bool is 8bits, quite consistent with what we see). 

There is already a function (defer_stack_allocation) that decides to give up due to 'the quadratic problem'. Maybe gcc should use some drastic short-cut, including not allocating the stack_vars_conflict array, as soon as ~10000 stack variables are detected ?

BTW, the -O3 compilation is still running (for 17h now).

Comment 18 Joost VandeVondele 2008-12-16 11:58:32 UTC

(In reply to comment #17)
> BTW, the -O3 compilation is still running (for 17h now).
finished successfully after 23h...

Comment 19 Steven Bosscher 2008-12-16 12:45:07 UTC

Re. comment #18, I'd say "brilliant" if it wasn't such a poor performance :-)

Did you manage to get a time report out of that run?

Comment 20 Joost VandeVondele 2008-12-16 12:47:27 UTC

(In reply to comment #19)
> Re. comment #18, I'd say "brilliant" if it wasn't such a poor performance :-)

I agree... quite an achievement not to crash in such a case.

> Did you manage to get a time report out of that run?

no... obviously I can rerun this (numbers tomorrow, of course)?

Comment 21 Joost VandeVondele 2008-12-16 12:48:48 UTC

(In reply to comment #16)
> Oh, and FWIW, for yukawa_gn_full, stack_vars_num == 67551 for me.

btw, that routine only has 3800 user variables, the rests are FE generated temporaries (which should have a limited lifetime).

Comment 22 Steven Bosscher 2008-12-16 13:41:36 UTC

We may be better off with a slightly reduced test case for the -O3 report.  It's not difficult to cut out ~8000 lines (like I did yesterday) and still have a huge test case (and the horendous compile times to go with that).

Comment 23 Joost VandeVondele 2008-12-16 14:17:47 UTC

Created attachment 16913 [details]
reduced testcase

just so we talk about the same file, I've reduced the testcase to more managable sizes. This one compiles in about 1min at -O0. I'll attach time-report output in a  sec.

Comment 24 Joost VandeVondele 2008-12-16 14:20:05 UTC

(In reply to comment #23)
reduced testcase timings at -O0 and -O3. Tree operand scan anybody?

> time gfortran -O0 -ffree-line-length-512 -c -ftime-report testcase_reduced.f90

Execution times (seconds)
 garbage collection    :   0.51 ( 1%) usr   0.00 ( 0%) sys   0.49 ( 1%) wall       0 kB ( 0%) ggc
 callgraph construction:   0.05 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall    4956 kB ( 2%) ggc
 callgraph optimization:   8.13 (18%) usr   0.20 (16%) sys   8.36 (18%) wall    1280 kB ( 1%) ggc
 cfg cleanup           :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 CFG verifier          :   0.48 ( 1%) usr   0.02 ( 2%) sys   0.46 ( 1%) wall       0 kB ( 0%) ggc
 trivially dead code   :   0.17 ( 0%) usr   0.00 ( 0%) sys   0.18 ( 0%) wall       0 kB ( 0%) ggc
 df live regs          :   0.11 ( 0%) usr   0.00 ( 0%) sys   0.11 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.24 ( 1%) usr   0.00 ( 0%) sys   0.23 ( 0%) wall    9445 kB ( 4%) ggc
 register information  :   0.11 ( 0%) usr   0.01 ( 1%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis        :   0.10 ( 0%) usr   0.01 ( 1%) sys   0.10 ( 0%) wall    4239 kB ( 2%) ggc
 rebuild jump labels   :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.11 ( 0%) wall       0 kB ( 0%) ggc
 parser                :   1.07 ( 2%) usr   0.05 ( 4%) sys   1.12 ( 2%) wall   22673 kB ( 9%) ggc
 inline heuristics     :  16.30 (36%) usr   0.41 (33%) sys  16.75 (36%) wall       0 kB ( 0%) ggc
 tree gimplify         :   0.06 ( 0%) usr   0.01 ( 1%) sys   0.08 ( 0%) wall    6435 kB ( 3%) ggc
 tree CFG construction :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall     180 kB ( 0%) ggc
 tree find ref. vars   :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall    3231 kB ( 1%) ggc
 tree SSA rewrite      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall      63 kB ( 0%) ggc
 tree SSA other        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree operand scan     :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     236 kB ( 0%) ggc
 tree SSA to normal    :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA verifier     :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 tree STMT verifier    :   0.20 ( 0%) usr   0.01 ( 1%) sys   0.22 ( 0%) wall       0 kB ( 0%) ggc
 callgraph verifier    :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 expand                :  10.86 (24%) usr   0.38 (30%) sys  11.26 (24%) wall  132856 kB (52%) ggc
 integrated RA         :   4.08 ( 9%) usr   0.05 ( 4%) sys   4.13 ( 9%) wall    4604 kB ( 2%) ggc
 reload                :   1.88 ( 4%) usr   0.07 ( 6%) sys   1.97 ( 4%) wall   59269 kB (23%) ggc
 thread pro- & epilogue:   0.17 ( 0%) usr   0.00 ( 0%) sys   0.18 ( 0%) wall     175 kB ( 0%) ggc
 final                 :   0.63 ( 1%) usr   0.03 ( 2%) sys   0.66 ( 1%) wall    3790 kB ( 1%) ggc
 TOTAL                 :  45.42             1.25            46.73             253684 kB
Extra diagnostic checks enabled; compiler may run slowly.
Configure with --enable-checking=release to disable checks.

real    0m47.298s
user    0m45.923s
sys     0m1.316s

> time gfortran -march=native -O3 -ffree-line-length-512 -c -ftime-report testcase_reduced.f90

Execution times (seconds)
 garbage collection    :   1.48 ( 1%) usr   0.01 ( 0%) sys   1.50 ( 1%) wall       0 kB ( 0%) ggc
 callgraph construction:   0.03 ( 0%) usr   0.01 ( 0%) sys   0.05 ( 0%) wall    4955 kB ( 1%) ggc
 callgraph optimization:   6.27 ( 3%) usr   0.15 ( 7%) sys   6.46 ( 4%) wall    2366 kB ( 0%) ggc
 ipa cp                :   0.05 ( 0%) usr   0.01 ( 0%) sys   0.06 ( 0%) wall      34 kB ( 0%) ggc
 cfg cleanup           :   0.01 ( 0%) usr   0.01 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 CFG verifier          :   1.41 ( 1%) usr   0.00 ( 0%) sys   1.33 ( 1%) wall       0 kB ( 0%) ggc
 trivially dead code   :   0.62 ( 0%) usr   0.00 ( 0%) sys   0.66 ( 0%) wall       0 kB ( 0%) ggc
 df reaching defs      :   0.69 ( 0%) usr   0.01 ( 0%) sys   0.67 ( 0%) wall       0 kB ( 0%) ggc
 df live regs          :   1.86 ( 1%) usr   0.00 ( 0%) sys   1.86 ( 1%) wall       0 kB ( 0%) ggc
 df live&initialized regs:   0.93 ( 1%) usr   0.00 ( 0%) sys   0.94 ( 1%) wall       0 kB ( 0%) ggc
 df use-def / def-use chains:   1.33 ( 1%) usr   0.04 ( 2%) sys   1.38 ( 1%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.92 ( 1%) usr   0.00 ( 0%) sys   0.96 ( 1%) wall   13469 kB ( 3%) ggc
 register information  :   0.44 ( 0%) usr   0.00 ( 0%) sys   0.43 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis        :   1.05 ( 1%) usr   0.00 ( 0%) sys   1.05 ( 1%) wall   24068 kB ( 5%) ggc
 register scan         :   0.20 ( 0%) usr   0.00 ( 0%) sys   0.17 ( 0%) wall      18 kB ( 0%) ggc
 rebuild jump labels   :   0.31 ( 0%) usr   0.00 ( 0%) sys   0.30 ( 0%) wall       0 kB ( 0%) ggc
 parser                :   1.16 ( 1%) usr   0.03 ( 1%) sys   1.21 ( 1%) wall   22673 kB ( 5%) ggc
 inline heuristics     :  15.83 ( 9%) usr   0.40 (20%) sys  16.25 ( 9%) wall     138 kB ( 0%) ggc
 integration           :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     885 kB ( 0%) ggc
 tree gimplify         :   0.06 ( 0%) usr   0.01 ( 0%) sys   0.07 ( 0%) wall    6434 kB ( 1%) ggc
 tree CFG construction :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall     179 kB ( 0%) ggc
 tree CFG cleanup      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       8 kB ( 0%) ggc
 tree VRP              :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall     448 kB ( 0%) ggc
 tree copy propagation :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall     159 kB ( 0%) ggc
 tree find ref. vars   :   0.01 ( 0%) usr   0.01 ( 0%) sys   0.01 ( 0%) wall    3229 kB ( 1%) ggc
 tree PTA              :   1.29 ( 1%) usr   0.03 ( 1%) sys   1.34 ( 1%) wall     540 kB ( 0%) ggc
 tree alias analysis   :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall      57 kB ( 0%) ggc
 tree call clobbering  :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      19 kB ( 0%) ggc
 tree flow sensitive alias:   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      60 kB ( 0%) ggc
 tree flow insensitive alias:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree memory partitioning:   0.09 ( 0%) usr   0.00 ( 0%) sys   0.07 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA rewrite      :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall    8391 kB ( 2%) ggc
 tree SSA other        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA incremental  :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall      21 kB ( 0%) ggc
 tree operand scan     :  98.14 (55%) usr   0.03 ( 1%) sys  98.31 (54%) wall    4048 kB ( 1%) ggc
 dominator optimization:   0.03 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall      73 kB ( 0%) ggc
 tree CCP              :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall     119 kB ( 0%) ggc
 tree PRE              :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall      62 kB ( 0%) ggc
 tree FRE              :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall      33 kB ( 0%) ggc
 tree forward propagate:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       3 kB ( 0%) ggc
 tree conservative DCE :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 tree aggressive DCE   :   0.02 ( 0%) usr   0.01 ( 0%) sys   0.01 ( 0%) wall      12 kB ( 0%) ggc
 tree loop init        :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      44 kB ( 0%) ggc
 tree SSA to normal    :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       4 kB ( 0%) ggc
 tree rename SSA copies:   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA verifier     :   0.95 ( 1%) usr   0.00 ( 0%) sys   0.97 ( 1%) wall       0 kB ( 0%) ggc
 tree STMT verifier    :   2.05 ( 1%) usr   0.04 ( 2%) sys   2.11 ( 1%) wall       0 kB ( 0%) ggc
 callgraph verifier    :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 expand                :  12.80 ( 7%) usr   0.33 (16%) sys  13.09 ( 7%) wall  131225 kB (27%) ggc
 lower subreg          :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 forward prop          :   0.91 ( 1%) usr   0.01 ( 0%) sys   0.91 ( 1%) wall    9021 kB ( 2%) ggc
 CSE                   :   2.70 ( 2%) usr   0.01 ( 0%) sys   2.70 ( 1%) wall    8941 kB ( 2%) ggc
 dead code elimination :   0.37 ( 0%) usr   0.00 ( 0%) sys   0.38 ( 0%) wall       0 kB ( 0%) ggc
 dead store elim1      :   0.59 ( 0%) usr   0.02 ( 1%) sys   0.62 ( 0%) wall   13140 kB ( 3%) ggc
 dead store elim2      :   0.77 ( 0%) usr   0.00 ( 0%) sys   0.76 ( 0%) wall   13219 kB ( 3%) ggc
 CSE 2                 :   2.04 ( 1%) usr   0.00 ( 0%) sys   2.04 ( 1%) wall    3477 kB ( 1%) ggc
 combiner              :   0.77 ( 0%) usr   0.00 ( 0%) sys   0.79 ( 0%) wall    6633 kB ( 1%) ggc
 if-conversion         :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      50 kB ( 0%) ggc
 regmove               :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall      28 kB ( 0%) ggc
 integrated RA         :   9.17 ( 5%) usr   0.71 (35%) sys   9.92 ( 5%) wall   25558 kB ( 5%) ggc
 reload                :   3.36 ( 2%) usr   0.10 ( 5%) sys   3.47 ( 2%) wall  101799 kB (21%) ggc
 reload CSE regs       :   1.76 ( 1%) usr   0.00 ( 0%) sys   1.75 ( 1%) wall   27970 kB ( 6%) ggc
 load CSE after reload :   0.09 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 thread pro- & epilogue:   0.18 ( 0%) usr   0.00 ( 0%) sys   0.19 ( 0%) wall     231 kB ( 0%) ggc
 peephole 2            :   0.16 ( 0%) usr   0.00 ( 0%) sys   0.15 ( 0%) wall      29 kB ( 0%) ggc
 rename registers      :   1.03 ( 1%) usr   0.00 ( 0%) sys   1.04 ( 1%) wall       0 kB ( 0%) ggc
 scheduling 2          :   3.82 ( 2%) usr   0.03 ( 1%) sys   3.82 ( 2%) wall   53812 kB (11%) ggc
 machine dep reorg     :   0.41 ( 0%) usr   0.00 ( 0%) sys   0.41 ( 0%) wall       0 kB ( 0%) ggc
 reorder blocks        :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       3 kB ( 0%) ggc
 final                 :   0.64 ( 0%) usr   0.02 ( 1%) sys   0.65 ( 0%) wall    3824 kB ( 1%) ggc
 TOTAL                 : 179.47             2.03           181.70             492168 kB
Extra diagnostic checks enabled; compiler may run slowly.
Configure with --enable-checking=release to disable checks.

real    3m2.238s
user    2m59.927s
sys     0m2.128s

Comment 25 Joost VandeVondele 2008-12-16 16:17:44 UTC

doing some more profiling, the -O0 problem is to a large extend due to compute_inline_parameters and estimate_stack_frame_size. Spending 10-30min just on estimating the stack_frame_size on something that can't be reasonably inlined anyways seems a waste.

Comment 26 Steven Bosscher 2008-12-16 16:26:45 UTC

I am going to work on the -O0 problems a bit.

The operand scanner is the problem at -O3. Richi, this is one you may want to try on the alias improvements branch, if most of the time is spent on virtual SSA names (I haven't checked, but it's likely with so many aggregate-typed variables).

Comment 27 Joost VandeVondele 2008-12-16 16:29:47 UTC

the slow routines at -O3 are related to compute_may_aliases, at the point I interupted the profiling, this routine had called add_virtual_operand 200M times.

Comment 28 Andrew Pinski 2008-12-16 19:32:39 UTC

The stack heuristic is new for 4.3.

Comment 29 Joost VandeVondele 2008-12-17 06:50:38 UTC

doing the original testcase again at -O3 has been a useful exercise i think. 13.5h is spent in rename_registers, 2h in tree operand scan, ~1h in inline heuristics, and 20min in expand. (Note that this is a 4.3 based compiler, maybe I should redo this with a 4.4 or has nothing changed there despite ira?).

gfortran-4.3 -ffree-line-length-512 -g -fopenmp -ffree-form -D__T_C_G0 -ftime-report -c -O3 -march=native -funroll-loops mpfr_yukawa.f

Execution times (seconds)
 garbage collection    :   3.96 ( 0%) usr   0.00 ( 0%) sys   3.95 ( 0%) wall       0 kB ( 0%) ggc
 callgraph construction:   0.42 ( 0%) usr   0.01 ( 0%) sys   0.44 ( 0%) wall   10751 kB ( 1%) ggc
 callgraph optimization:   0.78 ( 0%) usr   0.02 ( 0%) sys   0.82 ( 0%) wall   14320 kB ( 1%) ggc
 ipa reference         :   0.17 ( 0%) usr   0.00 ( 0%) sys   0.17 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 cfg cleanup           :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall       0 kB ( 0%) ggc
 trivially dead code   :   5.36 ( 0%) usr   0.00 ( 0%) sys   5.38 ( 0%) wall       0 kB ( 0%) ggc
 df reaching defs      :   6.56 ( 0%) usr   0.07 ( 1%) sys   6.60 ( 0%) wall       0 kB ( 0%) ggc
 df live regs          :  16.93 ( 0%) usr   0.00 ( 0%) sys  16.96 ( 0%) wall       0 kB ( 0%) ggc
 df live&initialized regs:   7.39 ( 0%) usr   0.00 ( 0%) sys   7.38 ( 0%) wall       0 kB ( 0%) ggc
 df use-def / def-use chains:  14.49 ( 0%) usr   0.02 ( 0%) sys  14.52 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:  10.94 ( 0%) usr   0.00 ( 0%) sys  10.96 ( 0%) wall   37217 kB ( 3%) ggc
 register information  :   4.97 ( 0%) usr   0.00 ( 0%) sys   4.97 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis        :   8.84 ( 0%) usr   0.01 ( 0%) sys   8.86 ( 0%) wall   49164 kB ( 3%) ggc
 register scan         :   1.78 ( 0%) usr   0.00 ( 0%) sys   1.80 ( 0%) wall       0 kB ( 0%) ggc
 rebuild jump labels   :   3.93 ( 0%) usr   0.00 ( 0%) sys   3.93 ( 0%) wall       0 kB ( 0%) ggc
 parser                :   3.24 ( 0%) usr   0.10 ( 1%) sys   3.35 ( 0%) wall   60522 kB ( 4%) ggc
 inline heuristics     :2516.42 ( 4%) usr   4.20 (35%) sys2542.00 ( 4%) wall       0 kB ( 0%) ggc
 tree gimplify         :   0.55 ( 0%) usr   0.00 ( 0%) sys   0.54 ( 0%) wall    3453 kB ( 0%) ggc
 tree eh               :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree CFG construction :   0.02 ( 0%) usr   0.01 ( 0%) sys   0.03 ( 0%) wall    7185 kB ( 1%) ggc
 tree CFG cleanup      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       1 kB ( 0%) ggc
 tree VRP              :   0.25 ( 0%) usr   0.00 ( 0%) sys   0.23 ( 0%) wall      36 kB ( 0%) ggc
 tree copy propagation :   0.63 ( 0%) usr   0.00 ( 0%) sys   0.60 ( 0%) wall       6 kB ( 0%) ggc
 tree find ref. vars   :   0.12 ( 0%) usr   0.01 ( 0%) sys   0.12 ( 0%) wall    8202 kB ( 1%) ggc
 tree PTA              :   3.61 ( 0%) usr   0.07 ( 1%) sys   3.64 ( 0%) wall      96 kB ( 0%) ggc
 tree alias analysis   :   6.71 ( 0%) usr   0.11 ( 1%) sys   6.74 ( 0%) wall       3 kB ( 0%) ggc
 tree call clobbering  :   0.39 ( 0%) usr   0.00 ( 0%) sys   0.37 ( 0%) wall       0 kB ( 0%) ggc
 tree flow sensitive alias:   0.11 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall       9 kB ( 0%) ggc
 tree flow insensitive alias:   3.54 ( 0%) usr   0.00 ( 0%) sys   3.53 ( 0%) wall       0 kB ( 0%) ggc
 tree memory partitioning:  18.88 ( 0%) usr   0.03 ( 0%) sys  18.93 ( 0%) wall       0 kB ( 0%) ggc
 tree PHI insertion    :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       1 kB ( 0%) ggc
 tree SSA rewrite      :   0.17 ( 0%) usr   0.00 ( 0%) sys   0.18 ( 0%) wall   25972 kB ( 2%) ggc
 tree SSA other        :   0.02 ( 0%) usr   0.03 ( 0%) sys   0.08 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA incremental  :   0.82 ( 0%) usr   0.00 ( 0%) sys   0.82 ( 0%) wall       9 kB ( 0%) ggc
 tree operand scan     :6681.62 (11%) usr   0.55 ( 5%) sys6698.35 (11%) wall   19727 kB ( 1%) ggc
 dominator optimization:   0.26 ( 0%) usr   0.00 ( 0%) sys   0.25 ( 0%) wall      11 kB ( 0%) ggc
 tree SRA              :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall       0 kB ( 0%) ggc
 tree STORE-CCP        :   0.15 ( 0%) usr   0.00 ( 0%) sys   0.15 ( 0%) wall       1 kB ( 0%) ggc
 tree CCP              :   0.18 ( 0%) usr   0.00 ( 0%) sys   0.19 ( 0%) wall       3 kB ( 0%) ggc
 tree reassociation    :   0.07 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall       2 kB ( 0%) ggc
 tree PRE              :   0.37 ( 0%) usr   0.00 ( 0%) sys   0.38 ( 0%) wall    2289 kB ( 0%) ggc
 tree FRE              :   0.38 ( 0%) usr   0.01 ( 0%) sys   0.38 ( 0%) wall    2297 kB ( 0%) ggc
 tree linearize phis   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree forward propagate:   0.04 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 tree conservative DCE :   0.53 ( 0%) usr   0.00 ( 0%) sys   0.56 ( 0%) wall       0 kB ( 0%) ggc
 tree aggressive DCE   :   0.11 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 tree DSE              :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA to normal    :   0.09 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall       3 kB ( 0%) ggc
 tree rename SSA copies:   0.06 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall       0 kB ( 0%) ggc
 expand                :1279.48 ( 2%) usr   2.74 (23%) sys1285.18 ( 2%) wall  440026 kB (31%) ggc
 lower subreg          :   0.09 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 forward prop          :   1.74 ( 0%) usr   0.03 ( 0%) sys   1.73 ( 0%) wall    8458 kB ( 1%) ggc
 CSE                   :   9.02 ( 0%) usr   0.04 ( 0%) sys   9.04 ( 0%) wall   27164 kB ( 2%) ggc
 dead code elimination :   3.63 ( 0%) usr   0.00 ( 0%) sys   3.62 ( 0%) wall       0 kB ( 0%) ggc
 dead store elim1      :   5.64 ( 0%) usr   0.05 ( 0%) sys   5.69 ( 0%) wall   37803 kB ( 3%) ggc
 dead store elim2      :   7.64 ( 0%) usr   0.01 ( 0%) sys   7.65 ( 0%) wall   38112 kB ( 3%) ggc
 web                   :   1.38 ( 0%) usr   0.02 ( 0%) sys   1.41 ( 0%) wall       0 kB ( 0%) ggc
 CSE 2                 :   4.26 ( 0%) usr   0.02 ( 0%) sys   4.29 ( 0%) wall   10087 kB ( 1%) ggc
 branch prediction     :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall       3 kB ( 0%) ggc
 combiner              :   5.02 ( 0%) usr   0.03 ( 0%) sys   5.06 ( 0%) wall   17823 kB ( 1%) ggc
 regmove               :   4.83 ( 0%) usr   0.00 ( 0%) sys   4.83 ( 0%) wall       1 kB ( 0%) ggc
 local alloc           :  11.33 ( 0%) usr   0.08 ( 1%) sys  11.36 ( 0%) wall   64779 kB ( 5%) ggc
 global alloc          :  21.86 ( 0%) usr   0.69 ( 6%) sys  22.55 ( 0%) wall  275939 kB (20%) ggc
 reload CSE regs       :  10.28 ( 0%) usr   0.00 ( 0%) sys  10.29 ( 0%) wall   79246 kB ( 6%) ggc
 load CSE after reload :   0.84 ( 0%) usr   0.00 ( 0%) sys   0.85 ( 0%) wall       0 kB ( 0%) ggc
 thread pro- & epilogue:   0.89 ( 0%) usr   0.00 ( 0%) sys   0.89 ( 0%) wall      13 kB ( 0%) ggc
 peephole 2            :   1.46 ( 0%) usr   0.00 ( 0%) sys   1.46 ( 0%) wall       0 kB ( 0%) ggc
 rename registers      :49100.30 (82%) usr   2.30 (19%) sys49376.54 (82%) wall   16842 kB ( 1%) ggc
 scheduling 2          :  21.80 ( 0%) usr   0.60 ( 5%) sys  22.39 ( 0%) wall  150933 kB (11%) ggc
 machine dep reorg     :   1.99 ( 0%) usr   0.00 ( 0%) sys   2.00 ( 0%) wall       0 kB ( 0%) ggc
 reorder blocks        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       1 kB ( 0%) ggc
 final                 :   3.83 ( 0%) usr   0.04 ( 0%) sys   3.88 ( 0%) wall     738 kB ( 0%) ggc
 symout                :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.10 ( 0%) wall    5418 kB ( 0%) ggc
 variable tracking     :   1.86 ( 0%) usr   0.02 ( 0%) sys   1.87 ( 0%) wall      17 kB ( 0%) ggc
 TOTAL                 :59825.47            11.92          60151.75            1415015 kB

Comment 30 Steven Bosscher 2008-12-17 07:01:38 UTC

I think redoing this with 4.4.0 would be useful, to check if new code (like IRA) uses this kind of non-linear algorithms.  But the register renaming patch hasn't changed between 4.3 and 4.4, so I would compile with -fno-rename-registers ;-)

Comment 31 Joost VandeVondele 2008-12-17 08:36:16 UTC

(In reply to comment #30)
> I think redoing this with 4.4.0 would be useful, to check if new code (like
> IRA) uses this kind of non-linear algorithms.  But the register renaming patch
> hasn't changed between 4.3 and 4.4, so I would compile with
> -fno-rename-registers ;-)
> 

thanks for the '-fno-rename-registers' trick, something like that is not obvious to beginners. I tried with 4.4, but virtual memory usage went to 9Gb, so I'll have to run it on a larger machine. I can't remember seeing this with 4.3, but I'll test.

Comment 32 Joost VandeVondele 2008-12-17 12:58:23 UTC

The 9.3Gb for 4.4 is confirmed. I attached gdb to the process at that point (after about 70min of compilation), and that is the backtrace:

#0  0x0000000000b48a9a in bucket_allocno_compare_func (v1p=0x7fffe3592d98, v2p=0x7fffe3592d90)
    at /data04/vondele/gcc_trunk/gcc/gcc/ira-color.c:746
#1  0x0000000000b4abbd in push_allocno_to_stack (allocno=<value optimized out>)
    at /data04/vondele/gcc_trunk/gcc/gcc/ira-color.c:803
#2  0x0000000000b4e4d2 in color_allocnos () at /data04/vondele/gcc_trunk/gcc/gcc/ira-color.c:989
#3  0x0000000000b4f614 in color_pass (loop_tree_node=<value optimized out>)
    at /data04/vondele/gcc_trunk/gcc/gcc/ira-color.c:1936
#4  0x0000000000b3ef2a in ira_traverse_loop_tree (bb_p=0 '\0', loop_node=0x7fffe3592d90,
    preorder_func=0x337d54f8, postorder_func=0) at /data04/vondele/gcc_trunk/gcc/gcc/ira-build.c:1381
#5  0x0000000000b4a320 in ira_color () at /data04/vondele/gcc_trunk/gcc/gcc/ira-color.c:2080
#6  0x0000000000b3d7eb in rest_of_handle_ira () at /data04/vondele/gcc_trunk/gcc/gcc/ira.c:1926
#7  0x000000000069e48d in execute_one_pass (pass=0x10980e0) at /data04/vondele/gcc_trunk/gcc/gcc/passes.c:1279
#8  0x000000000069e6d5 in execute_pass_list (pass=0x10980e0) at /data04/vondele/gcc_trunk/gcc/gcc/passes.c:1328
#9  0x000000000069e6ed in execute_pass_list (pass=0x1093060) at /data04/vondele/gcc_trunk/gcc/gcc/passes.c:1329
#10 0x0000000000794ddc in tree_rest_of_compilation (fndecl=0x7f45da99eb00)
    at /data04/vondele/gcc_trunk/gcc/gcc/tree-optimize.c:419

apart from this, 4.4 is actually a bit faster. This is the time report:

gfortran -ffree-line-length-512 -g -ffree-form -ftime-report -c -O3 -march=native -funroll-loops -fno-rename-registers testcase.f90

Execution times (seconds)
 garbage collection    :   7.00 ( 0%) usr   0.02 ( 0%) sys   7.05 ( 0%) wall       0 kB ( 0%) ggc
 callgraph construction:   0.25 ( 0%) usr   0.01 ( 0%) sys   0.25 ( 0%) wall   12496 kB ( 1%) ggc
 callgraph optimization: 387.87 ( 8%) usr   1.54 (11%) sys 389.44 ( 8%) wall    4414 kB ( 0%) ggc
 ipa cp                :   0.30 ( 0%) usr   0.00 ( 0%) sys   0.30 ( 0%) wall      34 kB ( 0%) ggc
 ipa reference         :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const        :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 cfg cleanup           :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall       1 kB ( 0%) ggc
 CFG verifier          :  10.67 ( 0%) usr   0.01 ( 0%) sys  10.74 ( 0%) wall       0 kB ( 0%) ggc
 trivially dead code   :   4.31 ( 0%) usr   0.00 ( 0%) sys   4.30 ( 0%) wall       0 kB ( 0%) ggc
 df reaching defs      :   4.51 ( 0%) usr   0.02 ( 0%) sys   4.54 ( 0%) wall       0 kB ( 0%) ggc
 df live regs          :  14.18 ( 0%) usr   0.00 ( 0%) sys  14.22 ( 0%) wall       0 kB ( 0%) ggc
 df live&initialized regs:   6.79 ( 0%) usr   0.00 ( 0%) sys   6.82 ( 0%) wall       0 kB ( 0%) ggc
 df use-def / def-use chains:  20.22 ( 0%) usr   0.05 ( 0%) sys  20.27 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   6.55 ( 0%) usr   0.01 ( 0%) sys   6.53 ( 0%) wall   36992 kB ( 3%) ggc
 register information  :   2.68 ( 0%) usr   0.00 ( 0%) sys   2.71 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis        :   6.93 ( 0%) usr   0.00 ( 0%) sys   6.92 ( 0%) wall   46600 kB ( 4%) ggc
 register scan         :   1.32 ( 0%) usr   0.00 ( 0%) sys   1.33 ( 0%) wall      18 kB ( 0%) ggc
 rebuild jump labels   :   2.14 ( 0%) usr   0.00 ( 0%) sys   2.15 ( 0%) wall       0 kB ( 0%) ggc
 parser                :   6.23 ( 0%) usr   0.09 ( 1%) sys   6.30 ( 0%) wall   59009 kB ( 4%) ggc
 inline heuristics     :1328.68 (28%) usr   3.27 (24%) sys1331.94 (28%) wall     138 kB ( 0%) ggc
 tree gimplify         :   0.36 ( 0%) usr   0.00 ( 0%) sys   0.38 ( 0%) wall   16833 kB ( 1%) ggc
 tree eh               :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 tree CFG construction :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     179 kB ( 0%) ggc
 tree CFG cleanup      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       8 kB ( 0%) ggc
 tree VRP              :   0.21 ( 0%) usr   0.00 ( 0%) sys   0.20 ( 0%) wall     448 kB ( 0%) ggc
 tree copy propagation :   0.41 ( 0%) usr   0.00 ( 0%) sys   0.38 ( 0%) wall     159 kB ( 0%) ggc
 tree find ref. vars   :   0.07 ( 0%) usr   0.02 ( 0%) sys   0.08 ( 0%) wall    7873 kB ( 1%) ggc
 tree PTA              :  19.23 ( 0%) usr   0.06 ( 0%) sys  19.31 ( 0%) wall     540 kB ( 0%) ggc
 tree alias analysis   :   0.47 ( 0%) usr   0.04 ( 0%) sys   0.53 ( 0%) wall      75 kB ( 0%) ggc
 tree call clobbering  :   0.13 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall      36 kB ( 0%) ggc
 tree flow sensitive alias:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      61 kB ( 0%) ggc
 tree flow insensitive alias:   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 tree memory partitioning:   0.61 ( 0%) usr   0.00 ( 0%) sys   0.60 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA rewrite      :   0.09 ( 0%) usr   0.01 ( 0%) sys   0.10 ( 0%) wall   20578 kB ( 2%) ggc
 tree SSA other        :   0.06 ( 0%) usr   0.01 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA incremental  :   0.21 ( 0%) usr   0.00 ( 0%) sys   0.17 ( 0%) wall      21 kB ( 0%) ggc
 tree operand scan     :2339.81 (49%) usr   0.26 ( 2%) sys2340.13 (49%) wall    9840 kB ( 1%) ggc
 dominator optimization:   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall      73 kB ( 0%) ggc
 tree SRA              :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall      78 kB ( 0%) ggc
 tree CCP              :   0.18 ( 0%) usr   0.00 ( 0%) sys   0.18 ( 0%) wall     119 kB ( 0%) ggc
 tree reassociation    :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall      50 kB ( 0%) ggc
 tree PRE              :   0.13 ( 0%) usr   0.00 ( 0%) sys   0.14 ( 0%) wall      62 kB ( 0%) ggc
 tree FRE              :   0.14 ( 0%) usr   0.00 ( 0%) sys   0.16 ( 0%) wall      33 kB ( 0%) ggc
 tree linearize phis   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall      17 kB ( 0%) ggc
 tree forward propagate:   0.03 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       3 kB ( 0%) ggc
 tree conservative DCE :   0.30 ( 0%) usr   0.00 ( 0%) sys   0.31 ( 0%) wall       0 kB ( 0%) ggc
 tree aggressive DCE   :   0.13 ( 0%) usr   0.00 ( 0%) sys   0.13 ( 0%) wall      12 kB ( 0%) ggc
 tree buildin call DCE :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree DSE              :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall      10 kB ( 0%) ggc
 complete unrolling    :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      46 kB ( 0%) ggc
 tree loop init        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      44 kB ( 0%) ggc
 tree SSA uncprop      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA to normal    :   0.09 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall       4 kB ( 0%) ggc
 tree rename SSA copies:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA verifier     :   7.53 ( 0%) usr   0.01 ( 0%) sys   7.55 ( 0%) wall       0 kB ( 0%) ggc
 tree STMT verifier    :  11.14 ( 0%) usr   0.01 ( 0%) sys  11.19 ( 0%) wall       0 kB ( 0%) ggc
 callgraph verifier    :   1.01 ( 0%) usr   0.00 ( 0%) sys   1.02 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 expand                : 421.25 ( 9%) usr   2.03 (15%) sys 423.13 ( 9%) wall  365559 kB (28%) ggc
 lower subreg          :   0.07 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall       0 kB ( 0%) ggc
 forward prop          :   5.34 ( 0%) usr   0.03 ( 0%) sys   5.36 ( 0%) wall   24812 kB ( 2%) ggc
 CSE                   :  10.24 ( 0%) usr   0.07 ( 1%) sys  10.30 ( 0%) wall   24343 kB ( 2%) ggc
 dead code elimination :   2.74 ( 0%) usr   0.00 ( 0%) sys   2.75 ( 0%) wall       0 kB ( 0%) ggc
 dead store elim1      :   2.46 ( 0%) usr   0.05 ( 0%) sys   2.52 ( 0%) wall   36487 kB ( 3%) ggc
 dead store elim2      :   3.81 ( 0%) usr   0.00 ( 0%) sys   3.82 ( 0%) wall   36569 kB ( 3%) ggc
 loop analysis         :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     109 kB ( 0%) ggc
 global CSE            :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 web                   :   4.48 ( 0%) usr   0.01 ( 0%) sys   4.49 ( 0%) wall       7 kB ( 0%) ggc
 CSE 2                 :   4.96 ( 0%) usr   0.02 ( 0%) sys   4.99 ( 0%) wall    9448 kB ( 1%) ggc
 branch prediction     :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall      46 kB ( 0%) ggc
 combiner              :   3.41 ( 0%) usr   0.00 ( 0%) sys   3.41 ( 0%) wall   18359 kB ( 1%) ggc
 regmove               :   0.27 ( 0%) usr   0.00 ( 0%) sys   0.26 ( 0%) wall      28 kB ( 0%) ggc
 integrated RA         :  71.46 ( 1%) usr   5.20 (38%) sys  96.42 ( 2%) wall   66797 kB ( 5%) ggc
 reload                :  14.31 ( 0%) usr   0.19 ( 1%) sys  14.49 ( 0%) wall  284193 kB (21%) ggc
 reload CSE regs       :   8.05 ( 0%) usr   0.00 ( 0%) sys   8.04 ( 0%) wall   77580 kB ( 6%) ggc
 load CSE after reload :   0.62 ( 0%) usr   0.00 ( 0%) sys   0.61 ( 0%) wall       0 kB ( 0%) ggc
 thread pro- & epilogue:   0.79 ( 0%) usr   0.00 ( 0%) sys   0.79 ( 0%) wall     216 kB ( 0%) ggc
 peephole 2            :   0.99 ( 0%) usr   0.00 ( 0%) sys   0.99 ( 0%) wall      29 kB ( 0%) ggc
 rename registers      :   3.99 ( 0%) usr   0.00 ( 0%) sys   4.02 ( 0%) wall       0 kB ( 0%) ggc
 scheduling 2          :  18.25 ( 0%) usr   0.54 ( 4%) sys  18.77 ( 0%) wall  147959 kB (11%) ggc
 machine dep reorg     :   1.87 ( 0%) usr   0.00 ( 0%) sys   1.88 ( 0%) wall       2 kB ( 0%) ggc
 reorder blocks        :   0.36 ( 0%) usr   0.00 ( 0%) sys   0.35 ( 0%) wall      17 kB ( 0%) ggc
 final                 :   2.93 ( 0%) usr   0.09 ( 1%) sys   3.01 ( 0%) wall   11989 kB ( 1%) ggc
 symout                :   0.04 ( 0%) usr   0.01 ( 0%) sys   0.05 ( 0%) wall    5340 kB ( 0%) ggc
 variable tracking     :   1.39 ( 0%) usr   0.06 ( 0%) sys   1.44 ( 0%) wall      63 kB ( 0%) ggc
 TOTAL                 :4777.54            13.76          4811.10            1328156 kB

while on the same machine, 4.3 has this time report:

gfortran-4.3 -ffree-line-length-512 -g -ffree-form -ftime-report -c -O3 -march=native -funroll-loops -fno-rename-registers testcase.f90

Execution times (seconds)
 garbage collection    :   2.22 ( 0%) usr   0.01 ( 0%) sys   2.26 ( 0%) wall       0 kB ( 0%) ggc
 callgraph construction:   0.24 ( 0%) usr   0.05 ( 0%) sys   0.30 ( 0%) wall   10435 kB ( 1%) ggc
 callgraph optimization:   0.88 ( 0%) usr   0.02 ( 0%) sys   0.94 ( 0%) wall   13970 kB ( 1%) ggc
 ipa reference         :   0.16 ( 0%) usr   0.00 ( 0%) sys   0.16 ( 0%) wall       3 kB ( 0%) ggc
 ipa pure const        :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 cfg cleanup           :   0.06 ( 0%) usr   0.02 ( 0%) sys   0.15 ( 0%) wall       1 kB ( 0%) ggc
 trivially dead code   :   3.72 ( 0%) usr   0.02 ( 0%) sys   3.76 ( 0%) wall       0 kB ( 0%) ggc
 df reaching defs      :   4.71 ( 0%) usr   0.06 ( 0%) sys   4.76 ( 0%) wall       0 kB ( 0%) ggc
 df live regs          :  14.35 ( 0%) usr   0.00 ( 0%) sys  14.35 ( 0%) wall       0 kB ( 0%) ggc
 df live&initialized regs:   6.28 ( 0%) usr   0.05 ( 0%) sys   6.41 ( 0%) wall       0 kB ( 0%) ggc
 df use-def / def-use chains:  12.90 ( 0%) usr   0.03 ( 0%) sys  12.96 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   6.17 ( 0%) usr   0.01 ( 0%) sys   6.19 ( 0%) wall   35719 kB ( 3%) ggc
 register information  :   2.84 ( 0%) usr   0.01 ( 0%) sys   2.91 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis        :   6.60 ( 0%) usr   0.02 ( 0%) sys   6.69 ( 0%) wall   46609 kB ( 3%) ggc
 register scan         :   1.24 ( 0%) usr   0.00 ( 0%) sys   1.28 ( 0%) wall       8 kB ( 0%) ggc
 rebuild jump labels   :   2.33 ( 0%) usr   0.00 ( 0%) sys   2.32 ( 0%) wall       0 kB ( 0%) ggc
 parser                :   2.77 ( 0%) usr   0.86 ( 6%) sys   5.64 ( 0%) wall   58659 kB ( 4%) ggc
 inline heuristics     :1683.93 (21%) usr   4.08 (29%) sys1689.76 (21%) wall     136 kB ( 0%) ggc
 integration           :   0.00 ( 0%) usr   0.02 ( 0%) sys   0.12 ( 0%) wall     824 kB ( 0%) ggc
 tree gimplify         :   0.43 ( 0%) usr   0.01 ( 0%) sys   0.43 ( 0%) wall    3446 kB ( 0%) ggc
 tree eh               :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 tree CFG construction :   0.03 ( 0%) usr   0.01 ( 0%) sys   0.03 ( 0%) wall    7150 kB ( 1%) ggc
 tree CFG cleanup      :   0.02 ( 0%) usr   0.04 ( 0%) sys   0.21 ( 0%) wall       8 kB ( 0%) ggc
 tree VRP              :   0.19 ( 0%) usr   0.00 ( 0%) sys   0.21 ( 0%) wall     479 kB ( 0%) ggc
 tree copy propagation :   0.51 ( 0%) usr   0.00 ( 0%) sys   0.55 ( 0%) wall     159 kB ( 0%) ggc
 tree find ref. vars   :   0.11 ( 0%) usr   0.02 ( 0%) sys   0.12 ( 0%) wall    7876 kB ( 1%) ggc
 tree PTA              :   3.88 ( 0%) usr   0.07 ( 1%) sys   3.98 ( 0%) wall     380 kB ( 0%) ggc
 tree alias analysis   :   8.07 ( 0%) usr   0.83 ( 6%) sys  10.31 ( 0%) wall      49 kB ( 0%) ggc
 tree call clobbering  :   0.32 ( 0%) usr   0.00 ( 0%) sys   0.31 ( 0%) wall       0 kB ( 0%) ggc
 tree flow sensitive alias:   0.08 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall      62 kB ( 0%) ggc
 tree flow insensitive alias:   4.79 ( 0%) usr   0.00 ( 0%) sys   4.79 ( 0%) wall       0 kB ( 0%) ggc
 tree memory partitioning:  23.61 ( 0%) usr   0.01 ( 0%) sys  23.65 ( 0%) wall       2 kB ( 0%) ggc
 tree PHI insertion    :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA rewrite      :   0.15 ( 0%) usr   0.02 ( 0%) sys   0.19 ( 0%) wall   25410 kB ( 2%) ggc
 tree SSA other        :   0.15 ( 0%) usr   0.52 ( 4%) sys   2.07 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA incremental  :   0.63 ( 0%) usr   0.02 ( 0%) sys   0.69 ( 0%) wall      23 kB ( 0%) ggc
 tree operand scan     :5025.85 (64%) usr   3.00 (22%) sys5036.21 (64%) wall   19771 kB ( 1%) ggc
 dominator optimization:   0.19 ( 0%) usr   0.01 ( 0%) sys   0.18 ( 0%) wall     381 kB ( 0%) ggc
 tree SRA              :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall      71 kB ( 0%) ggc
 tree STORE-CCP        :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall      39 kB ( 0%) ggc
 tree CCP              :   0.17 ( 0%) usr   0.00 ( 0%) sys   0.20 ( 0%) wall      81 kB ( 0%) ggc
 tree PHI const/copy prop:   0.00 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 tree reassociation    :   0.05 ( 0%) usr   0.01 ( 0%) sys   0.04 ( 0%) wall      50 kB ( 0%) ggc
 tree PRE              :   0.37 ( 0%) usr   0.00 ( 0%) sys   0.36 ( 0%) wall    2286 kB ( 0%) ggc
 tree FRE              :   0.31 ( 0%) usr   0.01 ( 0%) sys   0.34 ( 0%) wall    2268 kB ( 0%) ggc
 tree code sinking     :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      25 kB ( 0%) ggc
 tree linearize phis   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall      10 kB ( 0%) ggc
 tree forward propagate:   0.04 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall       1 kB ( 0%) ggc
 tree conservative DCE :   0.44 ( 0%) usr   0.00 ( 0%) sys   0.47 ( 0%) wall       4 kB ( 0%) ggc
 tree aggressive DCE   :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 tree DSE              :   0.05 ( 0%) usr   0.01 ( 0%) sys   0.04 ( 0%) wall       4 kB ( 0%) ggc
 PHI merge             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree loop optimization:   0.00 ( 0%) usr   0.01 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 loop invariant motion :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree canonical iv     :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree loop unswitching :   0.00 ( 0%) usr   0.01 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 complete unrolling    :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 predictive commoning  :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       1 kB ( 0%) ggc
 tree copy headers     :   0.00 ( 0%) usr   0.01 ( 0%) sys   0.01 ( 0%) wall      26 kB ( 0%) ggc
 tree SSA uncprop      :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA to normal    :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall       1 kB ( 0%) ggc
 tree NRV optimization :   0.00 ( 0%) usr   0.02 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 tree rename SSA copies:   0.03 ( 0%) usr   0.01 ( 0%) sys   0.06 ( 0%) wall       0 kB ( 0%) ggc
 dominance frontiers   :   0.00 ( 0%) usr   0.01 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation :   0.00 ( 0%) usr   0.02 ( 0%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 expand                : 959.04 (12%) usr   2.65 (19%) sys 964.23 (12%) wall  422621 kB (31%) ggc
 varconst              :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 lower subreg          :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall       0 kB ( 0%) ggc
 jump                  :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 forward prop          :   1.26 ( 0%) usr   0.04 ( 0%) sys   1.34 ( 0%) wall    8174 kB ( 1%) ggc
 CSE                   :   7.64 ( 0%) usr   0.04 ( 0%) sys   7.71 ( 0%) wall   26097 kB ( 2%) ggc
 dead code elimination :   2.69 ( 0%) usr   0.00 ( 0%) sys   2.70 ( 0%) wall       0 kB ( 0%) ggc
 dead store elim1      :   4.05 ( 0%) usr   0.10 ( 1%) sys   4.21 ( 0%) wall   36200 kB ( 3%) ggc
 dead store elim2      :   5.95 ( 0%) usr   0.01 ( 0%) sys   5.98 ( 0%) wall   36506 kB ( 3%) ggc
 loop analysis         :   0.00 ( 0%) usr   0.01 ( 0%) sys   0.04 ( 0%) wall     111 kB ( 0%) ggc
 global CSE            :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 bypass jumps          :   0.00 ( 0%) usr   0.01 ( 0%) sys   0.00 ( 0%) wall       6 kB ( 0%) ggc
 web                   :   0.98 ( 0%) usr   0.01 ( 0%) sys   1.00 ( 0%) wall       0 kB ( 0%) ggc
 CSE 2                 :   3.65 ( 0%) usr   0.01 ( 0%) sys   3.70 ( 0%) wall    9710 kB ( 1%) ggc
 branch prediction     :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall      48 kB ( 0%) ggc
 combiner              :   3.34 ( 0%) usr   0.00 ( 0%) sys   3.34 ( 0%) wall   17434 kB ( 1%) ggc
 if-conversion         :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall      50 kB ( 0%) ggc
 regmove               :   3.07 ( 0%) usr   0.00 ( 0%) sys   3.07 ( 0%) wall      28 kB ( 0%) ggc
 mode switching        :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 local alloc           :   9.49 ( 0%) usr   0.07 ( 1%) sys   9.54 ( 0%) wall   62106 kB ( 5%) ggc
 global alloc          :  17.28 ( 0%) usr   0.52 ( 4%) sys  18.12 ( 0%) wall  264214 kB (20%) ggc
 reload CSE regs       :   7.61 ( 0%) usr   0.01 ( 0%) sys   7.62 ( 0%) wall   75867 kB ( 6%) ggc
 load CSE after reload :   0.60 ( 0%) usr   0.00 ( 0%) sys   0.60 ( 0%) wall       0 kB ( 0%) ggc
 thread pro- & epilogue:   0.68 ( 0%) usr   0.01 ( 0%) sys   0.68 ( 0%) wall     236 kB ( 0%) ggc
 if-conversion 2       :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall      25 kB ( 0%) ggc
 peephole 2            :   1.08 ( 0%) usr   0.00 ( 0%) sys   1.09 ( 0%) wall      28 kB ( 0%) ggc
 rename registers      :   1.87 ( 0%) usr   0.00 ( 0%) sys   1.90 ( 0%) wall       0 kB ( 0%) ggc
 scheduling 2          :  15.12 ( 0%) usr   0.38 ( 3%) sys  15.51 ( 0%) wall  144759 kB (11%) ggc
 machine dep reorg     :   1.52 ( 0%) usr   0.00 ( 0%) sys   1.51 ( 0%) wall       2 kB ( 0%) ggc
 reorder blocks        :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      18 kB ( 0%) ggc
 reg stack             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 final                 :   2.94 ( 0%) usr   0.10 ( 1%) sys   3.29 ( 0%) wall    1267 kB ( 0%) ggc
 symout                :   0.15 ( 0%) usr   0.01 ( 0%) sys   0.16 ( 0%) wall    6928 kB ( 1%) ggc
 variable tracking     :   1.36 ( 0%) usr   0.00 ( 0%) sys   1.37 ( 0%) wall      76 kB ( 0%) ggc
 tree if-combine       :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 rest of compilation   :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 :7873.75            13.93          7906.31            1349263 kB
total: 1271414 kB

Comment 33 Steven Bosscher 2008-12-17 19:40:48 UTC

cfgexpand.c:defer_stack_allocation() has this gem:

  /* Without optimization, *most* variables are allocated from the
     stack, which makes the quadratic problem large exactly when we
     want compilation to proceed as quickly as possible.  On the
     other hand, we don't want the function's stack frame size to
     get completely out of hand.  So we avoid adding scalars and
     "small" aggregates to the list at all.  */
  if (optimize == 0 && tree_low_cst (DECL_SIZE_UNIT (var), 1) < 32)
    return false;

In our case, most variables are of type mpfr_type, which is ... yes, 32 bytes :-)

Comment 34 Joost VandeVondele 2008-12-20 08:58:58 UTC

BTW, should I split this PR in 4 sub PRs, and make them block this one?

1) inline heuristics (4.3/4.4 Regression) 
2) IRA mem explosion (4.4 Regression)
3) rename registers issue (?)
4) may_alias issue (?)

This makes kind of sense according to me.

Comment 35 Steven Bosscher 2008-12-20 09:54:40 UTC

Re comment #34: Good idea, but add:
5) quadratic behaviour in find_temp_slot_from_address.

Comment 36 Joost VandeVondele 2008-12-20 11:30:30 UTC

I've added 

PR38582 : rename registers
PR38583 : ira
PR38584 : inline heuristic
PR38585 : compute_may_aliases
PR38586 : find_temp_slot_from_address

and turned this one in a meta bug.

Comment 37 Richard Biener 2009-01-24 10:21:04 UTC

GCC 4.3.3 is being released, adjusting target milestone.

Comment 38 Richard Biener 2009-08-04 12:29:39 UTC

GCC 4.3.4 is being released, adjusting target milestone.

Comment 39 Joost VandeVondele 2009-11-27 08:52:29 UTC

I've rerun the initial (non-reduced) testcase at -O0, and I'm getting now more reasonable memory usage (2.5Gb), and all time is now in 'expand'. 'expand' is now about 3 times slower than 1year ago, but this is with checking enabled so I don't know if this is relevant:

Execution times (seconds)
 garbage collection    :   2.22 ( 0%) usr   0.00 ( 0%) sys   2.22 ( 0%) wall       0 kB ( 0%) ggc
 callgraph construction:   0.22 ( 0%) usr   0.02 ( 0%) sys   0.28 ( 0%) wall   12487 kB ( 2%) ggc
 callgraph optimization:   0.23 ( 0%) usr   0.00 ( 0%) sys   0.19 ( 0%) wall    4370 kB ( 1%) ggc
 cfg cleanup           :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall       0 kB ( 0%) ggc
 CFG verifier          :   3.33 ( 0%) usr   0.01 ( 0%) sys   3.36 ( 0%) wall       0 kB ( 0%) ggc
 trivially dead code   :   0.92 ( 0%) usr   0.00 ( 0%) sys   0.91 ( 0%) wall       0 kB ( 0%) ggc
 df live regs          :   0.62 ( 0%) usr   0.00 ( 0%) sys   0.64 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   1.33 ( 0%) usr   0.02 ( 0%) sys   1.31 ( 0%) wall   19416 kB ( 3%) ggc
 register information  :   0.63 ( 0%) usr   0.01 ( 0%) sys   0.64 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis        :   0.58 ( 0%) usr   0.01 ( 0%) sys   0.59 ( 0%) wall    8335 kB ( 1%) ggc
 rebuild jump labels   :   0.65 ( 0%) usr   0.00 ( 0%) sys   0.65 ( 0%) wall       0 kB ( 0%) ggc
 parser                :   4.96 ( 1%) usr   0.09 ( 2%) sys   5.06 ( 1%) wall   50732 kB ( 9%) ggc
 inline heuristics     :   0.15 ( 0%) usr   0.00 ( 0%) sys   0.11 ( 0%) wall       0 kB ( 0%) ggc
 tree gimplify         :   0.72 ( 0%) usr   0.01 ( 0%) sys   0.69 ( 0%) wall   13184 kB ( 2%) ggc
 tree eh               :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree CFG construction :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     172 kB ( 0%) ggc
 tree find ref. vars   :   0.07 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall    3263 kB ( 1%) ggc
 tree SSA rewrite      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      46 kB ( 0%) ggc
 tree SSA other        :   0.02 ( 0%) usr   0.02 ( 0%) sys   0.04 ( 0%) wall      18 kB ( 0%) ggc
 tree operand scan     :   0.06 ( 0%) usr   0.02 ( 0%) sys   0.07 ( 0%) wall     118 kB ( 0%) ggc
 tree SSA verifier     :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall       0 kB ( 0%) ggc
 tree STMT verifier    :   1.49 ( 0%) usr   0.03 ( 1%) sys   1.53 ( 0%) wall       0 kB ( 0%) ggc
 callgraph verifier    :   1.22 ( 0%) usr   0.00 ( 0%) sys   1.25 ( 0%) wall       0 kB ( 0%) ggc
 expand                : 737.90 (94%) usr   3.54 (79%) sys 741.44 (94%) wall  309551 kB (55%) ggc
 integrated RA         :  18.91 ( 2%) usr   0.28 ( 6%) sys  19.24 ( 2%) wall    8696 kB ( 2%) ggc
 reload                :   8.08 ( 1%) usr   0.26 ( 6%) sys   8.33 ( 1%) wall  123546 kB (22%) ggc
 thread pro- & epilogue:   0.80 ( 0%) usr   0.00 ( 0%) sys   0.81 ( 0%) wall     239 kB ( 0%) ggc
 final                 :   3.13 ( 0%) usr   0.15 ( 3%) sys   3.30 ( 0%) wall     533 kB ( 0%) ggc
 symout                :   0.08 ( 0%) usr   0.02 ( 0%) sys   0.08 ( 0%) wall    4818 kB ( 1%) ggc
 TOTAL                 : 788.49             4.49           792.98             559736 kB
Extra diagnostic checks enabled; compiler may run slowly.
Configure with --enable-checking=release to disable checks.
COLLECT_GCC_OPTIONS='-ffree-line-length-512' '-g' '-ffree-form' '-ftime-report' '-c' '-O0' '-ffree-line-length-512' '-v' '-mtune=generic'
 as -V -Qy -o PR38582.o /tmp/ccfulxg5.s

Comment 40 Joost VandeVondele 2009-11-27 09:00:20 UTC

with the fix for rename registers this now also runs 'fast' at -O3 (see below), and memory is reasonable as well. Most time is in expand as well. This is the time report of -O3:

Execution times (seconds)
 garbage collection    :   7.60 ( 1%) usr   0.03 ( 0%) sys   7.65 ( 1%) wall       0 kB ( 0%) ggc
 callgraph construction:   0.23 ( 0%) usr   0.01 ( 0%) sys   0.25 ( 0%) wall   12524 kB ( 1%) ggc
 callgraph optimization:   0.48 ( 0%) usr   0.03 ( 0%) sys   0.51 ( 0%) wall    4370 kB ( 0%) ggc
 ipa cp                :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall    2061 kB ( 0%) ggc
 ipa reference         :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.11 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const        :   0.18 ( 0%) usr   0.00 ( 0%) sys   0.18 ( 0%) wall       2 kB ( 0%) ggc
 cfg cleanup           :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall       0 kB ( 0%) ggc
 CFG verifier          :  11.18 ( 1%) usr   0.04 ( 0%) sys  11.25 ( 1%) wall       0 kB ( 0%) ggc
 trivially dead code   :   2.70 ( 0%) usr   0.01 ( 0%) sys   2.72 ( 0%) wall       0 kB ( 0%) ggc
 df multiple defs      :   3.28 ( 0%) usr   0.00 ( 0%) sys   3.28 ( 0%) wall       0 kB ( 0%) ggc
 df reaching defs      :   1.30 ( 0%) usr   0.04 ( 0%) sys   1.33 ( 0%) wall       0 kB ( 0%) ggc
 df live regs          :  11.46 ( 1%) usr   0.01 ( 0%) sys  11.47 ( 1%) wall       0 kB ( 0%) ggc
 df live&initialized regs:   6.86 ( 1%) usr   0.02 ( 0%) sys   6.87 ( 1%) wall       0 kB ( 0%) ggc
 df use-def / def-use chains:   3.87 ( 0%) usr   0.02 ( 0%) sys   3.91 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   9.18 ( 1%) usr   0.01 ( 0%) sys   9.23 ( 1%) wall   28894 kB ( 2%) ggc
 register information  :   3.54 ( 0%) usr   0.02 ( 0%) sys   3.58 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis        :   5.55 ( 1%) usr   0.01 ( 0%) sys   5.60 ( 1%) wall   42254 kB ( 4%) ggc
 alias stmt walking    :   0.23 ( 0%) usr   0.11 ( 1%) sys   0.33 ( 0%) wall       0 kB ( 0%) ggc
 register scan         :   0.70 ( 0%) usr   0.00 ( 0%) sys   0.71 ( 0%) wall       4 kB ( 0%) ggc
 rebuild jump labels   :   1.43 ( 0%) usr   0.00 ( 0%) sys   1.46 ( 0%) wall       0 kB ( 0%) ggc
 parser                :   4.66 ( 1%) usr   0.11 ( 1%) sys   4.78 ( 1%) wall   50732 kB ( 4%) ggc
 inline heuristics     :  40.66 ( 5%) usr   8.08 (51%) sys  48.90 ( 6%) wall     112 kB ( 0%) ggc
 integration           :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall     951 kB ( 0%) ggc
 tree gimplify         :   0.67 ( 0%) usr   0.00 ( 0%) sys   0.67 ( 0%) wall   13182 kB ( 1%) ggc
 tree eh               :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 tree CFG construction :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall     172 kB ( 0%) ggc
 tree CFG cleanup      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       1 kB ( 0%) ggc
 tree VRP              :   0.15 ( 0%) usr   0.00 ( 0%) sys   0.17 ( 0%) wall     425 kB ( 0%) ggc
 tree copy propagation :   0.26 ( 0%) usr   0.00 ( 0%) sys   0.22 ( 0%) wall     139 kB ( 0%) ggc
 tree find ref. vars   :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall    3262 kB ( 0%) ggc
 tree PTA              :  21.39 ( 3%) usr   0.38 ( 2%) sys  21.76 ( 3%) wall     371 kB ( 0%) ggc
 tree PHI insertion    :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA rewrite      :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall    8504 kB ( 1%) ggc
 tree SSA other        :   0.04 ( 0%) usr   0.01 ( 0%) sys   0.05 ( 0%) wall      18 kB ( 0%) ggc
 tree SSA incremental  :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall      24 kB ( 0%) ggc
 tree operand scan     :   0.04 ( 0%) usr   0.07 ( 0%) sys   0.10 ( 0%) wall    4721 kB ( 0%) ggc
 dominator optimization:   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall      68 kB ( 0%) ggc
 tree SRA              :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall      86 kB ( 0%) ggc
 tree CCP              :   0.18 ( 0%) usr   0.00 ( 0%) sys   0.19 ( 0%) wall     105 kB ( 0%) ggc
 tree reassociation    :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall      48 kB ( 0%) ggc
 tree PRE              :   0.11 ( 0%) usr   0.00 ( 0%) sys   0.11 ( 0%) wall     171 kB ( 0%) ggc
 tree FRE              :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.13 ( 0%) wall     140 kB ( 0%) ggc
 tree code sinking     :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall      24 kB ( 0%) ggc
 tree linearize phis   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      14 kB ( 0%) ggc
 tree forward propagate:   0.03 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       7 kB ( 0%) ggc
 tree conservative DCE :   0.40 ( 0%) usr   0.05 ( 0%) sys   0.46 ( 0%) wall       0 kB ( 0%) ggc
 tree aggressive DCE   :   0.21 ( 0%) usr   0.03 ( 0%) sys   0.19 ( 0%) wall     319 kB ( 0%) ggc
 tree buildin call DCE :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 tree DSE              :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       8 kB ( 0%) ggc
 complete unrolling    :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      43 kB ( 0%) ggc
 tree vectorization    :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree slp vectorization:   0.03 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      27 kB ( 0%) ggc
 tree rename SSA copies:   0.04 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA verifier     :   2.85 ( 0%) usr   0.02 ( 0%) sys   2.83 ( 0%) wall       0 kB ( 0%) ggc
 tree STMT verifier    :  13.12 ( 2%) usr   0.06 ( 0%) sys  13.20 ( 2%) wall       0 kB ( 0%) ggc
 callgraph verifier    :   1.85 ( 0%) usr   0.00 ( 0%) sys   1.86 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 expand                : 548.68 (65%) usr   4.27 (27%) sys 552.92 (64%) wall  311209 kB (26%) ggc
 lower subreg          :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.07 ( 0%) wall       0 kB ( 0%) ggc
 forward prop          :   6.49 ( 1%) usr   0.08 ( 1%) sys   6.57 ( 1%) wall   18623 kB ( 2%) ggc
 CSE                   :   4.60 ( 1%) usr   0.02 ( 0%) sys   4.62 ( 1%) wall   11149 kB ( 1%) ggc
 dead code elimination :   2.60 ( 0%) usr   0.01 ( 0%) sys   2.60 ( 0%) wall       0 kB ( 0%) ggc
 dead store elim1      :   3.33 ( 0%) usr   0.22 ( 1%) sys   3.51 ( 0%) wall   27472 kB ( 2%) ggc
 dead store elim2      :   8.94 ( 1%) usr   0.02 ( 0%) sys   8.92 ( 1%) wall   40503 kB ( 3%) ggc
 CPROP                 :   3.82 ( 0%) usr   0.01 ( 0%) sys   3.84 ( 0%) wall      10 kB ( 0%) ggc
 CSE 2                 :   4.43 ( 1%) usr   0.02 ( 0%) sys   4.44 ( 1%) wall    7115 kB ( 1%) ggc
 branch prediction     :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall      43 kB ( 0%) ggc
 combiner              :   3.60 ( 0%) usr   0.03 ( 0%) sys   3.62 ( 0%) wall   13773 kB ( 1%) ggc
 regmove               :   1.00 ( 0%) usr   0.01 ( 0%) sys   1.00 ( 0%) wall       0 kB ( 0%) ggc
 integrated RA         :  30.06 ( 4%) usr   0.29 ( 2%) sys  30.38 ( 4%) wall   52314 kB ( 4%) ggc
 reload                :  11.54 ( 1%) usr   0.52 ( 3%) sys  12.09 ( 1%) wall  216344 kB (18%) ggc
 reload CSE regs       :   9.15 ( 1%) usr   0.01 ( 0%) sys   9.16 ( 1%) wall   59432 kB ( 5%) ggc
 load CSE after reload :   0.53 ( 0%) usr   0.01 ( 0%) sys   0.53 ( 0%) wall       0 kB ( 0%) ggc
 thread pro- & epilogue:   0.86 ( 0%) usr   0.00 ( 0%) sys   0.86 ( 0%) wall     302 kB ( 0%) ggc
 if-conversion 2       :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      24 kB ( 0%) ggc
 combine stack adjustments:   0.18 ( 0%) usr   0.00 ( 0%) sys   0.19 ( 0%) wall       0 kB ( 0%) ggc
 peephole 2            :   1.07 ( 0%) usr   0.00 ( 0%) sys   1.07 ( 0%) wall      27 kB ( 0%) ggc
 hard reg cprop        :   3.83 ( 0%) usr   0.00 ( 0%) sys   3.85 ( 0%) wall       2 kB ( 0%) ggc
 scheduling 2          :  20.89 ( 2%) usr   0.83 ( 5%) sys  21.75 ( 3%) wall  125198 kB (10%) ggc
 machine dep reorg     :   1.51 ( 0%) usr   0.00 ( 0%) sys   1.53 ( 0%) wall       0 kB ( 0%) ggc
 reorder blocks        :   0.31 ( 0%) usr   0.00 ( 0%) sys   0.30 ( 0%) wall       1 kB ( 0%) ggc
 final                 :   3.47 ( 0%) usr   0.13 ( 1%) sys   3.56 ( 0%) wall    1631 kB ( 0%) ggc
 symout                :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall    4315 kB ( 0%) ggc
 variable tracking     :  15.85 ( 2%) usr   0.03 ( 0%) sys  15.90 ( 2%) wall  133442 kB (11%) ggc
 TOTAL                 : 844.13            15.69           860.19            1197120 kB
Extra diagnostic checks enabled; compiler may run slowly.
Configure with --enable-checking=release to disable checks.
COLLECT_GCC_OPTIONS='-ffree-line-length-512' '-g' '-ffree-form' '-ftime-report' '-c' '-O3' '-ffree-line-length-512' '-v' '-mtune=generic'
 as -V -Qy -o PR38582.o /tmp/ccoKMKzI.s

Comment 41 Richard Biener 2009-11-27 10:50:28 UTC

Micha - we still spend most of the time in expand_used_vars even at -O0.
Maybe you want to have a look.

 expand                : 555.46 (92%) usr   4.88 (77%) sys 579.14 (92%) wall  310089 kB (56%) ggc
 integrated RA         :  13.46 ( 2%) usr   0.14 ( 2%) sys  13.57 ( 2%) wall    8685 kB ( 2%) ggc
 reload                :  14.29 ( 2%) usr   0.71 (11%) sys  15.04 ( 2%) wall  123548 kB (22%) ggc
 TOTAL                 : 605.85             6.35           631.31             552067 kB

We also still peak at 2.4GB for this testcase...  the detailed memory
report is as follows (just the biggest pieces):

Kind                   Nodes      Bytes
---------------------------------------
decls                 145133   24456664
exprs                 490000   28838464
random kinds          226360    9054736
---------------------------------------
Total                 923230   66015565

GIMPLE statements
Kind                   Stmts      Bytes
---------------------------------------
assignments              361      29112
phi nodes                  1        240
conditionals              61       4880
sequences               3874      92976
everything else        78134   10240768
---------------------------------------
Total                  82431   10367976

(would probably interesting to separately count calls)

RTX Kind               Count      Bytes
---------------------------------------
expr_list            1113875   26733000
insn                 1106192   79645824
set                  1106448   26554752
reg                   829905   26556960
mem                  2508095   60194280
plus                 2894284   69462816
---------------------------------------
Total                10615719  311545008

DF as usual is a big memory consumer, even at -O0 ...

source location                                     Garbage            Freed             Leak         Overhead            Times
emit-rtl.c:907 (gen_reg_rtx)                       11781120: 2.3%   11769792:33.0%          0: 0.0%    6519744:60.7%         18
rtl.c:285 (copy_rtx)                               21352344: 4.1%          0: 0.0%          0: 0.0%          0: 0.0%     889681
emit-rtl.c:425 (gen_raw_REG)                       26543328: 5.1%          0: 0.0%      13600: 0.1%          0: 0.0%     829904
reload1.c:2622 (eliminate_regs_1)                  26589096: 5.1%          0: 0.0%          0: 0.0%          0: 0.0%    1107879
emit-rtl.c:640 (gen_rtx_MEM)                       59946576:11.5%          0: 0.0%     247704: 1.2%          0: 0.0%    2508095
emit-rtl.c:5457 (copy_insn_1)                      61260256:11.8%          0: 0.0%          0: 0.0%          0: 0.0%    2608317
emit-rtl.c:3610 (make_insn_raw)                    79645680:15.3%          0: 0.0%          0: 0.0%          0: 0.0%    1106190
Total                                             520012383         35665696         20430230         10739229         16135674
source location                                     Garbage            Freed             Leak         Overhead            Times


So most of the memory is used in the RTL parts of the compiler.

Comment 42 Michael Matz 2009-12-03 13:36:49 UTC

Subject: Bug 38474

Author: matz
Date: Thu Dec  3 13:36:32 2009
New Revision: 154945

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=154945
Log:
	PR middle-end/38474
	* cfgexpand.c (struct stack_var): Add conflicts member.
	(stack_vars_conflict, stack_vars_conflict_alloc,
	n_stack_vars_conflict): Remove.
	(add_stack_var): Initialize conflicts member.
	(triangular_index, resize_stack_vars_conflict): Remove.
	(add_stack_var_conflict, stack_var_conflict_p): Rewrite in
	terms of new member.
	(union_stack_vars): Only run over the conflicts.
	(partition_stack_vars): Remove special case.
	(expand_used_vars_for_block): Don't call resize_stack_vars_conflict,
	don't create self-conflicts.
	(account_used_vars_for_block): Don't create any conflicts.
	(fini_vars_expansion): Free bitmaps, don't free or clear removed
	globals.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cfgexpand.c

Comment 43 Joost VandeVondele 2009-12-03 17:47:55 UTC

(In reply to comment #42)
> Subject: Bug 38474
> 
> Author: matz
> Date: Thu Dec  3 13:36:32 2009
> New Revision: 154945

looks like the initial testcase now runs with 1.3Gb, and with the following timings (so mem/time both better by a factor of two):
expand                : 386.46 (89%) usr   0.81 (48%) sys 387.21 (89%) wall  309554 kB (56%) ggc
 integrated RA         :  17.97 ( 4%) usr   0.26 (15%) sys  18.28 ( 4%) wall    8696 kB ( 2%) ggc
 reload                :   7.78 ( 2%) usr   0.25 (15%) sys   8.07 ( 2%) wall  123546 kB (22%) ggc
 thread pro- & epilogue:   0.74 ( 0%) usr   0.00 ( 0%) sys   0.76 ( 0%) wall     239 kB ( 0%) ggc
 final                 :   2.84 ( 1%) usr   0.12 ( 7%) sys   2.95 ( 1%) wall      20 kB ( 0%) ggc
 TOTAL                 : 434.29             1.70           436.00             553866 kB

Comment 44 Michael Matz 2009-12-03 21:05:09 UTC

I'm glad.  I plan to work on also the other slow part of expand, which is the
temp slot goo, but a full solution requires touching very old and stable parts
of GCC, hence is IMO nothing for stage 3.  I have an obvious band aid patch
giving at least some further improvements that I plan to submit for 4.5.

Comment 45 Michael Matz 2009-12-08 13:56:22 UTC

Subject: Bug 38474

Author: matz
Date: Tue Dec  8 13:56:06 2009
New Revision: 155087

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=155087
Log:
	PR middle-end/38474
	* function.c (free_temp_slots): Only walk the temp slot
	addresses and combine slots if we actually changes something.
	(pop_temp_slots): Ditto.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/function.c

Comment 46 Richard Biener 2010-05-22 18:12:55 UTC

GCC 4.3.5 is being released, adjusting target milestone.

Comment 47 Joost VandeVondele 2010-05-23 06:31:22 UTC

all dependencies are fixed, and so is this bug.

Comment 48 Richard Biener 2010-05-23 20:08:35 UTC

Nope.  See comment#44.

Comment 49 Steven Bosscher 2010-05-23 21:02:57 UTC

Let's change the bug type at least, from a meta bug to a normal bug.

Comment 50 Steven Bosscher 2011-12-02 13:06:12 UTC

Michael, you were working on this. Did your patches resolve this bug completely by now?

Comment 51 Michael Matz 2011-12-02 13:23:57 UTC

Nope, I don't have more than a couple hacks to try different approaches
as of right now.  I should dust them off for next stage1.

Comment 52 Steven Bosscher 2012-05-27 23:15:19 UTC

Patch set posted by matz.

Comment 53 Joost VandeVondele 2012-05-29 07:45:36 UTC

For the original testcase I have for trunk (gcc version 4.8.0 20120516 (experimental) [trunk revision 187595] (GCC)) very reasonable times (1min) at -O0, but pretty slow (20min) at -O2. At -O2, all time goes to 'alias stmt walking      : 826.02' in the latter case. Time reports below:

gfortran -ftime-report -ffree-line-length-512 -g -c testcase.f90

Execution times (seconds)
 phase setup             :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall     243 kB ( 0%) ggc
 phase parsing           :   3.59 ( 6%) usr   0.05 ( 5%) sys   3.64 ( 6%) wall   47592 kB ( 7%) ggc
 phase cgraph            :  60.02 (94%) usr   0.90 (95%) sys  60.94 (94%) wall  649547 kB (93%) ggc
 phase generate          :  60.03 (94%) usr   0.90 (95%) sys  60.95 (94%) wall  649948 kB (93%) ggc
 garbage collection      :   1.04 ( 2%) usr   0.00 ( 0%) sys   1.04 ( 2%) wall       0 kB ( 0%) ggc
 callgraph construction  :   0.18 ( 0%) usr   0.01 ( 1%) sys   0.20 ( 0%) wall   15909 kB ( 2%) ggc
 callgraph optimization  :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall     201 kB ( 0%) ggc
 cfg construction        :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall       7 kB ( 0%) ggc
 cfg cleanup             :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 CFG verifier            :   1.16 ( 2%) usr   0.00 ( 0%) sys   1.18 ( 2%) wall       0 kB ( 0%) ggc
 trivially dead code     :   0.34 ( 1%) usr   0.00 ( 0%) sys   0.35 ( 1%) wall       0 kB ( 0%) ggc
 df scan insns           :   1.00 ( 2%) usr   0.25 (26%) sys   1.23 ( 2%) wall      11 kB ( 0%) ggc
 df live regs            :   0.46 ( 1%) usr   0.00 ( 0%) sys   0.49 ( 1%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.45 ( 1%) usr   0.01 ( 1%) sys   0.47 ( 1%) wall   19416 kB ( 3%) ggc
 register information    :   0.20 ( 0%) usr   0.01 ( 1%) sys   0.19 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis          :   0.15 ( 0%) usr   0.00 ( 0%) sys   0.17 ( 0%) wall    8336 kB ( 1%) ggc
 rebuild jump labels     :   0.22 ( 0%) usr   0.00 ( 0%) sys   0.21 ( 0%) wall       0 kB ( 0%) ggc
 parser (global)         :   3.59 ( 6%) usr   0.05 ( 5%) sys   3.64 ( 6%) wall   47587 kB ( 7%) ggc
 inline heuristics       :   0.07 ( 0%) usr   0.00 ( 0%) sys   0.07 ( 0%) wall      54 kB ( 0%) ggc
 tree gimplify           :   0.48 ( 1%) usr   0.01 ( 1%) sys   0.49 ( 1%) wall   26304 kB ( 4%) ggc
 tree eh                 :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      39 kB ( 0%) ggc
 tree CFG construction   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall     190 kB ( 0%) ggc
 tree find ref. vars     :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall    3263 kB ( 0%) ggc
 tree PHI insertion      :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA rewrite        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall      43 kB ( 0%) ggc
 tree SSA other          :   0.04 ( 0%) usr   0.02 ( 2%) sys   0.01 ( 0%) wall      18 kB ( 0%) ggc
 tree operand scan       :   0.01 ( 0%) usr   0.01 ( 1%) sys   0.06 ( 0%) wall     118 kB ( 0%) ggc
 tree SSA verifier       :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall       0 kB ( 0%) ggc
 tree STMT verifier      :   0.58 ( 1%) usr   0.06 ( 6%) sys   0.62 ( 1%) wall       0 kB ( 0%) ggc
 callgraph verifier      :   0.28 ( 0%) usr   0.00 ( 0%) sys   0.29 ( 0%) wall       0 kB ( 0%) ggc
 out of ssa              :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 expand vars             :  21.72 (34%) usr   0.02 ( 2%) sys  21.74 (34%) wall   10086 kB ( 1%) ggc
 expand                  :   6.18 (10%) usr   0.15 (16%) sys   6.31 (10%) wall  251886 kB (36%) ggc
 post expand cleanups    :   0.14 ( 0%) usr   0.00 ( 0%) sys   0.13 ( 0%) wall    1744 kB ( 0%) ggc
 integrated RA           :  10.75 (17%) usr   0.16 (17%) sys  10.87 (17%) wall  128826 kB (18%) ggc
 reload                  :   5.72 ( 9%) usr   0.15 (16%) sys   5.92 ( 9%) wall  123587 kB (18%) ggc
 thread pro- & epilogue  :   2.51 ( 4%) usr   0.00 ( 0%) sys   2.50 ( 4%) wall     198 kB ( 0%) ggc
 machine dep reorg       :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 final                   :   2.61 ( 4%) usr   0.04 ( 4%) sys   2.65 ( 4%) wall    7227 kB ( 1%) ggc
 symout                  :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall    4914 kB ( 1%) ggc
 rest of compilation     :   2.36 ( 4%) usr   0.00 ( 0%) sys   2.35 ( 4%) wall   47578 kB ( 7%) ggc
 verify RTL sharing      :   1.02 ( 2%) usr   0.00 ( 0%) sys   1.04 ( 2%) wall       0 kB ( 0%) ggc
 TOTAL                 :  63.65             0.95            64.62             697784 kB


gfortran -ftime-report -ffree-line-length-512 -O2 -g -c testcase.f90

Execution times (seconds)
 phase setup             :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall     243 kB ( 0%) ggc
 phase parsing           :   3.59 ( 0%) usr   0.07 ( 5%) sys   3.66 ( 0%) wall   47596 kB ( 7%) ggc
 phase cgraph            :1031.34 (100%) usr   1.36 (93%) sys1032.77 (100%) wall  630545 kB (91%) ggc
 phase generate          :1031.85 (100%) usr   1.39 (95%) sys1033.30 (100%) wall  643621 kB (93%) ggc
 garbage collection      :   1.74 ( 0%) usr   0.01 ( 1%) sys   1.74 ( 0%) wall       0 kB ( 0%) ggc
 callgraph construction  :   0.17 ( 0%) usr   0.00 ( 0%) sys   0.19 ( 0%) wall   15908 kB ( 2%) ggc
 callgraph optimization  :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     201 kB ( 0%) ggc
 ipa cp                  :   0.40 ( 0%) usr   0.12 ( 8%) sys   0.59 ( 0%) wall    6174 kB ( 1%) ggc
 ipa reference           :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 ipa profile             :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 cfg construction        :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall       0 kB ( 0%) ggc
 cfg cleanup             :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       1 kB ( 0%) ggc
 CFG verifier            :   1.84 ( 0%) usr   0.01 ( 1%) sys   1.83 ( 0%) wall       0 kB ( 0%) ggc
 trivially dead code     :   0.46 ( 0%) usr   0.00 ( 0%) sys   0.47 ( 0%) wall       0 kB ( 0%) ggc
 df scan insns           :   0.75 ( 0%) usr   0.08 ( 5%) sys   0.83 ( 0%) wall      11 kB ( 0%) ggc
 df multiple defs        :   0.49 ( 0%) usr   0.00 ( 0%) sys   0.49 ( 0%) wall       0 kB ( 0%) ggc
 df reaching defs        :   1.28 ( 0%) usr   0.02 ( 1%) sys   1.30 ( 0%) wall       0 kB ( 0%) ggc
 df live regs            :   2.22 ( 0%) usr   0.00 ( 0%) sys   2.22 ( 0%) wall       0 kB ( 0%) ggc
 df live&initialized regs:   1.18 ( 0%) usr   0.00 ( 0%) sys   1.19 ( 0%) wall       0 kB ( 0%) ggc
 df use-def / def-use chains:   0.46 ( 0%) usr   0.01 ( 1%) sys   0.46 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   2.26 ( 0%) usr   0.00 ( 0%) sys   2.29 ( 0%) wall   15547 kB ( 2%) ggc
 register information    :   1.16 ( 0%) usr   0.00 ( 0%) sys   1.15 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis          :   0.60 ( 0%) usr   0.00 ( 0%) sys   0.59 ( 0%) wall   18809 kB ( 3%) ggc
 alias stmt walking      : 826.02 (80%) usr   0.21 (14%) sys 826.21 (80%) wall       0 kB ( 0%) ggc
 register scan           :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall      10 kB ( 0%) ggc
 rebuild jump labels     :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 parser (global)         :   3.59 ( 0%) usr   0.07 ( 5%) sys   3.66 ( 0%) wall   47591 kB ( 7%) ggc
 inline heuristics       :   0.38 ( 0%) usr   0.01 ( 1%) sys   0.35 ( 0%) wall     161 kB ( 0%) ggc
 integration             :   0.01 ( 0%) usr   0.01 ( 1%) sys   0.04 ( 0%) wall     285 kB ( 0%) ggc
 tree gimplify           :   0.48 ( 0%) usr   0.02 ( 1%) sys   0.49 ( 0%) wall   26299 kB ( 4%) ggc
 tree eh                 :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall      39 kB ( 0%) ggc
 tree CFG cleanup        :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       4 kB ( 0%) ggc
 tree tail merge         :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 tree VRP                :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.10 ( 0%) wall     394 kB ( 0%) ggc
 tree copy propagation   :   0.20 ( 0%) usr   0.00 ( 0%) sys   0.23 ( 0%) wall     175 kB ( 0%) ggc
 tree find ref. vars     :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall    3262 kB ( 0%) ggc
 tree PTA                :  26.17 ( 3%) usr   0.25 (17%) sys  26.41 ( 3%) wall   36473 kB ( 5%) ggc
 tree SSA rewrite        :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall   15685 kB ( 2%) ggc
 tree SSA other          :   0.01 ( 0%) usr   0.02 ( 1%) sys   0.05 ( 0%) wall      18 kB ( 0%) ggc
 tree SSA incremental    :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall      14 kB ( 0%) ggc
 tree operand scan       :   0.09 ( 0%) usr   0.01 ( 1%) sys   0.12 ( 0%) wall    8291 kB ( 1%) ggc
 dominator optimization  :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall      46 kB ( 0%) ggc
 tree SRA                :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 tree CCP                :   0.17 ( 0%) usr   0.00 ( 0%) sys   0.16 ( 0%) wall     107 kB ( 0%) ggc
 tree reassociation      :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall      23 kB ( 0%) ggc
 tree PRE                :   0.15 ( 0%) usr   0.00 ( 0%) sys   0.11 ( 0%) wall     180 kB ( 0%) ggc
 tree FRE                :   0.21 ( 0%) usr   0.00 ( 0%) sys   0.18 ( 0%) wall     313 kB ( 0%) ggc
 tree linearize phis     :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      12 kB ( 0%) ggc
 tree forward propagate  :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall       0 kB ( 0%) ggc
 tree conservative DCE   :   0.19 ( 0%) usr   0.03 ( 2%) sys   0.25 ( 0%) wall       0 kB ( 0%) ggc
 tree aggressive DCE     :   0.11 ( 0%) usr   0.02 ( 1%) sys   0.10 ( 0%) wall     295 kB ( 0%) ggc
 tree buildin call DCE   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree DSE                :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree rename SSA copies  :   0.07 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA verifier       :   3.58 ( 0%) usr   0.00 ( 0%) sys   3.64 ( 0%) wall       0 kB ( 0%) ggc
 tree STMT verifier      :   8.91 ( 1%) usr   0.03 ( 2%) sys   8.87 ( 1%) wall       0 kB ( 0%) ggc
 tree strlen optimization:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 callgraph verifier      :   0.48 ( 0%) usr   0.00 ( 0%) sys   0.46 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation   :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 out of ssa              :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 expand vars             :  96.25 ( 9%) usr   0.00 ( 0%) sys  96.26 ( 9%) wall    9993 kB ( 1%) ggc
 expand                  :   1.34 ( 0%) usr   0.06 ( 4%) sys   1.38 ( 0%) wall  119840 kB (17%) ggc
 post expand cleanups    :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall     109 kB ( 0%) ggc
 lower subreg            :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 forward prop            :   3.47 ( 0%) usr   0.02 ( 1%) sys   3.48 ( 0%) wall    6588 kB ( 1%) ggc
 CSE                     :   0.99 ( 0%) usr   0.01 ( 1%) sys   1.01 ( 0%) wall   13067 kB ( 2%) ggc
 dead code elimination   :   0.41 ( 0%) usr   0.00 ( 0%) sys   0.41 ( 0%) wall       0 kB ( 0%) ggc
 dead store elim1        :   0.65 ( 0%) usr   0.02 ( 1%) sys   0.68 ( 0%) wall    1535 kB ( 0%) ggc
 dead store elim2        :   2.52 ( 0%) usr   0.00 ( 0%) sys   2.53 ( 0%) wall    4393 kB ( 1%) ggc
 CPROP                   :   1.83 ( 0%) usr   0.00 ( 0%) sys   1.83 ( 0%) wall      10 kB ( 0%) ggc
 PRE                     :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       1 kB ( 0%) ggc
 CSE 2                   :   0.78 ( 0%) usr   0.01 ( 1%) sys   0.78 ( 0%) wall    7470 kB ( 1%) ggc
 branch prediction       :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall      41 kB ( 0%) ggc
 combiner                :   0.47 ( 0%) usr   0.01 ( 1%) sys   0.48 ( 0%) wall      12 kB ( 0%) ggc
 if-conversion           :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      22 kB ( 0%) ggc
 regmove                 :   0.40 ( 0%) usr   0.00 ( 0%) sys   0.41 ( 0%) wall       0 kB ( 0%) ggc
 integrated RA           :  13.95 ( 1%) usr   0.12 ( 8%) sys  14.05 ( 1%) wall   67921 kB (10%) ggc
 reload                  :   3.03 ( 0%) usr   0.04 ( 3%) sys   3.04 ( 0%) wall   52942 kB ( 8%) ggc
 reload CSE regs         :   1.24 ( 0%) usr   0.00 ( 0%) sys   1.25 ( 0%) wall   10704 kB ( 2%) ggc
 ree                     :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.10 ( 0%) wall       1 kB ( 0%) ggc
 thread pro- & epilogue  :   0.46 ( 0%) usr   0.00 ( 0%) sys   0.45 ( 0%) wall     256 kB ( 0%) ggc
 if-conversion 2         :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall      22 kB ( 0%) ggc
 combine stack adjustments:   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 peephole 2              :   0.15 ( 0%) usr   0.00 ( 0%) sys   0.15 ( 0%) wall      15 kB ( 0%) ggc
 hard reg cprop          :   1.12 ( 0%) usr   0.00 ( 0%) sys   1.11 ( 0%) wall       3 kB ( 0%) ggc
 scheduling 2            :   3.33 ( 0%) usr   0.11 ( 8%) sys   3.44 ( 0%) wall   39009 kB ( 6%) ggc
 machine dep reorg       :   0.21 ( 0%) usr   0.00 ( 0%) sys   0.22 ( 0%) wall       0 kB ( 0%) ggc
 reorder blocks          :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall       2 kB ( 0%) ggc
 final                   :   0.87 ( 0%) usr   0.02 ( 1%) sys   0.92 ( 0%) wall   11744 kB ( 2%) ggc
 symout                  :   0.74 ( 0%) usr   0.06 ( 4%) sys   0.78 ( 0%) wall  111089 kB (16%) ggc
 variable tracking       :   0.61 ( 0%) usr   0.00 ( 0%) sys   0.61 ( 0%) wall   29367 kB ( 4%) ggc
 var-tracking dataflow   :   3.52 ( 0%) usr   0.00 ( 0%) sys   3.52 ( 0%) wall       0 kB ( 0%) ggc
 var-tracking emit       :   4.20 ( 0%) usr   0.00 ( 0%) sys   4.20 ( 0%) wall    5616 kB ( 1%) ggc
 rest of compilation     :   1.16 ( 0%) usr   0.02 ( 1%) sys   1.27 ( 0%) wall    2733 kB ( 0%) ggc
 remove unused locals    :   0.09 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 address taken           :   0.13 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 unaccounted todo        :   0.62 ( 0%) usr   0.03 ( 2%) sys   0.66 ( 0%) wall       0 kB ( 0%) ggc
 verify RTL sharing      :   3.87 ( 0%) usr   0.00 ( 0%) sys   3.89 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 :1035.46             1.46          1036.99             691461 kB

Comment 54 Michael Matz 2012-05-29 12:47:29 UTC

Yes, only the expand vars problem is attacked by my patch.  The alias walking
seems to come from an IPA analysis via ipa_compute_jump_functions.

detect_type_change uses the walker from all call statements, and that's
used by compute_known_type_jump_func, from compute_scalar_jump_functions,
from ipa_compute_jump_functions_for_edge.  And the latter is called for each
callee.  The yukawa_gn_full function has very many calls, so this
seems to make out for an quadratic problem.

Comment 55 Michael Matz 2012-05-29 13:08:52 UTC

FWIW the node->callees list in yukawa_gn_full has 25076 entries.

Comment 56 Jan Hubicka 2012-05-29 14:57:30 UTC

That functions is Martin's. Martin, can you please take a look?

Comment 57 Martin Jambor 2012-05-29 15:08:39 UTC

(In reply to comment #56)
> That functions is Martin's. Martin, can you please take a look?

I will, on Monday.

Comment 58 Michael Matz 2012-06-15 14:56:33 UTC

Author: matz
Date: Fri Jun 15 14:56:26 2012
New Revision: 188667

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188667
Log:
	PR middle-end/38474
	* cfgexpand.c (add_alias_set_conflicts): Remove.
	(expand_used_vars): Don't call it.
	(aggregate_contains_union_type): Remove.
	* function.c (n_temp_slots_in_use): New static data.
	(make_slot_available, assign_stack_temp_for_type): Update it.
	(init_temp_slots): Zero it.
	(remove_unused_temp_slot_addresses): Use it for quicker removal.
	(remove_unused_temp_slot_addresses_1): Use htab_clear_slot.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cfgexpand.c
    trunk/gcc/function.c

Comment 59 Michael Matz 2012-06-15 15:12:59 UTC

There should be no compile performance problems in expand anymore.
The alias stmt walker as used from IPA remains a problem, though.

Comment 60 Joost VandeVondele 2012-06-15 15:26:20 UTC

(In reply to comment #59)
> There should be no compile performance problems in expand anymore.
> The alias stmt walker as used from IPA remains a problem, though.

Thanks... expand is now indeed essentially gone from the timing report.

> gfortran -ftime-report -ffree-line-length-512 -g -c testcase.f90

Execution times (seconds)
 phase setup             :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall     243 kB ( 0%) ggc
 phase parsing           :   3.57 ( 9%) usr   0.06 ( 7%) sys   3.63 ( 9%) wall   47592 kB ( 7%) ggc
 phase cgraph            :  36.49 (91%) usr   0.86 (93%) sys  37.34 (91%) wall  647436 kB (93%) ggc
 phase generate          :  36.50 (91%) usr   0.86 (93%) sys  37.36 (91%) wall  647838 kB (93%) ggc
 garbage collection      :   1.04 ( 3%) usr   0.00 ( 0%) sys   1.04 ( 3%) wall       0 kB ( 0%) ggc
 callgraph construction  :   0.19 ( 0%) usr   0.00 ( 0%) sys   0.19 ( 0%) wall   15909 kB ( 2%) ggc
 callgraph optimization  :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     201 kB ( 0%) ggc
 cfg construction        :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall       7 kB ( 0%) ggc
 cfg cleanup             :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 CFG verifier            :   1.26 ( 3%) usr   0.00 ( 0%) sys   1.25 ( 3%) wall       0 kB ( 0%) ggc
 trivially dead code     :   0.43 ( 1%) usr   0.00 ( 0%) sys   0.41 ( 1%) wall       0 kB ( 0%) ggc
 df scan insns           :   0.98 ( 2%) usr   0.24 (26%) sys   1.24 ( 3%) wall      11 kB ( 0%) ggc
 df live regs            :   0.58 ( 1%) usr   0.01 ( 1%) sys   0.57 ( 1%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.43 ( 1%) usr   0.01 ( 1%) sys   0.45 ( 1%) wall   19416 kB ( 3%) ggc
 register information    :   0.18 ( 0%) usr   0.00 ( 0%) sys   0.18 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis          :   0.15 ( 0%) usr   0.00 ( 0%) sys   0.14 ( 0%) wall    8337 kB ( 1%) ggc
 rebuild jump labels     :   0.22 ( 1%) usr   0.00 ( 0%) sys   0.21 ( 1%) wall       0 kB ( 0%) ggc
 parser (global)         :   3.57 ( 9%) usr   0.06 ( 7%) sys   3.63 ( 9%) wall   47587 kB ( 7%) ggc
 inline heuristics       :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall      54 kB ( 0%) ggc
 tree gimplify           :   0.51 ( 1%) usr   0.01 ( 1%) sys   0.51 ( 1%) wall   26304 kB ( 4%) ggc
 tree eh                 :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      39 kB ( 0%) ggc
 tree CFG construction   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     190 kB ( 0%) ggc
 tree CFG cleanup        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 tree find ref. vars     :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall    3263 kB ( 0%) ggc
 tree PHI insertion      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA other          :   0.01 ( 0%) usr   0.01 ( 1%) sys   0.02 ( 0%) wall      18 kB ( 0%) ggc
 tree operand scan       :   0.03 ( 0%) usr   0.03 ( 3%) sys   0.05 ( 0%) wall     118 kB ( 0%) ggc
 tree SSA verifier       :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.10 ( 0%) wall       0 kB ( 0%) ggc
 tree STMT verifier      :   0.56 ( 1%) usr   0.05 ( 5%) sys   0.63 ( 2%) wall       0 kB ( 0%) ggc
 callgraph verifier      :   0.25 ( 1%) usr   0.00 ( 0%) sys   0.27 ( 1%) wall       0 kB ( 0%) ggc
 out of ssa              :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 expand vars             :   1.02 ( 3%) usr   0.02 ( 2%) sys   1.03 ( 3%) wall   10086 kB ( 1%) ggc
 expand                  :   2.03 ( 5%) usr   0.12 (13%) sys   2.18 ( 5%) wall  249774 kB (36%) ggc
 post expand cleanups    :   0.14 ( 0%) usr   0.01 ( 1%) sys   0.14 ( 0%) wall    1744 kB ( 0%) ggc
 integrated RA           :  10.75 (27%) usr   0.15 (16%) sys  10.93 (27%) wall  128826 kB (19%) ggc
 reload                  :   5.56 (14%) usr   0.16 (17%) sys   5.77 (14%) wall  123587 kB (18%) ggc
 thread pro- & epilogue  :   2.65 ( 7%) usr   0.00 ( 0%) sys   2.64 ( 6%) wall     198 kB ( 0%) ggc
 machine dep reorg       :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.07 ( 0%) wall       0 kB ( 0%) ggc
 final                   :   3.11 ( 8%) usr   0.04 ( 4%) sys   3.15 ( 8%) wall    7227 kB ( 1%) ggc
 symout                  :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall    4914 kB ( 1%) ggc
 rest of compilation     :   2.46 ( 6%) usr   0.00 ( 0%) sys   2.39 ( 6%) wall   47578 kB ( 7%) ggc
 unaccounted todo        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 verify RTL sharing      :   1.49 ( 4%) usr   0.00 ( 0%) sys   1.48 ( 4%) wall       0 kB ( 0%) ggc
 TOTAL                 :  40.09             0.92            41.02             695674 kB

Comment 61 Martin Jambor 2012-06-26 14:26:34 UTC

(In reply to comment #57)
> 
> I will, on Monday.

And by Monday I obviously meant yesterday ;-)

Anyway, on the machine where are debugged this, compilation at -O3
took over 16 seconds which dropped to about 13.5 seconds when I also
added -fno-devirtualize (-ftime-report showed that alias stmt walking
dropped from 82% to 75%).  This is mainly due to calls to
detect_type_change from compute_known_type_jump_func, there are 36454
of them and all are of course completely pointless because we do not
devirtualize in Fortran.

Looking into the code, it is apparent that I even attempted to avoid
such situations but somehow was not paying enough attention.  The
rather obvious patch below fixes that.  With it, the compile time at
-O3 drops to 13.5 without any additional options (~50 calls to
detect_type_change_ssa and detect_type change from other places remain
but those are not a big problem here, they are not so easy to get rid
of and I hope to eventually remove the type detection machinery from
IPA altogether so I'll keep those for later).

I'll bootstrap and test the patch and post it to the mailing list
soon.

Index: gcc/ipa-prop.c
===================================================================
--- gcc/ipa-prop.c      (revision 188931)
+++ gcc/ipa-prop.c      (working copy)
@@ -912,8 +912,8 @@ compute_known_type_jump_func (tree op, s
       || is_global_var (base))
     return;

-  if (detect_type_change (op, base, call, jfunc, offset)
-      || !TYPE_BINFO (TREE_TYPE (base)))
+  if (!TYPE_BINFO (TREE_TYPE (base))
+      || detect_type_change (op, base, call, jfunc, offset))
     return;

Comment 62 Michael Matz 2012-06-26 14:44:58 UTC

(In reply to comment #61)
> (In reply to comment #57)
> 
> Anyway, on the machine where are debugged this, compilation at -O3
> took over 16 seconds which dropped to about 13.5 seconds when I also

What?  Must be a future machine.  On everything I have access to the reduced
testcase (6309 lines) takes about 800 to 1000 seconds.  Do you build without
any checking?

In any case, the proposed patch does reduce the time to basically nothing for
the alias tree walker, so: thanks :)

Comment 63 Richard Biener 2012-06-26 14:58:31 UTC

(In reply to comment #61)
> (In reply to comment #57)
> > 
> > I will, on Monday.
> 
> And by Monday I obviously meant yesterday ;-)
> 
> Anyway, on the machine where are debugged this, compilation at -O3
> took over 16 seconds which dropped to about 13.5 seconds when I also
> added -fno-devirtualize (-ftime-report showed that alias stmt walking
> dropped from 82% to 75%).  This is mainly due to calls to
> detect_type_change from compute_known_type_jump_func, there are 36454
> of them and all are of course completely pointless because we do not
> devirtualize in Fortran.
> 
> Looking into the code, it is apparent that I even attempted to avoid
> such situations but somehow was not paying enough attention.  The
> rather obvious patch below fixes that.  With it, the compile time at
> -O3 drops to 13.5 without any additional options (~50 calls to
> detect_type_change_ssa and detect_type change from other places remain
> but those are not a big problem here, they are not so easy to get rid
> of and I hope to eventually remove the type detection machinery from
> IPA altogether so I'll keep those for later).
> 
> I'll bootstrap and test the patch and post it to the mailing list
> soon.
> 
> Index: gcc/ipa-prop.c
> ===================================================================
> --- gcc/ipa-prop.c      (revision 188931)
> +++ gcc/ipa-prop.c      (working copy)
> @@ -912,8 +912,8 @@ compute_known_type_jump_func (tree op, s
>        || is_global_var (base))
>      return;
> 
> -  if (detect_type_change (op, base, call, jfunc, offset)
> -      || !TYPE_BINFO (TREE_TYPE (base)))
> +  if (!TYPE_BINFO (TREE_TYPE (base))
> +      || detect_type_change (op, base, call, jfunc, offset))
>      return;

That change qualifies for a backport to all branches it applies to ...

Comment 64 Martin Jambor 2012-06-26 15:01:28 UTC

(In reply to comment #62)
> (In reply to comment #61)
> > (In reply to comment #57)
> > 
> > Anyway, on the machine where are debugged this, compilation at -O3
> > took over 16 seconds which dropped to about 13.5 seconds when I also
> 
> What?  Must be a future machine.  On everything I have access to the reduced
> testcase (6309 lines) takes about 800 to 1000 seconds.  Do you build without
> any checking?

Minutes! Of course I meant minutes, the drop is thus from ~1000
seconds to ~810 seconds.  I forgot I was using bash time instead of
/usr/bin/time -f%U which I was regularly using only a few days ago.

> 
> In any case, the proposed patch does reduce the time to basically nothing for
> the alias tree walker, so: thanks :)

I've experimentally disabled the walker in
is_parm_modified_before_stmt and am now waiting for results but I
guess it won't have any measurable impact.

Comment 65 Martin Jambor 2012-06-29 14:34:34 UTC

I have posted the patch to the mailing list:

http://gcc.gnu.org/ml/gcc-patches/2012-06/msg01928.html

along with an equivalent one for the 4.6 branch:

http://gcc.gnu.org/ml/gcc-patches/2012-06/msg01929.html

Comment 66 Martin Jambor 2012-07-02 15:28:17 UTC

Author: jamborm
Date: Mon Jul  2 15:28:11 2012
New Revision: 189163

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189163
Log:
2012-07-02  Martin Jambor  <mjambor@suse.cz>

	PR middle-end/38474
	* ipa-prop.c (compute_known_type_jump_func): Put BINFO check before a
	dynamic type change check.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-prop.c

Comment 67 Martin Jambor 2012-07-02 15:44:01 UTC

Author: jamborm
Date: Mon Jul  2 15:43:56 2012
New Revision: 189164

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189164
Log:
2012-07-02  Martin Jambor  <mjambor@suse.cz>

	PR middle-end/38474
	* ipa-prop.c (compute_known_type_jump_func): Put BINFO check before a
	dynamic type change check.


Modified:
    branches/gcc-4_7-branch/gcc/ChangeLog
    branches/gcc-4_7-branch/gcc/ipa-prop.c

Comment 68 Martin Jambor 2012-07-02 15:53:29 UTC

Author: jamborm
Date: Mon Jul  2 15:53:21 2012
New Revision: 189165

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189165
Log:
2012-07-02  Martin Jambor  <mjambor@suse.cz>

	PR middle-end/38474
	* ipa-prop.c (compute_known_type_jump_func): Check for a BINFO before
	checking for a dynamic type change.


Modified:
    branches/gcc-4_6-branch/gcc/ChangeLog
    branches/gcc-4_6-branch/gcc/ipa-prop.c

Comment 69 Steven Bosscher 2012-08-28 08:25:06 UTC

Is there still a problem here?

Comment 70 Joost VandeVondele 2012-08-28 11:28:06 UTC

(In reply to comment #69)
> Is there still a problem here?

for current trunk and the original testcase, timings are reasonable at -O0 -O1 -O2, but very long at -O3 (>60min):

report.O0.txt: TOTAL                 :  38.78             0.89            39.67             691166 kB
report.O1.txt: TOTAL                 :  70.04             1.13            71.22             634523 kB
report.O2.txt: TOTAL                 : 204.51             1.16           205.71             691522 kB

the biggest consumers are

-O0:  
====
integrated RA           :  10.36 
reload                  :   5.16;

-O1:
====
tree PTA                :   7.77
integrated RA           :  13.36

-O2:
====
expand vars             :  83.15
tree PTA                :  35.04

-O3: (also needs about 4Gb of memory)
====
??? not yet finished (>60min)

Comment 71 Joost VandeVondele 2012-08-28 14:54:54 UTC

The -O3 compile is 3h later still running and needs >20Gb of RAM. The issue seems now to be variable_tracking_main

#0  0x0000000000b7b8ce in dataflow_set_preserve_mem_locs(void**, void*) ()
#1  0x0000000000e76168 in htab_traverse_noresize ()
#2  0x0000000000b770e0 in dataflow_set_clear_at_call(dataflow_set_def*) ()
#3  0x0000000000b7c613 in vt_emit_notes() ()
#4  0x0000000000b847ea in variable_tracking_main() ()
#5  0x00000000008e8acf in execute_one_pass(opt_pass*) ()

Comment 72 Steven Bosscher 2012-08-28 15:06:56 UTC

Thus, another var-tracking issue.

Comment 73 Richard Biener 2013-03-07 10:31:53 UTC

On trunk with the reduced testcase I now see PTA taking 90% of compile-time ...

Argh.

Comment 74 Richard Biener 2013-03-07 14:55:27 UTC

(In reply to comment #73)
> On trunk with the reduced testcase I now see PTA taking 90% of compile-time ...
> 
> Argh.

I can speed it up by

@@ -1631,7 +1619,20 @@ do_sd_constraint (constraint_graph_t gra
            flag |= bitmap_set_bit (sol, escaped_id);
          else if (v->may_have_pointers
                   && add_graph_edge (graph, lhs, t))
-           flag |= bitmap_ior_into (sol, get_varinfo (t)->solution);
+           {
+             /* For transitive closures, x = *(x + UNKNOWN), delay
+                propagation of the solution across the added edges
+                by marking sources as changed.  */
+             if (lhs == c->rhs.var)
+               {
+                 bitmap_set_bit (changed, t);
+                 flag |= true;
+               }
+             /* Else speedup solving by doing that here to save
+                iterations.  */
+             else
+               flag |= bitmap_ior_into (sol, get_varinfo (t)->solution);
+           }
 
          /* If the variable is not exactly at the requested offset
             we have to include the next one.  */

as repeatedly walking all of 'sol' for 'sol = *(sol + UNKNOWN)' when
adding the solution of one of delta(sol)'s member is slow.  Using
a temporary bitmap to collect all changes doesn't speed it up though,
so the only effect of the above is that the forwarding is delayed until
the next solver iteration (where eventually we discover and eliminate
more indirect cycles).

Comment 75 Richard Biener 2013-12-06 13:43:34 UTC

On trunk with the reduced testcase and -O2 (no -g):

 ipa inlining heuristics :   9.85 ( 5%) usr   0.00 ( 0%) sys   9.93 ( 5%) wall    1448 kB ( 0%) ggc
 tree PTA                : 161.26 (78%) usr   0.30 (45%) sys 162.00 (78%) wall   42484 kB ( 8%) ggc
 expand vars             :   3.06 ( 1%) usr   0.01 ( 2%) sys   3.06 ( 1%) wall   16074 kB ( 3%) ggc
 integrated RA           :   4.46 ( 2%) usr   0.11 (17%) sys   4.56 ( 2%) wall   87144 kB (17%) ggc
 TOTAL                 : 205.49             0.66           206.80             513245 kB

-O3 is in the same ballpark.  I'll look at this again.

Comment 76 Richard Biener 2013-12-06 14:19:58 UTC

There are a lot of calls with fnspec, almost all constraints look like

D.12770.0+16 = allalltmp
D.12770.64+128 = allalltmp
D.12770.192+64 = allalltmp
callarg = &READONLY
callarg = *callarg
callarg = callarg + UNKNOWN
CALLUSED = callarg
ESCAPED = &NONLOCAL
allalltmp = CALLUSED
allalltmp = NONLOCAL
D.12771.0+16 = allalltmp
D.12771.64+128 = allalltmp
D.12771.192+64 = allalltmp
callarg = &D.12770.0+16
callarg = *callarg
callarg = callarg + UNKNOWN

They get unified pretty quickly though.

Still we end up with many very large sets that include ESCAPED but also some
members of ESCAPED explicitely (that's redundant).  I have some idea on how
to mitigate this which eventually should speed things up (or at least reduce
memory usage).

Like

Index: tree-ssa-structalias.c
===================================================================
--- tree-ssa-structalias.c      (revision 205739)
+++ tree-ssa-structalias.c      (working copy)
@@ -1600,6 +1600,14 @@ do_sd_constraint (constraint_graph_t gra
       goto done;
     }

+  /* If the solution of Y contains escaped then filter all bits from
+     that from the delta to reduce work.  */
+  if (bitmap_bit_p (delta, escaped_id))
+    {
+      bitmap_and_compl_into (delta, get_varinfo (find (escaped_id))->solution);
+      flag |= bitmap_set_bit (sol, escaped_id);
+    }
+  
   /* If we do not know at with offset the rhs is dereferenced compute
      the reachability set of DELTA, conservatively assuming it is
      dereferenced at all valid offsets.  */

will check next week.

Comment 77 Richard Biener 2013-12-09 15:13:10 UTC

Author: rguenth
Date: Mon Dec  9 15:13:07 2013
New Revision: 205808

URL: http://gcc.gnu.org/viewcvs?rev=205808&root=gcc&view=rev
Log:
2013-12-09  Richard Biener  <rguenther@suse.de>

	PR middle-end/38474
	* tree-ssa-structalias.c (set_union_with_increment): Remove
	unreachable code.
	(do_complex_constraint): Call set_union_with_increment with
	the solution delta, not the full solution.
	(make_transitive_closure_constraints): Merge the two
	constraints.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-structalias.c

Comment 78 Richard Biener 2013-12-10 12:31:42 UTC

Author: rguenth
Date: Tue Dec 10 12:31:39 2013
New Revision: 205857

URL: http://gcc.gnu.org/viewcvs?rev=205857&root=gcc&view=rev
Log:
2013-12-10  Richard Biener  <rguenther@suse.de>

	PR middle-end/38474
	* tree-ssa-structalias.c (solution_set_expand): Expand into
	a different possibly cached bitmap and return the result.
	(set_union_with_increment): Pass in a shared expanded bitmap
	and adjust.
	(do_sd_constraint): Likewise.
	(do_ds_constraint): Likewise.
	(do_complex_constraint): Likewise.
	(solve_graph): Manage the shared expanded bitmap.

	* gcc.dg/ipa/ipa-pta-14.c: Un-XFAIL.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/ipa/ipa-pta-14.c
    trunk/gcc/tree-ssa-structalias.c

Comment 79 Richard Biener 2018-11-06 12:52:12 UTC

On the GCC 8 branch with -g -O0 (x86_64-linux) I get

 TOTAL                              :   9.59          0.49         10.09         551543 kB
10.53user 0.55system 0:11.08elapsed 99%CPU (0avgtext+0avgdata 1230908maxresident)k
0inputs+68384outputs (0major+379474minor)pagefaults 0swaps

On trunk with mem-stats the biggest offender is (as reported)

df_scan mw_reg                  alloc-pool.h:487 (df_scan_alloc)                  1136         0 :  0.0%      483M       38M: 92.4%          32

going up to -Ofast makes compile-time explode again.  For the reduced testcase
it's still inlining and PTA:

 ipa inlining heuristics            :  41.55 ( 36%)   0.01 (  1%)  41.60 ( 36%)    6333 kB (  2%)
 alias stmt walking                 :  14.73 ( 13%)   0.14 ( 18%)  14.98 ( 13%)     164 kB (  0%)
 tree PTA                           :  37.93 ( 33%)   0.23 ( 30%)  38.16 ( 33%)   31921 kB (  8%)
 TOTAL                              : 115.67          0.77        116.52         411925 kB
115.83user 0.81system 1:56.71elapsed 99%CPU (0avgtext+0avgdata 1034364maxresident)k
1624inputs+12624outputs (1major+308881minor)pagefaults 0swaps

trunk seems to behave similar:

 ipa inlining heuristics            :  53.74 ( 41%)   0.02 (  2%)  53.75 ( 40%)    5428 kB (  1%)
 alias stmt walking                 :  14.98 ( 11%)   0.19 ( 23%)  15.16 ( 11%)     165 kB (  0%)
 tree PTA                           :  39.70 ( 30%)   0.24 ( 29%)  39.92 ( 30%)   31896 kB (  8%)
 TOTAL                              : 132.01          0.83        132.85         407617 kB
132.01user 0.86system 2:12.88elapsed 99%CPU (0avgtext+0avgdata 1034096maxresident)k
0inputs+8224outputs (0major+301553minor)pagefaults 0swaps

flat perf profile:

Samples: 510K of event 'instructions:p', Event count (approx.): 715615147320    
Overhead       Samples  Command  Shared Object     Symbol                       
   8.08%         95243  f951     f951              [.] bitmap_ior_into
   7.21%         25966  f951     f951              [.] sreal::operator*
   5.43%         19353  f951     f951              [.] hash_table<hash_map<int_h
   5.20%         23167  f951     f951              [.] get_ref_base_and_extent
   4.93%         17947  f951     f951              [.] profile_count::to_sreal_s
   4.37%         15865  f951     f951              [.] sreal::operator/
   3.45%         30532  f951     f951              [.] bitmap_set_bit
   3.41%         12159  f951     f951              [.] hash_table<hash_map<int_h
   3.08%         11034  f951     f951              [.] default_binds_local_p_3
   3.08%         11146  f951     f951              [.] hash_table<hash_map<int_h
   2.21%          7877  f951     f951              [.] want_inline_small_functio
   1.93%          6874  f951     f951              [.] edge_badness
   1.87%          6675  f951     f951              [.] compute_inlined_call_time

the ipa_fn_summary hash and edge_growth_cache / call_summary hashes are
oddly on top of the profile...

Comment 80 Jan Hubicka 2018-11-06 12:58:56 UTC

> 
> flat perf profile:
> 
> Samples: 510K of event 'instructions:p', Event count (approx.): 715615147320    
> Overhead       Samples  Command  Shared Object     Symbol                       
>    8.08%         95243  f951     f951              [.] bitmap_ior_into
>    7.21%         25966  f951     f951              [.] sreal::operator*
>    5.43%         19353  f951     f951              [.]
> hash_table<hash_map<int_h
>    5.20%         23167  f951     f951              [.] get_ref_base_and_extent
>    4.93%         17947  f951     f951              [.]
> profile_count::to_sreal_s
>    4.37%         15865  f951     f951              [.] sreal::operator/
>    3.45%         30532  f951     f951              [.] bitmap_set_bit
>    3.41%         12159  f951     f951              [.]
> hash_table<hash_map<int_h
>    3.08%         11034  f951     f951              [.] default_binds_local_p_3
>    3.08%         11146  f951     f951              [.]
> hash_table<hash_map<int_h
>    2.21%          7877  f951     f951              [.]
> want_inline_small_functio
>    1.93%          6874  f951     f951              [.] edge_badness
>    1.87%          6675  f951     f951              [.]
> compute_inlined_call_time
> 
> the ipa_fn_summary hash and edge_growth_cache / call_summary hashes are
> oddly on top of the profile...

Yep, this is because they used to be arrays indexed by symbol UIDs which
Martin converted to hash tables.  Inliner happily calls summary_get each
time it needs the summary.  I have some patches to speed this up which I
will push out after the type changes (while they add bit of extra
functionality by teaching ipa-predicates abou value range I hope they
are OK for early stage3).

Comment 81 rguenther@suse.de 2018-11-06 13:14:34 UTC

On Tue, 6 Nov 2018, hubicka at ucw dot cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474
> 
> --- Comment #80 from Jan Hubicka <hubicka at ucw dot cz> ---
> > 
> > flat perf profile:
> > 
> > Samples: 510K of event 'instructions:p', Event count (approx.): 715615147320    
> > Overhead       Samples  Command  Shared Object     Symbol                       
> >    8.08%         95243  f951     f951              [.] bitmap_ior_into
> >    7.21%         25966  f951     f951              [.] sreal::operator*
> >    5.43%         19353  f951     f951              [.]
> > hash_table<hash_map<int_h
> >    5.20%         23167  f951     f951              [.] get_ref_base_and_extent
> >    4.93%         17947  f951     f951              [.]
> > profile_count::to_sreal_s
> >    4.37%         15865  f951     f951              [.] sreal::operator/
> >    3.45%         30532  f951     f951              [.] bitmap_set_bit
> >    3.41%         12159  f951     f951              [.]
> > hash_table<hash_map<int_h
> >    3.08%         11034  f951     f951              [.] default_binds_local_p_3
> >    3.08%         11146  f951     f951              [.]
> > hash_table<hash_map<int_h
> >    2.21%          7877  f951     f951              [.]
> > want_inline_small_functio
> >    1.93%          6874  f951     f951              [.] edge_badness
> >    1.87%          6675  f951     f951              [.]
> > compute_inlined_call_time
> > 
> > the ipa_fn_summary hash and edge_growth_cache / call_summary hashes are
> > oddly on top of the profile...
> 
> Yep, this is because they used to be arrays indexed by symbol UIDs which
> Martin converted to hash tables.  Inliner happily calls summary_get each
> time it needs the summary.  I have some patches to speed this up which I
> will push out after the type changes (while they add bit of extra
> functionality by teaching ipa-predicates abou value range I hope they
> are OK for early stage3).

Awww.  I guess with no longer re-using UIDs we then get bad hashing
behavior as well :/  I hope the hash function is _not_ simply the UID?

Comment 82 Jan Hubicka 2018-11-06 13:23:43 UTC

> > Yep, this is because they used to be arrays indexed by symbol UIDs which
> > Martin converted to hash tables.  Inliner happily calls summary_get each
> > time it needs the summary.  I have some patches to speed this up which I
> > will push out after the type changes (while they add bit of extra
> > functionality by teaching ipa-predicates abou value range I hope they
> > are OK for early stage3).
> 
> Awww.  I guess with no longer re-using UIDs we then get bad hashing
> behavior as well :/  I hope the hash function is _not_ simply the UID?

Yep, UID reuse was there precisely to make the arrays reasonably
compact.  hash is int_hash which IMO returns simply uid.

Comment 83 Richard Biener 2021-02-10 14:52:05 UTC

Meh.  On trunk (GCC 11) we now have for the reduced testcase

> ./f951 -quiet testcase_reduced.f90 -ffree-line-length-512 -ftime-report -O3

Time variable                                   usr           sys          wall           GGC
...
 callgraph ipa passes               :  28.09 (  8%)   0.23 ( 38%)  28.33 (  8%)    68M ( 12%)
 ipa inlining heuristics            :   5.13 (  1%)   0.01 (  2%)   5.13 (  1%)    14M (  3%)
 alias stmt walking                 :   7.03 (  2%)   0.09 ( 15%)   7.15 (  2%)   277k (  0%)
 tree PTA                           :  26.20 (  7%)   0.17 ( 28%)  26.39 (  7%)    25M (  5%)
 store merging                      : 308.60 ( 84%)   0.01 (  2%) 308.70 ( 84%)  3858k (  1%)
 TOTAL                              : 365.68          0.61        366.42          557M

so store-merging goes bollocks.  I will try to dig into it a bit but I'm not very familiar with the code.  GCC 10 behaves similar here but not as bad:

 store merging                      : 232.10 ( 82%)   0.02 (  4%) 232.19 ( 82%)    3837 kB (  1%)
 TOTAL                              : 283.51          0.45        284.05         582957 kB

while GCC 9 is sane:

 store merging                      :   0.04 (  0%)   0.00 (  0%)   0.04 (  0%)    2700 kB (  1%)
 TOTAL                              :  88.59          0.70         89.34         521364 kB

Comment 84 Richard Biener 2021-02-10 15:03:58 UTC

So it's the usual (quadratic) culprit:

Samples: 1M of event 'cycles:u', Event count (approx.): 1675893461671                                                                
Overhead       Samples  Command  Shared Object     Symbol                                                                            
  20.61%        316521  f951     f951              [.] get_ref_base_and_extent
  14.42%        221221  f951     f951              [.] (anonymous namespace)::pass_store_merging::terminate_all_aliasing_chains
   5.77%         88586  f951     f951              [.] special_function_p

I'll see whether I can do some surgery.

Comment 85 Richard Biener 2021-02-10 15:46:25 UTC

Starting new chain with statement:
D.31414 ={v} {CLOBBER};
The base object is:
&D.31414
Starting new chain with statement:
D.31415 ={v} {CLOBBER};
The base object is:
&D.31415
...

but those are all the last use of the base object so they just add up and
are never invalidated, but lengthening the m_stores_head list and thus
making terminate_all_aliasing_chains more expensive.

Jakub, were the clobbers ever supposed to _start_ a chain?

With

diff --git a/gcc/gimple-ssa-store-merging.c b/gcc/gimple-ssa-store-merging.c
index f0f4a068de5..fa9a092d544 100644
--- a/gcc/gimple-ssa-store-merging.c
+++ b/gcc/gimple-ssa-store-merging.c
@@ -5175,6 +5175,9 @@ pass_store_merging::process_store (gimple *stmt)
 
   /* Store aliases any existing chain?  */
   ret |= terminate_all_aliasing_chains (NULL, stmt);
+  /* Do not start a new chain from a CLOBBER.  */
+  if (gimple_clobber_p (stmt))
+    return ret;
   /* Start a new chain.  */
   class imm_store_chain_info *new_chain
     = new imm_store_chain_info (m_stores_head, base_addr);

compile-time gets down to

 store merging                      :   1.18 (  2%)   0.00 (  0%)   1.18 (  2%)  3858k (  1%)
 TOTAL                              :  59.84          0.57         60.43          557M

I'm checking if it has any testsuite fallout.

Comment 86 Richard Biener 2021-02-10 15:47:39 UTC

OK, so clobber handling was added as a fix for PR92038

Comment 87 Jakub Jelinek 2021-02-10 15:51:52 UTC

At least for PR92038 it is important to see CLOBBERs in the chain, including the first position in there.

Comment 88 rguenther@suse.de 2021-02-10 15:55:10 UTC

On Wed, 10 Feb 2021, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38474
> 
> --- Comment #87 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> At least for PR92038 it is important to see CLOBBERs in the chain, including
> the first position in there.

Hmm, OK.  I'll look closer tomorrow but can you try to explain why
it's ever important at the first position?

Comment 89 Richard Biener 2021-02-10 16:02:40 UTC

Fallout includes

FAIL: g++.dg/opt/store-merging-1.C  scan-tree-dump store-merging "New sequence of [12] stores to replace old one of 2 stores"

which shows

Starting new chain with statement:
s ={v} {CLOBBER};
The base object is:
&s
Recording immediate store from stmt:
s.a = 0;
Recording immediate store from stmt:
s.b = 0;
stmt causes chain termination:
foo (s);

and the CLOBBER allows us to use zeros for padding:

Store 0:
bitsize:64 bitpos:0 val:{CLOBBER}
Store 1:
bitsize:32 bitpos:0 val:0
Store 2:
bitsize:8 bitpos:32 val:0
After writing {CLOBBER} of size 64 at position 0
  the merged value contains 00 00 00 00 00 00 00 00
  the merged mask contains  00 00 00 00 00 00 00 00
After writing 0 of size 32 at position 0
  the merged value contains 00 00 00 00 00 00 00 00
  the merged mask contains  00 00 00 00 00 00 00 00
After writing 0 of size 8 at position 32
  the merged value contains 00 00 00 00 00 00 00 00
  the merged mask contains  00 00 00 00 00 00 00 00
Coalescing successful!
Merged into 1 stores
New sequence of 1 stores to replace old one of 2 stores
# .MEM_6 = VDEF <.MEM_5>
MEM <unsigned long> [(void *)&s] = 0;
Merging successful!

Comment 90 Jakub Jelinek 2021-02-10 16:06:07 UTC

Because it says that the whole range is uninitialized, so the store merging code doesn't need to care about pre-existing content in any gaps between the stored values.  So say when the whole var is clobbered and then the code stores to every second bitfield, we don't need to read the old content, mask it, or with the stored bits and store that, but can just put some suitable value into the gaps (0 or all ones or whatever is best).
For quadratic behavior, I wonder if we just shouldn't see how many chains are we tracking currently and if we have too many (some param), terminate all of them.

Comment 91 Richard Biener 2021-02-10 16:06:19 UTC

So the other simple idea I have is to limit the number of active store groups
and force-terminate in either a LRU or FIFO manner.

For the testcase at hand the decls we start the chain for are all only
used in full but knowing that would require some pre-analysis of the IL
similar to what SRA does for example (collecting all accesses).  It's
then also still easy to "break" such heuristic so limiting is in the end
the only (and I guess best) option.

Comment 92 Richard Biener 2021-02-10 16:28:48 UTC

Simple and stupid like the below works and does

 store merging                      :   0.42 (  1%)   0.00 (  0%)   0.43 (  1%)  3858k (  1%)
 TOTAL                              :  56.86          0.56         57.45          557M

we have a limit of 64 stores in a single chain, so the product of both limits limit the number of alias queries done in terminate_all_aliasing_chains.  I'll
polish it up tomorrow (and will refrain from trying to avoid the linear
walk here and keeping a counter or even a pointer to the last element).

diff --git a/gcc/gimple-ssa-store-merging.c b/gcc/gimple-ssa-store-merging.c
index f0f4a068de5..c6ec6b2cbce 100644
--- a/gcc/gimple-ssa-store-merging.c
+++ b/gcc/gimple-ssa-store-merging.c
@@ -5175,6 +5175,19 @@ pass_store_merging::process_store (gimple *stmt)
 
   /* Store aliases any existing chain?  */
   ret |= terminate_all_aliasing_chains (NULL, stmt);
+  unsigned cnt = 0;
+  imm_store_chain_info **e = &m_stores_head;
+  while (*e)
+    if (++cnt > 16)
+      {
+       if (dump_file && (dump_flags & TDF_DETAILS))
+         fprintf (dump_file, "Too many chains, terminating oldest before "
+                  "starting a new chain.\n");
+       terminate_and_process_chain (*e);
+      }
+    else
+      e = &(*e)->next;
+
   /* Start a new chain.  */
   class imm_store_chain_info *new_chain
     = new imm_store_chain_info (m_stores_head, base_addr);

Comment 93 Jakub Jelinek 2021-02-10 16:49:08 UTC

I think I'd go for more chains by default, at least 64 or even 256, with a param and tracking on how many we have in a counter.  The class has a ctor/dtor, so the increment/decrement of the counter can be done there.
And I think it is doubly-linked, so tail should be prev on the head.

Comment 94 Richard Biener 2021-02-11 09:32:21 UTC

Created attachment 50165 [details]
patch experiment

(In reply to Jakub Jelinek from comment #93)
> I think I'd go for more chains by default, at least 64 or even 256, with a
> param and tracking on how many we have in a counter.  The class has a
> ctor/dtor, so the increment/decrement of the counter can be done there.
> And I think it is doubly-linked, so tail should be prev on the head.

So the list is not circular but adding a counter is easy and avoids the
linked list walk in the usual cases when we do not hit the limit.

Note that the number of alias queries we do is
(param_max_store_chains_to_track * param_max_stores_to_merge) squared
with a default of 64 that's already 64 ^ 4.  So a less restrictive option
might be to limit the product of both numbers.

Now, I was thinking that we eventually can get rid of most alias queries
by instead of ambiguating against all stores in each chain only
ambiguate against the whole base of each chain, effectively making it
param_max_store_chains_to_track squared then.  That might lose some odd
cases.  The profile also shows that the caching of ao_refs (and thus
get_ref_base_and_extent calls) is imperfect - for the store chains
we could record ao_refs for example (it effectively already records
all necessary info anyway).

A simple patch caching the ao_ref in store_immediate_info improves compile
time to

 store merging                      : 265.07 ( 83%)   0.00 (  0%) 265.14 ( 82%)  3858k (  1%)
 TOTAL                              : 321.25          0.53        321.88          557M

shaving off some 50s and thus possibly worth the extra memory cost here.

Btw, I've done some statistics and for this testcase we indeed mostly
have chains with a single store (the clobber) and thus the idea of
reducing the number of alias queries by globbing all stores of a chain
into a single effective ao_ref wouldn't help too much
(but it would save us from caching the ao_ref in each store to a single
ao_ref per chain):

  36422 Terminating chain with 1 stores
  10889 Terminating chain with 3 stores

and the largest number of active chains is 32661.

The idea of limiting the number of overall tracked stores instead of the
number of chains would be similarly simple and put a limit on the overall
alias queries (but then we have the limit on the number of stores in a
single chain, possibly because of non-linearities in chain processing).

Still limiting the number of tracked chain is more obvious in behavior
and thus IMHO superior for users (even if it means we have to put a
more conservative default on that).

Statistics on the param defaults:

 64    1.31 (  2%) 
128    2.64 (  4%)
256    4.86 (  8%)

but given the above statistics the testcase isn't a good example to tune
these with just either one or three stores on each chain.  Enabling
the ao_ref caching only marginally improves the testcase with the
limiting in place:

256    4.28 (  7%)

so any comments?

Comment 95 GCC Commits 2021-02-12 08:57:43 UTC

The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:95d94b52ea8478334fb92cca545f0bd904bd0034

commit r11-7205-g95d94b52ea8478334fb92cca545f0bd904bd0034
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Feb 11 11:13:47 2021 +0100

    tree-optimization/38474 - fix store-merging compile-time regression
    
    The following puts a limit on the number of alias tests we do in
    terminate_all_aliasing_chains which is quadratic in the number of
    overall stores currentrly tracked.  There is already a limit in
    place on the maximum number of stores in a single chain so the
    following adds a limit on the number of chains tracked.  The
    worst number of overall stores tracked from the defaults (64 and 64)
    is then 4096 which when imposed as the sole limit for the testcase
    still causes
    
     store merging                      :  71.65 ( 56%)
    
    because the testcase is somewhat degenerate with most chains
    consisting only of a single store (and 25% of exactly three stores).
    The single stores are all CLOBBERs at the point variables go out of
    scope.  Note unpatched we have
    
     store merging                      : 308.60 ( 84%)
    
    Limiting the number of chains to 64 brings this down to
    
     store merging                      :   1.52 (  3%)
    
    which is more reasonable.  There are ideas on how to make
    terminate_all_aliasing_chains cheaper but for this degenerate case
    they would not have any effect so I'll defer for GCC 12 for those.
    
    I'm not sure we want to have both --params, just keeping the
    more to-the-point max-stores-to-track works but makes the
    degenerate case above slower.
    I made the current default 1024 which for the testcasse
    (without limiting chains) results in 25% compile time and 20s
    putting it in the same ballpart as the next offender (which is PTA).
    
    This is a regression on trunk and the GCC 10 branch btw.
    
    2021-02-11  Richard Biener  <rguenther@suse.de>
    
            PR tree-optimization/38474
            * params.opt (-param=max-store-chains-to-track=): New param.
            (-param=max-stores-to-track=): Likewise.
            * doc/invoke.texi (max-store-chains-to-track): Document.
            (max-stores-to-track): Likewise.
            * gimple-ssa-store-merging.c (pass_store_merging::m_n_chains):
            New.
            (pass_store_merging::m_n_stores): Likewise.
            (pass_store_merging::terminate_and_process_chain): Update
            m_n_stores and m_n_chains.
            (pass_store_merging::process_store): Likewise.   Terminate
            oldest chains if the number of stores or chains get too large.
            (imm_store_chain_info::terminate_and_process_chain): Dump
            chain length.

Comment 96 Richard Biener 2021-02-12 10:29:12 UTC

The full testcase on trunk (g:95d94b52ea8478334fb92cca545f0bd904bd0034) at -O0 -g
now takes 9s to compile and uses 1GB ram.

With -O1 -g we have

Time variable                                   usr           sys          wall           GGC
 callgraph functions expansion      :  13.41 ( 12%)   0.21 ( 60%)  13.63 ( 12%)   439M ( 73%)
 callgraph ipa passes               :  94.79 ( 86%)   0.13 ( 37%)  94.95 ( 86%)    75M ( 13%)
 ipa function summary               :  91.46 ( 83%)   0.02 (  6%)  91.53 ( 83%)    17M (  3%)
 tree PTA                           :   5.78 (  5%)   0.05 ( 14%)   5.85 (  5%)    23M (  4%)
 TOTAL                              : 109.96          0.35        110.37          597M
109.97user 0.37system 1:50.38elapsed 99%CPU (0avgtext+0avgdata 1110568maxresident)k
0inputs+0outputs (0major+350549minor)pagefaults 0swaps

where perf shows

Samples: 448K of event 'cycles:u', Event count (approx.): 483237005145                                                     
Overhead       Samples  Command   Shared Object     Symbol                                                                 
  17.26%         77187  f951      f951              [.] get_ref_base_and_extent                                           #
   8.36%         37385  f951      f951              [.] stmt_may_clobber_ref_p_1                                          #
   7.16%         32045  f951      f951              [.] default_binds_local_p_3                                           #
   6.40%         28628  f951      f951              [.] bitmap_bit_p                                                      #
   6.39%         28557  f951      f951              [.] determine_known_aggregate_parts                                   #
   5.92%         26464  f951      f951              [.] pt_solution_includes_1                                            #
   4.66%         20834  f951      f951              [.] call_may_clobber_ref_p_1                                          #
   3.44%         15406  f951      f951              [.] flags_from_decl_or_type                                           #
   3.35%         14971  f951      f951              [.] refs_may_alias_p_1                                                #
   3.05%         13667  f951      f951              [.] gimple_call_flags                                                 #
   2.55%         11387  f951      f951              [.] cgraph_node::get_availability                                     #
   2.40%         10739  f951      libc-2.26.so      [.] __strncmp_sse42                                                   #
   2.32%         10372  f951      f951              [.] check_fnspec                                                      #
   1.89%          8411  f951      f951              [.] bitmap_set_bit                                                    #
   1.71%          7635  f951      f951              [.] private_lookup_attribute                                          #
   1.68%          7512  f951      f951              [.] get_modref_function_summary                                       #
   1.52%          6805  f951      f951              [.] decl_binds_to_current_def_p                                       #
   1.46%          6512  f951      f951              [.] gimple_call_fnspec                                                #
   1.26%          5582  f951      f951              [.] bitmap_clear_bit                                                  #
   0.94%          4212  f951      f951              [.] cgraph_node::function_or_virtual_thunk_symbol       

we need to do sth about the IPA fnsummary cost, it looks unreasonable compared
to all the rest, at least for -O1.  Cutting down --param ipa-max-aa-steps
doesn't seem to help but it looks accounting is simply broken.

And with -O2 or -O3 we have

Time variable                                   usr           sys          wall           GGC
 callgraph functions expansion      : 201.23 ( 20%)   0.77 ( 46%) 202.05 ( 20%)  1230M ( 82%)
 callgraph ipa passes               : 807.58 ( 80%)   0.86 ( 52%) 808.75 ( 80%)   201M ( 13%)
 ipa inlining heuristics            :  40.25 (  4%)   0.01 (  1%)  40.24 (  4%)    41M (  3%)
 alias stmt walking                 :  21.48 (  2%)   0.20 ( 12%)  21.72 (  2%)   601k (  0%)
 tree PTA                           : 788.36 ( 78%)   0.76 ( 46%) 789.43 ( 78%)   101M (  7%)
 tree slp vectorization             :  13.97 (  1%)   0.04 (  2%)  14.01 (  1%)   225M ( 15%)
 expand vars                        :  92.66 (  9%)   0.00 (  0%)  92.72 (  9%)    63M (  4%)
 TOTAL                              :1010.42          1.66       1012.46         1509M
1010.42user 1.73system 16:52.53elapsed 99%CPU (0avgtext+0avgdata 4764428maxresident)k
0inputs+0outputs (0major+1199966minor)pagefaults 0swaps

surprisingly the IPA fnsummary issue is -O1 only but maybe it's an accounting
issue.  perf with callgraph points to (if I interpret correctly) the
determine_known_aggregate_parts function which, while accounting alias
queries done via get_continuation_for_phi, does not account those done
by walking the VDEF chain itself.  I'm testing a fix.

Comment 97 Richard Biener 2021-02-12 10:35:43 UTC

So fixing that makes GCC 11 compile the full testcase at -O1 -g in 18 seconds
using about 1GB of memory.

That leaves PTA at -O2+ as the biggest offender (it also shows up with the reduced testcase).

Comment 98 GCC Commits 2021-02-12 11:42:49 UTC

The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:6cc886bf4279461b8931c4ca544185a85cd69f26

commit r11-7208-g6cc886bf4279461b8931c4ca544185a85cd69f26
Author: Richard Biener <rguenther@suse.de>
Date:   Fri Feb 12 11:13:36 2021 +0100

    middle-end/38474 - fix alias walk budget accounting in IPA analysis
    
    The walk_aliased_vdef calls do not update the walking budget until
    it is hit by a single call (and then in one case it resumes with
    no limit at all).  The following rectifies this in multiple places.
    It also makes the updates more consistend and fixes
    determine_known_aggregate_parts to account its own alias queries.
    
    2021-02-12  Richard Biener  <rguenther@suse.de>
    
            PR middle-end/38474
            * ipa-fnsummary.c (unmodified_parm_1): Only walk when
            fbi->aa_walk_budget is bigger than zero.  Update
            fbi->aa_walk_budget.
            (param_change_prob): Likewise.
            * ipa-prop.c (detect_type_change_from_memory_writes):
            Properly account walk_aliased_vdefs.
            (parm_preserved_before_stmt_p): Canonicalize updates.
            (parm_ref_data_preserved_p): Likewise.
            (parm_ref_data_pass_through_p): Likewise.
            (determine_known_aggregate_parts): Account own alias queries.

Comment 99 Richard Biener 2021-02-12 14:40:50 UTC

Just a short brain-dump for the PTA issue:

--param max-fields-for-field-sensitive=1 helps, so some magic limit and auto-degrading might be a good idea.

Solver stats are not so bad:

 Total vars:               21507
Non-pointer vars:          16
Statically unified vars:  6800
Dynamically unified vars: 0
Iterations:               4
Number of edges:          43380
Number of implicit edges: 23056

but varmap "compression" happens before unifying those 6800 vars which
means bitmaps are less dense than possible.  That there's nothing
dynamically unified also says that likely iteration order is sub-optimal.
We don't have entries of the forward graph and so likely do multile
DFSs starting from somewhere inside for example.  Given we add both
succ and pred edges during solving itself makes itegrating the DFS
into the iteration itself look attractive eventually.

More stats are needed to judge iteration order tweaks.

We have IL like

  D.335748 = __result_mpfr_division_mp_mp;
  __result_mpfr_division_mp_mp ={v} {CLOBBER};
  D.76250 = D.335748;
  D.335748 ={v} {CLOBBER};
...
  mpfr_add (&__result_mpfr_addition_mp_mp, &D.76250, &D.76256, 0);

that just generates a lot of initial constraints and variables.  D.335748
becomes live here, so does D.76250.  This happens early also, like with

__attribute__((fn spec (". r r ")))
struct mpfr_type mpfr_division_mp_mp (struct mpfr_type & restrict a1, struct mpfr_type & restrict a2)
{
  struct mpfr_type __result_mpfr_division_mp_mp;
  integer(kind=4) retval;

  <bb 2> :
  mpfr_init (&__result_mpfr_division_mp_mp);
  retval_6 = mpfr_div (&__result_mpfr_division_mp_mp, a1_3(D), a2_4(D), 0);
  <retval> = __result_mpfr_division_mp_mp;
  __result_mpfr_division_mp_mp ={v} {CLOBBER};
  return <retval>;

having some early pass after inlining clean up the result would be nice
[simply renaming and eliding 1:1 copies here].  It takes until SRA to fix this
and the PTA pass after it (run as part of PRE) is fast:

 tree PTA                           :  20.26 (  7%)

another thing to notice would be not splitting vars that just occur
"bare" in the IL but that would require some pre-scanning of the IL
to note interesting (and uninteresting) variables.  That's something
we might need anyway though for improved allocated array handling for
example.

Comment 100 GCC Commits 2021-02-16 12:38:15 UTC

The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:3f16a1678156035bbe73b217fbce4d9c27d1d559

commit r11-7254-g3f16a1678156035bbe73b217fbce4d9c27d1d559
Author: Richard Biener <rguenther@suse.de>
Date:   Tue Feb 16 12:42:26 2021 +0100

    tree-optimization/38474 - improve PTA varinfo sorting
    
    This improves a previous heuristic to sort address-taken variables
    first (because those appear in points-to bitmaps) by tracking which
    variables appear in ADDRESSOF constraints (there's also
    graph->address_taken but that's computed only later).
    
    This shaves off 30s worth of compile-time for the full testcase in
    PR38474 (which then still takes 965s to compile at -O2).
    
    2021-02-16  Richard Biener  <rguenther@suse.de>
    
            PR tree-optimization/38474
            * tree-ssa-structalias.c (variable_info::address_taken): New.
            (new_var_info): Initialize address_taken.
            (process_constraint): Set address_taken.
            (solve_constraints): Use the new address_taken flag rather
            than is_reg_var for sorting variables.
            (dump_constraint): Dump the variable number if the name
            is just NULL.

Comment 101 GCC Commits 2021-04-29 08:04:10 UTC

The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:c57a8aea0c3ab8394f7dbfa417ee27b4613f63b7

commit r12-280-gc57a8aea0c3ab8394f7dbfa417ee27b4613f63b7
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Apr 29 08:32:00 2021 +0200

    middle-end/38474 - speedup PTA constraint solving
    
    In testcases like PR38474 and PR99912 we're seeing very slow
    PTA solving.  One can observe an excessive amount of forwarding,
    mostly during sd constraint solving.  The way we solve the graph
    does not avoid forwarding the same bits through multiple paths,
    and especially when such alternate path involves ESCAPED as
    intermediate this causes the ESCAPED solution to be expanded
    in receivers.
    
    The following adds heuristic to add_graph_edge which adds
    forwarding edges but also guards the initial solution forwarding
    (which is the expensive part) to detect the case of ESCAPED
    receiving the same set and the destination already containing
    ESCAPED.
    
    This speeds up the PTA solving process by more than 50%.
    
    2021-04-29  Richard Biener  <rguenther@suse.de>
    
            PR middle-end/38474
            * tree-ssa-structalias.c (add_graph_edge): Avoid direct
            forwarding when indirect forwarding through ESCAPED
            alread happens.