34683 – SSA rewriting in the loop unroller causes quadratic behavior

Bug 34683 - SSA rewriting in the loop unroller causes quadratic behavior

Summary: SSA rewriting in the loop unroller causes quadratic behavior

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.3.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	compile-time-hog, memory-hog

Depends on:	33870 33974 34450
Blocks:
	Show dependency tree / graph

Reported:	2008-01-05 23:34 UTC by Janus Weil
Modified:	2018-03-08 03:18 UTC (History)
CC List:	4 users (show)

See Also:	34705
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2008-01-09 09:56:00

Attachments
test case (2.17 KB, text/plain) 2008-01-05 23:36 UTC, Janus Weil	Details
hack to disable rewriting of VOPs (1.34 KB, text/plain) 2008-01-10 14:36 UTC, Richard Biener	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Janus Weil 2008-01-05 23:34:09 UTC

The attached Fortran file takes unusually long to compile with full optimization on the latest GCC trunk builds. I tried a lot of the optimization flags, and found that -fstrict-aliasing seems to trigger the long compile-time.

Without this flag the compile-time is perfectly reasonable, as you can see from the following numbers (using trunk revision 131325 on i686-pc-linux-gnu, Pentium M 1,7Ghz and 1GB memory):

-O0: 0.9s
-O1: 4.5s
-O2 -fno-strict-aliasing:  8.0s
-O3 -fno-strict-aliasing: 10.4s

But once I use -fstrict-aliasing, compile-time explodes:

-O1 -fstrict-aliasing: 1m10s
-O2: 2m16s
-O3: 1m43s

(-fstrict-aliasing is included in -O2 and -O3, but not in -O1)
I don't know why -O3 is faster than -O2 here (which seems funny), but anyway all three numbers are way too large in my opinion. I mean: compile-time rises by a factor of roughly 15, just due to this one flag! I don't even know what the flag is supposed to do for this code. Maybe someone can explain this to me?

I didn't examine memory consumption very carefully, but this also seems unusually high, as a similar file failed to compile on an old machine (with only 256MB) due to lack of memory(!).

Am I right to assume that something goes wrong here somehow, or can someone tell me why I should expect this kind of behaviour for this code?

Comment 1 Janus Weil 2008-01-05 23:36:48 UTC

Created attachment 14885 [details]
test case

Comment 2 Andrew Pinski 2008-01-05 23:37:22 UTC

Can you give the output of the compiler when -ftime-report is added?

Comment 3 Janus Weil 2008-01-05 23:46:36 UTC

(In reply to comment #2)
> Can you give the output of the compiler when -ftime-report is added?

Sure thing, thanks for the remark. Here it goes:

gfortran-4.3 -c -ftime-report -O1 -fstrict-aliasing Amplitudes.f90

Execution times (seconds)
 garbage collection    :   0.21 ( 0%) usr   0.02 ( 2%) sys   0.24 ( 0%) wall       0 kB ( 0%) ggc
 callgraph construction:   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     299 kB ( 0%) ggc
 callgraph optimization:   0.02 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall      62 kB ( 0%) ggc
 trivially dead code   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 df reaching defs      :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 df live regs          :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 df live&initialized regs:   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.02 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     127 kB ( 0%) ggc
 register information  :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis        :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     196 kB ( 0%) ggc
 rebuild jump labels   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 parser                :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall    1848 kB ( 0%) ggc
 inline heuristics     :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall       1 kB ( 0%) ggc
 tree gimplify         :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall     847 kB ( 0%) ggc
 tree CFG construction :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     769 kB ( 0%) ggc
 tree CFG cleanup      :   4.20 ( 5%) usr   0.06 ( 7%) sys   4.60 ( 5%) wall    2550 kB ( 0%) ggc
 tree copy propagation :   0.25 ( 0%) usr   0.01 ( 1%) sys   0.31 ( 0%) wall      24 kB ( 0%) ggc
 tree PTA              :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall      67 kB ( 0%) ggc
 tree alias analysis   :   0.01 ( 0%) usr   0.01 ( 1%) sys   0.01 ( 0%) wall       2 kB ( 0%) ggc
 tree call clobbering  :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 tree flow sensitive alias:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall      16 kB ( 0%) ggc
 tree memory partitioning:   0.07 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA rewrite      :   0.48 ( 1%) usr   0.01 ( 1%) sys   0.53 ( 1%) wall    8455 kB ( 1%) ggc
 tree SSA other        :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA incremental  :   3.26 ( 4%) usr   0.08 ( 9%) sys   3.62 ( 4%) wall   19615 kB ( 2%) ggc
 tree operand scan     :  49.67 (63%) usr   0.48 (52%) sys  57.11 (64%) wall  247409 kB (30%) ggc
 dominator optimization:   0.85 ( 1%) usr   0.02 ( 2%) sys   1.57 ( 2%) wall   14019 kB ( 2%) ggc
 tree SRA              :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall      27 kB ( 0%) ggc
 tree STORE-CCP        :   0.09 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall       1 kB ( 0%) ggc
 tree CCP              :   0.07 ( 0%) usr   0.01 ( 1%) sys   0.08 ( 0%) wall      15 kB ( 0%) ggc
 tree PHI const/copy prop:   0.35 ( 0%) usr   0.00 ( 0%) sys   0.37 ( 0%) wall       0 kB ( 0%) ggc
 tree split crit edges :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall    2489 kB ( 0%) ggc
 tree reassociation    :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      72 kB ( 0%) ggc
 tree FRE              :  16.12 (20%) usr   0.17 (18%) sys  16.75 (19%) wall  505793 kB (62%) ggc
 tree code sinking     :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      41 kB ( 0%) ggc
 tree forward propagate:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       7 kB ( 0%) ggc
 tree conservative DCE :   0.18 ( 0%) usr   0.00 ( 0%) sys   0.19 ( 0%) wall       0 kB ( 0%) ggc
 tree aggressive DCE   :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 tree DSE              :   0.07 ( 0%) usr   0.00 ( 0%) sys   0.07 ( 0%) wall    1879 kB ( 0%) ggc
 tree loop bounds      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     160 kB ( 0%) ggc
 loop invariant motion :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       1 kB ( 0%) ggc
 tree canonical iv     :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     138 kB ( 0%) ggc
 scev constant prop    :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall      68 kB ( 0%) ggc
 complete unrolling    :   0.30 ( 0%) usr   0.01 ( 1%) sys   0.36 ( 0%) wall    2694 kB ( 0%) ggc
 tree iv optimization  :   0.17 ( 0%) usr   0.00 ( 0%) sys   0.20 ( 0%) wall    3958 kB ( 0%) ggc
 tree loop init        :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall      52 kB ( 0%) ggc
 tree copy headers     :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall    1280 kB ( 0%) ggc
 tree SSA uncprop      :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA to normal    :   0.11 ( 0%) usr   0.01 ( 1%) sys   0.12 ( 0%) wall     274 kB ( 0%) ggc
 tree rename SSA copies:   0.05 ( 0%) usr   0.00 ( 0%) sys   0.07 ( 0%) wall       0 kB ( 0%) ggc
 dominance frontiers   :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 expand                :   0.91 ( 1%) usr   0.02 ( 2%) sys   1.01 ( 1%) wall    3726 kB ( 0%) ggc
 lower subreg          :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       8 kB ( 0%) ggc
 CSE                   :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall      49 kB ( 0%) ggc
 dead code elimination :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 dead store elim1      :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall     249 kB ( 0%) ggc
 dead store elim2      :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall     214 kB ( 0%) ggc
 loop analysis         :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall      83 kB ( 0%) ggc
 branch prediction     :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     154 kB ( 0%) ggc
 combiner              :   0.09 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall     485 kB ( 0%) ggc
 if-conversion         :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall      58 kB ( 0%) ggc
 local alloc           :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall     217 kB ( 0%) ggc
 global alloc          :   0.26 ( 0%) usr   0.01 ( 1%) sys   0.28 ( 0%) wall    1050 kB ( 0%) ggc
 reload CSE regs       :   0.07 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall     258 kB ( 0%) ggc
 thread pro- & epilogue:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       5 kB ( 0%) ggc
 rename registers      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 machine dep reorg     :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 reg stack             :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      34 kB ( 0%) ggc
 final                 :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 :  79.11             0.92            89.22             822312 kB

So we have 30% for "tree operand scan" and 62% for "tree FRE". I have no idea what that means though. I hope you do ;)

Comment 4 Janus Weil 2008-01-05 23:51:12 UTC

But 800MB of memory is definitely *a lot* for such a small file!

Comment 5 Jerry DeLisle 2008-01-06 00:36:29 UTC

Confirmed: Some comparative timings with Intel and Sun compilers which also use very little memory.

$ time gfc -c -O2 janus_slow.f90 

real    2m53.448s
user    2m51.425s
sys     0m1.878s
$ time ifort -c -O2 janus_slow.f90 

real    0m10.489s
user    0m10.264s
sys     0m0.185s
$ time sunf95 -c -O2 janus_slow.f90 

real    0m24.029s
user    0m23.339s
sys     0m0.685s

Comment 6 Richard Biener 2008-01-06 09:56:16 UTC

The problem is we have loads of unpartitionable SFTs.  There are workarounds
possible that only might affect performance of generated code, for example
you can use --param salias-max-array-elements=0, which will reduce the number
of unpartitionable SFTs somewhat.  Or as a more big hammer, disable generating
SFTs completely by --param max-fields-for-field-sensitive=0.

Comment 7 Richard Biener 2008-01-06 10:07:58 UTC

Note that this shouldn't really be affected by strict aliasing or not.  Note
that the FE generated IL causes the first alias pass to emit too conservative
alias info as well:

  # VUSE <MPT.5965_208230>
  D.6049_4652 = atmp.1110.data;
  D.6050_4653 = (complex(kind=4)[0:] *) D.6049_4652;
  D.6051_4654 = S.1113_289 + D.3340_4634;
  # VUSE <SFT.......>
  D.6052_4655 = (*D.6050_4653)[D.6051_4654];

I suppose the atmp.1110.data type is something like (void *), so the
cast is required.  But this really pessimizes the middle-end IL and
it looks like the FE knows it will be complex(kind=4)[4] at the point
of creation.  Note fixing this will also improve optimization and thus
runtime performance.

I also see the FE creates lots of array temporaries:

  struct array2_complex(kind=4) atmp.1093;
  complex(kind=4) A.1094[4];
  struct array2_complex(kind=4) atmp.1095;
  complex(kind=4) A.1096[4];
  struct array2_complex(kind=4) atmp.1100;
  complex(kind=4) A.1101[4];
  struct array2_complex(kind=4) atmp.1102;
  complex(kind=4) A.1103[4];
  struct array2_complex(kind=4) atmp.1106;
  complex(kind=4) A.1107[4];
  real(kind=4) D.3326;
...

instead of re-using a single one.  This also causes internal representation
to blow up.

So, to sum up, the situation could be significantly improved by improving
the FE.  (Like I noticed for I/O statements, where I provided a patch -
that was not accepted - to share temporary variables created for the I/O
context)

As we already have bugs for the middle-end partitioning problems, I make
this one a Fortran FE bug instead.

Comment 8 Tobias Burnus 2008-01-06 10:24:10 UTC

> So, to sum up, the situation could be significantly improved by improving
> the FE.  (Like I noticed for I/O statements, where I provided a patch -
> that was not accepted - to share temporary variables created for the I/O
> context)

Could you add a link to that patch? I quickly searched my mail archive, but could not find it.

Comment 9 Richard Biener 2008-01-06 10:34:43 UTC

Basically, not re-using array descriptor variables is as bad as not re-using
I/O descriptor variables.  I also wonder what this _gfortran_internal_{un,}pack
is about?

For

   taux=gn(p2+p4)

gfortran generates

    real(kind=4) A.1[4];
    struct array1_real(kind=4) atmp.0;

    atmp.0.dtype = 281;
    atmp.0.dim[0].stride = 1;
    atmp.0.dim[0].lbound = 0;
    atmp.0.dim[0].ubound = 3;
    atmp.0.data = (void *) &A.1;
    atmp.0.offset = 0;
    {
      integer(kind=4) S.2;

      S.2 = 0;
      while (1)
        {
          if (S.2 > 3) goto L.1;
          (*(real(kind=4)[0:] *) atmp.0.data)[S.2] = (*p2)[S.2] + (*p4)[S.2];
          S.2 = S.2 + 1;
        }
      L.1:;
    }
    D.531 = _gfortran_internal_pack (&atmp.0);
    D.533 = gn (D.531);
    if ((real(kind=4)[0:] *) D.531 != (real(kind=4)[0:] *) atmp.0.data)
      {
        _gfortran_internal_unpack (&atmp.0, D.531);
        {
          void * D.532;

          D.532 = D.531;
          if (D.532 != 0B)
            {
              __builtin_free (D.532);
            }
        }
      }

why does it not know the result of

    if ((real(kind=4)[0:] *) D.531 != (real(kind=4)[0:] *) atmp.0.data)

at the point it creates the IL?  I suppose this is about arrays with
stride != 1?

Comment 10 Richard Biener 2008-01-06 10:38:41 UTC

This one:

http://gcc.gnu.org/ml/gcc-patches/2006-02/msg01762.html

basically it mentiones async I/O as a reason to not do it, but instead the
callee can copy the I/O structure in that case instead (it probably needs
to anyway, as currently those structures live on the stack, so with async
I/O you'd need otherwise to guarantee that the I/O finished before the
current frame is left).

Comment 11 Richard Biener 2008-01-06 10:50:02 UTC

So, currently you can build up arbitrary large chains of virtual clobbers with

  write (*,*) 'hello'
  write (*,*) 'hello'
  ...

as each of those writes create two un-partitionable SFTs.  For arrays temporaries
this is the same (no idea what is the most simple statement to create one).
Without the fix for PR34093 this would even reliably ICE the compiler (and it
of course creates compile-time and memory-usage problems, because the growth
of the IL is non-linear).

Comment 12 Richard Biener 2008-01-06 11:09:16 UTC

Actually it isn't that bad for I/O as long as there are no pointers to the
dt_parm structs in the IL (function arguments are ok).

Comment 13 Richard Biener 2008-01-06 11:13:56 UTC

For array temporaries the pinning of SFTs happens because we have the address of
the actual array _data_ in the IL:

  # SFT.10_52 = VDEF <SFT.10_51(D)>
  atmp.0.data = &A.1;

the array descriptor itself is not the problem (the redundant descriptors still
consume memory, but should not cause compile-time problems where observed).

Comment 14 Richard Biener 2008-01-06 11:46:07 UTC

As of optimization, the conversion sequence

  atmp.0.data = &A.1;
  D.540_5 = atmp.0.data;
  D.541_6 = (real(kind=4)[0:] *) D.540_5;
...
  (*D.541_6)[S.5_1] = D.545_14;

can be optimized to

  A.1[S.5_1] = D.545_14;

with the following patch that makes sure we use (real(kind=4)[4] *) instead
of the unknown size array type.

Index: trans-types.c
===================================================================
--- trans-types.c	(revision 131336)
+++ trans-types.c	(working copy)
@@ -1511,10 +1511,12 @@ gfc_get_array_type_bounds (tree etype, i
   /* TODO: known offsets for descriptors.  */
   GFC_TYPE_ARRAY_OFFSET (fat_type) = NULL_TREE;
 
-  /* We define data as an unknown size array. Much better than doing
+  /* We define data as an array with the correct size. Much better than doing
      pointer arithmetic.  */
   arraytype =
-    build_array_type (etype, gfc_array_range_type);
+    build_array_type (etype, build_range_type (gfc_array_index_type,
+	gfc_index_zero_node, int_const_binop (MINUS_EXPR, stride,
+	integer_one_node, 0)));
   arraytype = build_pointer_type (arraytype);
   GFC_TYPE_ARRAY_DATAPTR_TYPE (fat_type) = arraytype;
 
(the patch needs to be adjusted for the cases stride is not the actual array
size, but you should get the idea)

Comment 15 Richard Biener 2008-01-06 12:07:51 UTC

Note that the large FRE compile-time is a dup of PR34450, try experimenting
with --param sccvn-max-scc-size=N and find a N where the problem vanishes.

Comment 16 Janus Weil 2008-01-06 22:08:30 UTC

I've done some experimenting with older GCC versions I have floating around on my machines:

Compiling the test case with both 4.1.2 and 4.2.1 gives an ICE, so I guess we can't exactly call this a regression.

But then compiling with older 4.3 trunk builds works much better:
I have a version from August 24 of last year, which runs the test case at full -O3 in a perfect 8sec, with 32MB memory usage (that is with --enable-checking=release). And I have two other builds from October 8 and November 9, which both run the test case in about 38s with 85MB mem-usage (they are built with --enable-checking=debug, so this is probably just debugging overhead). At least none of them takes the crazy 800MB of recent builds.

To sum up: It seems like trunk versions up to at least November 9 seem to work fine on this test case.

Comment 17 rguenther@suse.de 2008-01-06 22:12:11 UTC

Subject: Re:  Fortran FE generated IL pessimizes middle-end
 IL and analysis

On Sun, 6 Jan 2008, jaydub66 at gmail dot com wrote:

> ------- Comment #16 from jaydub66 at gmail dot com  2008-01-06 22:08 -------
> I've done some experimenting with older GCC versions I have floating around on
> my machines:
> 
> Compiling the test case with both 4.1.2 and 4.2.1 gives an ICE, so I guess we
> can't exactly call this a regression.
> 
> But then compiling with older 4.3 trunk builds works much better:
> I have a version from August 24 of last year, which runs the test case at full
> -O3 in a perfect 8sec, with 32MB memory usage (that is with
> --enable-checking=release). And I have two other builds from October 8 and
> November 9, which both run the test case in about 38s with 85MB mem-usage (they
> are built with --enable-checking=debug, so this is probably just debugging
> overhead). At least none of them takes the crazy 800MB of recent builds.
> 
> To sum up: It seems like trunk versions up to at least November 9 seem to work
> fine on this test case.

Well, we know what introduced this regression on the trunk.  It was
the fix for PR33870.

Richard.

Comment 18 Richard Biener 2008-01-07 13:25:48 UTC

Most of the operand scanner time is caused by cfg_cleanup:

 tree CFG cleanup      :  93.71 (79%) usr   1.16 (54%) sys  95.03 (78%) wall  474427 kB (28%) ggc

globbing CFG cleanup (and operand scanner) time to the passes that trigger it reveals:

 tree PHI const/copy prop:  27.30 (16%) usr   0.27 ( 9%) sys  31.33 (16%) wall       9 kB ( 0%) ggc

 tree FRE              :  25.01 (15%) usr   0.75 (25%) sys  27.63 (14%) wall 1061857 kB (62%) ggc

 complete unrolling    : 100.00 (59%) usr   1.46 (49%) sys 111.81 (58%) wall  482221 kB (28%) ggc

wheee ;)

looking where we spend most of the time it seems to be building tree lists
in ssa_redirect_edge.  I have a hack to make this a single allocation of a
VEC on the heap, still in cfg_cleanup remove_forwarder_block_with_phi there
is a linear search(!) in the list of pending PHI args.

Profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 48.75     52.85    52.85  2734252     0.00     0.00  ggc_alloc_stat
 11.07     64.85    12.00 35486689     0.00     0.00  add_vars_for_offset
  4.22     69.42     4.57 327924414     0.00     0.00  var_ann
  3.89     73.64     4.22   138084     0.00     0.00  finalize_ssa_stmt_operands
  3.22     77.13     3.49 44525266     0.00     0.00  add_virtual_operand
...

the add_vars_for_offset is PR33870 related.

Comment 19 Richard Biener 2008-01-07 14:32:04 UTC

-fno-tree-loop-optimize reduces compile-time of this testcase from 117s to 35s with -O -fstrict-aliasing.  try_unroll_loop_completely calls update_ssa, for
every loop unrolled, added by r98599, which seems to be necessary - mainly because
of the interesting PHI redirect re-use of PENDING_STMT.

Comment 20 Richard Biener 2008-01-07 14:50:22 UTC

Subject: Bug 34683

Author: rguenth
Date: Mon Jan  7 14:49:36 2008
New Revision: 131375

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=131375
Log:
2008-01-07  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/34683
	* tree-ssa-sccvn.c (vuses_to_vec): Pre-allocate the vector of
	VOPs of the needed size to save memory.  Use VEC_quick_push
	to save compile-time.
	(vdefs_to_vec): Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-sccvn.c

Comment 21 Richard Biener 2008-01-07 15:08:09 UTC

Without the loop optimizer the profile looks like:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
  9.77      2.96     2.96 14743868     0.00     0.00  add_vars_for_offset
  8.00      5.38     2.42                             vuses_compare
  7.06      7.51     2.14 44262948     0.00     0.00  add_virtual_operand
  6.65      9.52     2.01 235204852     0.00     0.00  var_ann
  4.83     10.98     1.46 68814811     0.00     0.00  iterative_hash_expr
  4.66     12.39     1.41      172     0.01     0.04  DFS
  4.63     13.79     1.40 69248754     0.00     0.00  htab_find_with_hash
  4.28     15.09     1.30  1229783     0.00     0.00  ggc_alloc_stat
  4.13     16.34     1.25 206399974     0.00     0.00  VN_INFO
  4.00     17.55     1.21 19029637     0.00     0.00  set_bb_for_stmt
  3.74     18.68     1.13    12610     0.00     0.00  get_call_expr_operands
  2.88     19.55     0.87   266239     0.00     0.00  valueize_vuses
  2.31     20.25     0.70 68677290     0.00     0.00  uid_decl_map_eq
  2.25     20.93     0.68   129473     0.00     0.00  visit_reference_op_store
  1.90     21.50     0.58 114366877     0.00     0.00  is_gimple_reg
  1.49     21.95     0.45   147496     0.00     0.00  vec_gc_o_reserve_1
  1.41     22.38     0.43 68653081     0.00     0.00  referenced_var_lookup

and most of the time (cfgcleanup and operand scanner globbed into passes) is
spent in

 tree PHI const/copy prop:  27.82 (42%) usr   0.05 ( 5%) sys  27.95 (41%) wall       8 kB ( 0%) ggc
 tree FRE              :  24.94 (37%) usr   0.78 (73%) sys  25.73 (38%) wall  508759 kB (80%) ggc

Comment 22 Richard Biener 2008-01-07 15:56:00 UTC

A histogram over the number of VUSEs/VDEFs that are being sorted by sort_vuses
reveals:

    180 2
 281964 256
   1498 4

that is, we call qsort 281964 times with 256 virtual operands in the VEC.  Now,
curiously(?), virtual operands are sorted already, but after DECL_UID, not
SSA_NAME_VERSION (see operand_build_sort_virtual).

Adjusting the tree-vn sorting order to that of the operand scanner lets us
improve these numbers to

     27 2
  28665 256
    128 4

a reduction of a factor of 10.  Profile after that:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 10.37      2.91     2.91 14769893     0.00     0.00  add_vars_for_offset
  7.91      5.13     2.22 44264182     0.00     0.00  add_virtual_operand
  7.16      7.14     2.01 235231611     0.00     0.00  var_ann
  5.24      8.61     1.47 206399974     0.00     0.00  VN_INFO
  5.20     10.07     1.46      172     0.01     0.04  DFS
  5.17     11.52     1.45 68814811     0.00     0.00  iterative_hash_expr
  4.92     12.90     1.38    12610     0.00     0.00  get_call_expr_operands
  4.74     14.23     1.33  1439173     0.00     0.00  ggc_alloc_stat
  4.63     15.53     1.30 69275355     0.00     0.00  htab_find_with_hash
  3.88     16.62     1.09 19029637     0.00     0.00  set_bb_for_stmt
  3.56     17.62     1.00   266239     0.00     0.00  valueize_vuses
  2.21     18.24     0.62   129473     0.00     0.00  visit_reference_op_store
  1.96     18.79     0.55 68703417     0.00     0.00  uid_decl_map_eq
  1.92     19.33     0.54                             operand_build_cmp

Comment 23 Richard Biener 2008-01-07 17:08:05 UTC

Zdenek, can you have a look if there is sth low-hanging with the tree level
loop unroller?

Comment 24 Richard Biener 2008-01-08 09:46:35 UTC

The patch in comment #14 reduces compile time from 120s to 80s.  I'll try to
get it into shape.

Comment 25 Richard Biener 2008-01-08 09:55:01 UTC

The unroll problem is that if you have multiple loops like

   for (i=0; i<3; ++i)
     a[i] = i;
   ...
   for (i=0; i<3; ++i)
     a[i] = i;

then even though we are in loop closed SSA form, we rename all variables
in the loop BBs after duplicating them for *every loop* we unroll.  Which
causes the complexity of renaming to be quadratic (you can easily see
that for this testcase the number of SSA_NAMEs allocated by the renamer
is quadratic with the number of loops in the testcase).  Now what makes
it so bad in this testcase is that we mostly deal with virtual operands
during renaming.

Comment 26 Richard Biener 2008-01-08 11:06:22 UTC

We do a _lot_ of calls to var_ann during the compilation:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 39.79     30.12    30.12  1311048     0.00     0.00  ggc_alloc_stat
 12.62     39.67     9.55 34186814     0.00     0.00  add_vars_for_offset
  5.77     44.04     4.37 331206978     0.00     0.00  var_ann

so var_ann is not inlined despite of it being declared as such.

                0.00    0.00   35522/331206978     add_to_addressable_set [318]
                1.58    0.00 119702754/331206978     add_virtual_operand [12]
                2.79    0.00 211468702/331206978     add_vars_for_offset [17]
[39]     5.8    4.37    0.00 331206978         var_ann [39]
                0.00    0.00  119615/96446282     htab_find_with_hash <cycle 34> [46]

So, the add_vars_for_offset loop is resposible for most of the calls
(unsurprisingly), and most of the calls are not for global vars.

Making sure this call gets inlined brings another 5s of compile-time.

Comment 27 Richard Biener 2008-01-08 12:13:48 UTC

Subject: Bug 34683

Author: rguenth
Date: Tue Jan  8 12:12:56 2008
New Revision: 131393

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=131393
Log:
2008-01-08  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/34683
	* tree-flow-inline.h (var_ann): Remove overzealous asserts.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-flow-inline.h

Comment 28 Steven Bosscher 2008-01-08 12:17:53 UTC

Re. comment #26

More proof that assert checking _is_ expensive.  I wonder how much compile time can be blamed overall on release checking (i.e. checking disabled vs. release checking).

Have you ever compared e.g. build times of cc1-i files or a linux kernel with a non-checking compiler vs. a release-checking compiler?

Comment 29 Steven Bosscher 2008-01-08 12:18:01 UTC

Re. comment #26

More proof that assert checking _is_ expensive.  I wonder how much compile time can be blamed overall on release checking (i.e. checking disabled vs. release checking).

Have you ever compared e.g. build times of cc1-i files or a linux kernel with a non-checking compiler vs. a release-checking compiler?

Comment 30 Richard Biener 2008-01-08 12:30:22 UTC

No, I didn't.  But yes, overzealous asserts in the simple inline accessor
functions are probably counter-productive.  For example things like

static inline bool
gimple_aliases_computed_p (const struct function *fun)
{
  gcc_assert (fun && fun->gimple_df);
  return fun->gimple_df->aliases_computed_p;
} 

would segfault anyway if fun or fun->gimple_df would be NULL.  Or

static inline function_ann_t
function_ann (const_tree t)
{
  gcc_assert (t);
  gcc_assert (TREE_CODE (t) == FUNCTION_DECL);
  gcc_assert (!t->base.ann
              || t->base.ann->common.type == FUNCTION_ANN);

  return (function_ann_t) t->base.ann;
}

multiple asserts generate multiple diagnostic calls.  Now, both of these
functions are not in hot code-paths.

Comment 31 Richard Biener 2008-01-08 12:55:03 UTC

Another hot assert is

static inline tree
memory_partition (tree sym)
{
  tree tag;

  /* MPTs belong to their own partition.  */
  if (TREE_CODE (sym) == MEMORY_PARTITION_TAG)
    return sym;

  gcc_assert (!is_gimple_reg (sym));


and is_gimple_reg isn't cheap either.  For this testcase two thirds of
the calls are attributed to add_vars_for_offset (through add_v{use,def}):

                0.50    0.01 135234172/199610365     add_vars_for_offset [27]
[114]    1.1    0.74    0.01 199610365         is_gimple_reg [114]

and it really isn't cheap:

CPU: AMD64 processors, speed 2000 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
545326   36.8318  ggc_alloc_stat
310838   20.9943  add_vars_for_offset
85655     5.7852  add_virtual_operand
69156     4.6709  finalize_ssa_stmt_operands
57057     3.8537  bitmap_set_bit
36755     2.4825  htab_find_with_hash
35915     2.4257  get_expr_operands
28207     1.9051  is_gimple_reg
23340     1.5764  referenced_var_lookup

wrapping the assert inside ENABLE_CHECKING buys us 2s or 2.5%.  But this
is too much micro-optimizing for me ;)

I'd rather have the quadratic SSA rewriting in the unroller fixed.

Comment 32 Steven Bosscher 2008-01-08 13:08:23 UTC

You are of course right that for this test case the bigger win is in the algorithmic issues.

On the other hand, for compile time in general, the gcc_asserts are IMHO the worst thing that happened to GCC since the introduction of the garbage allocator ;-)  It is way too easy to write "gcc_assert (blah)" instead of "if (blah)\nabort ();", and people don't realize how expensive these "cheap" asserts are.

This would then be a nice project for the Wiki beginners projects pages, perhaps: Audit the gcc_assert calls in small "inline" functions and see if there is any measurable gain to win from removing some that don't make sense (like e.g. the ones you identified in comment #30).

But that's not relevant for this bug report, of course :-)

Comment 33 Richard Biener 2008-01-08 16:29:58 UTC

Subject: Bug 34683

Author: rguenth
Date: Tue Jan  8 16:29:14 2008
New Revision: 131400

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=131400
Log:
2008-01-08  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/34683
	* tree-ssa-operands.c (operand_build_cmp): Export.
	* tree-ssa-operands.h (operand_build_cmp): Declare.
	* tree-vn.c (vuses_compare): Remove.
	(sort_vuses): Use operand_build_cmp.
	(sort_vuses_heap): Likewise.
	* tree-ssa-sccvn.c (vuses_to_vec): Use VEC_reserve, not VEC_alloc
	to re-use old VEC if available.  Do not sort already sorted VUSEs.
	(vdefs_to_vec): Do not sort already sorted VDEFs.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-operands.c
    trunk/gcc/tree-ssa-operands.h
    trunk/gcc/tree-ssa-sccvn.c
    trunk/gcc/tree-vn.c

Comment 34 Richard Biener 2008-01-08 21:22:14 UTC

Subject: Bug 34683

Author: rguenth
Date: Tue Jan  8 21:21:29 2008
New Revision: 131404

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=131404
Log:
2008-01-08  Richard Guenther  <rguenther@suse.de>

	PR fortran/34706
	PR tree-optimization/34683
	* trans-types.c (gfc_get_array_type_bounds): Use an array type
	with known size for accesses if that is known.

Modified:
    trunk/gcc/fortran/ChangeLog
    trunk/gcc/fortran/trans-types.c

Comment 35 Steven Bosscher 2008-01-09 00:04:42 UTC

How does the trunk compare now to the numbers mentioned in comment #16 ?

Comment 36 Janus Weil 2008-01-09 09:38:20 UTC

> How does the trunk compare now to the numbers mentioned in comment #16 ?

Compiling with rev. 131414 gives:

-O1 -fstrict-aliasing: 33sec, 438197 kB
-O2:                   97sec, 669637 kB
-O3:                   50sec, 392669 kB

This is of course better than a few days ago (thanks to Richard's patches). But it's not even close to the numbers in comment #16.

What worries me about this is not the 33sec compile time (I could live with that, if I'd have to), but the really large amount of memory. On a machine with only 256MB compiling takes several minutes, due to massive swapping.

Richard:
You said that the remaining issue is probably not gonna be fixed for 4.3. I don't quite see the reason for this, since this apparently *was* fixed at some point in the 4.3 trunk (see comment 16).
Also this would really be a shame since it would make GCC 4.3 practically unusable for our project (http://gibuu.physik.uni-giessen.de/GiBUU/).

Comment 37 rguenther@suse.de 2008-01-09 09:45:32 UTC

Subject: Re:  [4.3 Regression] Fortran FE
 generated IL pessimizes middle-end IL and analysis

On Wed, 9 Jan 2008, jaydub66 at gmail dot com wrote:

> ------- Comment #36 from jaydub66 at gmail dot com  2008-01-09 09:38 -------
> 
> > How does the trunk compare now to the numbers mentioned in comment #16 ?
> 
> Compiling with rev. 131414 gives:
> 
> -O1 -fstrict-aliasing: 33sec, 438197 kB
> -O2:                   97sec, 669637 kB
> -O3:                   50sec, 392669 kB
> 
> This is of course better than a few days ago (thanks to Richard's patches). But
> it's not even close to the numbers in comment #16.
> 
> What worries me about this is not the 33sec compile time (I could live with
> that, if I'd have to), but the really large amount of memory. On a machine with
> only 256MB compiling takes several minutes, due to massive swapping.
> 
> Richard:
> You said that the remaining issue is probably not gonna be fixed for 4.3. I
> don't quite see the reason for this, since this apparently *was* fixed at some
> point in the 4.3 trunk (see comment 16).

Yeah, but we won't trade wrong-code bugs for better compile-time here.
I have no idea how practical it is to fix basic-block duplication code
(which is the worst offender regarding compile-time in this circumstances
of too many VOPs).

> Also this would really be a shame since it would make GCC 4.3 practically
> unusable for our project (http://gibuu.physik.uni-giessen.de/GiBUU/).

There is a workaround available - you can use
--param max-fields-for-field-sensitive=0
to disable SFTs, but this may disable useful optimization.  This is
the route we want to go with 4.4, addressing the missed optimizations in
different ways.

Richard.

Comment 38 Richard Biener 2008-01-10 14:36:22 UTC

Created attachment 14908 [details]
hack to disable rewriting of VOPs

I tried to change the unroller doing only SSA_NAME replacement update after
each unrolling and leave rewriting the virtual operands to after the whole
pass but either I messed up or this even makes it worse ;)  See the prototype
hack.

Comment 39 Richard Biener 2008-01-10 15:01:27 UTC

Hmm, looks I was compeltely wrong about the cause of the slowdown.  We actually
run cfg_cleanup after cunroll and merge blocks like

 <BB1>
   ...

 <BB2>
   # SFT.1_2 = PHI <SFT.1_1 (BB1)>
   ...
   # SFT.1000_2 = PHI <SFT.1000_1 (BB1)>

   # SFT.1_3 = VDEF <SFT.1_2>
   ...
   # SFT.1000_3 = VDEF <SFT.1_2>
   *mem = x;

and in merging the blocks we do (tree_merge_blocks):

  /* Remove all single-valued PHI nodes from block B of the form
     V_i = PHI <V_j> by propagating V_j to all the uses of V_i.  */ 
  for (phi = phi_nodes (b); phi; phi = phi_nodes (b))
    {
...
          replace_uses_by (def, use);
          remove_phi_node (phi, NULL, true);

BUT!  replace_uses_by will for _each_ phi-node we replace its uses update
the target stmt!  And fold it!  We can do better with VOPs

Preparing a patch.

Comment 40 Richard Biener 2008-01-10 15:32:12 UTC

After you fix this, we're back to SCCVN is slow with lots of virtual operands.
For -O2 (which appearantly is the worst case) we now have the following profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 10.33      3.08     3.08 313563133     0.00     0.00  VN_INFO
  8.85      5.72     2.64      365     0.01     0.04  DFS
  7.17      7.86     2.14 96368081     0.00     0.00  iterative_hash_expr
  5.77      9.58     1.72   472085     0.00     0.00  valueize_vuses
  5.43     11.20     1.62   216407     0.00     0.00  visit_reference_op_store
  4.93     12.67     1.47  2093493     0.00     0.00  ggc_alloc_stat
  3.15     13.61     0.94 17042636     0.00     0.00  set_bb_for_stmt
  2.63     14.40     0.79  2442685     0.00     0.00  add_vars_for_offset
  1.74     14.92     0.52                             operand_build_cmp
  1.54     15.38     0.46   226140     0.00     0.00  shared_vuses_from_stmt
  1.48     15.82     0.44   245792     0.00     0.00  vn_reference_lookup
  1.27     16.20     0.38  5272581     0.00     0.00  mark_all_vars_used_1
  1.24     16.57     0.37 16115332     0.00     0.00  bitmap_set_bit
  1.14     16.91     0.34       58     0.01     0.02  remove_unused_locals
  1.14     17.25     0.34  6878571     0.00     0.00  htab_find_slot_with_hash
  1.04     17.56     0.31  4286747     0.00     0.00  add_virtual_operand
  1.04     17.87     0.31   255362     0.00     0.00  vec_gc_o_reserve_1
...

Comment 41 Richard Biener 2008-01-10 16:59:50 UTC

Subject: Bug 34683

Author: rguenth
Date: Thu Jan 10 16:59:06 2008
New Revision: 131446

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=131446
Log:
2008-01-10  Richard Guenther  <rguenther@suse.de>

	PR middle-end/34683
	* tree-cfg.c (tree_merge_blocks): Do not go through the
	full-blown folding and stmt updating path if we just deal
	with virtual operands.
	* tree-ssa-copy.c (may_propagate_copy): Do not short-cut
	test for abnormal SSA_NAMEs.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-cfg.c
    trunk/gcc/tree-ssa-copy.c

Comment 42 Richard Biener 2008-01-10 17:02:19 UTC

This is probably all we can get for now - maybe another few % with minor tweaks,
but nothing earth-shattering.  After all, we _do_ have a much larger IL due
to the number of VOPs not partitioned.

Comment 43 Richard Biener 2008-01-10 17:05:59 UTC

Actually this is not a regression against a released compiler, as both the
4.1 and the 4.2 branches ICE for this testcase:

gfortran-4.2 -S -o /dev/null t.f90 -O -fstrict-aliasing
t.f90: In function 'ampli':
t.f90:83: internal compiler error: in gfc_trans_array_constructor, at fortran/trans-array.c:1466
Please submit a full bug report,
with preprocessed source if appropriate.
See <URL:http://bugs.opensuse.org> for instructions.

so, can we declare this as fixed (the huge number of VOPs we have special
PRs for)?

Comment 44 Steven Bosscher 2008-01-10 17:39:13 UTC

Yes, fixed.

Comment 45 Janus Weil 2008-01-10 20:14:41 UTC

(In reply to comment #42)
> This is probably all we can get for now - maybe another few % with minor
> tweaks,
> but nothing earth-shattering.  After all, we _do_ have a much larger IL due
> to the number of VOPs not partitioned.

Well, it may not be earth-shattering, but it's still a solid improvement - of more than a few %! With rev. 131446 I get the following numbers:

-O1 -fstrict-aliasing: 11sec, 201825 kB
-O2:                   22sec, 392166 kB
-O3:                   24sec, 388802 kB

Compile time is reduced by a factor of 4 for -O2, and memory usage is halved for -O1! Nice job, Richard.

So I'd say compile time really isn't an issue any more. But memory consumption is still a factor of 10 above what it was last year. Most of that comes from 'tree FRE' (~140MB), but all the junk in 'tree operand scan' is gone. For -O2 and -O3 another large chunk appears in 'tree PRE' (~160MB).

Are you sure you want to close this bug? For me memory consumption is still an issue, and I would really like to see this fixed at some point. And if there is another bug which is a duplicate of this, could you point me to it?

Comment 46 Steven Bosscher 2008-01-10 20:21:59 UTC

See e.g. bug 27004 and bug 33974.

Comment 47 Richard Biener 2008-01-11 13:29:17 UTC

Most of the PRE/FRE memory is spent in copied VOPs VECs.  Unfortunately we
cannot move them to heap memory easily as the get shared in the PRE tables...
I tried to be explicit in managing the memory, but it gets really hard (which
is probably the reason we stick it in GC memory in the first place).

So we could use two variants of all routines, one using GC allocated VECs and
one using heap allocated VECs and use the GC allocated one only for the final
SCC iteration and the PRE tables.

Smells like a task for C++ templates or a really ugly hack in C.

Obviously this is not going to happen for 4.3 :/

Comment 48 Zdenek Dvorak 2008-01-12 03:59:29 UTC

(In reply to comment #47)
> Most of the PRE/FRE memory is spent in copied VOPs VECs.  Unfortunately we
> cannot move them to heap memory easily as the get shared in the PRE tables...
> I tried to be explicit in managing the memory, but it gets really hard (which
> is probably the reason we stick it in GC memory in the first place).

in fact, it copies the VOPS list of each store statement 50 times on average (once for PRE and once for FRE, i.e., about 100x total).  Cannot sccvn be throttled down somehow so that it does not spend so much memory this way?