63155 – [7 Regression] memory hog

Bug 63155 - [7 Regression] memory hog

Summary: [7 Regression] memory hog

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	4.9.0

Importance:	P2 normal
Target Milestone:	8.3
Assignee:	Richard Biener

URL:
Keywords:	compile-time-hog, memory-hog

Duplicates (1):	87316 (view as bug list)
Depends on:
Blocks:

Reported:	2014-09-03 11:59 UTC by Matthias Klose
Modified:	2019-09-04 08:55 UTC (History)
CC List:	3 users (show)

See Also:	64928
Host:
Target:
Build:
Known to work:	4.8.3, 8.2.1, 9.0
Known to fail:	4.9.2, 5.0, 7.4.1, 8.2.0
Last reconfirmed:	2018-08-22 00:00:00

Attachments
preprocessed source (9.42 KB, application/gzip) 2014-09-03 11:59 UTC, Matthias Klose	Details
alternate testcase (10.38 KB, text/plain) 2018-08-23 13:20 UTC, Richard Biener	Details
patch that does not help (874 bytes, patch) 2018-09-14 12:36 UTC, Richard Biener	Details \| Diff
patch (1.39 KB, patch) 2018-09-17 13:54 UTC, Richard Biener	Details \| Diff
patch for the SSA propagator issue (872 bytes, patch) 2018-09-19 07:50 UTC, Richard Biener	Details \| Diff
patch for the SSA propagator issue (1.50 KB, patch) 2018-09-19 13:07 UTC, Richard Biener	Details \| Diff
Small testcase more similar to original environment (2.53 KB, text/x-csrc) 2018-10-04 20:57 UTC, Rogério de Souza Moraes	Details
GCC 6.3.0 consolidated patch based on Richard's patches (5.43 KB, patch) 2018-10-17 17:58 UTC, Rogério de Souza Moraes	Details \| Diff
New testcase that shows performance degradation in GCC 9.1 (7.44 KB, text/plain) 2019-07-25 12:31 UTC, Rogério de Souza Moraes	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Matthias Klose 2014-09-03 11:59:54 UTC

Created attachment 33441 [details]
preprocessed source

[forwarded from https://bugs.debian.org/759683]

compiling the attached test case with the 4.9 branch r214759 and trunk r213954 takes about 90sec on x86_64 and 10GB of memory. succeeds with the 4.8 branch in less than a second.

$ gcc -std=c99 -c testunity_Runner.i


from the Debian issue:

  """
  Notice that replacing "_setjmp 
  (Unity.AbortFrame[Unity.CurrentAbortFrame])" in main function by 
  "_setjmp (Unity.AbortFrame[0])", make gcc works normaly.
  After few tests it seems that gcc does not like having a variable in here.
  """

I don't see the crash reported in the Debian issue.

Comment 1 Richard Biener 2014-09-03 14:00:17 UTC

Clearly caused by the correctness fix for setjmp to wire abnormal edges.

For me it is out-of-ssa which uses too much memory while building
the conflict graph.

We have gigantic PHI nodes here:

_10263(ab) = PHI <_109925(D)(ab)(2),...., _10592(ab)(1489)>

it's fast when optimizing.

At -O0 we have a _lot_ more anonymous SSA names.

-O1:

  <bb 4>:
  # _1(ab) = PHI <_1902(3), _2(ab)(5)>
  _1905 = _setjmp (_1(ab));
  if (_1905 == 0)
    goto <bb 6>;
  else
    goto <bb 8>;

  <bb 5>
  # _2(ab) = PHI <_1895(D), .... single gigantic PHI

-O0:

  <bb 4>:
  # _1(ab) = PHI <_398164(3), _2(ab)(5)>
  # _632(ab) = PHI <_397532(D)(ab)(3), _633(ab)(5)>
  # _1263(ab) = PHI <_397533(D)(ab)(3), _1264(ab)(5)>
  # _1894(ab) = PHI <_397534(D)(ab)(3), _1895(ab)(5)>
  # _2525(ab) = PHI <_397535(D)(ab)(3), _2526(ab)(5)>
...
  # _396900(ab) = PHI <_398160(D)(ab)(3), _396901(ab)(5)>
  _398165 = _setjmp (_1(ab));
  if (_398165 == 0)
    goto <bb 6>;
  else
    goto <bb 8>;

  <bb 5>
  # _2(ab) = PHI <_397531(D)(ab)(2)...
....
  # _396901(ab) = PHI <_398160(D)(ab)(2), _3...

gazillion of gigantic PHIs.  And very many PHIs in every block.

It's into-SSA that introduces the difference for the PHI nodes
but already GIMPLIFICATION that introduces very many more
temporaries which is the underlying issue (lookup_tmp_var
!optimize check).

Index: gcc/gimplify.c
===================================================================
--- gcc/gimplify.c      (revision 214810)
+++ gcc/gimplify.c      (working copy)
@@ -476,7 +476,7 @@ lookup_tmp_var (tree val, bool is_formal
      block, which means it will go into memory, causing much extra
      work in reload and final and poorer code generation, outweighing
      the extra memory allocation here.  */
-  if (!optimize || !is_formal || TREE_SIDE_EFFECTS (val))
+  if (!is_formal || TREE_SIDE_EFFECTS (val))
     ret = create_tmp_from_val (val);
   else
     {

fixes it (but it means that changing the testcase to use more distinct
user variables would produce the same issue even when optimizing).

Comment 2 Richard Biener 2014-09-03 14:09:18 UTC

I wonder why we need to explicitely represent abnormal PHIs in the dispatcher.
All incoming edges are abnormal and all SSA names have to be coalesced anyway.
Thus we could instead have

  <bb 5>:
/* Not: # _2(ab) = PHI <_17(D)(ab)(2), _1(ab)(6), _1(ab)(7), _3(ab)(11), _3(ab)(12), _4(ab)(15), _4(ab)(16), _5(ab)(20), _5(ab)(21), _5(ab)(22)> */
  ABNORMAL_DISPATCHER (0);
  _2(ab) = D.12345;

or simply rewrite all must-coalesce vars out-of-SSA?  (or not into SSA
in the first place)

The question is whether accesses to them should be loads/stores (I think so)
and if that will cause other similar issues.

We'd have to factor abnormal edges into a block to a separate forwarder
of course, with a load of all "abnormal" vars.

Anyway, not sure why the gimplify code is disabled for -O0 (or why we
don't re-use formal temps more aggressively as they become anonymous
SSA names later anyway).

Comment 3 Richard Biener 2014-09-03 14:24:40 UTC

So the issue is that the setjmp argument needs two temporaries:

  D.2832 = Unity.CurrentAbortFrame;
  D.2833 = &Unity.AbortFrame[D.2832];

  <bb 18>:
  D.2834 = _setjmp (D.2833);

and the EH edge going into the _setjmp call has to merge those through
the abnormal dispatcher.  And that way it receives all of them.  Hmm.

Huh.  Without the abnormal dispatcher they should just get default defs
everywhere (but still many PHI nodes).  Maybe that would be more light-weight.

Comment 4 Jakub Jelinek 2014-10-30 10:39:03 UTC

GCC 4.9.2 has been released.

Comment 5 Richard Biener 2015-03-06 13:01:30 UTC

I happen to have a (controversical) patch.

Comment 6 Richard Biener 2015-03-06 13:16:52 UTC

https://gcc.gnu.org/ml/gcc-patches/2015-03/msg00364.html

Comment 7 Richard Biener 2015-03-10 11:17:05 UTC

Author: rguenth
Date: Tue Mar 10 11:16:33 2015
New Revision: 221318

URL: https://gcc.gnu.org/viewcvs?rev=221318&root=gcc&view=rev
Log:
2015-03-10  Richard Biener  <rguenther@suse.de>

	PR middle-end/63155
	* tree-ssa-coalesce.h (verify_ssa_coalescing): Declare.
	* tree-ssa-coalesce.c: Include timevar.h.
	(attempt_coalesce): Handle graph being NULL.
	(coalesce_partitions): Call verify_ssa_coalescing if ENABLE_CHECKING.
	Split out abnormal coalescing to ...
	(perform_abnormal_coalescing): ... this function.
	(coalesce_ssa_name): Perform abnormal coalescing without computing
	live/conflict.
	(verify_ssa_coalescing_worker): New function.
	(verify_ssa_coalescing): Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-coalesce.c
    trunk/gcc/tree-ssa-coalesce.h

Comment 8 Richard Biener 2015-03-10 11:17:35 UTC

Fixed on trunk sofar.

Comment 9 Richard Biener 2015-03-19 13:36:43 UTC

Reverted.

Comment 10 Richard Biener 2015-03-19 13:36:50 UTC

Author: rguenth
Date: Thu Mar 19 13:36:18 2015
New Revision: 221515

URL: https://gcc.gnu.org/viewcvs?rev=221515&root=gcc&view=rev
Log:
2015-03-19  Richard Biener  <rguenther@suse.de>

	Revert
	2015-03-10  Richard Biener  <rguenther@suse.de>

	PR middle-end/63155
	* tree-ssa-coalesce.h (verify_ssa_coalescing): Declare.
	* tree-ssa-coalesce.c: Include timevar.h.
	(attempt_coalesce): Handle graph being NULL.
	(coalesce_partitions): Call verify_ssa_coalescing if ENABLE_CHECKING.
	Split out abnormal coalescing to ...
	(perform_abnormal_coalescing): ... this function.
	(coalesce_ssa_name): Perform abnormal coalescing without computing
	live/conflict.
	(verify_ssa_coalescing_worker): New function.
	(verify_ssa_coalescing): Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-coalesce.c
    trunk/gcc/tree-ssa-coalesce.h

Comment 11 Jakub Jelinek 2015-06-26 19:54:30 UTC

GCC 4.9.3 has been released.

Comment 12 Richard Biener 2016-08-03 11:41:30 UTC

GCC 4.9 branch is being closed

Comment 13 Jakub Jelinek 2017-03-23 09:12:14 UTC

So, shall we instead try to handle this in a mini-pass that would walk the PHIs of the abnormal dispatcher (if any), or maybe just if those PHI have many arguments or there are many such PHIs, create addressable vars for each of the PHIs and store on the edge sources and add load on the abnormal dispatcher successors?  Or any other plans?

Comment 14 Richard Biener 2017-03-23 09:52:49 UTC

The plan is to re-write coalescings conflict graph by using SSA-based liveness queries according to https://hal.inria.fr/inria-00192219

Well, at least I'd like to experiment with that as it might be useful in other
places as well (cost considerations wrt register pressure changes).

Only experiments will show if we trade too much compile-time for the reduction in memory use (not computing liveness / conflict graph).

Comment 15 Jakub Jelinek 2017-10-10 13:28:00 UTC

GCC 5 branch is being closed

Comment 16 Richard Biener 2018-08-22 07:21:13 UTC

Re-confirmed.

Comment 17 Richard Biener 2018-08-23 13:20:03 UTC

Created attachment 44579 [details]
alternate testcase

Comment 18 Richard Biener 2018-09-13 13:18:48 UTC

-ftree-coalesce-vars is a workaround for some testcases (enabled by default with optimization).

Comment 19 Richard Biener 2018-09-13 13:55:15 UTC

Oddly enough

Index: gcc/tree-ssa-coalesce.c
===================================================================
--- gcc/tree-ssa-coalesce.c     (revision 264259)
+++ gcc/tree-ssa-coalesce.c     (working copy)
@@ -620,7 +620,11 @@ ssa_conflicts_merge (ssa_conflicts *ptr,
     {
       bitmap bz = ptr->conflicts[z];
       if (bz)
-       bitmap_set_bit (bz, x);
+       {
+         bool was_there = bitmap_clear_bit (bz, y);
+         gcc_checking_assert (was_there);
+         bitmap_set_bit (bz, x);
+       }
     }
 
   if (bx)

changes at least the 2nd testcase to run faster (albeit memory use stays around the same).

w/o patch

> /usr/bin/time ./cc1 -quiet t2.c
108.14user 1.62system 1:49.78elapsed 99%CPU (0avgtext+0avgdata 5610876maxresident)k
0inputs+440outputs (0major+1442106minor)pagefaults 0swaps

w/ patch

> /usr/bin/time ./cc1 -quiet t2.c
86.53user 1.46system 1:27.99elapsed 99%CPU (0avgtext+0avgdata 5610888maxresident)k
0inputs+440outputs (0major+1440069minor)pagefaults 0swaps

note this is a -O0 "optimized" cc1 binary with checking enabled so ...

It's even so slightly faster with doing

          if (y < x)
            {
              bool was_there = bitmap_clear_bit (bz, y);
              gcc_checking_assert (was_there);
              bitmap_set_bit (bz, x);
            }
          else
            {
              bitmap_set_bit (bz, x);
              bool was_there = bitmap_clear_bit (bz, y);
              gcc_checking_assert (was_there);
            }

but that's probably luck (bitmap caching and path length from start vs.
x / y).  Eventually doing a forward walk also makes prefetchers
happy.

Probably with a lot of coalesces this keeps the conflict bitmaps small
(and thus the bitmap element walks fast).

Timings with release checking and optimized build:

patched:

> /usr/bin/time ../../obj/gcc/cc1 -quiet t2.c
22.91user 1.45system 0:24.38elapsed 99%CPU (0avgtext+0avgdata 5515460maxresident)k
0inputs+440outputs (0major+1378102minor)pagefaults 0swaps

patched, fancy forward walk:

> /usr/bin/time ../../obj/gcc/cc1 -quiet t2.c
22.47user 1.39system 0:23.88elapsed 99%CPU (0avgtext+0avgdata 5515420maxresident)k
0inputs+440outputs (0major+1377586minor)pagefaults 0swaps

unpatched:

> /usr/bin/time ../../obj/gcc/cc1 -quiet t2.c
46.60user 1.43system 0:48.03elapsed 99%CPU (0avgtext+0avgdata 5515380maxresident)k
0inputs+440outputs (0major+1378102minor)pagefaults 0swaps

I'm testing the simple patch now.

Comment 20 Richard Biener 2018-09-14 06:59:53 UTC

Author: rguenth
Date: Fri Sep 14 06:59:21 2018
New Revision: 264304

URL: https://gcc.gnu.org/viewcvs?rev=264304&root=gcc&view=rev
Log:
2018-09-14  Richard Biener  <rguenther@suse.de>

	PR middle-end/63155
	* tree-ssa-coalesce.c (ssa_conflicts_merge): Remove conflict
	bits for the merged partition.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-coalesce.c

Comment 21 Richard Biener 2018-09-14 07:53:52 UTC

So for the alternate testcase we have

Kind                   Nodes      Bytes
ssa names             753988   54287136

GIMPLE statements
Kind                   Stmts      Bytes
phi nodes             747999  381049752

Bitmaps                                                 Leak            Peak            Times  N searches Search iter      Type
tree-ssa-coalesce.c:705 (new_live_track)              91800:  0.0%     91800   6469037:  1.8%     1238519      248500      heap
tree-ssa-coalesce.c:586 (ssa_conflicts_add_one)     9916176:  3.5%4977905016 341148023: 94.8%   364290728   601796535      heap
tree-ssa-live.c:937 (new_tree_live_info)          119177600: 42.5% 119177600   2979440:  0.8%     2230467           0      heap
tree-ssa-live.c:938 (new_tree_live_info)          149077680: 53.1% 149077680   4476926:  1.2%     4467438     1379139      heap

as expected the conflict bitmaps are the issue.

We can shave off some more compile-time by noticing the asymmetry in
live_track_process_def and use bitmap_ior_into for one half.  That should
be more cache friendly as well.

before:

> /usr/bin/time ./cc1 -quiet t2.c
38.98user 1.40system 0:40.39elapsed 99%CPU (0avgtext+0avgdata 5626832maxresident)k
0inputs+440outputs (0major+1448898minor)pagefaults 0swaps

after:

> /usr/bin/time ./cc1 -quiet t2.c
35.81user 1.48system 0:37.30elapsed 99%CPU (0avgtext+0avgdata 5628100maxresident)k
0inputs+440outputs (0major+1440532minor)pagefaults 0swaps

not as much as with the previous patch but still...

Moves the above to

tree-ssa-coalesce.c:812 (live_track_process_def)    9896264:  3.5%4958012960 339784598: 94.4%   301208226   660769345      heap

and then we can of course avoid touching bitmaps with already recorded
conflicts.  Doing that improves things to

33.18user 1.43system 0:34.62elapsed 99%CPU (0avgtext+0avgdata 5626660maxresident)k
0inputs+440outputs (0major+1390642minor)pagefaults 0swaps

tree-ssa-coalesce.c:803 (live_track_process_def)      19912:  0.0%  19892656   1363425:  0.4%      960492     1915158      heap
tree-ssa-coalesce.c:810 (live_track_process_def)    9896264:  3.5%4958012960 339784598: 94.4%   301208226   660769345      heap

note it all seems to be a bit noisy in the end...

@@ -818,8 +794,20 @@ live_track_process_def (live_track *ptr,
   if (bitmap_bit_p (ptr->live_base_var, root))
     {
       b = ptr->live_base_partitions[root];
-      EXECUTE_IF_SET_IN_BITMAP (b, 0, x, bi)
-        ssa_conflicts_add (graph, p, x);
+      bitmap bp = graph->conflicts[p];
+      if (!bp)
+       bp = graph->conflicts[p] = BITMAP_ALLOC (&graph->obstack);
+      /* Add a conflict to P to each live base partition.  */
+      EXECUTE_IF_AND_COMPL_IN_BITMAP (b, bp, 0, x, bi)
+       {
+         gcc_checking_assert (x != (unsigned)p);
+         bitmap bx = graph->conflicts[x];
+         if (!bx)
+           bx = graph->conflicts[x] = BITMAP_ALLOC (&graph->obstack);
+         bitmap_set_bit (bx, p);
+       }
+      /* Add conflicts to each live base partition to P.  */
+      bitmap_ior_into (bp, b);
     }
 }

ideally we'd be able to combine the ior_into with the EXECUTE_IF.. walk.
Manually jump-threading the case of !bp might be worth it as well, since
we are micro-optimizing this...

Comment 22 Richard Biener 2018-09-14 12:36:07 UTC

Created attachment 44692 [details]
patch that does not help

The attached patch changes the partition view so that members of bases are adjacent.  This should improve the locality of the initial conflict set
bits (we only record conflicts within same base variables).

This doesn't help memory usage for either testcase.

For the 2nd testcase we have 187988 partitions and 500 bases.

The conflict bitmaps still end up very sparse (but large).  There are also
a lot of duplicate bitmaps (if you'd add self-conflicts).

Restricting anonymous coalesces to abnormal coalescing only increases the
number of bases to 747 and doesn't help memory use significantly.

There may be cases where the patch helps since it should improve locality.

Comment 23 Richard Biener 2018-09-14 12:59:48 UTC

(In reply to Richard Biener from comment #22)
> Created attachment 44692 [details]
> patch that does not help
> 
> The attached patch changes the partition view so that members of bases are
> adjacent.  This should improve the locality of the initial conflict set
> bits (we only record conflicts within same base variables).
> 
> This doesn't help memory usage for either testcase.
> 
> For the 2nd testcase we have 187988 partitions and 500 bases.
> 
> The conflict bitmaps still end up very sparse (but large).  There are also
> a lot of duplicate bitmaps (if you'd add self-conflicts).
> 
> Restricting anonymous coalesces to abnormal coalescing only increases the
> number of bases to 747 and doesn't help memory use significantly.

So doing this^^^ more conservatively somehow, choosing different bases for
different set of final coalesced abnormal partitions, would shrink the
size of the largest base significantly, reducing the quadraticness.

So perform the abnormal part of coalesce_partitions implicitely, computing
sets of coalesced partitions and pre-assigning a base to each of them.
The only complication is in rejecting further coalescing to the base
members (thus adjusting gimple_can_coalesce_p).

> There may be cases where the patch helps since it should improve locality.

Comment 24 Richard Biener 2018-09-17 10:50:09 UTC

*** Bug 87316 has been marked as a duplicate of this bug. ***

Comment 25 Richard Biener 2018-09-17 13:54:48 UTC

Created attachment 44705 [details]
patch

This "simple" one helps.  It builds partition bases similar to the -ftree-coalesce-vars case.  Correctness needs to be verified still but I think it errs on the safe side (computing too many dependences at most).  In particular it avoids computing dependences between things we'll later not try coalescing anyways.

Comment 26 Richard Biener 2018-09-18 13:26:37 UTC

Author: rguenth
Date: Tue Sep 18 13:26:05 2018
New Revision: 264388

URL: https://gcc.gnu.org/viewcvs?rev=264388&root=gcc&view=rev
Log:
2018-09-18  Richard Biener  <rguenther@suse.de>

	PR middle-end/63155
	* tree-ssa-coalesce.c (tree_int_map_hasher): Remove.
	(compute_samebase_partition_bases): Likewise.
	(coalesce_ssa_name): Always use compute_optimized_partition_bases.
	(gimple_can_coalesce_p): Simplify.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-coalesce.c

Comment 27 Richard Biener 2018-09-19 07:37:44 UTC

TOT now for the alternate testcase at -O1+ exhibits

Time variable                                   usr           sys          wall               GGC
...
 tree CCP                           :   9.21 ( 64%)   0.00 (  0%)   9.22 ( 63%)     244 kB (  0%)
...
 TOTAL                              :  14.35          0.22         14.57         521240 kB

never seen CCP take that much time.

All time is spent in bitmap_set_bit called by add_ssa_edge from ssa_propagation_engine::simulate_stmt.

It sets stmt UIDs in the ssa_edge_worklist.

Note on that testcase CCP really does a lot (disabling CCP makes things only slower), which means the testcase is quite artificial.  There are
752492 stmts at the point the SSA propagator starts to run and as out-of-SSA
coalescing told us bits are spread in an unfortunate way here.

Sorting the uses we walk in add_ssa_edge after gimple_uid doesn't help much.
We're still processing bits like 4, 1509, 3014, 4519, 6024, 7529, 9034, ...

Assigning UIDs lazily in that very function improves things (as expected),
but we still have

 tree CCP                           :   3.82 ( 43%)   0.00 (  0%)   3.83 ( 42%)     244 kB (  0%)

after that.  I'll dig a bit further later.

Comment 28 Richard Biener 2018-09-19 07:50:06 UTC

Created attachment 44724 [details]
patch for the SSA propagator issue

Using an sbitmap helps.  I am testing the attached.

Comment 29 Richard Biener 2018-09-19 13:06:14 UTC

(In reply to Richard Biener from comment #28)
> Created attachment 44724 [details]
> patch for the SSA propagator issue
> 
> Using an sbitmap helps.  I am testing the attached.

Note there's

void
ssa_propagation_engine::process_ssa_edge_worklist (void)
{
  /* Process the next entry from the worklist.  */
  unsigned stmt_uid = bitmap_first_set_bit (ssa_edge_worklist);

which will then be a O(n) operation.  Likewise the bitmap_empty_p test
isn't efficient (can be avoided by re-using the above walk).

For this particular testcase we only
end up here 1494 times (compared to many more bitmap_set_bit operations).
Still this makes using a sbitmap not the best idea.  A sparse-set fits a bit
better but we want to iterate over the set in UID order as well which
the sparse-set doesn't allow - it only does choose_one in O(1) which might
not be the optimal propagation order (but eventually this detail isn't
too important...?).  Iff only our sparse bitmap implementation would
be O(log N) in bit finding...

Note all of the same issues in theory apply to the CFG BB worklist.

Comment 30 Richard Biener 2018-09-19 13:07:05 UTC

Created attachment 44726 [details]
patch for the SSA propagator issue

Comment 31 Richard Biener 2018-09-20 11:07:44 UTC

So we end up with an almost fully set ssa_edge_worklist because we add all the PHIs from blocks we already processed by having visited the backedge defs.  But
the backedge isn't actually yet executable and avoiding to set those bits
brings down the set to a maximum size of 2985 (compared to ~700000 before).

So a simpler patch for this particular issue is

Index: gcc/tree-ssa-propagate.c
===================================================================
--- gcc/tree-ssa-propagate.c    (revision 264438)
+++ gcc/tree-ssa-propagate.c    (working copy)
@@ -168,15 +170,26 @@ add_ssa_edge (tree var)
   FOR_EACH_IMM_USE_FAST (use_p, iter, var)
     {
       gimple *use_stmt = USE_STMT (use_p);
+      basic_block use_bb = gimple_bb (use_stmt);
 
       /* If we did not yet simulate the block wait for this to happen
          and do not add the stmt to the SSA edge worklist.  */
-      if (! (gimple_bb (use_stmt)->flags & BB_VISITED))
+      if (! (use_bb->flags & BB_VISITED))
        continue;
 
+      /* If this is a use on a not yet executable edge do not bother to
+         queue it.  */
+      if (gimple_code (use_stmt) == GIMPLE_PHI
+         && !(EDGE_PRED (use_bb, PHI_ARG_INDEX_FROM_USE (use_p))->flags
+              & EDGE_EXECUTABLE))
+       return;
+
       if (prop_simulate_again_p (use_stmt)
          && bitmap_set_bit (ssa_edge_worklist, gimple_uid (use_stmt)))
        {

Comment 32 Richard Biener 2018-09-24 07:08:55 UTC

Author: rguenth
Date: Mon Sep 24 07:08:24 2018
New Revision: 264523

URL: https://gcc.gnu.org/viewcvs?rev=264523&root=gcc&view=rev
Log:
2018-09-24  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/63155
	* tree-ssa-propagate.c (add_ssa_edge): Avoid adding PHIs to
	the worklist when the edge of the respective argument isn't
	executable.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-propagate.c

Comment 33 Richard Biener 2018-09-25 11:30:45 UTC

So on trunk for the original testcase I now see

> /usr/bin/time ./cc1 -quiet testunity_Runner.i -std=c99
2.70user 0.16system 0:02.86elapsed 100%CPU (0avgtext+0avgdata 427672maxresident)k
0inputs+504outputs (0major+106295minor)pagefaults 0swaps

while on the same machine using GCC 4.8:

> /usr/bin/time /space/rguenther/install/gcc-4.8.5/bin/gcc -S testunity_Runner.i -std=c99
0.24user 0.01system 0:00.60elapsed 41%CPU (0avgtext+0avgdata 39424maxresident)k
30960inputs+504outputs (37major+8516minor)pagefaults 0swaps

so we've come a long way but still regressed which is somehow not avoidable
because of the correctness fix that started this.

For reference GCC 8.2 numbers are

> /usr/bin/time /space/rguenther/install/gcc-8.2/bin/gcc -S testunity_Runner.i -std=c99
94.31user 2.46system 1:36.79elapsed 99%CPU (0avgtext+0avgdata 10172916maxresident)k
0inputs+504outputs (0major+2535422minor)pagefaults 0swaps

So overall I consider this issue fixed for trunk.

Comment 34 David 2018-09-26 09:03:12 UTC

My primary concern in 87316 was about memory usage and this patch definitely helps a lot with that.  Thanks! 

Using -ftree-coalesce-vars helps on >= 4.9 versions and does not seem to have an adverse effect on test coverage.

Comment 35 Rogério de Souza Moraes 2018-10-04 20:57:22 UTC

Created attachment 44791 [details]
Small testcase more similar to original environment

Hi Richard,

this is a new testcase, based on another file in the original environment. It’s quite small (7000 lines, 240 setjmp calls).

This code with a little complex but still simplified control structure represents state machine implementation, which is very widely used by our customers. Another new factor is the nested setjmp calls. Of course, original testcase is more complex and takes even more time with more difference.

You can run it using the following commands:


time gcc -DGCC -DLINUX_C -D_GLIBCXX_USE_CXX11_ABI=0  -m32 -m32 -w -c -O0 -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 gcc_2nd_synth_pure_c_item.c -o gcc_2nd_synth_pure_c_item.o

time gcc -DGCC -DLINUX_C -D_GLIBCXX_USE_CXX11_ABI=0  -m32 -m32 -w -c -O -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 gcc_2nd_synth_pure_c_item.c -o gcc_2nd_synth_pure_c_item.o


Results :

GCC: 4.8.5 (From RHEL 7.5)

real	0m0.349s
user	0m0.255s
sys	0m0.083s

real	0m0.193s
user	0m0.163s
sys	0m0.023s

GCC: 6.3.0 (GCC 6.3.0 with Revision 264523 backported and applied to it)

real	0m32.235s
user	0m30.486s
sys	0m1.622s

real	3m34.203s
user	3m33.726s
sys	0m0.292s

The performance difference is relevant in this test.

Regards,
--
Rogerio

Comment 36 rguenther@suse.de 2018-10-05 08:01:51 UTC

On Thu, 4 Oct 2018, rogerio.souza at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155
> 
> --- Comment #35 from Rogério de Souza Moraes <rogerio.souza at gmail dot com> ---
> Created attachment 44791 [details]
>   --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44791&action=edit
> Small testcase more similar to original environment
> 
> Hi Richard,
> 
> this is a new testcase, based on another file in the original environment. It’s
> quite small (7000 lines, 240 setjmp calls).
> 
> This code with a little complex but still simplified control structure
> represents state machine implementation, which is very widely used by our
> customers. Another new factor is the nested setjmp calls. Of course, original
> testcase is more complex and takes even more time with more difference.
> 
> You can run it using the following commands:
> 
> 
> time gcc -DGCC -DLINUX_C -D_GLIBCXX_USE_CXX11_ABI=0  -m32 -m32 -w -c -O0
> -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4
> gcc_2nd_synth_pure_c_item.c -o gcc_2nd_synth_pure_c_item.o
> 
> time gcc -DGCC -DLINUX_C -D_GLIBCXX_USE_CXX11_ABI=0  -m32 -m32 -w -c -O
> -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4
> gcc_2nd_synth_pure_c_item.c -o gcc_2nd_synth_pure_c_item.o
> 
> 
> Results :
> 
> GCC: 4.8.5 (From RHEL 7.5)
> 
> real    0m0.349s
> user    0m0.255s
> sys     0m0.083s
> 
> real    0m0.193s
> user    0m0.163s
> sys     0m0.023s
> 
> GCC: 6.3.0 (GCC 6.3.0 with Revision 264523 backported and applied to it)
> 
> real    0m32.235s
> user    0m30.486s
> sys     0m1.622s
> 
> real    3m34.203s
> user    3m33.726s
> sys     0m0.292s
> 
> The performance difference is relevant in this test.

Thanks for the more realistic testcase.  I can confirm the above
and I also see a slowdown in GCC 9 compared to GCC 8 at -O1:

> /usr//bin/time gcc-8 -S t.c -O -fwrapv -mstackrealign 
-mpreferred-stack-boundary=4 -m32
157.48user 0.24system 2:37.78elapsed 99%CPU (0avgtext+0avgdata 
888036maxresident)k
47704inputs+152outputs (8major+240936minor)pagefaults 0swaps

> /usr//bin/time gcc-9 -S t.c -O -fwrapv -mstackrealign 
-mpreferred-stack-boundary=4 -m32
197.61user 0.39system 3:18.08elapsed 99%CPU (0avgtext+0avgdata 
890628maxresident)k
0inputs+184outputs (0major+259016minor)pagefaults 0swaps

Somehow it's still CCP that makes things slow:

 tree CCP                           : 178.52 ( 89%)   0.01 (  2%) 178.55 ( 
89%)     646 kB (  0%)

perf tells me it's

-   96.33%    29.55%         14801  cc1      cc1               [.] 
ccp_propagate::visit_phi                                            ▒
     ccp_propagate::visit_phi                                                                                                          
▒
   - ssa_propagation_engine::simulate_stmt                                                                                             
▒
      + 49.51% ssa_propagation_engine::simulate_block                                                                                  
▒
      + 46.82% ssa_propagation_engine::ssa_propagate                                                 

-   37.06%    28.98%         12421  cc1      cc1               [.] 
ccp_lattice_meet                                                    ▒
   - ccp_lattice_meet                                                                                                                  
▒
      + 37.02% ccp_propagate::visit_phi                                                                                                
▒
      + 0.03% set_lattice_value                  

-    5.17%     5.17%          1949  cc1      cc1               [.] 
wi::bit_or<generic_wide_int<fixed_wide_int_storage<192> >, generic_w▒
     wi::bit_or<generic_wide_int<fixed_wide_int_storage<192> >, 
generic_wide_int<fixed_wide_int_storage<192> > >                       ▒
   - ccp_lattice_meet                                                                                                                  
▒
      + 5.16% ccp_propagate::visit_phi                                                                                                 
▒
      + 0.01% set_lattice_value                                                      

-    4.02%     4.02%          1509  cc1      cc1               [.] 
canonicalize_value                                                  ▒
   - canonicalize_value                                                                                                                
▒
      + 4.02% get_value_for_expr                                                                                                       
▒
      + 0.00% ccp_folder::get_value                  

-    2.90%     2.89%          1083  cc1      cc1               [.] 
wi::eq_p<generic_wide_int<fixed_wide_int_storage<192> >, int>       ▒
     wi::eq_p<generic_wide_int<fixed_wide_int_storage<192> >, int>                                                                     
▒
   - ccp_lattice_meet                                                                                                                  
▒
      + 2.89% ccp_propagate::visit_phi                                                                                                 
▒
      + 0.00% set_lattice_value                   

As said, thanks for the testcase.

Comment 37 Richard Biener 2018-10-05 12:55:22 UTC

Author: rguenth
Date: Fri Oct  5 12:54:51 2018
New Revision: 264869

URL: https://gcc.gnu.org/viewcvs?rev=264869&root=gcc&view=rev
Log:
2018-10-05  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/63155
	* tree-ssa-ccp.c (ccp_propagate::visit_phi): Avoid excess
	vertical space in dumpfiles.
	* tree-ssa-propagate.h
	(ssa_propagation_engine::process_ssa_edge_worklist): Remove.
	* tree-ssa-propagate.c (cfg_blocks_back): New global.
	(ssa_edge_worklist_back): Likewise.
	(curr_order): Likewise.
	(cfg_blocks_get): Remove abstraction.
	(cfg_blocks_add): Likewise.
	(cfg_blocks_empty_p): Likewise.
	(add_ssa_edge): Add to current or next worklist based on
	RPO index.
	(add_control_edge): Likewise.
	(ssa_propagation_engine::process_ssa_edge_worklist): Fold
	into ...
	(ssa_propagation_engine::ssa_propagate): ... here.  Unify
	iteration from CFG and SSA edge worklist so we process
	everything in RPO order, prioritizing forward progress
	over iteration.
	(ssa_prop_init): Allocate new worklists, do not dump
	immediate uses.
	(ssa_prop_fini): Free new worklists.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-ccp.c
    trunk/gcc/tree-ssa-propagate.c
    trunk/gcc/tree-ssa-propagate.h

Comment 38 Richard Biener 2018-10-05 13:14:11 UTC

For the last testcase the compile-time on trunk is now 25s at -O1:

 tree PTA                           :   3.37 ( 13%)   0.10 ( 30%)   3.46 ( 13%)   12445 kB (  2%)
 tree CCP                           :   4.61 ( 18%)   0.00 (  0%)   4.62 ( 18%)     646 kB (  0%)
 tree FRE                           :   2.21 (  9%)   0.01 (  3%)   2.21 (  9%)     116 kB (  0%)
 tree backward propagate            :   5.03 ( 20%)   0.00 (  0%)   5.04 ( 20%)       0 kB (  0%)
 out of ssa                         :   3.05 ( 12%)   0.00 (  0%)   3.05 ( 12%)       0 kB (  0%)
 TOTAL                              :  25.39          0.33         25.72         573954 kB

and perf:

Samples: 9K of event 'instructions', Event count (approx.): 107285199390                                                      
Overhead       Samples  Command  Shared Object     Symbol                                                                    ◆
  18.06%          1195  cc1      cc1               [.] (anonymous namespace)::backprop::process_var                          ▒
   5.58%           560  cc1      cc1               [.] visit_phi                                                             ▒
   5.21%           476  cc1      cc1               [.] inchash::add_expr                                                     ▒
   5.21%           671  cc1      cc1               [.] VN_INFO                                                               ▒
   5.14%           493  cc1      cc1               [.] bitmap_set_bit                                                        ▒
   3.13%           296  cc1      cc1               [.] hash_table<vn_ssa_aux_hasher, xcallocator>::find_with_hash            ▒
   2.99%           287  cc1      cc1               [.] vn_phi_lookup                                                         ▒
   2.39%           229  cc1      cc1               [.] bitmap_ior_into                                                       ▒
   1.77%           165  cc1      cc1               [.] do_rpo_vn

Comment 39 Richard Biener 2018-10-05 13:21:50 UTC

Oh, and backprop is really intersect_uses () with

  FOR_EACH_IMM_USE_STMT (stmt, iter, var)
    {

being quadratic due to its stupid implementation (we really have many uses
of vars).  If the pass can deal with duplicate stmt uses just fine using
FOR_EACH_IMM_USE_FAST is going to be faster.

Comment 40 Richard Biener 2018-10-08 07:17:00 UTC

Author: rguenth
Date: Mon Oct  8 07:16:28 2018
New Revision: 264912

URL: https://gcc.gnu.org/viewcvs?rev=264912&root=gcc&view=rev
Log:
2018-10-08  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/63155
	* tree-ssa-propagate.c (add_ssa_edge): Do cheap check first.
	(ssa_propagation_engine::ssa_propagate): Remove redundant
	bitmap bit clearing.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-propagate.c

Comment 41 Richard Sandiford 2018-10-08 08:08:52 UTC

(In reply to Richard Biener from comment #39)
> Oh, and backprop is really intersect_uses () with
> 
>   FOR_EACH_IMM_USE_STMT (stmt, iter, var)
>     {
> 
> being quadratic due to its stupid implementation (we really have many uses
> of vars).

Ouch, hadn't realised the difference between them was that severe.

> If the pass can deal with duplicate stmt uses just fine using
> FOR_EACH_IMM_USE_FAST is going to be faster.

Yeah, should be fine here, since the function is just gathering
information.  Testing a patch...

Comment 42 Richard Sandiford 2018-10-08 18:59:30 UTC

Author: rsandifo
Date: Mon Oct  8 18:58:59 2018
New Revision: 264941

URL: https://gcc.gnu.org/viewcvs?rev=264941&root=gcc&view=rev
Log:
Use FOR_EACH_IMM_USE_FAST in gimple-ssa-backprop.c

As pointed out by Richard in PR63155.  It speeds up the testcase a few %.

2018-10-08  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	PR middle-end/63155
	* gimple-ssa-backprop.c (backprop::intersect_uses): Use
	FOR_EACH_IMM_USE_FAST instead of FOR_EACH_IMM_USE_STMT.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/gimple-ssa-backprop.c

Comment 43 Richard Biener 2018-10-09 08:06:18 UTC

We're now down to

 tree PTA                           :   3.92 ( 16%)   0.12 ( 36%)   4.02 ( 16%)   12445 kB (  2%)
 tree CCP                           :   7.43 ( 30%)   0.02 (  6%)   7.44 ( 29%)     646 kB (  0%)
 tree FRE                           :   2.34 (  9%)   0.00 (  0%)   2.35 (  9%)     116 kB (  0%)
 tree backward propagate            :   0.62 (  2%)   0.00 (  0%)   0.62 (  2%)       0 kB (  0%)
 out of ssa                         :   3.01 ( 12%)   0.00 (  0%)   3.01 ( 12%)       0 kB (  0%)
 TOTAL                              :  24.91          0.33         25.26         573769 kB

notice the tree backward propagate improvement.  This makes CCP the main
offender again but as said the rectification would probably mean pulling
back the SSA SCC discovery code from SCCVN and use that in the SSA
propagator somehow.

The out of SSA time is what was originally topic of this bug.

The tree PTA time is "new" and related to the number of PHI nodes
and edges.  You can disable PTA via -fno-tree-pta.

The tree FRE time is PHI lookups/inserts, some refactoring can speed this up
a bit.

Comment 44 Richard Biener 2018-10-09 09:27:01 UTC

(In reply to Richard Biener from comment #43)
> This makes CCP the main
> offender again but as said the rectification would probably mean pulling
> back the SSA SCC discovery code from SCCVN and use that in the SSA
> propagator somehow.

I take that back.  SCC processing is quite fundamentally incompatible
with the way SSA propagation works.

But what would be possible is to add a non-optimistic mode to the SSA
propagator removing the need to iterate at all.  That's some non-trivial
work though, possibly better spent teaching value-numbering the bits
of CCP that it doesn't do (bit-value tracking, UNDEF handling) and then
kill off CCP altogether.

Comment 45 Richard Biener 2018-10-09 11:45:57 UTC

Author: rguenth
Date: Tue Oct  9 11:43:46 2018
New Revision: 264956

URL: https://gcc.gnu.org/viewcvs?rev=264956&root=gcc&view=rev
Log:
2018-10-09  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/63155
	* tree-ssa-structalias.c: Include tree-ssa.h.
	(get_constraint_for_ssa_var): For undefs return nothing_id.
	(find_func_aliases): Cleanup PHI handling.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-structalias.c

Comment 46 Richard Biener 2018-10-16 11:23:54 UTC

Author: rguenth
Date: Tue Oct 16 11:23:22 2018
New Revision: 265189

URL: https://gcc.gnu.org/viewcvs?rev=265189&root=gcc&view=rev
Log:
2018-10-16  Richard Biener  <rguenther@suse.de>

	Backport from mainline
	2018-09-18  Richard Biener  <rguenther@suse.de>

	PR middle-end/63155
	* tree-ssa-coalesce.c (tree_int_map_hasher): Remove.
	(compute_samebase_partition_bases): Likewise.
	(coalesce_ssa_name): Always use compute_optimized_partition_bases.
	(gimple_can_coalesce_p): Simplify.

Modified:
    branches/gcc-8-branch/gcc/ChangeLog
    branches/gcc-8-branch/gcc/tree-ssa-coalesce.c

Comment 47 Richard Biener 2018-10-16 13:24:30 UTC

Author: rguenth
Date: Tue Oct 16 13:23:56 2018
New Revision: 265193

URL: https://gcc.gnu.org/viewcvs?rev=265193&root=gcc&view=rev
Log:
2018-10-16  Richard Biener  <rguenther@suse.de>

	Backport from mainline
	2018-10-08  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/63155
	* tree-ssa-propagate.c (add_ssa_edge): Do cheap check first.
	(ssa_propagation_engine::ssa_propagate): Remove redundant
	bitmap bit clearing.

	2018-10-05  Richard Biener  <rguenther@suse.de>
 
	PR tree-optimization/63155
	* tree-ssa-ccp.c (ccp_propagate::visit_phi): Avoid excess
	vertical space in dumpfiles.
	* tree-ssa-propagate.h
	(ssa_propagation_engine::process_ssa_edge_worklist): Remove.
	* tree-ssa-propagate.c (cfg_blocks_back): New global.
	(ssa_edge_worklist_back): Likewise.
	(curr_order): Likewise.
	(cfg_blocks_get): Remove abstraction.
	(cfg_blocks_add): Likewise.
	(cfg_blocks_empty_p): Likewise.
	(add_ssa_edge): Add to current or next worklist based on
	RPO index.
	(add_control_edge): Likewise.
	(ssa_propagation_engine::process_ssa_edge_worklist): Fold
	into ...
	(ssa_propagation_engine::ssa_propagate): ... here.  Unify
	iteration from CFG and SSA edge worklist so we process
	everything in RPO order, prioritizing forward progress
	over iteration.
	(ssa_prop_init): Allocate new worklists, do not dump
	immediate uses.
	(ssa_prop_fini): Free new worklists.

	2018-09-24  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/63155
	* tree-ssa-propagate.c (add_ssa_edge): Avoid adding PHIs to
	the worklist when the edge of the respective argument isn't
	executable.

Modified:
    branches/gcc-8-branch/gcc/ChangeLog
    branches/gcc-8-branch/gcc/tree-ssa-propagate.c

Comment 48 Richard Biener 2018-10-17 07:02:01 UTC

Author: rguenth
Date: Wed Oct 17 07:01:28 2018
New Revision: 265231

URL: https://gcc.gnu.org/viewcvs?rev=265231&root=gcc&view=rev
Log:
2018-10-17  Richard Biener  <rguenther@suse.de>

	Backport from mainline
	2018-10-08  Richard Sandiford  <richard.sandiford@arm.com>

	PR middle-end/63155
	* gimple-ssa-backprop.c (backprop::intersect_uses): Use
	FOR_EACH_IMM_USE_FAST instead of FOR_EACH_IMM_USE_STMT.

Modified:
    branches/gcc-8-branch/gcc/ChangeLog
    branches/gcc-8-branch/gcc/gimple-ssa-backprop.c

Comment 49 Richard Biener 2018-10-17 08:49:33 UTC

Author: rguenth
Date: Wed Oct 17 08:49:00 2018
New Revision: 265235

URL: https://gcc.gnu.org/viewcvs?rev=265235&root=gcc&view=rev
Log:
2018-10-16  Richard Biener  <rguenther@suse.de>

	Backport from mainline
	2018-10-08  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/63155
	* tree-ssa-propagate.c (add_ssa_edge): Do cheap check first.
	(ssa_propagation_engine::ssa_propagate): Remove redundant
	bitmap bit clearing.

	2018-10-05  Richard Biener  <rguenther@suse.de>
 
	PR tree-optimization/63155
	* tree-ssa-ccp.c (ccp_propagate::visit_phi): Avoid excess
	vertical space in dumpfiles.
	* tree-ssa-propagate.h
	(ssa_propagation_engine::process_ssa_edge_worklist): Remove.
	* tree-ssa-propagate.c (cfg_blocks_back): New global.
	(ssa_edge_worklist_back): Likewise.
	(curr_order): Likewise.
	(cfg_blocks_get): Remove abstraction.
	(cfg_blocks_add): Likewise.
	(cfg_blocks_empty_p): Likewise.
	(add_ssa_edge): Add to current or next worklist based on
	RPO index.
	(add_control_edge): Likewise.
	(ssa_propagation_engine::process_ssa_edge_worklist): Fold
	into ...
	(ssa_propagation_engine::ssa_propagate): ... here.  Unify
	iteration from CFG and SSA edge worklist so we process
	everything in RPO order, prioritizing forward progress
	over iteration.
	(ssa_prop_init): Allocate new worklists, do not dump
	immediate uses.
	(ssa_prop_fini): Free new worklists.

	2018-09-24  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/63155
	* tree-ssa-propagate.c (add_ssa_edge): Avoid adding PHIs to
	the worklist when the edge of the respective argument isn't
	executable.

Modified:
    branches/gcc-8-branch/gcc/tree-ssa-ccp.c
    branches/gcc-8-branch/gcc/tree-ssa-propagate.h

Comment 50 Rogério de Souza Moraes 2018-10-17 17:58:38 UTC

Created attachment 44848 [details]
GCC 6.3.0 consolidated patch based on Richard's patches

The patch attached is a backport based on Richard's patches to GCC v6.3.0. If any issues, please let me know.

Regards,
--
Rogerio

Comment 51 Richard Biener 2018-10-22 13:54:55 UTC

Author: rguenth
Date: Mon Oct 22 13:54:23 2018
New Revision: 265390

URL: https://gcc.gnu.org/viewcvs?rev=265390&root=gcc&view=rev
Log:
2018-10-22  Steven Bosscher <steven@gcc.gnu.org>
	Richard Biener  <rguenther@suse.de>

	* bitmap.h: Update data structure documentation, including a
	description of bitmap views as either linked-lists or splay trees.
	(struct bitmap_element_def): Update comments for splay tree bitmaps.
	(struct bitmap_head_def): Likewise.
	(bitmap_list_view, bitmap_tree_view): New prototypes.
	(bitmap_initialize_stat): Initialize a bitmap_head's indx and
	tree_form fields.
	(bmp_iter_set_init): Assert the iterated bitmaps are in list form.
	(bmp_iter_and_init, bmp_iter_and_compl_init): Likewise.
	* bitmap.c (bitmap_elem_to_freelist): Unregister overhead of a
	released bitmap element here.
	(bitmap_element_free): Remove.
	(bitmap_elt_clear_from): Work on splay tree bitmaps.
	(bitmap_list_link_element): Renamed from bitmap_element_link.  Move
	this function similar ones such that linked-list bitmap implementation
	functions are grouped.
	(bitmap_list_unlink_element): Renamed from bitmap_element_unlink,
	and moved for grouping.
	(bitmap_list_insert_element_after): Renamed from
	bitmap_elt_insert_after, and moved for grouping.
	(bitmap_list_find_element): New function spliced from bitmap_find_bit.
	(bitmap_tree_link_left, bitmap_tree_link_right,
	bitmap_tree_rotate_left, bitmap_tree_rotate_right, bitmap_tree_splay,
	bitmap_tree_link_element, bitmap_tree_unlink_element,
	bitmap_tree_find_element): New functions for splay-tree bitmap
	implementation.
	(bitmap_element_link, bitmap_element_unlink, bitmap_elt_insert_after):
	Renamed and moved, see above entries.
	(bitmap_tree_listify_from): New function to convert part of a splay
	tree bitmap to a linked-list bitmap.
	(bitmap_list_view): Convert a splay tree bitmap to linked-list form.
	(bitmap_tree_view): Convert a linked-list bitmap to splay tree form.
	(bitmap_find_bit): Remove.
	(bitmap_clear, bitmap_clear_bit, bitmap_set_bit,
	bitmap_single_bit_set_p, bitmap_first_set_bit, bitmap_last_set_bit):
	Handle splay tree bitmaps.
	(bitmap_copy, bitmap_count_bits, bitmap_and, bitmap_and_into,
	bitmap_elt_copy, bitmap_and_compl, bitmap_and_compl_into,
	bitmap_compl_and_into, bitmap_elt_ior, bitmap_ior, bitmap_ior_into,
	bitmap_xor, bitmap_xor_into, bitmap_equal_p, bitmap_intersect_p,
	bitmap_intersect_compl_p, bitmap_ior_and_compl,
	bitmap_ior_and_compl_into, bitmap_set_range, bitmap_clear_range,
	bitmap_hash): Reject trying to act on splay tree bitmaps.  Make
	corresponding changes to use linked-list specific bitmap_element
	manipulation functions as applicable for efficiency.
	(bitmap_tree_to_vec): New function.
	(debug_bitmap_elt_file): New function split out from ...
	(debug_bitmap_file): ... here.  Handle splay tree bitmaps.
	(bitmap_print): Likewise.

	PR tree-optimization/63155
	* tree-ssa-propagate.c (ssa_prop_init): Use tree-view for the
	SSA edge worklists.
	* tree-ssa-coalesce.c (coalesce_ssa_name): Populate used_in_copies
	in tree-view.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/bitmap.c
    trunk/gcc/bitmap.h
    trunk/gcc/tree-ssa-coalesce.c
    trunk/gcc/tree-ssa-propagate.c

Comment 52 Jakub Jelinek 2018-10-26 10:13:02 UTC

GCC 6 branch is being closed

Comment 53 Rogério de Souza Moraes 2019-07-25 12:31:31 UTC

Created attachment 46627 [details]
New testcase that shows performance degradation in GCC 9.1

This new testcase that shows performance degradation in GCC 9.1, the performance gets worst as we use a newer GCC version to build it.

Example how to compile:

gcc -ftime-report -m32 -w -c -O3 -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 testcase-v1.c -o testcase-v1.o

Comment 54 Rogério de Souza Moraes 2019-07-25 12:53:10 UTC

Hi everyone,

we continue to get performance degradation as we use a new compiler version to build the source code 'testcase-v1.c" that is available in the attachment above.

Could you please check if the test case that I attached reproduces the same issue, or it is a different issue?

Below are the results that I got:

# GCC 4.4.7
-bash-4.1$ gcc -ftime-report -m32 -w -c -O3 -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 testcase-v1.c -o testcase-v1.o

Execution times (seconds)
 callgraph construction:   0.02 ( 0%) usr   0.01 ( 4%) sys   0.03 ( 0%) wall    2535 kB ( 1%) ggc
 callgraph optimization:   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall      11 kB ( 0%) ggc
 cfg cleanup           :   0.05 ( 1%) usr   0.00 ( 0%) sys   0.07 ( 1%) wall     151 kB ( 0%) ggc
 trivially dead code   :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 df reaching defs      :   0.40 ( 4%) usr   0.01 ( 4%) sys   0.40 ( 4%) wall       0 kB ( 0%) ggc
 df live regs          :   0.33 ( 3%) usr   0.00 ( 0%) sys   0.37 ( 4%) wall       0 kB ( 0%) ggc
 df live&initialized regs:   0.08 ( 1%) usr   0.00 ( 0%) sys   0.12 ( 1%) wall       0 kB ( 0%) ggc
 df use-def / def-use chains:   0.17 ( 2%) usr   0.00 ( 0%) sys   0.16 ( 2%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.25 ( 3%) usr   0.00 ( 0%) sys   0.22 ( 2%) wall    2519 kB ( 1%) ggc
 register information  :   0.13 ( 1%) usr   0.00 ( 0%) sys   0.11 ( 1%) wall       0 kB ( 0%) ggc
 alias analysis        :   0.06 ( 1%) usr   0.00 ( 0%) sys   0.08 ( 1%) wall    1952 kB ( 1%) ggc
 register scan         :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 rebuild jump labels   :   0.06 ( 1%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 preprocessing         :   0.02 ( 0%) usr   0.02 ( 7%) sys   0.02 ( 0%) wall     466 kB ( 0%) ggc
 lexical analysis      :   0.03 ( 0%) usr   0.02 ( 7%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 parser                :   0.06 ( 1%) usr   0.07 (26%) sys   0.19 ( 2%) wall   18164 kB ( 9%) ggc
 inline heuristics     :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree gimplify         :   0.07 ( 1%) usr   0.00 ( 0%) sys   0.06 ( 1%) wall   19793 kB (10%) ggc
 tree CFG construction :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall    4445 kB ( 2%) ggc
 tree CFG cleanup      :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 1%) wall    1209 kB ( 1%) ggc
 tree VRP              :   0.20 ( 2%) usr   0.00 ( 0%) sys   0.20 ( 2%) wall   12984 kB ( 6%) ggc
 tree copy propagation :   0.10 ( 1%) usr   0.00 ( 0%) sys   0.05 ( 1%) wall       9 kB ( 0%) ggc
 tree find ref. vars   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall    1190 kB ( 1%) ggc
 tree PTA              :   0.05 ( 1%) usr   0.00 ( 0%) sys   0.06 ( 1%) wall      33 kB ( 0%) ggc
 tree alias analysis   :   0.04 ( 0%) usr   0.01 ( 4%) sys   0.03 ( 0%) wall     308 kB ( 0%) ggc
 tree flow sensitive alias:   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree PHI insertion    :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall    1349 kB ( 1%) ggc
 tree SSA rewrite      :   0.05 ( 1%) usr   0.00 ( 0%) sys   0.08 ( 1%) wall   16607 kB ( 8%) ggc
 tree SSA other        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA incremental  :   0.11 ( 1%) usr   0.01 ( 4%) sys   0.06 ( 1%) wall    5142 kB ( 3%) ggc
 tree operand scan     :   0.11 ( 1%) usr   0.03 (11%) sys   0.12 ( 1%) wall   10620 kB ( 5%) ggc
 dominator optimization:   0.08 ( 1%) usr   0.00 ( 0%) sys   0.11 ( 1%) wall    2130 kB ( 1%) ggc
 tree CCP              :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.10 ( 1%) wall       6 kB ( 0%) ggc
 tree split crit edges :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall    6801 kB ( 3%) ggc
 tree reassociation    :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       2 kB ( 0%) ggc
 tree PRE              :   2.94 (31%) usr   0.02 ( 7%) sys   2.99 (30%) wall    4955 kB ( 2%) ggc
 tree FRE              :   0.16 ( 2%) usr   0.01 ( 4%) sys   0.17 ( 2%) wall    1933 kB ( 1%) ggc
 tree code sinking     :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       1 kB ( 0%) ggc
 tree forward propagate:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 tree conservative DCE :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 tree aggressive DCE   :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 tree DSE              :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     228 kB ( 0%) ggc
 PHI merge             :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 complete unrolling    :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       2 kB ( 0%) ggc
 tree SSA uncprop      :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA to normal    :   0.07 ( 1%) usr   0.00 ( 0%) sys   0.09 ( 1%) wall   15109 kB ( 7%) ggc
 tree rename SSA copies:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree switch initialization conversion:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 dominance frontiers   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation :   0.05 ( 1%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 control dependences   :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 expand                :   0.49 ( 5%) usr   0.02 ( 7%) sys   0.50 ( 5%) wall   34138 kB (17%) ggc
 jump                  :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 forward prop          :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall    1556 kB ( 1%) ggc
 CSE                   :   0.13 ( 1%) usr   0.01 ( 4%) sys   0.16 ( 2%) wall     606 kB ( 0%) ggc
 dead code elimination :   0.09 ( 1%) usr   0.00 ( 0%) sys   0.07 ( 1%) wall       0 kB ( 0%) ggc
 dead store elim1      :   0.10 ( 1%) usr   0.00 ( 0%) sys   0.09 ( 1%) wall    1706 kB ( 1%) ggc
 dead store elim2      :   0.12 ( 1%) usr   0.00 ( 0%) sys   0.12 ( 1%) wall    4207 kB ( 2%) ggc
 CSE 2                 :   0.11 ( 1%) usr   0.00 ( 0%) sys   0.08 ( 1%) wall     303 kB ( 0%) ggc
 branch prediction     :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       2 kB ( 0%) ggc
 combiner              :   0.15 ( 2%) usr   0.01 ( 4%) sys   0.14 ( 1%) wall    4423 kB ( 2%) ggc
 if-conversion         :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     154 kB ( 0%) ggc
 regmove               :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 integrated RA         :   0.90 ( 9%) usr   0.00 ( 0%) sys   0.89 ( 9%) wall    1788 kB ( 1%) ggc
 reload                :   0.39 ( 4%) usr   0.00 ( 0%) sys   0.38 ( 4%) wall    6285 kB ( 3%) ggc
 reload CSE regs       :   0.20 ( 2%) usr   0.01 ( 4%) sys   0.20 ( 2%) wall    8333 kB ( 4%) ggc
 load CSE after reload :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 thread pro- & epilogue:   0.06 ( 1%) usr   0.00 ( 0%) sys   0.06 ( 1%) wall      13 kB ( 0%) ggc
 peephole 2            :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 rename registers      :   0.06 ( 1%) usr   0.00 ( 0%) sys   0.08 ( 1%) wall      75 kB ( 0%) ggc
 scheduling 2          :   0.35 ( 4%) usr   0.01 ( 4%) sys   0.35 ( 4%) wall       0 kB ( 0%) ggc
 machine dep reorg     :   0.05 ( 1%) usr   0.00 ( 0%) sys   0.06 ( 1%) wall     301 kB ( 0%) ggc
 reorder blocks        :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 1%) wall    3708 kB ( 2%) ggc
 final                 :   0.11 ( 1%) usr   0.00 ( 0%) sys   0.13 ( 1%) wall    3977 kB ( 2%) ggc
 TOTAL                 :   9.48             0.27             9.83             203289 kB

####################################################################################
#GCC 6.3.1
bash-4.1$ gcc -ftime-report -m32 -w -c -O3 -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 testcase-v1.c -o testcase-v1.o

Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     892 kB ( 0%) ggc
 phase parsing           :   0.21 ( 0%) usr   0.09 ( 3%) sys   0.32 ( 0%) wall   20284 kB ( 1%) ggc
 phase opt and generate  : 120.50 (100%) usr   3.43 (97%) sys 125.05 (100%) wall 3091168 kB (99%) ggc
 garbage collection      :   2.50 ( 2%) usr   0.02 ( 1%) sys   2.90 ( 2%) wall       0 kB ( 0%) ggc
 callgraph construction  :   1.81 ( 1%) usr   0.08 ( 2%) sys   2.33 ( 2%) wall    8824 kB ( 0%) ggc
 callgraph optimization  :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       8 kB ( 0%) ggc
 ipa dead code removal   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 ipa cp                  :   0.99 ( 1%) usr   0.00 ( 0%) sys   0.99 ( 1%) wall     512 kB ( 0%) ggc
 ipa inlining heuristics :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall       9 kB ( 0%) ggc
 ipa profile             :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 ipa icf                 :   5.44 ( 5%) usr   0.00 ( 0%) sys   5.44 ( 4%) wall      26 kB ( 0%) ggc
 cfg cleanup             :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall      18 kB ( 0%) ggc
 trivially dead code     :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 df scan insns           :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 df multiple defs        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 df reaching defs        :   0.09 ( 0%) usr   0.01 ( 0%) sys   0.10 ( 0%) wall       0 kB ( 0%) ggc
 df live regs            :   0.13 ( 0%) usr   0.00 ( 0%) sys   0.14 ( 0%) wall       0 kB ( 0%) ggc
 df live&initialized regs:   0.03 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 df must-initialized regs:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 df use-def / def-use chains:   0.04 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall     665 kB ( 0%) ggc
 register information    :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis          :   0.07 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall    1090 kB ( 0%) ggc
 alias stmt walking      :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.13 ( 0%) wall       1 kB ( 0%) ggc
 rebuild jump labels     :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 preprocessing           :   0.03 ( 0%) usr   0.01 ( 0%) sys   0.01 ( 0%) wall    3645 kB ( 0%) ggc
 lexical analysis        :   0.07 ( 0%) usr   0.03 ( 1%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 parser (global)         :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     643 kB ( 0%) ggc
 parser function body    :   0.11 ( 0%) usr   0.04 ( 1%) sys   0.19 ( 0%) wall   15822 kB ( 1%) ggc
 parser inl. func. body  :   0.00 ( 0%) usr   0.01 ( 0%) sys   0.01 ( 0%) wall     110 kB ( 0%) ggc
 inline parameters       :   0.13 ( 0%) usr   0.02 ( 1%) sys   0.09 ( 0%) wall    1530 kB ( 0%) ggc
 tree gimplify           :   0.06 ( 0%) usr   0.02 ( 1%) sys   0.06 ( 0%) wall   12544 kB ( 0%) ggc
 tree eh                 :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 tree CFG construction   :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall   10956 kB ( 0%) ggc
 tree CFG cleanup        :   0.45 ( 0%) usr   0.01 ( 0%) sys   0.46 ( 0%) wall       0 kB ( 0%) ggc
 tree VRP                :  25.14 (21%) usr   0.56 (16%) sys  25.73 (21%) wall  115035 kB ( 4%) ggc
 tree copy propagation   :   0.57 ( 0%) usr   0.00 ( 0%) sys   0.57 ( 0%) wall       0 kB ( 0%) ggc
 tree PTA                :  18.90 (16%) usr   0.63 (18%) sys  19.58 (16%) wall     488 kB ( 0%) ggc
 tree PHI insertion      :   3.90 ( 3%) usr   1.67 (47%) sys   5.58 ( 4%) wall 2912441 kB (94%) ggc
 tree SSA rewrite        :  12.59 (10%) usr   0.01 ( 0%) sys  12.62 (10%) wall    4709 kB ( 0%) ggc
 tree SSA other          :   0.09 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA incremental    :   0.09 ( 0%) usr   0.00 ( 0%) sys   0.10 ( 0%) wall      94 kB ( 0%) ggc
 tree operand scan       :   0.04 ( 0%) usr   0.01 ( 0%) sys   0.01 ( 0%) wall    3523 kB ( 0%) ggc
 dominator optimization  :   1.49 ( 1%) usr   0.00 ( 0%) sys   1.50 ( 1%) wall     378 kB ( 0%) ggc
 isolate eroneous paths  :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.10 ( 0%) wall       0 kB ( 0%) ggc
 tree CCP                :   9.87 ( 8%) usr   0.03 ( 1%) sys   9.94 ( 8%) wall      12 kB ( 0%) ggc
 tree PHI const/copy prop:   0.26 ( 0%) usr   0.00 ( 0%) sys   0.27 ( 0%) wall       0 kB ( 0%) ggc
 tree split crit edges   :   0.47 ( 0%) usr   0.00 ( 0%) sys   0.48 ( 0%) wall     568 kB ( 0%) ggc
 tree reassociation      :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree PRE                :   0.21 ( 0%) usr   0.00 ( 0%) sys   0.21 ( 0%) wall       6 kB ( 0%) ggc
 tree FRE                :   1.47 ( 1%) usr   0.00 ( 0%) sys   1.48 ( 1%) wall       6 kB ( 0%) ggc
 tree backward propagate :  11.39 ( 9%) usr   0.00 ( 0%) sys  11.40 ( 9%) wall       0 kB ( 0%) ggc
 tree forward propagate  :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 tree phiprop            :   0.11 ( 0%) usr   0.00 ( 0%) sys   0.10 ( 0%) wall       0 kB ( 0%) ggc
 tree conservative DCE   :   0.78 ( 1%) usr   0.01 ( 0%) sys   0.80 ( 1%) wall       0 kB ( 0%) ggc
 tree aggressive DCE     :   2.56 ( 2%) usr   0.00 ( 0%) sys   2.53 ( 2%) wall      12 kB ( 0%) ggc
 tree DSE                :   0.67 ( 1%) usr   0.00 ( 0%) sys   0.68 ( 1%) wall       0 kB ( 0%) ggc
 PHI merge               :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 tree slp vectorization  :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     283 kB ( 0%) ggc
 tree SSA uncprop        :   0.83 ( 1%) usr   0.02 ( 1%) sys   0.85 ( 1%) wall       0 kB ( 0%) ggc
 tree strlen optimization:   0.13 ( 0%) usr   0.00 ( 0%) sys   0.13 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation   :   0.05 ( 0%) usr   0.01 ( 0%) sys   0.06 ( 0%) wall       0 kB ( 0%) ggc
 out of ssa              :  10.30 ( 9%) usr   0.24 ( 7%) sys  10.60 ( 8%) wall      47 kB ( 0%) ggc
 expand vars             :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall     428 kB ( 0%) ggc
 expand                  :   0.22 ( 0%) usr   0.02 ( 1%) sys   0.26 ( 0%) wall    5452 kB ( 0%) ggc
 lower subreg            :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 forward prop            :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall     246 kB ( 0%) ggc
 CSE                     :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall     145 kB ( 0%) ggc
 dead code elimination   :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 dead store elim1        :   0.03 ( 0%) usr   0.01 ( 0%) sys   0.03 ( 0%) wall     720 kB ( 0%) ggc
 dead store elim2        :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall    1042 kB ( 0%) ggc
 loop init               :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall      17 kB ( 0%) ggc
 CSE 2                   :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall      56 kB ( 0%) ggc
 branch prediction       :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 combiner                :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall     965 kB ( 0%) ggc
 integrated RA           :   0.15 ( 0%) usr   0.00 ( 0%) sys   0.17 ( 0%) wall    2130 kB ( 0%) ggc
 LRA non-specific        :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall     746 kB ( 0%) ggc
 LRA virtuals elimination:   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     461 kB ( 0%) ggc
 LRA reload inheritance  :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     132 kB ( 0%) ggc
 LRA create live ranges  :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall      87 kB ( 0%) ggc
 LRA hard reg assignment :   0.01 ( 0%) usr   0.01 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 LRA rematerialization   :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 reload CSE regs         :   0.07 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall    1865 kB ( 0%) ggc
 load CSE after reload   :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall     138 kB ( 0%) ggc
 ree                     :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 thread pro- & epilogue  :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       4 kB ( 0%) ggc
 if-conversion 2         :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 combine stack adjustments:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall      37 kB ( 0%) ggc
 peephole 2              :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall     348 kB ( 0%) ggc
 hard reg cprop          :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 scheduling 2            :   0.11 ( 0%) usr   0.00 ( 0%) sys   0.13 ( 0%) wall     320 kB ( 0%) ggc
 reorder blocks          :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     528 kB ( 0%) ggc
 shorten branches        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 final                   :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall    1302 kB ( 0%) ggc
 straight-line strength reduction:   0.14 ( 0%) usr   0.01 ( 0%) sys   0.15 ( 0%) wall       0 kB ( 0%) ggc
 rest of compilation     :   0.12 ( 0%) usr   0.01 ( 0%) sys   0.16 ( 0%) wall      33 kB ( 0%) ggc
 remove unused locals    :   3.42 ( 3%) usr   0.01 ( 0%) sys   3.46 ( 3%) wall       0 kB ( 0%) ggc
 address taken           :   1.19 ( 1%) usr   0.01 ( 0%) sys   1.21 ( 1%) wall       0 kB ( 0%) ggc
 unaccounted todo        :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall      40 kB ( 0%) ggc
 TOTAL                 : 120.71             3.52           125.38            3112354 kB

####################################################################################

# GCC 8.3.1
bash-4.1$ gcc -ftime-report -m32 -w -c -O3 -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 testcase-v1.c -o testcase-v1.o

Time variable                                   usr           sys          wall               GGC
 phase setup                        :   0.00 (  0%)   0.01 (  0%)   0.01 (  0%)    1047 kB (  0%)
 phase parsing                      :   0.21 (  0%)   0.08 (  0%)   0.32 (  0%)   20551 kB (  0%)
 phase opt and generate             : 189.32 (100%)  24.44 (100%) 367.57 (100%) 5277730 kB (100%)
 garbage collection                 :   7.52 (  4%)   2.16 (  9%)  32.80 (  9%)       0 kB (  0%)
 dump files                         :   0.00 (  0%)   0.01 (  0%)   0.02 (  0%)       0 kB (  0%)
 callgraph construction             :   4.62 (  2%)   4.29 ( 17%)  59.37 ( 16%)    9442 kB (  0%)
 callgraph optimization             :   0.28 (  0%)   0.04 (  0%)   0.60 (  0%)       0 kB (  0%)
 ipa function summary               :   0.13 (  0%)   0.22 (  1%)   2.63 (  1%)      11 kB (  0%)
 ipa dead code removal              :   0.00 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 ipa cp                             :   2.11 (  1%)   2.48 ( 10%)  30.08 (  8%)       2 kB (  0%)
 ipa inlining heuristics            :   0.00 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 ipa pure const                     :   0.02 (  0%)   0.01 (  0%)   0.05 (  0%)       0 kB (  0%)
 ipa icf                            :  12.01 (  6%)   4.13 ( 17%)  60.34 ( 16%)      29 kB (  0%)
 ipa free inline summary            :   0.00 (  0%)   0.01 (  0%)   0.07 (  0%)       0 kB (  0%)
 cfg cleanup                        :   0.00 (  0%)   0.00 (  0%)   0.02 (  0%)      18 kB (  0%)
 trivially dead code                :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 df scan insns                      :   0.02 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 df multiple defs                   :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 df reaching defs                   :   0.08 (  0%)   0.00 (  0%)   0.08 (  0%)       0 kB (  0%)
 df live regs                       :   0.11 (  0%)   0.00 (  0%)   0.12 (  0%)       0 kB (  0%)
 df live&initialized regs           :   0.02 (  0%)   0.00 (  0%)   0.03 (  0%)       0 kB (  0%)
 df must-initialized regs           :   0.02 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 df use-def / def-use chains        :   0.05 (  0%)   0.00 (  0%)   0.04 (  0%)       0 kB (  0%)
 df reg dead/unused notes           :   0.11 (  0%)   0.00 (  0%)   0.12 (  0%)     684 kB (  0%)
 register information               :   0.03 (  0%)   0.00 (  0%)   0.03 (  0%)       0 kB (  0%)
 alias analysis                     :   0.02 (  0%)   0.00 (  0%)   0.02 (  0%)    1090 kB (  0%)
 alias stmt walking                 :   0.13 (  0%)   0.02 (  0%)   0.36 (  0%)       1 kB (  0%)
 register scan                      :   0.02 (  0%)   0.00 (  0%)   0.00 (  0%)     140 kB (  0%)
 rebuild jump labels                :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 preprocessing                      :   0.04 (  0%)   0.01 (  0%)   0.11 (  0%)    3722 kB (  0%)
 lexical analysis                   :   0.04 (  0%)   0.02 (  0%)   0.09 (  0%)       0 kB (  0%)
 parser (global)                    :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)     716 kB (  0%)
 parser struct body                 :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      61 kB (  0%)
 parser function body               :   0.13 (  0%)   0.05 (  0%)   0.10 (  0%)   15942 kB (  0%)
 early inlining heuristics          :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 inline parameters                  :   0.16 (  0%)   0.19 (  1%)   0.47 (  0%)    1538 kB (  0%)
 tree gimplify                      :   0.06 (  0%)   0.01 (  0%)   0.07 (  0%)   14496 kB (  0%)
 tree eh                            :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 tree CFG construction              :   0.04 (  0%)   0.00 (  0%)   0.06 (  0%)   11884 kB (  0%)
 tree CFG cleanup                   :   0.64 (  0%)   0.00 (  0%)   0.65 (  0%)       0 kB (  0%)
 tree tail merge                    :   1.50 (  1%)   0.00 (  0%)   1.50 (  0%)       0 kB (  0%)
 tree VRP                           :   9.73 (  5%)   2.02 (  8%)  12.15 (  3%)  165762 kB (  3%)
 tree Early VRP                     :  13.07 (  7%)   0.23 (  1%)  13.32 (  4%)  496545 kB (  9%)
 tree copy propagation              :   0.88 (  0%)   0.00 (  0%)   0.89 (  0%)       0 kB (  0%)
 tree PTA                           :  28.88 ( 15%)   4.22 ( 17%)  38.06 ( 10%)   46436 kB (  1%)
 tree PHI insertion                 :   5.78 (  3%)   1.96 (  8%)   7.83 (  2%) 4254415 kB ( 80%)
 tree SSA rewrite                   :  15.12 (  8%)   0.02 (  0%)  15.13 (  4%)    3780 kB (  0%)
 tree SSA other                     :   0.21 (  0%)   0.16 (  1%)   0.34 (  0%)       0 kB (  0%)
 tree SSA incremental               :   0.14 (  0%)   0.00 (  0%)   0.16 (  0%)      94 kB (  0%)
 tree operand scan                  :   0.03 (  0%)   0.00 (  0%)   0.04 (  0%)    3619 kB (  0%)
 dominator optimization             :   5.51 (  3%)   0.10 (  0%)   5.66 (  2%)  165892 kB (  3%)
 backwards jump threading           :   0.01 (  0%)   0.01 (  0%)   0.01 (  0%)       0 kB (  0%)
 isolate eroneous paths             :   0.10 (  0%)   0.00 (  0%)   0.09 (  0%)       0 kB (  0%)
 tree CCP                           :  14.39 (  8%)   1.75 (  7%)  16.89 (  5%)      75 kB (  0%)
 tree PHI const/copy prop           :   0.43 (  0%)   0.00 (  0%)   0.43 (  0%)       0 kB (  0%)
 tree split crit edges              :   0.40 (  0%)   0.01 (  0%)   0.44 (  0%)     274 kB (  0%)
 tree reassociation                 :   0.02 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 tree PRE                           :   3.41 (  2%)   0.14 (  1%)   3.55 (  1%)     281 kB (  0%)
 tree FRE                           :  20.53 ( 11%)   0.04 (  0%)  20.61 (  6%)       5 kB (  0%)
 tree linearize phis                :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       6 kB (  0%)
 tree backward propagate            :   1.41 (  1%)   0.04 (  0%)   1.92 (  1%)       0 kB (  0%)
 tree forward propagate             :   3.69 (  2%)   0.00 (  0%)   3.70 (  1%)       0 kB (  0%)
 tree phiprop                       :   0.10 (  0%)   0.00 (  0%)   0.09 (  0%)       0 kB (  0%)
 tree conservative DCE              :   1.63 (  1%)   0.00 (  0%)   1.66 (  0%)       0 kB (  0%)
 tree aggressive DCE                :   3.75 (  2%)   0.01 (  0%)   3.77 (  1%)      12 kB (  0%)
 tree DSE                           :   1.09 (  1%)   0.00 (  0%)   1.07 (  0%)       0 kB (  0%)
 PHI merge                          :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 tree slp vectorization             :   0.02 (  0%)   0.00 (  0%)   0.04 (  0%)     747 kB (  0%)
 tree SSA uncprop                   :   1.44 (  1%)   0.00 (  0%)   1.44 (  0%)       0 kB (  0%)
 gimple widening/fma detection      :   0.13 (  0%)   0.00 (  0%)   0.13 (  0%)       0 kB (  0%)
 tree strlen optimization           :   0.19 (  0%)   0.00 (  0%)   0.19 (  0%)       0 kB (  0%)
 dominance computation              :   0.03 (  0%)   0.01 (  0%)   0.13 (  0%)       0 kB (  0%)
 control dependences                :   0.02 (  0%)   0.00 (  0%)   0.04 (  0%)       0 kB (  0%)
 out of ssa                         :  17.90 (  9%)   0.06 (  0%)  17.97 (  5%)      47 kB (  0%)
 expand vars                        :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)     444 kB (  0%)
 expand                             :   0.32 (  0%)   0.01 (  0%)   0.33 (  0%)    5520 kB (  0%)
 forward prop                       :   0.03 (  0%)   0.00 (  0%)   0.03 (  0%)     246 kB (  0%)
 CSE                                :   0.02 (  0%)   0.00 (  0%)   0.03 (  0%)     145 kB (  0%)
 dead code elimination              :   0.02 (  0%)   0.00 (  0%)   0.03 (  0%)       0 kB (  0%)
 dead store elim1                   :   0.03 (  0%)   0.00 (  0%)   0.03 (  0%)     720 kB (  0%)
 dead store elim2                   :   0.03 (  0%)   0.00 (  0%)   0.04 (  0%)    1070 kB (  0%)
 loop init                          :   0.01 (  0%)   0.00 (  0%)   0.04 (  0%)      22 kB (  0%)
 CSE 2                              :   0.02 (  0%)   0.00 (  0%)   0.02 (  0%)      56 kB (  0%)
 branch prediction                  :   0.05 (  0%)   0.01 (  0%)   0.05 (  0%)       0 kB (  0%)
 combiner                           :   0.07 (  0%)   0.00 (  0%)   0.06 (  0%)     965 kB (  0%)
 integrated RA                      :   0.12 (  0%)   0.00 (  0%)   0.24 (  0%)    2193 kB (  0%)
 LRA non-specific                   :   0.07 (  0%)   0.00 (  0%)   0.07 (  0%)     818 kB (  0%)
 LRA virtuals elimination           :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)     460 kB (  0%)
 LRA reload inheritance             :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)     132 kB (  0%)
 LRA create live ranges             :   0.18 (  0%)   0.00 (  0%)   0.18 (  0%)      87 kB (  0%)
 LRA hard reg assignment            :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 LRA rematerialization              :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 reload CSE regs                    :   0.06 (  0%)   0.00 (  0%)   0.06 (  0%)    1865 kB (  0%)
 load CSE after reload              :   0.11 (  0%)   0.00 (  0%)   0.11 (  0%)     138 kB (  0%)
 ree                                :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 thread pro- & epilogue             :   0.02 (  0%)   0.01 (  0%)   0.03 (  0%)       6 kB (  0%)
 combine stack adjustments          :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)      37 kB (  0%)
 peephole 2                         :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)     348 kB (  0%)
 hard reg cprop                     :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 scheduling 2                       :   0.13 (  0%)   0.00 (  0%)   0.13 (  0%)     377 kB (  0%)
 reorder blocks                     :   0.02 (  0%)   0.00 (  0%)   0.01 (  0%)     518 kB (  0%)
 shorten branches                   :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 final                              :   0.03 (  0%)   0.00 (  0%)   0.09 (  0%)    1302 kB (  0%)
 straight-line strength reduction   :   0.24 (  0%)   0.00 (  0%)   0.24 (  0%)       0 kB (  0%)
 initialize rtl                     :   0.00 (  0%)   0.00 (  0%)   0.02 (  0%)       9 kB (  0%)
 rest of compilation                :   1.18 (  1%)   0.06 (  0%)   1.28 (  0%)   82789 kB (  2%)
 remove unused locals               :   4.82 (  3%)   0.00 (  0%)   5.02 (  1%)       0 kB (  0%)
 address taken                      :   1.88 (  1%)   0.00 (  0%)   1.91 (  1%)       0 kB (  0%)
 TOTAL                              : 189.53         24.53        367.91        5299338 kB

####################################################################################

As you can see, in each newer GCC version the performance is even more degraded.

F.Y.I I also tried the same code using LLVM v8.0.0 which complied similar to GCC 4.4.7.

Best regards,
--
Rogerio

Comment 55 Richard Biener 2019-07-25 13:18:41 UTC

(In reply to Rogério de Souza Moraes from comment #53)
> Created attachment 46627 [details]
> New testcase that shows performance degradation in GCC 9.1
> 
> This new testcase that shows performance degradation in GCC 9.1, the
> performance gets worst as we use a newer GCC version to build it.
> 
> Example how to compile:
> 
> gcc -ftime-report -m32 -w -c -O3 -pedantic -fwrapv -mstackrealign
> -mpreferred-stack-boundary=4 testcase-v1.c -o testcase-v1.o

Let's open a new bugreport for this testcase, it looks different enough to
warrant this.  I notice you use -O3 which might explain the larger
tree SSA rewrite time.

PR91257.

Comment 56 Richard Biener 2019-09-04 08:54:49 UTC

Not going to backport further, fixed in GCC 8+.