Created attachment 33441 [details] preprocessed source [forwarded from https://bugs.debian.org/759683] compiling the attached test case with the 4.9 branch r214759 and trunk r213954 takes about 90sec on x86_64 and 10GB of memory. succeeds with the 4.8 branch in less than a second. $ gcc -std=c99 -c testunity_Runner.i from the Debian issue: """ Notice that replacing "_setjmp (Unity.AbortFrame[Unity.CurrentAbortFrame])" in main function by "_setjmp (Unity.AbortFrame[0])", make gcc works normaly. After few tests it seems that gcc does not like having a variable in here. """ I don't see the crash reported in the Debian issue.
Clearly caused by the correctness fix for setjmp to wire abnormal edges. For me it is out-of-ssa which uses too much memory while building the conflict graph. We have gigantic PHI nodes here: _10263(ab) = PHI <_109925(D)(ab)(2),...., _10592(ab)(1489)> it's fast when optimizing. At -O0 we have a _lot_ more anonymous SSA names. -O1: <bb 4>: # _1(ab) = PHI <_1902(3), _2(ab)(5)> _1905 = _setjmp (_1(ab)); if (_1905 == 0) goto <bb 6>; else goto <bb 8>; <bb 5> # _2(ab) = PHI <_1895(D), .... single gigantic PHI -O0: <bb 4>: # _1(ab) = PHI <_398164(3), _2(ab)(5)> # _632(ab) = PHI <_397532(D)(ab)(3), _633(ab)(5)> # _1263(ab) = PHI <_397533(D)(ab)(3), _1264(ab)(5)> # _1894(ab) = PHI <_397534(D)(ab)(3), _1895(ab)(5)> # _2525(ab) = PHI <_397535(D)(ab)(3), _2526(ab)(5)> ... # _396900(ab) = PHI <_398160(D)(ab)(3), _396901(ab)(5)> _398165 = _setjmp (_1(ab)); if (_398165 == 0) goto <bb 6>; else goto <bb 8>; <bb 5> # _2(ab) = PHI <_397531(D)(ab)(2)... .... # _396901(ab) = PHI <_398160(D)(ab)(2), _3... gazillion of gigantic PHIs. And very many PHIs in every block. It's into-SSA that introduces the difference for the PHI nodes but already GIMPLIFICATION that introduces very many more temporaries which is the underlying issue (lookup_tmp_var !optimize check). Index: gcc/gimplify.c =================================================================== --- gcc/gimplify.c (revision 214810) +++ gcc/gimplify.c (working copy) @@ -476,7 +476,7 @@ lookup_tmp_var (tree val, bool is_formal block, which means it will go into memory, causing much extra work in reload and final and poorer code generation, outweighing the extra memory allocation here. */ - if (!optimize || !is_formal || TREE_SIDE_EFFECTS (val)) + if (!is_formal || TREE_SIDE_EFFECTS (val)) ret = create_tmp_from_val (val); else { fixes it (but it means that changing the testcase to use more distinct user variables would produce the same issue even when optimizing).
I wonder why we need to explicitely represent abnormal PHIs in the dispatcher. All incoming edges are abnormal and all SSA names have to be coalesced anyway. Thus we could instead have <bb 5>: /* Not: # _2(ab) = PHI <_17(D)(ab)(2), _1(ab)(6), _1(ab)(7), _3(ab)(11), _3(ab)(12), _4(ab)(15), _4(ab)(16), _5(ab)(20), _5(ab)(21), _5(ab)(22)> */ ABNORMAL_DISPATCHER (0); _2(ab) = D.12345; or simply rewrite all must-coalesce vars out-of-SSA? (or not into SSA in the first place) The question is whether accesses to them should be loads/stores (I think so) and if that will cause other similar issues. We'd have to factor abnormal edges into a block to a separate forwarder of course, with a load of all "abnormal" vars. Anyway, not sure why the gimplify code is disabled for -O0 (or why we don't re-use formal temps more aggressively as they become anonymous SSA names later anyway).
So the issue is that the setjmp argument needs two temporaries: D.2832 = Unity.CurrentAbortFrame; D.2833 = &Unity.AbortFrame[D.2832]; <bb 18>: D.2834 = _setjmp (D.2833); and the EH edge going into the _setjmp call has to merge those through the abnormal dispatcher. And that way it receives all of them. Hmm. Huh. Without the abnormal dispatcher they should just get default defs everywhere (but still many PHI nodes). Maybe that would be more light-weight.
GCC 4.9.2 has been released.
I happen to have a (controversical) patch.
https://gcc.gnu.org/ml/gcc-patches/2015-03/msg00364.html
Author: rguenth Date: Tue Mar 10 11:16:33 2015 New Revision: 221318 URL: https://gcc.gnu.org/viewcvs?rev=221318&root=gcc&view=rev Log: 2015-03-10 Richard Biener <rguenther@suse.de> PR middle-end/63155 * tree-ssa-coalesce.h (verify_ssa_coalescing): Declare. * tree-ssa-coalesce.c: Include timevar.h. (attempt_coalesce): Handle graph being NULL. (coalesce_partitions): Call verify_ssa_coalescing if ENABLE_CHECKING. Split out abnormal coalescing to ... (perform_abnormal_coalescing): ... this function. (coalesce_ssa_name): Perform abnormal coalescing without computing live/conflict. (verify_ssa_coalescing_worker): New function. (verify_ssa_coalescing): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-coalesce.c trunk/gcc/tree-ssa-coalesce.h
Fixed on trunk sofar.
Reverted.
Author: rguenth Date: Thu Mar 19 13:36:18 2015 New Revision: 221515 URL: https://gcc.gnu.org/viewcvs?rev=221515&root=gcc&view=rev Log: 2015-03-19 Richard Biener <rguenther@suse.de> Revert 2015-03-10 Richard Biener <rguenther@suse.de> PR middle-end/63155 * tree-ssa-coalesce.h (verify_ssa_coalescing): Declare. * tree-ssa-coalesce.c: Include timevar.h. (attempt_coalesce): Handle graph being NULL. (coalesce_partitions): Call verify_ssa_coalescing if ENABLE_CHECKING. Split out abnormal coalescing to ... (perform_abnormal_coalescing): ... this function. (coalesce_ssa_name): Perform abnormal coalescing without computing live/conflict. (verify_ssa_coalescing_worker): New function. (verify_ssa_coalescing): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-coalesce.c trunk/gcc/tree-ssa-coalesce.h
GCC 4.9.3 has been released.
GCC 4.9 branch is being closed
So, shall we instead try to handle this in a mini-pass that would walk the PHIs of the abnormal dispatcher (if any), or maybe just if those PHI have many arguments or there are many such PHIs, create addressable vars for each of the PHIs and store on the edge sources and add load on the abnormal dispatcher successors? Or any other plans?
The plan is to re-write coalescings conflict graph by using SSA-based liveness queries according to https://hal.inria.fr/inria-00192219 Well, at least I'd like to experiment with that as it might be useful in other places as well (cost considerations wrt register pressure changes). Only experiments will show if we trade too much compile-time for the reduction in memory use (not computing liveness / conflict graph).
GCC 5 branch is being closed
Re-confirmed.
Created attachment 44579 [details] alternate testcase
-ftree-coalesce-vars is a workaround for some testcases (enabled by default with optimization).
Oddly enough Index: gcc/tree-ssa-coalesce.c =================================================================== --- gcc/tree-ssa-coalesce.c (revision 264259) +++ gcc/tree-ssa-coalesce.c (working copy) @@ -620,7 +620,11 @@ ssa_conflicts_merge (ssa_conflicts *ptr, { bitmap bz = ptr->conflicts[z]; if (bz) - bitmap_set_bit (bz, x); + { + bool was_there = bitmap_clear_bit (bz, y); + gcc_checking_assert (was_there); + bitmap_set_bit (bz, x); + } } if (bx) changes at least the 2nd testcase to run faster (albeit memory use stays around the same). w/o patch > /usr/bin/time ./cc1 -quiet t2.c 108.14user 1.62system 1:49.78elapsed 99%CPU (0avgtext+0avgdata 5610876maxresident)k 0inputs+440outputs (0major+1442106minor)pagefaults 0swaps w/ patch > /usr/bin/time ./cc1 -quiet t2.c 86.53user 1.46system 1:27.99elapsed 99%CPU (0avgtext+0avgdata 5610888maxresident)k 0inputs+440outputs (0major+1440069minor)pagefaults 0swaps note this is a -O0 "optimized" cc1 binary with checking enabled so ... It's even so slightly faster with doing if (y < x) { bool was_there = bitmap_clear_bit (bz, y); gcc_checking_assert (was_there); bitmap_set_bit (bz, x); } else { bitmap_set_bit (bz, x); bool was_there = bitmap_clear_bit (bz, y); gcc_checking_assert (was_there); } but that's probably luck (bitmap caching and path length from start vs. x / y). Eventually doing a forward walk also makes prefetchers happy. Probably with a lot of coalesces this keeps the conflict bitmaps small (and thus the bitmap element walks fast). Timings with release checking and optimized build: patched: > /usr/bin/time ../../obj/gcc/cc1 -quiet t2.c 22.91user 1.45system 0:24.38elapsed 99%CPU (0avgtext+0avgdata 5515460maxresident)k 0inputs+440outputs (0major+1378102minor)pagefaults 0swaps patched, fancy forward walk: > /usr/bin/time ../../obj/gcc/cc1 -quiet t2.c 22.47user 1.39system 0:23.88elapsed 99%CPU (0avgtext+0avgdata 5515420maxresident)k 0inputs+440outputs (0major+1377586minor)pagefaults 0swaps unpatched: > /usr/bin/time ../../obj/gcc/cc1 -quiet t2.c 46.60user 1.43system 0:48.03elapsed 99%CPU (0avgtext+0avgdata 5515380maxresident)k 0inputs+440outputs (0major+1378102minor)pagefaults 0swaps I'm testing the simple patch now.
Author: rguenth Date: Fri Sep 14 06:59:21 2018 New Revision: 264304 URL: https://gcc.gnu.org/viewcvs?rev=264304&root=gcc&view=rev Log: 2018-09-14 Richard Biener <rguenther@suse.de> PR middle-end/63155 * tree-ssa-coalesce.c (ssa_conflicts_merge): Remove conflict bits for the merged partition. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-coalesce.c
So for the alternate testcase we have Kind Nodes Bytes ssa names 753988 54287136 GIMPLE statements Kind Stmts Bytes phi nodes 747999 381049752 Bitmaps Leak Peak Times N searches Search iter Type tree-ssa-coalesce.c:705 (new_live_track) 91800: 0.0% 91800 6469037: 1.8% 1238519 248500 heap tree-ssa-coalesce.c:586 (ssa_conflicts_add_one) 9916176: 3.5%4977905016 341148023: 94.8% 364290728 601796535 heap tree-ssa-live.c:937 (new_tree_live_info) 119177600: 42.5% 119177600 2979440: 0.8% 2230467 0 heap tree-ssa-live.c:938 (new_tree_live_info) 149077680: 53.1% 149077680 4476926: 1.2% 4467438 1379139 heap as expected the conflict bitmaps are the issue. We can shave off some more compile-time by noticing the asymmetry in live_track_process_def and use bitmap_ior_into for one half. That should be more cache friendly as well. before: > /usr/bin/time ./cc1 -quiet t2.c 38.98user 1.40system 0:40.39elapsed 99%CPU (0avgtext+0avgdata 5626832maxresident)k 0inputs+440outputs (0major+1448898minor)pagefaults 0swaps after: > /usr/bin/time ./cc1 -quiet t2.c 35.81user 1.48system 0:37.30elapsed 99%CPU (0avgtext+0avgdata 5628100maxresident)k 0inputs+440outputs (0major+1440532minor)pagefaults 0swaps not as much as with the previous patch but still... Moves the above to tree-ssa-coalesce.c:812 (live_track_process_def) 9896264: 3.5%4958012960 339784598: 94.4% 301208226 660769345 heap and then we can of course avoid touching bitmaps with already recorded conflicts. Doing that improves things to 33.18user 1.43system 0:34.62elapsed 99%CPU (0avgtext+0avgdata 5626660maxresident)k 0inputs+440outputs (0major+1390642minor)pagefaults 0swaps tree-ssa-coalesce.c:803 (live_track_process_def) 19912: 0.0% 19892656 1363425: 0.4% 960492 1915158 heap tree-ssa-coalesce.c:810 (live_track_process_def) 9896264: 3.5%4958012960 339784598: 94.4% 301208226 660769345 heap note it all seems to be a bit noisy in the end... @@ -818,8 +794,20 @@ live_track_process_def (live_track *ptr, if (bitmap_bit_p (ptr->live_base_var, root)) { b = ptr->live_base_partitions[root]; - EXECUTE_IF_SET_IN_BITMAP (b, 0, x, bi) - ssa_conflicts_add (graph, p, x); + bitmap bp = graph->conflicts[p]; + if (!bp) + bp = graph->conflicts[p] = BITMAP_ALLOC (&graph->obstack); + /* Add a conflict to P to each live base partition. */ + EXECUTE_IF_AND_COMPL_IN_BITMAP (b, bp, 0, x, bi) + { + gcc_checking_assert (x != (unsigned)p); + bitmap bx = graph->conflicts[x]; + if (!bx) + bx = graph->conflicts[x] = BITMAP_ALLOC (&graph->obstack); + bitmap_set_bit (bx, p); + } + /* Add conflicts to each live base partition to P. */ + bitmap_ior_into (bp, b); } } ideally we'd be able to combine the ior_into with the EXECUTE_IF.. walk. Manually jump-threading the case of !bp might be worth it as well, since we are micro-optimizing this...
Created attachment 44692 [details] patch that does not help The attached patch changes the partition view so that members of bases are adjacent. This should improve the locality of the initial conflict set bits (we only record conflicts within same base variables). This doesn't help memory usage for either testcase. For the 2nd testcase we have 187988 partitions and 500 bases. The conflict bitmaps still end up very sparse (but large). There are also a lot of duplicate bitmaps (if you'd add self-conflicts). Restricting anonymous coalesces to abnormal coalescing only increases the number of bases to 747 and doesn't help memory use significantly. There may be cases where the patch helps since it should improve locality.
(In reply to Richard Biener from comment #22) > Created attachment 44692 [details] > patch that does not help > > The attached patch changes the partition view so that members of bases are > adjacent. This should improve the locality of the initial conflict set > bits (we only record conflicts within same base variables). > > This doesn't help memory usage for either testcase. > > For the 2nd testcase we have 187988 partitions and 500 bases. > > The conflict bitmaps still end up very sparse (but large). There are also > a lot of duplicate bitmaps (if you'd add self-conflicts). > > Restricting anonymous coalesces to abnormal coalescing only increases the > number of bases to 747 and doesn't help memory use significantly. So doing this^^^ more conservatively somehow, choosing different bases for different set of final coalesced abnormal partitions, would shrink the size of the largest base significantly, reducing the quadraticness. So perform the abnormal part of coalesce_partitions implicitely, computing sets of coalesced partitions and pre-assigning a base to each of them. The only complication is in rejecting further coalescing to the base members (thus adjusting gimple_can_coalesce_p). > There may be cases where the patch helps since it should improve locality.
*** Bug 87316 has been marked as a duplicate of this bug. ***
Created attachment 44705 [details] patch This "simple" one helps. It builds partition bases similar to the -ftree-coalesce-vars case. Correctness needs to be verified still but I think it errs on the safe side (computing too many dependences at most). In particular it avoids computing dependences between things we'll later not try coalescing anyways.
Author: rguenth Date: Tue Sep 18 13:26:05 2018 New Revision: 264388 URL: https://gcc.gnu.org/viewcvs?rev=264388&root=gcc&view=rev Log: 2018-09-18 Richard Biener <rguenther@suse.de> PR middle-end/63155 * tree-ssa-coalesce.c (tree_int_map_hasher): Remove. (compute_samebase_partition_bases): Likewise. (coalesce_ssa_name): Always use compute_optimized_partition_bases. (gimple_can_coalesce_p): Simplify. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-coalesce.c
TOT now for the alternate testcase at -O1+ exhibits Time variable usr sys wall GGC ... tree CCP : 9.21 ( 64%) 0.00 ( 0%) 9.22 ( 63%) 244 kB ( 0%) ... TOTAL : 14.35 0.22 14.57 521240 kB never seen CCP take that much time. All time is spent in bitmap_set_bit called by add_ssa_edge from ssa_propagation_engine::simulate_stmt. It sets stmt UIDs in the ssa_edge_worklist. Note on that testcase CCP really does a lot (disabling CCP makes things only slower), which means the testcase is quite artificial. There are 752492 stmts at the point the SSA propagator starts to run and as out-of-SSA coalescing told us bits are spread in an unfortunate way here. Sorting the uses we walk in add_ssa_edge after gimple_uid doesn't help much. We're still processing bits like 4, 1509, 3014, 4519, 6024, 7529, 9034, ... Assigning UIDs lazily in that very function improves things (as expected), but we still have tree CCP : 3.82 ( 43%) 0.00 ( 0%) 3.83 ( 42%) 244 kB ( 0%) after that. I'll dig a bit further later.
Created attachment 44724 [details] patch for the SSA propagator issue Using an sbitmap helps. I am testing the attached.
(In reply to Richard Biener from comment #28) > Created attachment 44724 [details] > patch for the SSA propagator issue > > Using an sbitmap helps. I am testing the attached. Note there's void ssa_propagation_engine::process_ssa_edge_worklist (void) { /* Process the next entry from the worklist. */ unsigned stmt_uid = bitmap_first_set_bit (ssa_edge_worklist); which will then be a O(n) operation. Likewise the bitmap_empty_p test isn't efficient (can be avoided by re-using the above walk). For this particular testcase we only end up here 1494 times (compared to many more bitmap_set_bit operations). Still this makes using a sbitmap not the best idea. A sparse-set fits a bit better but we want to iterate over the set in UID order as well which the sparse-set doesn't allow - it only does choose_one in O(1) which might not be the optimal propagation order (but eventually this detail isn't too important...?). Iff only our sparse bitmap implementation would be O(log N) in bit finding... Note all of the same issues in theory apply to the CFG BB worklist.
Created attachment 44726 [details] patch for the SSA propagator issue
So we end up with an almost fully set ssa_edge_worklist because we add all the PHIs from blocks we already processed by having visited the backedge defs. But the backedge isn't actually yet executable and avoiding to set those bits brings down the set to a maximum size of 2985 (compared to ~700000 before). So a simpler patch for this particular issue is Index: gcc/tree-ssa-propagate.c =================================================================== --- gcc/tree-ssa-propagate.c (revision 264438) +++ gcc/tree-ssa-propagate.c (working copy) @@ -168,15 +170,26 @@ add_ssa_edge (tree var) FOR_EACH_IMM_USE_FAST (use_p, iter, var) { gimple *use_stmt = USE_STMT (use_p); + basic_block use_bb = gimple_bb (use_stmt); /* If we did not yet simulate the block wait for this to happen and do not add the stmt to the SSA edge worklist. */ - if (! (gimple_bb (use_stmt)->flags & BB_VISITED)) + if (! (use_bb->flags & BB_VISITED)) continue; + /* If this is a use on a not yet executable edge do not bother to + queue it. */ + if (gimple_code (use_stmt) == GIMPLE_PHI + && !(EDGE_PRED (use_bb, PHI_ARG_INDEX_FROM_USE (use_p))->flags + & EDGE_EXECUTABLE)) + return; + if (prop_simulate_again_p (use_stmt) && bitmap_set_bit (ssa_edge_worklist, gimple_uid (use_stmt))) {
Author: rguenth Date: Mon Sep 24 07:08:24 2018 New Revision: 264523 URL: https://gcc.gnu.org/viewcvs?rev=264523&root=gcc&view=rev Log: 2018-09-24 Richard Biener <rguenther@suse.de> PR tree-optimization/63155 * tree-ssa-propagate.c (add_ssa_edge): Avoid adding PHIs to the worklist when the edge of the respective argument isn't executable. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-propagate.c
So on trunk for the original testcase I now see > /usr/bin/time ./cc1 -quiet testunity_Runner.i -std=c99 2.70user 0.16system 0:02.86elapsed 100%CPU (0avgtext+0avgdata 427672maxresident)k 0inputs+504outputs (0major+106295minor)pagefaults 0swaps while on the same machine using GCC 4.8: > /usr/bin/time /space/rguenther/install/gcc-4.8.5/bin/gcc -S testunity_Runner.i -std=c99 0.24user 0.01system 0:00.60elapsed 41%CPU (0avgtext+0avgdata 39424maxresident)k 30960inputs+504outputs (37major+8516minor)pagefaults 0swaps so we've come a long way but still regressed which is somehow not avoidable because of the correctness fix that started this. For reference GCC 8.2 numbers are > /usr/bin/time /space/rguenther/install/gcc-8.2/bin/gcc -S testunity_Runner.i -std=c99 94.31user 2.46system 1:36.79elapsed 99%CPU (0avgtext+0avgdata 10172916maxresident)k 0inputs+504outputs (0major+2535422minor)pagefaults 0swaps So overall I consider this issue fixed for trunk.
My primary concern in 87316 was about memory usage and this patch definitely helps a lot with that. Thanks! Using -ftree-coalesce-vars helps on >= 4.9 versions and does not seem to have an adverse effect on test coverage.
Created attachment 44791 [details] Small testcase more similar to original environment Hi Richard, this is a new testcase, based on another file in the original environment. It’s quite small (7000 lines, 240 setjmp calls). This code with a little complex but still simplified control structure represents state machine implementation, which is very widely used by our customers. Another new factor is the nested setjmp calls. Of course, original testcase is more complex and takes even more time with more difference. You can run it using the following commands: time gcc -DGCC -DLINUX_C -D_GLIBCXX_USE_CXX11_ABI=0 -m32 -m32 -w -c -O0 -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 gcc_2nd_synth_pure_c_item.c -o gcc_2nd_synth_pure_c_item.o time gcc -DGCC -DLINUX_C -D_GLIBCXX_USE_CXX11_ABI=0 -m32 -m32 -w -c -O -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 gcc_2nd_synth_pure_c_item.c -o gcc_2nd_synth_pure_c_item.o Results : GCC: 4.8.5 (From RHEL 7.5) real 0m0.349s user 0m0.255s sys 0m0.083s real 0m0.193s user 0m0.163s sys 0m0.023s GCC: 6.3.0 (GCC 6.3.0 with Revision 264523 backported and applied to it) real 0m32.235s user 0m30.486s sys 0m1.622s real 3m34.203s user 3m33.726s sys 0m0.292s The performance difference is relevant in this test. Regards, -- Rogerio
On Thu, 4 Oct 2018, rogerio.souza at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155 > > --- Comment #35 from Rogério de Souza Moraes <rogerio.souza at gmail dot com> --- > Created attachment 44791 [details] > --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44791&action=edit > Small testcase more similar to original environment > > Hi Richard, > > this is a new testcase, based on another file in the original environment. It’s > quite small (7000 lines, 240 setjmp calls). > > This code with a little complex but still simplified control structure > represents state machine implementation, which is very widely used by our > customers. Another new factor is the nested setjmp calls. Of course, original > testcase is more complex and takes even more time with more difference. > > You can run it using the following commands: > > > time gcc -DGCC -DLINUX_C -D_GLIBCXX_USE_CXX11_ABI=0 -m32 -m32 -w -c -O0 > -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 > gcc_2nd_synth_pure_c_item.c -o gcc_2nd_synth_pure_c_item.o > > time gcc -DGCC -DLINUX_C -D_GLIBCXX_USE_CXX11_ABI=0 -m32 -m32 -w -c -O > -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 > gcc_2nd_synth_pure_c_item.c -o gcc_2nd_synth_pure_c_item.o > > > Results : > > GCC: 4.8.5 (From RHEL 7.5) > > real 0m0.349s > user 0m0.255s > sys 0m0.083s > > real 0m0.193s > user 0m0.163s > sys 0m0.023s > > GCC: 6.3.0 (GCC 6.3.0 with Revision 264523 backported and applied to it) > > real 0m32.235s > user 0m30.486s > sys 0m1.622s > > real 3m34.203s > user 3m33.726s > sys 0m0.292s > > The performance difference is relevant in this test. Thanks for the more realistic testcase. I can confirm the above and I also see a slowdown in GCC 9 compared to GCC 8 at -O1: > /usr//bin/time gcc-8 -S t.c -O -fwrapv -mstackrealign -mpreferred-stack-boundary=4 -m32 157.48user 0.24system 2:37.78elapsed 99%CPU (0avgtext+0avgdata 888036maxresident)k 47704inputs+152outputs (8major+240936minor)pagefaults 0swaps > /usr//bin/time gcc-9 -S t.c -O -fwrapv -mstackrealign -mpreferred-stack-boundary=4 -m32 197.61user 0.39system 3:18.08elapsed 99%CPU (0avgtext+0avgdata 890628maxresident)k 0inputs+184outputs (0major+259016minor)pagefaults 0swaps Somehow it's still CCP that makes things slow: tree CCP : 178.52 ( 89%) 0.01 ( 2%) 178.55 ( 89%) 646 kB ( 0%) perf tells me it's - 96.33% 29.55% 14801 cc1 cc1 [.] ccp_propagate::visit_phi ▒ ccp_propagate::visit_phi ▒ - ssa_propagation_engine::simulate_stmt ▒ + 49.51% ssa_propagation_engine::simulate_block ▒ + 46.82% ssa_propagation_engine::ssa_propagate - 37.06% 28.98% 12421 cc1 cc1 [.] ccp_lattice_meet ▒ - ccp_lattice_meet ▒ + 37.02% ccp_propagate::visit_phi ▒ + 0.03% set_lattice_value - 5.17% 5.17% 1949 cc1 cc1 [.] wi::bit_or<generic_wide_int<fixed_wide_int_storage<192> >, generic_w▒ wi::bit_or<generic_wide_int<fixed_wide_int_storage<192> >, generic_wide_int<fixed_wide_int_storage<192> > > ▒ - ccp_lattice_meet ▒ + 5.16% ccp_propagate::visit_phi ▒ + 0.01% set_lattice_value - 4.02% 4.02% 1509 cc1 cc1 [.] canonicalize_value ▒ - canonicalize_value ▒ + 4.02% get_value_for_expr ▒ + 0.00% ccp_folder::get_value - 2.90% 2.89% 1083 cc1 cc1 [.] wi::eq_p<generic_wide_int<fixed_wide_int_storage<192> >, int> ▒ wi::eq_p<generic_wide_int<fixed_wide_int_storage<192> >, int> ▒ - ccp_lattice_meet ▒ + 2.89% ccp_propagate::visit_phi ▒ + 0.00% set_lattice_value As said, thanks for the testcase.
Author: rguenth Date: Fri Oct 5 12:54:51 2018 New Revision: 264869 URL: https://gcc.gnu.org/viewcvs?rev=264869&root=gcc&view=rev Log: 2018-10-05 Richard Biener <rguenther@suse.de> PR tree-optimization/63155 * tree-ssa-ccp.c (ccp_propagate::visit_phi): Avoid excess vertical space in dumpfiles. * tree-ssa-propagate.h (ssa_propagation_engine::process_ssa_edge_worklist): Remove. * tree-ssa-propagate.c (cfg_blocks_back): New global. (ssa_edge_worklist_back): Likewise. (curr_order): Likewise. (cfg_blocks_get): Remove abstraction. (cfg_blocks_add): Likewise. (cfg_blocks_empty_p): Likewise. (add_ssa_edge): Add to current or next worklist based on RPO index. (add_control_edge): Likewise. (ssa_propagation_engine::process_ssa_edge_worklist): Fold into ... (ssa_propagation_engine::ssa_propagate): ... here. Unify iteration from CFG and SSA edge worklist so we process everything in RPO order, prioritizing forward progress over iteration. (ssa_prop_init): Allocate new worklists, do not dump immediate uses. (ssa_prop_fini): Free new worklists. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-ccp.c trunk/gcc/tree-ssa-propagate.c trunk/gcc/tree-ssa-propagate.h
For the last testcase the compile-time on trunk is now 25s at -O1: tree PTA : 3.37 ( 13%) 0.10 ( 30%) 3.46 ( 13%) 12445 kB ( 2%) tree CCP : 4.61 ( 18%) 0.00 ( 0%) 4.62 ( 18%) 646 kB ( 0%) tree FRE : 2.21 ( 9%) 0.01 ( 3%) 2.21 ( 9%) 116 kB ( 0%) tree backward propagate : 5.03 ( 20%) 0.00 ( 0%) 5.04 ( 20%) 0 kB ( 0%) out of ssa : 3.05 ( 12%) 0.00 ( 0%) 3.05 ( 12%) 0 kB ( 0%) TOTAL : 25.39 0.33 25.72 573954 kB and perf: Samples: 9K of event 'instructions', Event count (approx.): 107285199390 Overhead Samples Command Shared Object Symbol ◆ 18.06% 1195 cc1 cc1 [.] (anonymous namespace)::backprop::process_var ▒ 5.58% 560 cc1 cc1 [.] visit_phi ▒ 5.21% 476 cc1 cc1 [.] inchash::add_expr ▒ 5.21% 671 cc1 cc1 [.] VN_INFO ▒ 5.14% 493 cc1 cc1 [.] bitmap_set_bit ▒ 3.13% 296 cc1 cc1 [.] hash_table<vn_ssa_aux_hasher, xcallocator>::find_with_hash ▒ 2.99% 287 cc1 cc1 [.] vn_phi_lookup ▒ 2.39% 229 cc1 cc1 [.] bitmap_ior_into ▒ 1.77% 165 cc1 cc1 [.] do_rpo_vn
Oh, and backprop is really intersect_uses () with FOR_EACH_IMM_USE_STMT (stmt, iter, var) { being quadratic due to its stupid implementation (we really have many uses of vars). If the pass can deal with duplicate stmt uses just fine using FOR_EACH_IMM_USE_FAST is going to be faster.
Author: rguenth Date: Mon Oct 8 07:16:28 2018 New Revision: 264912 URL: https://gcc.gnu.org/viewcvs?rev=264912&root=gcc&view=rev Log: 2018-10-08 Richard Biener <rguenther@suse.de> PR tree-optimization/63155 * tree-ssa-propagate.c (add_ssa_edge): Do cheap check first. (ssa_propagation_engine::ssa_propagate): Remove redundant bitmap bit clearing. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-propagate.c
(In reply to Richard Biener from comment #39) > Oh, and backprop is really intersect_uses () with > > FOR_EACH_IMM_USE_STMT (stmt, iter, var) > { > > being quadratic due to its stupid implementation (we really have many uses > of vars). Ouch, hadn't realised the difference between them was that severe. > If the pass can deal with duplicate stmt uses just fine using > FOR_EACH_IMM_USE_FAST is going to be faster. Yeah, should be fine here, since the function is just gathering information. Testing a patch...
Author: rsandifo Date: Mon Oct 8 18:58:59 2018 New Revision: 264941 URL: https://gcc.gnu.org/viewcvs?rev=264941&root=gcc&view=rev Log: Use FOR_EACH_IMM_USE_FAST in gimple-ssa-backprop.c As pointed out by Richard in PR63155. It speeds up the testcase a few %. 2018-10-08 Richard Sandiford <richard.sandiford@arm.com> gcc/ PR middle-end/63155 * gimple-ssa-backprop.c (backprop::intersect_uses): Use FOR_EACH_IMM_USE_FAST instead of FOR_EACH_IMM_USE_STMT. Modified: trunk/gcc/ChangeLog trunk/gcc/gimple-ssa-backprop.c
We're now down to tree PTA : 3.92 ( 16%) 0.12 ( 36%) 4.02 ( 16%) 12445 kB ( 2%) tree CCP : 7.43 ( 30%) 0.02 ( 6%) 7.44 ( 29%) 646 kB ( 0%) tree FRE : 2.34 ( 9%) 0.00 ( 0%) 2.35 ( 9%) 116 kB ( 0%) tree backward propagate : 0.62 ( 2%) 0.00 ( 0%) 0.62 ( 2%) 0 kB ( 0%) out of ssa : 3.01 ( 12%) 0.00 ( 0%) 3.01 ( 12%) 0 kB ( 0%) TOTAL : 24.91 0.33 25.26 573769 kB notice the tree backward propagate improvement. This makes CCP the main offender again but as said the rectification would probably mean pulling back the SSA SCC discovery code from SCCVN and use that in the SSA propagator somehow. The out of SSA time is what was originally topic of this bug. The tree PTA time is "new" and related to the number of PHI nodes and edges. You can disable PTA via -fno-tree-pta. The tree FRE time is PHI lookups/inserts, some refactoring can speed this up a bit.
(In reply to Richard Biener from comment #43) > This makes CCP the main > offender again but as said the rectification would probably mean pulling > back the SSA SCC discovery code from SCCVN and use that in the SSA > propagator somehow. I take that back. SCC processing is quite fundamentally incompatible with the way SSA propagation works. But what would be possible is to add a non-optimistic mode to the SSA propagator removing the need to iterate at all. That's some non-trivial work though, possibly better spent teaching value-numbering the bits of CCP that it doesn't do (bit-value tracking, UNDEF handling) and then kill off CCP altogether.
Author: rguenth Date: Tue Oct 9 11:43:46 2018 New Revision: 264956 URL: https://gcc.gnu.org/viewcvs?rev=264956&root=gcc&view=rev Log: 2018-10-09 Richard Biener <rguenther@suse.de> PR tree-optimization/63155 * tree-ssa-structalias.c: Include tree-ssa.h. (get_constraint_for_ssa_var): For undefs return nothing_id. (find_func_aliases): Cleanup PHI handling. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-structalias.c
Author: rguenth Date: Tue Oct 16 11:23:22 2018 New Revision: 265189 URL: https://gcc.gnu.org/viewcvs?rev=265189&root=gcc&view=rev Log: 2018-10-16 Richard Biener <rguenther@suse.de> Backport from mainline 2018-09-18 Richard Biener <rguenther@suse.de> PR middle-end/63155 * tree-ssa-coalesce.c (tree_int_map_hasher): Remove. (compute_samebase_partition_bases): Likewise. (coalesce_ssa_name): Always use compute_optimized_partition_bases. (gimple_can_coalesce_p): Simplify. Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/tree-ssa-coalesce.c
Author: rguenth Date: Tue Oct 16 13:23:56 2018 New Revision: 265193 URL: https://gcc.gnu.org/viewcvs?rev=265193&root=gcc&view=rev Log: 2018-10-16 Richard Biener <rguenther@suse.de> Backport from mainline 2018-10-08 Richard Biener <rguenther@suse.de> PR tree-optimization/63155 * tree-ssa-propagate.c (add_ssa_edge): Do cheap check first. (ssa_propagation_engine::ssa_propagate): Remove redundant bitmap bit clearing. 2018-10-05 Richard Biener <rguenther@suse.de> PR tree-optimization/63155 * tree-ssa-ccp.c (ccp_propagate::visit_phi): Avoid excess vertical space in dumpfiles. * tree-ssa-propagate.h (ssa_propagation_engine::process_ssa_edge_worklist): Remove. * tree-ssa-propagate.c (cfg_blocks_back): New global. (ssa_edge_worklist_back): Likewise. (curr_order): Likewise. (cfg_blocks_get): Remove abstraction. (cfg_blocks_add): Likewise. (cfg_blocks_empty_p): Likewise. (add_ssa_edge): Add to current or next worklist based on RPO index. (add_control_edge): Likewise. (ssa_propagation_engine::process_ssa_edge_worklist): Fold into ... (ssa_propagation_engine::ssa_propagate): ... here. Unify iteration from CFG and SSA edge worklist so we process everything in RPO order, prioritizing forward progress over iteration. (ssa_prop_init): Allocate new worklists, do not dump immediate uses. (ssa_prop_fini): Free new worklists. 2018-09-24 Richard Biener <rguenther@suse.de> PR tree-optimization/63155 * tree-ssa-propagate.c (add_ssa_edge): Avoid adding PHIs to the worklist when the edge of the respective argument isn't executable. Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/tree-ssa-propagate.c
Author: rguenth Date: Wed Oct 17 07:01:28 2018 New Revision: 265231 URL: https://gcc.gnu.org/viewcvs?rev=265231&root=gcc&view=rev Log: 2018-10-17 Richard Biener <rguenther@suse.de> Backport from mainline 2018-10-08 Richard Sandiford <richard.sandiford@arm.com> PR middle-end/63155 * gimple-ssa-backprop.c (backprop::intersect_uses): Use FOR_EACH_IMM_USE_FAST instead of FOR_EACH_IMM_USE_STMT. Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/gimple-ssa-backprop.c
Author: rguenth Date: Wed Oct 17 08:49:00 2018 New Revision: 265235 URL: https://gcc.gnu.org/viewcvs?rev=265235&root=gcc&view=rev Log: 2018-10-16 Richard Biener <rguenther@suse.de> Backport from mainline 2018-10-08 Richard Biener <rguenther@suse.de> PR tree-optimization/63155 * tree-ssa-propagate.c (add_ssa_edge): Do cheap check first. (ssa_propagation_engine::ssa_propagate): Remove redundant bitmap bit clearing. 2018-10-05 Richard Biener <rguenther@suse.de> PR tree-optimization/63155 * tree-ssa-ccp.c (ccp_propagate::visit_phi): Avoid excess vertical space in dumpfiles. * tree-ssa-propagate.h (ssa_propagation_engine::process_ssa_edge_worklist): Remove. * tree-ssa-propagate.c (cfg_blocks_back): New global. (ssa_edge_worklist_back): Likewise. (curr_order): Likewise. (cfg_blocks_get): Remove abstraction. (cfg_blocks_add): Likewise. (cfg_blocks_empty_p): Likewise. (add_ssa_edge): Add to current or next worklist based on RPO index. (add_control_edge): Likewise. (ssa_propagation_engine::process_ssa_edge_worklist): Fold into ... (ssa_propagation_engine::ssa_propagate): ... here. Unify iteration from CFG and SSA edge worklist so we process everything in RPO order, prioritizing forward progress over iteration. (ssa_prop_init): Allocate new worklists, do not dump immediate uses. (ssa_prop_fini): Free new worklists. 2018-09-24 Richard Biener <rguenther@suse.de> PR tree-optimization/63155 * tree-ssa-propagate.c (add_ssa_edge): Avoid adding PHIs to the worklist when the edge of the respective argument isn't executable. Modified: branches/gcc-8-branch/gcc/tree-ssa-ccp.c branches/gcc-8-branch/gcc/tree-ssa-propagate.h
Created attachment 44848 [details] GCC 6.3.0 consolidated patch based on Richard's patches The patch attached is a backport based on Richard's patches to GCC v6.3.0. If any issues, please let me know. Regards, -- Rogerio
Author: rguenth Date: Mon Oct 22 13:54:23 2018 New Revision: 265390 URL: https://gcc.gnu.org/viewcvs?rev=265390&root=gcc&view=rev Log: 2018-10-22 Steven Bosscher <steven@gcc.gnu.org> Richard Biener <rguenther@suse.de> * bitmap.h: Update data structure documentation, including a description of bitmap views as either linked-lists or splay trees. (struct bitmap_element_def): Update comments for splay tree bitmaps. (struct bitmap_head_def): Likewise. (bitmap_list_view, bitmap_tree_view): New prototypes. (bitmap_initialize_stat): Initialize a bitmap_head's indx and tree_form fields. (bmp_iter_set_init): Assert the iterated bitmaps are in list form. (bmp_iter_and_init, bmp_iter_and_compl_init): Likewise. * bitmap.c (bitmap_elem_to_freelist): Unregister overhead of a released bitmap element here. (bitmap_element_free): Remove. (bitmap_elt_clear_from): Work on splay tree bitmaps. (bitmap_list_link_element): Renamed from bitmap_element_link. Move this function similar ones such that linked-list bitmap implementation functions are grouped. (bitmap_list_unlink_element): Renamed from bitmap_element_unlink, and moved for grouping. (bitmap_list_insert_element_after): Renamed from bitmap_elt_insert_after, and moved for grouping. (bitmap_list_find_element): New function spliced from bitmap_find_bit. (bitmap_tree_link_left, bitmap_tree_link_right, bitmap_tree_rotate_left, bitmap_tree_rotate_right, bitmap_tree_splay, bitmap_tree_link_element, bitmap_tree_unlink_element, bitmap_tree_find_element): New functions for splay-tree bitmap implementation. (bitmap_element_link, bitmap_element_unlink, bitmap_elt_insert_after): Renamed and moved, see above entries. (bitmap_tree_listify_from): New function to convert part of a splay tree bitmap to a linked-list bitmap. (bitmap_list_view): Convert a splay tree bitmap to linked-list form. (bitmap_tree_view): Convert a linked-list bitmap to splay tree form. (bitmap_find_bit): Remove. (bitmap_clear, bitmap_clear_bit, bitmap_set_bit, bitmap_single_bit_set_p, bitmap_first_set_bit, bitmap_last_set_bit): Handle splay tree bitmaps. (bitmap_copy, bitmap_count_bits, bitmap_and, bitmap_and_into, bitmap_elt_copy, bitmap_and_compl, bitmap_and_compl_into, bitmap_compl_and_into, bitmap_elt_ior, bitmap_ior, bitmap_ior_into, bitmap_xor, bitmap_xor_into, bitmap_equal_p, bitmap_intersect_p, bitmap_intersect_compl_p, bitmap_ior_and_compl, bitmap_ior_and_compl_into, bitmap_set_range, bitmap_clear_range, bitmap_hash): Reject trying to act on splay tree bitmaps. Make corresponding changes to use linked-list specific bitmap_element manipulation functions as applicable for efficiency. (bitmap_tree_to_vec): New function. (debug_bitmap_elt_file): New function split out from ... (debug_bitmap_file): ... here. Handle splay tree bitmaps. (bitmap_print): Likewise. PR tree-optimization/63155 * tree-ssa-propagate.c (ssa_prop_init): Use tree-view for the SSA edge worklists. * tree-ssa-coalesce.c (coalesce_ssa_name): Populate used_in_copies in tree-view. Modified: trunk/gcc/ChangeLog trunk/gcc/bitmap.c trunk/gcc/bitmap.h trunk/gcc/tree-ssa-coalesce.c trunk/gcc/tree-ssa-propagate.c
GCC 6 branch is being closed
Created attachment 46627 [details] New testcase that shows performance degradation in GCC 9.1 This new testcase that shows performance degradation in GCC 9.1, the performance gets worst as we use a newer GCC version to build it. Example how to compile: gcc -ftime-report -m32 -w -c -O3 -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 testcase-v1.c -o testcase-v1.o
Hi everyone, we continue to get performance degradation as we use a new compiler version to build the source code 'testcase-v1.c" that is available in the attachment above. Could you please check if the test case that I attached reproduces the same issue, or it is a different issue? Below are the results that I got: # GCC 4.4.7 -bash-4.1$ gcc -ftime-report -m32 -w -c -O3 -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 testcase-v1.c -o testcase-v1.o Execution times (seconds) callgraph construction: 0.02 ( 0%) usr 0.01 ( 4%) sys 0.03 ( 0%) wall 2535 kB ( 1%) ggc callgraph optimization: 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 11 kB ( 0%) ggc cfg cleanup : 0.05 ( 1%) usr 0.00 ( 0%) sys 0.07 ( 1%) wall 151 kB ( 0%) ggc trivially dead code : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc df reaching defs : 0.40 ( 4%) usr 0.01 ( 4%) sys 0.40 ( 4%) wall 0 kB ( 0%) ggc df live regs : 0.33 ( 3%) usr 0.00 ( 0%) sys 0.37 ( 4%) wall 0 kB ( 0%) ggc df live&initialized regs: 0.08 ( 1%) usr 0.00 ( 0%) sys 0.12 ( 1%) wall 0 kB ( 0%) ggc df use-def / def-use chains: 0.17 ( 2%) usr 0.00 ( 0%) sys 0.16 ( 2%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 0.25 ( 3%) usr 0.00 ( 0%) sys 0.22 ( 2%) wall 2519 kB ( 1%) ggc register information : 0.13 ( 1%) usr 0.00 ( 0%) sys 0.11 ( 1%) wall 0 kB ( 0%) ggc alias analysis : 0.06 ( 1%) usr 0.00 ( 0%) sys 0.08 ( 1%) wall 1952 kB ( 1%) ggc register scan : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc rebuild jump labels : 0.06 ( 1%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc preprocessing : 0.02 ( 0%) usr 0.02 ( 7%) sys 0.02 ( 0%) wall 466 kB ( 0%) ggc lexical analysis : 0.03 ( 0%) usr 0.02 ( 7%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc parser : 0.06 ( 1%) usr 0.07 (26%) sys 0.19 ( 2%) wall 18164 kB ( 9%) ggc inline heuristics : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc tree gimplify : 0.07 ( 1%) usr 0.00 ( 0%) sys 0.06 ( 1%) wall 19793 kB (10%) ggc tree CFG construction : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 4445 kB ( 2%) ggc tree CFG cleanup : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 1%) wall 1209 kB ( 1%) ggc tree VRP : 0.20 ( 2%) usr 0.00 ( 0%) sys 0.20 ( 2%) wall 12984 kB ( 6%) ggc tree copy propagation : 0.10 ( 1%) usr 0.00 ( 0%) sys 0.05 ( 1%) wall 9 kB ( 0%) ggc tree find ref. vars : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 1190 kB ( 1%) ggc tree PTA : 0.05 ( 1%) usr 0.00 ( 0%) sys 0.06 ( 1%) wall 33 kB ( 0%) ggc tree alias analysis : 0.04 ( 0%) usr 0.01 ( 4%) sys 0.03 ( 0%) wall 308 kB ( 0%) ggc tree flow sensitive alias: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc tree PHI insertion : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 1349 kB ( 1%) ggc tree SSA rewrite : 0.05 ( 1%) usr 0.00 ( 0%) sys 0.08 ( 1%) wall 16607 kB ( 8%) ggc tree SSA other : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc tree SSA incremental : 0.11 ( 1%) usr 0.01 ( 4%) sys 0.06 ( 1%) wall 5142 kB ( 3%) ggc tree operand scan : 0.11 ( 1%) usr 0.03 (11%) sys 0.12 ( 1%) wall 10620 kB ( 5%) ggc dominator optimization: 0.08 ( 1%) usr 0.00 ( 0%) sys 0.11 ( 1%) wall 2130 kB ( 1%) ggc tree CCP : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.10 ( 1%) wall 6 kB ( 0%) ggc tree split crit edges : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 6801 kB ( 3%) ggc tree reassociation : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 2 kB ( 0%) ggc tree PRE : 2.94 (31%) usr 0.02 ( 7%) sys 2.99 (30%) wall 4955 kB ( 2%) ggc tree FRE : 0.16 ( 2%) usr 0.01 ( 4%) sys 0.17 ( 2%) wall 1933 kB ( 1%) ggc tree code sinking : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 1 kB ( 0%) ggc tree forward propagate: 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc tree conservative DCE : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc tree aggressive DCE : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc tree DSE : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 228 kB ( 0%) ggc PHI merge : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc complete unrolling : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 2 kB ( 0%) ggc tree SSA uncprop : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc tree SSA to normal : 0.07 ( 1%) usr 0.00 ( 0%) sys 0.09 ( 1%) wall 15109 kB ( 7%) ggc tree rename SSA copies: 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc tree switch initialization conversion: 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc dominance frontiers : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc dominance computation : 0.05 ( 1%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc control dependences : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc expand : 0.49 ( 5%) usr 0.02 ( 7%) sys 0.50 ( 5%) wall 34138 kB (17%) ggc jump : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc forward prop : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 1556 kB ( 1%) ggc CSE : 0.13 ( 1%) usr 0.01 ( 4%) sys 0.16 ( 2%) wall 606 kB ( 0%) ggc dead code elimination : 0.09 ( 1%) usr 0.00 ( 0%) sys 0.07 ( 1%) wall 0 kB ( 0%) ggc dead store elim1 : 0.10 ( 1%) usr 0.00 ( 0%) sys 0.09 ( 1%) wall 1706 kB ( 1%) ggc dead store elim2 : 0.12 ( 1%) usr 0.00 ( 0%) sys 0.12 ( 1%) wall 4207 kB ( 2%) ggc CSE 2 : 0.11 ( 1%) usr 0.00 ( 0%) sys 0.08 ( 1%) wall 303 kB ( 0%) ggc branch prediction : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 2 kB ( 0%) ggc combiner : 0.15 ( 2%) usr 0.01 ( 4%) sys 0.14 ( 1%) wall 4423 kB ( 2%) ggc if-conversion : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 154 kB ( 0%) ggc regmove : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc integrated RA : 0.90 ( 9%) usr 0.00 ( 0%) sys 0.89 ( 9%) wall 1788 kB ( 1%) ggc reload : 0.39 ( 4%) usr 0.00 ( 0%) sys 0.38 ( 4%) wall 6285 kB ( 3%) ggc reload CSE regs : 0.20 ( 2%) usr 0.01 ( 4%) sys 0.20 ( 2%) wall 8333 kB ( 4%) ggc load CSE after reload : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc thread pro- & epilogue: 0.06 ( 1%) usr 0.00 ( 0%) sys 0.06 ( 1%) wall 13 kB ( 0%) ggc peephole 2 : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc rename registers : 0.06 ( 1%) usr 0.00 ( 0%) sys 0.08 ( 1%) wall 75 kB ( 0%) ggc scheduling 2 : 0.35 ( 4%) usr 0.01 ( 4%) sys 0.35 ( 4%) wall 0 kB ( 0%) ggc machine dep reorg : 0.05 ( 1%) usr 0.00 ( 0%) sys 0.06 ( 1%) wall 301 kB ( 0%) ggc reorder blocks : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 1%) wall 3708 kB ( 2%) ggc final : 0.11 ( 1%) usr 0.00 ( 0%) sys 0.13 ( 1%) wall 3977 kB ( 2%) ggc TOTAL : 9.48 0.27 9.83 203289 kB #################################################################################### #GCC 6.3.1 bash-4.1$ gcc -ftime-report -m32 -w -c -O3 -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 testcase-v1.c -o testcase-v1.o Execution times (seconds) phase setup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 892 kB ( 0%) ggc phase parsing : 0.21 ( 0%) usr 0.09 ( 3%) sys 0.32 ( 0%) wall 20284 kB ( 1%) ggc phase opt and generate : 120.50 (100%) usr 3.43 (97%) sys 125.05 (100%) wall 3091168 kB (99%) ggc garbage collection : 2.50 ( 2%) usr 0.02 ( 1%) sys 2.90 ( 2%) wall 0 kB ( 0%) ggc callgraph construction : 1.81 ( 1%) usr 0.08 ( 2%) sys 2.33 ( 2%) wall 8824 kB ( 0%) ggc callgraph optimization : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 8 kB ( 0%) ggc ipa dead code removal : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc ipa cp : 0.99 ( 1%) usr 0.00 ( 0%) sys 0.99 ( 1%) wall 512 kB ( 0%) ggc ipa inlining heuristics : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 9 kB ( 0%) ggc ipa profile : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc ipa pure const : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc ipa icf : 5.44 ( 5%) usr 0.00 ( 0%) sys 5.44 ( 4%) wall 26 kB ( 0%) ggc cfg cleanup : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 18 kB ( 0%) ggc trivially dead code : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc df scan insns : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc df multiple defs : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc df reaching defs : 0.09 ( 0%) usr 0.01 ( 0%) sys 0.10 ( 0%) wall 0 kB ( 0%) ggc df live regs : 0.13 ( 0%) usr 0.00 ( 0%) sys 0.14 ( 0%) wall 0 kB ( 0%) ggc df live&initialized regs: 0.03 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc df must-initialized regs: 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc df use-def / def-use chains: 0.04 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 0.12 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 0%) wall 665 kB ( 0%) ggc register information : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc alias analysis : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 1090 kB ( 0%) ggc alias stmt walking : 0.08 ( 0%) usr 0.00 ( 0%) sys 0.13 ( 0%) wall 1 kB ( 0%) ggc rebuild jump labels : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc preprocessing : 0.03 ( 0%) usr 0.01 ( 0%) sys 0.01 ( 0%) wall 3645 kB ( 0%) ggc lexical analysis : 0.07 ( 0%) usr 0.03 ( 1%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc parser (global) : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 643 kB ( 0%) ggc parser function body : 0.11 ( 0%) usr 0.04 ( 1%) sys 0.19 ( 0%) wall 15822 kB ( 1%) ggc parser inl. func. body : 0.00 ( 0%) usr 0.01 ( 0%) sys 0.01 ( 0%) wall 110 kB ( 0%) ggc inline parameters : 0.13 ( 0%) usr 0.02 ( 1%) sys 0.09 ( 0%) wall 1530 kB ( 0%) ggc tree gimplify : 0.06 ( 0%) usr 0.02 ( 1%) sys 0.06 ( 0%) wall 12544 kB ( 0%) ggc tree eh : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc tree CFG construction : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 10956 kB ( 0%) ggc tree CFG cleanup : 0.45 ( 0%) usr 0.01 ( 0%) sys 0.46 ( 0%) wall 0 kB ( 0%) ggc tree VRP : 25.14 (21%) usr 0.56 (16%) sys 25.73 (21%) wall 115035 kB ( 4%) ggc tree copy propagation : 0.57 ( 0%) usr 0.00 ( 0%) sys 0.57 ( 0%) wall 0 kB ( 0%) ggc tree PTA : 18.90 (16%) usr 0.63 (18%) sys 19.58 (16%) wall 488 kB ( 0%) ggc tree PHI insertion : 3.90 ( 3%) usr 1.67 (47%) sys 5.58 ( 4%) wall 2912441 kB (94%) ggc tree SSA rewrite : 12.59 (10%) usr 0.01 ( 0%) sys 12.62 (10%) wall 4709 kB ( 0%) ggc tree SSA other : 0.09 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall 0 kB ( 0%) ggc tree SSA incremental : 0.09 ( 0%) usr 0.00 ( 0%) sys 0.10 ( 0%) wall 94 kB ( 0%) ggc tree operand scan : 0.04 ( 0%) usr 0.01 ( 0%) sys 0.01 ( 0%) wall 3523 kB ( 0%) ggc dominator optimization : 1.49 ( 1%) usr 0.00 ( 0%) sys 1.50 ( 1%) wall 378 kB ( 0%) ggc isolate eroneous paths : 0.10 ( 0%) usr 0.00 ( 0%) sys 0.10 ( 0%) wall 0 kB ( 0%) ggc tree CCP : 9.87 ( 8%) usr 0.03 ( 1%) sys 9.94 ( 8%) wall 12 kB ( 0%) ggc tree PHI const/copy prop: 0.26 ( 0%) usr 0.00 ( 0%) sys 0.27 ( 0%) wall 0 kB ( 0%) ggc tree split crit edges : 0.47 ( 0%) usr 0.00 ( 0%) sys 0.48 ( 0%) wall 568 kB ( 0%) ggc tree reassociation : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc tree PRE : 0.21 ( 0%) usr 0.00 ( 0%) sys 0.21 ( 0%) wall 6 kB ( 0%) ggc tree FRE : 1.47 ( 1%) usr 0.00 ( 0%) sys 1.48 ( 1%) wall 6 kB ( 0%) ggc tree backward propagate : 11.39 ( 9%) usr 0.00 ( 0%) sys 11.40 ( 9%) wall 0 kB ( 0%) ggc tree forward propagate : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc tree phiprop : 0.11 ( 0%) usr 0.00 ( 0%) sys 0.10 ( 0%) wall 0 kB ( 0%) ggc tree conservative DCE : 0.78 ( 1%) usr 0.01 ( 0%) sys 0.80 ( 1%) wall 0 kB ( 0%) ggc tree aggressive DCE : 2.56 ( 2%) usr 0.00 ( 0%) sys 2.53 ( 2%) wall 12 kB ( 0%) ggc tree DSE : 0.67 ( 1%) usr 0.00 ( 0%) sys 0.68 ( 1%) wall 0 kB ( 0%) ggc PHI merge : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc tree slp vectorization : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 283 kB ( 0%) ggc tree SSA uncprop : 0.83 ( 1%) usr 0.02 ( 1%) sys 0.85 ( 1%) wall 0 kB ( 0%) ggc tree strlen optimization: 0.13 ( 0%) usr 0.00 ( 0%) sys 0.13 ( 0%) wall 0 kB ( 0%) ggc dominance computation : 0.05 ( 0%) usr 0.01 ( 0%) sys 0.06 ( 0%) wall 0 kB ( 0%) ggc out of ssa : 10.30 ( 9%) usr 0.24 ( 7%) sys 10.60 ( 8%) wall 47 kB ( 0%) ggc expand vars : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 428 kB ( 0%) ggc expand : 0.22 ( 0%) usr 0.02 ( 1%) sys 0.26 ( 0%) wall 5452 kB ( 0%) ggc lower subreg : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc forward prop : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 246 kB ( 0%) ggc CSE : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 145 kB ( 0%) ggc dead code elimination : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc dead store elim1 : 0.03 ( 0%) usr 0.01 ( 0%) sys 0.03 ( 0%) wall 720 kB ( 0%) ggc dead store elim2 : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 1042 kB ( 0%) ggc loop init : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 17 kB ( 0%) ggc CSE 2 : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 56 kB ( 0%) ggc branch prediction : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc combiner : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 965 kB ( 0%) ggc integrated RA : 0.15 ( 0%) usr 0.00 ( 0%) sys 0.17 ( 0%) wall 2130 kB ( 0%) ggc LRA non-specific : 0.08 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall 746 kB ( 0%) ggc LRA virtuals elimination: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 461 kB ( 0%) ggc LRA reload inheritance : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 132 kB ( 0%) ggc LRA create live ranges : 0.12 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 0%) wall 87 kB ( 0%) ggc LRA hard reg assignment : 0.01 ( 0%) usr 0.01 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc LRA rematerialization : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc reload CSE regs : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall 1865 kB ( 0%) ggc load CSE after reload : 0.12 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 0%) wall 138 kB ( 0%) ggc ree : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc thread pro- & epilogue : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 4 kB ( 0%) ggc if-conversion 2 : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc combine stack adjustments: 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 37 kB ( 0%) ggc peephole 2 : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 348 kB ( 0%) ggc hard reg cprop : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc scheduling 2 : 0.11 ( 0%) usr 0.00 ( 0%) sys 0.13 ( 0%) wall 320 kB ( 0%) ggc reorder blocks : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 528 kB ( 0%) ggc shorten branches : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc final : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 1302 kB ( 0%) ggc straight-line strength reduction: 0.14 ( 0%) usr 0.01 ( 0%) sys 0.15 ( 0%) wall 0 kB ( 0%) ggc rest of compilation : 0.12 ( 0%) usr 0.01 ( 0%) sys 0.16 ( 0%) wall 33 kB ( 0%) ggc remove unused locals : 3.42 ( 3%) usr 0.01 ( 0%) sys 3.46 ( 3%) wall 0 kB ( 0%) ggc address taken : 1.19 ( 1%) usr 0.01 ( 0%) sys 1.21 ( 1%) wall 0 kB ( 0%) ggc unaccounted todo : 0.08 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 40 kB ( 0%) ggc TOTAL : 120.71 3.52 125.38 3112354 kB #################################################################################### # GCC 8.3.1 bash-4.1$ gcc -ftime-report -m32 -w -c -O3 -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4 testcase-v1.c -o testcase-v1.o Time variable usr sys wall GGC phase setup : 0.00 ( 0%) 0.01 ( 0%) 0.01 ( 0%) 1047 kB ( 0%) phase parsing : 0.21 ( 0%) 0.08 ( 0%) 0.32 ( 0%) 20551 kB ( 0%) phase opt and generate : 189.32 (100%) 24.44 (100%) 367.57 (100%) 5277730 kB (100%) garbage collection : 7.52 ( 4%) 2.16 ( 9%) 32.80 ( 9%) 0 kB ( 0%) dump files : 0.00 ( 0%) 0.01 ( 0%) 0.02 ( 0%) 0 kB ( 0%) callgraph construction : 4.62 ( 2%) 4.29 ( 17%) 59.37 ( 16%) 9442 kB ( 0%) callgraph optimization : 0.28 ( 0%) 0.04 ( 0%) 0.60 ( 0%) 0 kB ( 0%) ipa function summary : 0.13 ( 0%) 0.22 ( 1%) 2.63 ( 1%) 11 kB ( 0%) ipa dead code removal : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 0 kB ( 0%) ipa cp : 2.11 ( 1%) 2.48 ( 10%) 30.08 ( 8%) 2 kB ( 0%) ipa inlining heuristics : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 0 kB ( 0%) ipa pure const : 0.02 ( 0%) 0.01 ( 0%) 0.05 ( 0%) 0 kB ( 0%) ipa icf : 12.01 ( 6%) 4.13 ( 17%) 60.34 ( 16%) 29 kB ( 0%) ipa free inline summary : 0.00 ( 0%) 0.01 ( 0%) 0.07 ( 0%) 0 kB ( 0%) cfg cleanup : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 18 kB ( 0%) trivially dead code : 0.01 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 0 kB ( 0%) df scan insns : 0.02 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 0 kB ( 0%) df multiple defs : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%) df reaching defs : 0.08 ( 0%) 0.00 ( 0%) 0.08 ( 0%) 0 kB ( 0%) df live regs : 0.11 ( 0%) 0.00 ( 0%) 0.12 ( 0%) 0 kB ( 0%) df live&initialized regs : 0.02 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 0 kB ( 0%) df must-initialized regs : 0.02 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%) df use-def / def-use chains : 0.05 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 0 kB ( 0%) df reg dead/unused notes : 0.11 ( 0%) 0.00 ( 0%) 0.12 ( 0%) 684 kB ( 0%) register information : 0.03 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 0 kB ( 0%) alias analysis : 0.02 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 1090 kB ( 0%) alias stmt walking : 0.13 ( 0%) 0.02 ( 0%) 0.36 ( 0%) 1 kB ( 0%) register scan : 0.02 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 140 kB ( 0%) rebuild jump labels : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%) preprocessing : 0.04 ( 0%) 0.01 ( 0%) 0.11 ( 0%) 3722 kB ( 0%) lexical analysis : 0.04 ( 0%) 0.02 ( 0%) 0.09 ( 0%) 0 kB ( 0%) parser (global) : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 716 kB ( 0%) parser struct body : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 61 kB ( 0%) parser function body : 0.13 ( 0%) 0.05 ( 0%) 0.10 ( 0%) 15942 kB ( 0%) early inlining heuristics : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%) inline parameters : 0.16 ( 0%) 0.19 ( 1%) 0.47 ( 0%) 1538 kB ( 0%) tree gimplify : 0.06 ( 0%) 0.01 ( 0%) 0.07 ( 0%) 14496 kB ( 0%) tree eh : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%) tree CFG construction : 0.04 ( 0%) 0.00 ( 0%) 0.06 ( 0%) 11884 kB ( 0%) tree CFG cleanup : 0.64 ( 0%) 0.00 ( 0%) 0.65 ( 0%) 0 kB ( 0%) tree tail merge : 1.50 ( 1%) 0.00 ( 0%) 1.50 ( 0%) 0 kB ( 0%) tree VRP : 9.73 ( 5%) 2.02 ( 8%) 12.15 ( 3%) 165762 kB ( 3%) tree Early VRP : 13.07 ( 7%) 0.23 ( 1%) 13.32 ( 4%) 496545 kB ( 9%) tree copy propagation : 0.88 ( 0%) 0.00 ( 0%) 0.89 ( 0%) 0 kB ( 0%) tree PTA : 28.88 ( 15%) 4.22 ( 17%) 38.06 ( 10%) 46436 kB ( 1%) tree PHI insertion : 5.78 ( 3%) 1.96 ( 8%) 7.83 ( 2%) 4254415 kB ( 80%) tree SSA rewrite : 15.12 ( 8%) 0.02 ( 0%) 15.13 ( 4%) 3780 kB ( 0%) tree SSA other : 0.21 ( 0%) 0.16 ( 1%) 0.34 ( 0%) 0 kB ( 0%) tree SSA incremental : 0.14 ( 0%) 0.00 ( 0%) 0.16 ( 0%) 94 kB ( 0%) tree operand scan : 0.03 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 3619 kB ( 0%) dominator optimization : 5.51 ( 3%) 0.10 ( 0%) 5.66 ( 2%) 165892 kB ( 3%) backwards jump threading : 0.01 ( 0%) 0.01 ( 0%) 0.01 ( 0%) 0 kB ( 0%) isolate eroneous paths : 0.10 ( 0%) 0.00 ( 0%) 0.09 ( 0%) 0 kB ( 0%) tree CCP : 14.39 ( 8%) 1.75 ( 7%) 16.89 ( 5%) 75 kB ( 0%) tree PHI const/copy prop : 0.43 ( 0%) 0.00 ( 0%) 0.43 ( 0%) 0 kB ( 0%) tree split crit edges : 0.40 ( 0%) 0.01 ( 0%) 0.44 ( 0%) 274 kB ( 0%) tree reassociation : 0.02 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 0 kB ( 0%) tree PRE : 3.41 ( 2%) 0.14 ( 1%) 3.55 ( 1%) 281 kB ( 0%) tree FRE : 20.53 ( 11%) 0.04 ( 0%) 20.61 ( 6%) 5 kB ( 0%) tree linearize phis : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 6 kB ( 0%) tree backward propagate : 1.41 ( 1%) 0.04 ( 0%) 1.92 ( 1%) 0 kB ( 0%) tree forward propagate : 3.69 ( 2%) 0.00 ( 0%) 3.70 ( 1%) 0 kB ( 0%) tree phiprop : 0.10 ( 0%) 0.00 ( 0%) 0.09 ( 0%) 0 kB ( 0%) tree conservative DCE : 1.63 ( 1%) 0.00 ( 0%) 1.66 ( 0%) 0 kB ( 0%) tree aggressive DCE : 3.75 ( 2%) 0.01 ( 0%) 3.77 ( 1%) 12 kB ( 0%) tree DSE : 1.09 ( 1%) 0.00 ( 0%) 1.07 ( 0%) 0 kB ( 0%) PHI merge : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%) tree slp vectorization : 0.02 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 747 kB ( 0%) tree SSA uncprop : 1.44 ( 1%) 0.00 ( 0%) 1.44 ( 0%) 0 kB ( 0%) gimple widening/fma detection : 0.13 ( 0%) 0.00 ( 0%) 0.13 ( 0%) 0 kB ( 0%) tree strlen optimization : 0.19 ( 0%) 0.00 ( 0%) 0.19 ( 0%) 0 kB ( 0%) dominance computation : 0.03 ( 0%) 0.01 ( 0%) 0.13 ( 0%) 0 kB ( 0%) control dependences : 0.02 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 0 kB ( 0%) out of ssa : 17.90 ( 9%) 0.06 ( 0%) 17.97 ( 5%) 47 kB ( 0%) expand vars : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 444 kB ( 0%) expand : 0.32 ( 0%) 0.01 ( 0%) 0.33 ( 0%) 5520 kB ( 0%) forward prop : 0.03 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 246 kB ( 0%) CSE : 0.02 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 145 kB ( 0%) dead code elimination : 0.02 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 0 kB ( 0%) dead store elim1 : 0.03 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 720 kB ( 0%) dead store elim2 : 0.03 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 1070 kB ( 0%) loop init : 0.01 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 22 kB ( 0%) CSE 2 : 0.02 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 56 kB ( 0%) branch prediction : 0.05 ( 0%) 0.01 ( 0%) 0.05 ( 0%) 0 kB ( 0%) combiner : 0.07 ( 0%) 0.00 ( 0%) 0.06 ( 0%) 965 kB ( 0%) integrated RA : 0.12 ( 0%) 0.00 ( 0%) 0.24 ( 0%) 2193 kB ( 0%) LRA non-specific : 0.07 ( 0%) 0.00 ( 0%) 0.07 ( 0%) 818 kB ( 0%) LRA virtuals elimination : 0.01 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 460 kB ( 0%) LRA reload inheritance : 0.01 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 132 kB ( 0%) LRA create live ranges : 0.18 ( 0%) 0.00 ( 0%) 0.18 ( 0%) 87 kB ( 0%) LRA hard reg assignment : 0.01 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%) LRA rematerialization : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%) reload CSE regs : 0.06 ( 0%) 0.00 ( 0%) 0.06 ( 0%) 1865 kB ( 0%) load CSE after reload : 0.11 ( 0%) 0.00 ( 0%) 0.11 ( 0%) 138 kB ( 0%) ree : 0.01 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%) thread pro- & epilogue : 0.02 ( 0%) 0.01 ( 0%) 0.03 ( 0%) 6 kB ( 0%) combine stack adjustments : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 37 kB ( 0%) peephole 2 : 0.01 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 348 kB ( 0%) hard reg cprop : 0.01 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%) scheduling 2 : 0.13 ( 0%) 0.00 ( 0%) 0.13 ( 0%) 377 kB ( 0%) reorder blocks : 0.02 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 518 kB ( 0%) shorten branches : 0.01 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%) final : 0.03 ( 0%) 0.00 ( 0%) 0.09 ( 0%) 1302 kB ( 0%) straight-line strength reduction : 0.24 ( 0%) 0.00 ( 0%) 0.24 ( 0%) 0 kB ( 0%) initialize rtl : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 9 kB ( 0%) rest of compilation : 1.18 ( 1%) 0.06 ( 0%) 1.28 ( 0%) 82789 kB ( 2%) remove unused locals : 4.82 ( 3%) 0.00 ( 0%) 5.02 ( 1%) 0 kB ( 0%) address taken : 1.88 ( 1%) 0.00 ( 0%) 1.91 ( 1%) 0 kB ( 0%) TOTAL : 189.53 24.53 367.91 5299338 kB #################################################################################### As you can see, in each newer GCC version the performance is even more degraded. F.Y.I I also tried the same code using LLVM v8.0.0 which complied similar to GCC 4.4.7. Best regards, -- Rogerio
(In reply to Rogério de Souza Moraes from comment #53) > Created attachment 46627 [details] > New testcase that shows performance degradation in GCC 9.1 > > This new testcase that shows performance degradation in GCC 9.1, the > performance gets worst as we use a newer GCC version to build it. > > Example how to compile: > > gcc -ftime-report -m32 -w -c -O3 -pedantic -fwrapv -mstackrealign > -mpreferred-stack-boundary=4 testcase-v1.c -o testcase-v1.o Let's open a new bugreport for this testcase, it looks different enough to warrant this. I notice you use -O3 which might explain the larger tree SSA rewrite time. PR91257.
Not going to backport further, fixed in GCC 8+.