Revision 238005 (richi's fix to path splitting to handle empty else blocks) disables loop unrolling for the test case below (extracted from one of our benchmarks) leading to a performance regression. bergner@genoa:~/gcc/BUGS/$ cat foo.c void foo (double *x, double *a, double *b, long n, double limit) { long i; for (i=0; i < n; i++) if (a[i] < limit) x[i] = b[i]; } bergner@genoa:~/gcc/BUGS/$ /home/bergner/gcc/build/gcc-fsf-mainline-r238005/gcc/xgcc -B/home/bergner/gcc/build/gcc-fsf-mainline-r238005/gcc -O3 -funroll-loops -S foo.c bergner@genoa:~/gcc/BUGS/$ cat foo.s foo: cmpdi 0,6,0 ble 0,.L1 sldi 6,6,3 li 9,0 .p2align 4,,15 .L3: lfdx 0,4,9 fcmpu 7,0,1 bnl 7,.L8 lfdx 2,5,9 stfdx 2,3,9 addi 9,9,8 cmpld 5,9,6 bne 5,.L3 .L1: blr .p2align 4,,15 .L8: addi 9,9,8 cmpld 1,9,6 bne 1,.L3 blr bergner@genoa:~/gcc/BUGS/$ /home/bergner/gcc/build/gcc-fsf-mainline-r238004/gcc/xgcc -B/home/bergner/gcc/build/gcc-fsf-mainline-r238004/gcc -O3 -funroll-loops -S foo.c bergner@genoa:~/gcc/BUGS/$ cat foo.s foo: cmpdi 0,6,0 ble 0,.L1 lfd 0,0(4) sldi 6,6,3 addi 8,6,-8 srdi 0,8,3 rldicl 10,0,0,61 fcmpu 7,0,1 blt 7,.L8 .L59: li 9,8 cmpld 1,9,6 beq 1,.L1 cmpdi 5,10,0 beq 5,.L30 cmpdi 6,10,1 beq 6,.L46 cmpdi 0,10,2 beq 0,.L47 cmpdi 7,10,3 beq 7,.L48 cmpdi 1,10,4 beq 1,.L49 cmpdi 5,10,5 beq 5,.L50 cmpdi 6,10,6 beq 6,.L51 lfdx 3,4,9 fcmpu 0,3,1 bnl 0,.L61 lfdx 4,5,9 stfdx 4,3,9 .L61: addi 9,9,8 .L51: lfdx 5,4,9 fcmpu 7,5,1 bnl 7,.L62 lfdx 6,5,9 stfdx 6,3,9 .L62: addi 9,9,8 .L50: lfdx 7,4,9 fcmpu 1,7,1 bnl 1,.L63 lfdx 8,5,9 stfdx 8,3,9 [snip] An executable version of the test case above is: bergner@genoa:~/gcc/BUGS/LTC144447$ cat loop.c #define SIZE 1024*1000 void __attribute__ ((noinline)) foo (double *x, double *a, double *b, long n, double limit) { long i; for (i=0; i < n; i++) if (a[i] < limit) x[i] = b[i]; } double x[SIZE], a[SIZE], b[SIZE]; int main (void) { long i; for (i=0; i < 3000; i++) foo (x, a, b, SIZE, 1.0); return 0; } bergner@genoa:~/gcc/BUGS/$ /home/bergner/gcc/build/gcc-fsf-mainline-r238004/gcc/xgcc -B/home/bergner/gcc/build/gcc-fsf-mainline-r238004/gcc -O3 -funroll-loops loop.c bergner@genoa:~/gcc/BUGS/$ time ./a.out real 0m3.729s user 0m3.690s sys 0m0.021s bergner@genoa:~/gcc/BUGS/$ /home/bergner/gcc/build/gcc-fsf-mainline-r238005/gcc/xgcc -B/home/bergner/gcc/build/gcc-fsf-mainline-r238005/gcc -O3 -funroll-loops loop.c bergner@genoa:~/gcc/BUGS/$ time ./a.out real 0m6.939s user 0m6.851s sys 0m0.040s
This was tested on powerpc64le-linux, but given the patch that caused this is target independent, I'm guessing this affects other targets as well. I haven't confirmed that yet though.
For documentation purposes, the upstream patch that caused this is: https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00189.html
I noticed path splitting doing some weird stuff with loops even in GCC 6 (though I don't have a testcase right now that shows it being worse off). Confirmed for this case.
I don't like (and see the point of) path-splitting but the patch was to avoid a regression with PRE code-hoisting. The path-splitting code is incredibly stupid when it comes to cost modeling. That is, it tries to expose CSE but doesn't actually verify there is any likeliness in achieving that. In fact, if the block we duplicate (the latch) doesn't contain any stmts (in this case just the IV increment) then there is _zero_ CSE possibility. Improving path-splitting for this case is welcome. Or just remove that stupid pass and wire it into a pass that can actually (opportunistically) perform the CSE before deciding to duplicate the tail. Note that unrolling likely gives up because of the loop now having two latches? Note that there was a plan to move RTL unrolling (-funroll-loops) to GIMPLE, but that wouldn't affect fallout like SMS not being able to unroll (not sure if that can handle conditional code at all).
Patch to fix the testcase, but will eventually regress the path-splitting testcase again. The IV increment may be combined with a stmt in the path thus more analysis would be required here. I guess if there are no non-virtual PHIs in the joiner in addition to <= 2 stmts then we can be reasonably sure of no CSE possibilities. Index: gcc/gimple-ssa-split-paths.c =================================================================== --- gcc/gimple-ssa-split-paths.c (revision 239560) +++ gcc/gimple-ssa-split-paths.c (working copy) @@ -200,6 +200,12 @@ is_feasible_trace (basic_block bb) } } + /* If the join block has only the conditional and possibly an IV + increment there are no possible CSE/DCE opportunities to expose by + duplicating it. */ + if (num_stmts_in_join <= 2) + return false; + /* We may want something here which looks at dataflow and tries to guess if duplication of BB is likely to result in simplification of instructions in BB in either the original or the duplicate. */ Or simply look at all PHIs (no PHIs == no CSE/DCE possibility anyway) and see if any of its result has a use in the joiner. If not -> fail. Index: gcc/gimple-ssa-split-paths.c =================================================================== --- gcc/gimple-ssa-split-paths.c (revision 239560) +++ gcc/gimple-ssa-split-paths.c (working copy) @@ -32,6 +32,9 @@ along with GCC; see the file COPYING3. #include "tracer.h" #include "predict.h" #include "params.h" +#include "gimple-ssa.h" +#include "tree-phinodes.h" +#include "ssa-iterators.h" /* Given LATCH, the latch block in a loop, see if the shape of the path reaching LATCH is suitable for being split by duplication. @@ -200,6 +203,34 @@ is_feasible_trace (basic_block bb) } } + /* If the joiner has no PHIs with uses inside it there is zero chance + of CSE/DCE possibilities exposed by duplicating it. */ + bool found_phi_with_uses_in_bb = false; + for (gphi_iterator si = gsi_start_phis (bb); ! gsi_end_p (si); + gsi_next (&si)) + { + gphi *phi = si.phi (); + use_operand_p use_p; + imm_use_iterator iter; + FOR_EACH_IMM_USE_FAST (use_p, iter, gimple_phi_result (phi)) + if (gimple_bb (USE_STMT (use_p)) == bb) + { + found_phi_with_uses_in_bb = true; + break; + } + if (found_phi_with_uses_in_bb) + break; + } + if (! found_phi_with_uses_in_bb) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, + "Block %d is a join that does not expose CSE/DCE " + "opportunities when duplicated.\n", + bb->index); + return false; + } + /* We may want something here which looks at dataflow and tries to guess if duplication of BB is likely to result in simplification of instructions in BB in either the original or the duplicate. */ This one FAILs gcc.dg/tree-ssa/split-path-7.c though. But it looks quite artificial (with empty if blocks, etc.). Looking at the IL definitely only one path is profitable to split. Jeff?
I'm reasonably happy with the PHI logic so I am going to test it. It's quite on the do-more-duplicating side for memory references (just uses the virtual operand PHI to see if there are any loads/stores in the joiner). For the case of if (a) i = i + 1; else ...; j = i + 1; thus CSE opportunities on one/both paths this is something that PRE catches already. So we're not actually looking for CSE opportunities but opportunities for simplification or DCE/DSE.
Created attachment 39469 [details] patch Ok, causes FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-not dom3 "bit_xor" FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-times dom3 "Folded to: _[0-9]+ = 0;" 1 FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-times dom3 "Folded to: _[0-9]+ = 1;" 1 FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-times dom3 "Replaced .bufferstep_ [0-9]+. with constant .0." 1 FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-times dom3 "Replaced .bufferstep_ [0-9]+. with constant .1." 1 which is where path-splitting exposes a jump threading opportunity it seems as jump threading is not happy to perform the operation in one go. Also my adjustment of gcc.dg/tree-ssa/split-path-7.c was only good in my dev tree for some reason. Otherwise bootstrapped / tested on x86_64-unknown-linux-gnu.
(In reply to Richard Biener from comment #7) > Also my adjustment of gcc.dg/tree-ssa/split-path-7.c was only good in my dev > tree for some reason. Otherwise bootstrapped / tested on > x86_64-unknown-linux-gnu. Ah, that's because my dev tree has store commoning during sinking which turns if () *dp = ... else *dp = ... if (++dp) goto loop; into if () ... else ... *dp = ... if (++dp) goto loop; making the block artificially interesting.
*** Bug 77366 has been marked as a duplicate of this bug. ***
Testcase from PR77366 for s390x. void foo(unsigned int size, unsigned int *state) { unsigned int i; for(i = 0; i < size; i++) { if(*state & 1) { *state ^= 1; } } }
Any progress on this? The patch fixes the problem for s390x (no performance regressions), but without it we see the regression in SPEC2006's libquantum all the time, I guess the same is true for PowerPC? Any chance for it to go into 7.0? For s390x, disabling the path-splitting pass does not introduce performance regressions either and fixes libquantum but that would only be a very last resort.
Created attachment 40508 [details] updated patch Well, updated patch avoids to regress existing testcases (where they make sense). But it doesn't fix the s390 testcase (added as split-path-9.c) given we have <bb 4> [85.00%]: # i_15 = PHI <i_11(8), 0(3)> # prephitmp_19 = PHI <prephitmp_17(8), pretmp_18(3)> _2 = prephitmp_19 & 1; if (_2 != 0) goto <bb 5>; [50.00%] else goto <bb 6>; [50.00%] <bb 5> [42.50%]: _3 = prephitmp_19 ^ 1; *state_9(D) = _3; <bb 6> [85.00%]: # prephitmp_17 = PHI <prephitmp_19(4), _3(5)> i_11 = i_15 + 1; if (i_11 != size_8(D)) goto <bb 8>; [85.00%] else goto <bb 7>; [15.00%] <bb 8> [72.25%]: goto <bb 4>; [100.00%] where clearly threading is possible (and wanted). Unfortunately we do a pretty bad job as followup (for whatever reason). We fail to transform this to an early exit of the loop. Jump threading fails to thread away the & 1 check as well.
Btw, I'm somewhat fed up with the lack of maintainance shown by the pass submitter. He doesn't have write-after-approval nor a bugzilla account.
And maybe PR77366 is too simplified. The following testcase is "fixed": void foo(unsigned int size, unsigned int *state) { unsigned int i; for(i = 0; i < size; i++) { if(state[i] & 1) state[i] ^= 1; } }
The updated patch fixes libquantum on s390 so PR77366 might indeed be to simplified to check for that, but it was unrolled before r238005. Addressing libquantum is more important, of course.
Patch posted.
Author: rguenth Date: Fri Jan 13 08:11:01 2017 New Revision: 244392 URL: https://gcc.gnu.org/viewcvs?rev=244392&root=gcc&view=rev Log: 2017-01-13 Richard Biener <rguenther@suse.de> PR tree-optimization/77283 * gimple-ssa-split-paths.c: Include gimple-ssa.h, tree-phinodes.h and ssa-iterators.h. (is_feasible_trace): Implement a cost model based on joiner PHI node uses. * gcc.dg/tree-ssa/split-path-7.c: Adjust. * gcc.dg/tree-ssa/split-path-8.c: New testcase. * gcc.dg/tree-ssa/split-path-9.c: Likewise. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/split-path-8.c trunk/gcc/testsuite/gcc.dg/tree-ssa/split-path-9.c Modified: trunk/gcc/gimple-ssa-split-paths.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/split-path-7.c
Author: rguenth Date: Fri Jan 13 08:47:57 2017 New Revision: 244394 URL: https://gcc.gnu.org/viewcvs?rev=244394&root=gcc&view=rev Log: 2017-01-13 Richard Biener <rguenther@suse.de> PR tree-optimization/77283 * gcc.dg/tree-ssa/split-path-9.c: Fix. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/split-path-9.c
Author: rguenth Date: Mon Jan 16 09:33:12 2017 New Revision: 244487 URL: https://gcc.gnu.org/viewcvs?rev=244487&root=gcc&view=rev Log: 2017-01-13 Richard Biener <rguenther@suse.de> PR tree-optimization/77283 * gimple-ssa-split-paths.c: Include gimple-ssa.h, tree-phinodes.h and ssa-iterators.h. (is_feasible_trace): Implement a cost model based on joiner PHI node uses. * gcc.dg/tree-ssa/split-path-7.c: Adjust. * gcc.dg/tree-ssa/split-path-8.c: New testcase. * gcc.dg/tree-ssa/split-path-9.c: Likewise. Modified: trunk/gcc/ChangeLog
Fixed.