Bug 77283 - [7 Regression] Revision 238005 disables loop unrolling
Summary: [7 Regression] Revision 238005 disables loop unrolling
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 7.0
: P1 normal
Target Milestone: 7.0
Assignee: Richard Biener
URL:
Keywords: missed-optimization
: 77366 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-08-17 20:04 UTC by Peter Bergner
Modified: 2017-01-16 10:36 UTC (History)
5 users (show)

See Also:
Host:
Target: powerpc64*-*-*, s390x-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2016-08-18 00:00:00


Attachments
patch (976 bytes, text/plain)
2016-08-18 12:11 UTC, Richard Biener
Details
updated patch (1.42 KB, patch)
2017-01-12 10:31 UTC, Richard Biener
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Bergner 2016-08-17 20:04:50 UTC
Revision 238005 (richi's fix to path splitting to handle empty else blocks) disables loop unrolling for the test case below (extracted from one of our benchmarks) leading to a performance regression.

bergner@genoa:~/gcc/BUGS/$ cat foo.c 
void
foo (double *x, double *a, double *b, long n, double limit)
{
  long i;
  for (i=0; i < n; i++)
    if (a[i] < limit)
      x[i] = b[i];
}
bergner@genoa:~/gcc/BUGS/$ /home/bergner/gcc/build/gcc-fsf-mainline-r238005/gcc/xgcc -B/home/bergner/gcc/build/gcc-fsf-mainline-r238005/gcc -O3 -funroll-loops -S foo.c 
bergner@genoa:~/gcc/BUGS/$ cat foo.s 

foo:
	cmpdi 0,6,0
	ble 0,.L1
	sldi 6,6,3
	li 9,0
	.p2align 4,,15
.L3:
	lfdx 0,4,9
	fcmpu 7,0,1
	bnl 7,.L8
	lfdx 2,5,9
	stfdx 2,3,9
	addi 9,9,8
	cmpld 5,9,6
	bne 5,.L3
.L1:
	blr
	.p2align 4,,15
.L8:
	addi 9,9,8
	cmpld 1,9,6
	bne 1,.L3
	blr


bergner@genoa:~/gcc/BUGS/$ /home/bergner/gcc/build/gcc-fsf-mainline-r238004/gcc/xgcc -B/home/bergner/gcc/build/gcc-fsf-mainline-r238004/gcc -O3 -funroll-loops -S foo.c 
bergner@genoa:~/gcc/BUGS/$ cat foo.s

foo:
	cmpdi 0,6,0
	ble 0,.L1
	lfd 0,0(4)
	sldi 6,6,3
	addi 8,6,-8
	srdi 0,8,3
	rldicl 10,0,0,61
	fcmpu 7,0,1
	blt 7,.L8
.L59:
	li 9,8
	cmpld 1,9,6
	beq 1,.L1
	cmpdi 5,10,0
	beq 5,.L30
	cmpdi 6,10,1
	beq 6,.L46
	cmpdi 0,10,2
	beq 0,.L47
	cmpdi 7,10,3
	beq 7,.L48
	cmpdi 1,10,4
	beq 1,.L49
	cmpdi 5,10,5
	beq 5,.L50
	cmpdi 6,10,6
	beq 6,.L51
	lfdx 3,4,9
	fcmpu 0,3,1
	bnl 0,.L61
	lfdx 4,5,9
	stfdx 4,3,9
.L61:
	addi 9,9,8
.L51:
	lfdx 5,4,9
	fcmpu 7,5,1
	bnl 7,.L62
	lfdx 6,5,9
	stfdx 6,3,9
.L62:
	addi 9,9,8
.L50:
	lfdx 7,4,9
	fcmpu 1,7,1
	bnl 1,.L63
	lfdx 8,5,9
	stfdx 8,3,9
[snip]


An executable version of the test case above is:

bergner@genoa:~/gcc/BUGS/LTC144447$ cat loop.c 
#define SIZE 1024*1000

void
__attribute__ ((noinline))
foo (double *x, double *a, double *b, long n, double limit)
{
  long i;
  for (i=0; i < n; i++)
    if (a[i] < limit)
      x[i] = b[i];
}

double x[SIZE], a[SIZE], b[SIZE];

int
main (void)
{
  long i;
  for (i=0; i < 3000; i++)
    foo (x, a, b, SIZE, 1.0);
  return 0;
}

bergner@genoa:~/gcc/BUGS/$ /home/bergner/gcc/build/gcc-fsf-mainline-r238004/gcc/xgcc -B/home/bergner/gcc/build/gcc-fsf-mainline-r238004/gcc -O3 -funroll-loops loop.c 
bergner@genoa:~/gcc/BUGS/$ time ./a.out 

real	0m3.729s
user	0m3.690s
sys	0m0.021s

bergner@genoa:~/gcc/BUGS/$ /home/bergner/gcc/build/gcc-fsf-mainline-r238005/gcc/xgcc -B/home/bergner/gcc/build/gcc-fsf-mainline-r238005/gcc -O3 -funroll-loops loop.c 
bergner@genoa:~/gcc/BUGS/$ time ./a.out 

real	0m6.939s
user	0m6.851s
sys	0m0.040s
Comment 1 Peter Bergner 2016-08-17 20:09:57 UTC
This was tested on powerpc64le-linux, but given the patch that caused this is target independent, I'm guessing this affects other targets as well.  I haven't confirmed that yet though.
Comment 2 Peter Bergner 2016-08-17 20:19:23 UTC
For documentation purposes, the upstream patch that caused this is:

https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00189.html
Comment 3 Drea Pinski 2016-08-18 06:04:44 UTC
I noticed path splitting doing some weird stuff with loops even in GCC 6 (though I don't have a testcase right now that shows it being worse off).

Confirmed for this case.
Comment 4 Richard Biener 2016-08-18 08:42:50 UTC
I don't like (and see the point of) path-splitting but the patch was to avoid a regression with PRE code-hoisting.  The path-splitting code is incredibly stupid
when it comes to cost modeling.  That is, it tries to expose CSE but doesn't
actually verify there is any likeliness in achieving that.  In fact, if the
block we duplicate (the latch) doesn't contain any stmts (in this case just
the IV increment) then there is _zero_ CSE possibility.  Improving path-splitting
for this case is welcome.  Or just remove that stupid pass and wire it into
a pass that can actually (opportunistically) perform the CSE before deciding
to duplicate the tail.

Note that unrolling likely gives up because of the loop now having two latches?

Note that there was a plan to move RTL unrolling (-funroll-loops) to GIMPLE,
but that wouldn't affect fallout like SMS not being able to unroll (not sure
if that can handle conditional code at all).
Comment 5 Richard Biener 2016-08-18 08:59:53 UTC
Patch to fix the testcase, but will eventually regress the path-splitting testcase again.  The IV increment may be combined with a stmt in the
path thus more analysis would be required here.  I guess if there are
no non-virtual PHIs in the joiner in addition to <= 2 stmts then we can
be reasonably sure of no CSE possibilities.

Index: gcc/gimple-ssa-split-paths.c
===================================================================
--- gcc/gimple-ssa-split-paths.c        (revision 239560)
+++ gcc/gimple-ssa-split-paths.c        (working copy)
@@ -200,6 +200,12 @@ is_feasible_trace (basic_block bb)
        }
     }
 
+  /* If the join block has only the conditional and possibly an IV
+     increment there are no possible CSE/DCE opportunities to expose by
+     duplicating it.  */
+  if (num_stmts_in_join <= 2)
+    return false;
+
   /* We may want something here which looks at dataflow and tries
      to guess if duplication of BB is likely to result in simplification
      of instructions in BB in either the original or the duplicate.  */


Or simply look at all PHIs
(no PHIs == no CSE/DCE possibility anyway) and see if any of its result
has a use in the joiner.  If not -> fail.

Index: gcc/gimple-ssa-split-paths.c
===================================================================
--- gcc/gimple-ssa-split-paths.c        (revision 239560)
+++ gcc/gimple-ssa-split-paths.c        (working copy)
@@ -32,6 +32,9 @@ along with GCC; see the file COPYING3.
 #include "tracer.h"
 #include "predict.h"
 #include "params.h"
+#include "gimple-ssa.h"
+#include "tree-phinodes.h"
+#include "ssa-iterators.h"
 
 /* Given LATCH, the latch block in a loop, see if the shape of the
    path reaching LATCH is suitable for being split by duplication.
@@ -200,6 +203,34 @@ is_feasible_trace (basic_block bb)
        }
     }
 
+  /* If the joiner has no PHIs with uses inside it there is zero chance
+     of CSE/DCE possibilities exposed by duplicating it.  */
+  bool found_phi_with_uses_in_bb = false;
+  for (gphi_iterator si = gsi_start_phis (bb); ! gsi_end_p (si);
+       gsi_next (&si))
+    {
+      gphi *phi = si.phi ();
+      use_operand_p use_p;
+      imm_use_iterator iter;
+      FOR_EACH_IMM_USE_FAST (use_p, iter, gimple_phi_result (phi))
+       if (gimple_bb (USE_STMT (use_p)) == bb)
+         {
+           found_phi_with_uses_in_bb = true;
+           break;
+         }
+      if (found_phi_with_uses_in_bb)
+       break;
+    }
+  if (! found_phi_with_uses_in_bb)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+       fprintf (dump_file,
+                "Block %d is a join that does not expose CSE/DCE "
+                "opportunities when duplicated.\n",
+                bb->index);
+      return false;
+    }
+
   /* We may want something here which looks at dataflow and tries
      to guess if duplication of BB is likely to result in simplification
      of instructions in BB in either the original or the duplicate.  */


This one FAILs gcc.dg/tree-ssa/split-path-7.c though.  But it looks quite
artificial (with empty if blocks, etc.).  Looking at the IL definitely only
one path is profitable to split.  Jeff?
Comment 6 Richard Biener 2016-08-18 09:33:47 UTC
I'm reasonably happy with the PHI logic so I am going to test it.  It's quite
on the do-more-duplicating side for memory references (just uses the virtual
operand PHI to see if there are any loads/stores in the joiner).

For the case of

  if (a)
    i = i + 1;
  else
    ...;
  j = i + 1;

thus CSE opportunities on one/both paths this is something that PRE catches
already.  So we're not actually looking for CSE opportunities but opportunities
for simplification or DCE/DSE.
Comment 7 Richard Biener 2016-08-18 12:11:55 UTC
Created attachment 39469 [details]
patch

Ok, causes

FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-not dom3 "bit_xor"
FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-times dom3 "Folded to: _[0-9]+ = 
0;" 1
FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-times dom3 "Folded to: _[0-9]+ = 
1;" 1
FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-times dom3 "Replaced .bufferstep_
[0-9]+. with constant .0." 1
FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-times dom3 "Replaced .bufferstep_
[0-9]+. with constant .1." 1

which is where path-splitting exposes a jump threading opportunity it seems
as jump threading is not happy to perform the operation in one go.

Also my adjustment of gcc.dg/tree-ssa/split-path-7.c was only good in my dev tree for some reason.  Otherwise bootstrapped / tested on x86_64-unknown-linux-gnu.
Comment 8 Richard Biener 2016-08-18 12:24:09 UTC
(In reply to Richard Biener from comment #7)
> Also my adjustment of gcc.dg/tree-ssa/split-path-7.c was only good in my dev
> tree for some reason.  Otherwise bootstrapped / tested on
> x86_64-unknown-linux-gnu.

Ah, that's because my dev tree has store commoning during sinking which turns

  if ()
    *dp = ...
  else
    *dp = ...
  if (++dp)
    goto loop;

into

  if ()
    ...
  else
    ...
  *dp = ...
  if (++dp)
    goto loop;

making the block artificially interesting.
Comment 9 Richard Biener 2016-08-24 14:06:29 UTC
*** Bug 77366 has been marked as a duplicate of this bug. ***
Comment 10 Richard Biener 2016-08-24 14:07:37 UTC
Testcase from PR77366 for s390x.

void
foo(unsigned int size, unsigned int *state)
{
   unsigned int i;

   for(i = 0; i < size; i++)
    {
       if(*state & 1)
         {
           *state ^= 1;
         }
    }
}
Comment 11 rdapp 2017-01-12 09:40:20 UTC
Any progress on this? The patch fixes the problem for s390x (no performance regressions), but without it we see the regression in SPEC2006's libquantum all the time, I guess the same is true for PowerPC? Any chance for it to go into 7.0?

For s390x, disabling the path-splitting pass does not introduce performance regressions either and fixes libquantum but that would only be a very last resort.
Comment 12 Richard Biener 2017-01-12 10:31:21 UTC
Created attachment 40508 [details]
updated patch

Well, updated patch avoids to regress existing testcases (where they make sense).

But it doesn't fix the s390 testcase (added as split-path-9.c) given we have

  <bb 4> [85.00%]:
  # i_15 = PHI <i_11(8), 0(3)>
  # prephitmp_19 = PHI <prephitmp_17(8), pretmp_18(3)>
  _2 = prephitmp_19 & 1;
  if (_2 != 0)
    goto <bb 5>; [50.00%]
  else
    goto <bb 6>; [50.00%]

  <bb 5> [42.50%]:
  _3 = prephitmp_19 ^ 1;
  *state_9(D) = _3;

  <bb 6> [85.00%]:
  # prephitmp_17 = PHI <prephitmp_19(4), _3(5)>
  i_11 = i_15 + 1;
  if (i_11 != size_8(D))
    goto <bb 8>; [85.00%]
  else
    goto <bb 7>; [15.00%]

  <bb 8> [72.25%]:
  goto <bb 4>; [100.00%]

where clearly threading is possible (and wanted).  Unfortunately we do a pretty
bad job as followup (for whatever reason).  We fail to transform this to an
early exit of the loop.  Jump threading fails to thread away the & 1 check
as well.
Comment 13 Richard Biener 2017-01-12 10:36:38 UTC
Btw, I'm somewhat fed up with the lack of maintainance shown by the pass submitter.  He doesn't have write-after-approval nor a bugzilla account.
Comment 14 Richard Biener 2017-01-12 10:38:38 UTC
And maybe PR77366 is too simplified.  The following testcase is "fixed":

void
foo(unsigned int size, unsigned int *state)
{
  unsigned int i;

  for(i = 0; i < size; i++)
    {
      if(state[i] & 1)
        state[i] ^= 1;
    }
}
Comment 15 rdapp 2017-01-12 12:11:03 UTC
The updated patch fixes libquantum on s390 so PR77366 might indeed be to simplified to check for that, but it was unrolled before r238005. Addressing libquantum is more important, of course.
Comment 16 Richard Biener 2017-01-12 15:18:44 UTC
Patch posted.
Comment 17 Richard Biener 2017-01-13 08:11:34 UTC
Author: rguenth
Date: Fri Jan 13 08:11:01 2017
New Revision: 244392

URL: https://gcc.gnu.org/viewcvs?rev=244392&root=gcc&view=rev
Log:
2017-01-13  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/77283
	* gimple-ssa-split-paths.c: Include gimple-ssa.h, tree-phinodes.h
	and ssa-iterators.h.
	(is_feasible_trace): Implement a cost model based on joiner
	PHI node uses.

	* gcc.dg/tree-ssa/split-path-7.c: Adjust.
	* gcc.dg/tree-ssa/split-path-8.c: New testcase.
	* gcc.dg/tree-ssa/split-path-9.c: Likewise.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/split-path-8.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/split-path-9.c
Modified:
    trunk/gcc/gimple-ssa-split-paths.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/tree-ssa/split-path-7.c
Comment 18 Richard Biener 2017-01-13 08:48:29 UTC
Author: rguenth
Date: Fri Jan 13 08:47:57 2017
New Revision: 244394

URL: https://gcc.gnu.org/viewcvs?rev=244394&root=gcc&view=rev
Log:
2017-01-13  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/77283
	* gcc.dg/tree-ssa/split-path-9.c: Fix.

Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/tree-ssa/split-path-9.c
Comment 19 Richard Biener 2017-01-16 09:33:44 UTC
Author: rguenth
Date: Mon Jan 16 09:33:12 2017
New Revision: 244487

URL: https://gcc.gnu.org/viewcvs?rev=244487&root=gcc&view=rev
Log:
2017-01-13  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/77283
	* gimple-ssa-split-paths.c: Include gimple-ssa.h, tree-phinodes.h
	and ssa-iterators.h.
	(is_feasible_trace): Implement a cost model based on joiner
	PHI node uses.

	* gcc.dg/tree-ssa/split-path-7.c: Adjust.
	* gcc.dg/tree-ssa/split-path-8.c: New testcase.
	* gcc.dg/tree-ssa/split-path-9.c: Likewise.

Modified:
    trunk/gcc/ChangeLog
Comment 20 Richard Biener 2017-01-16 10:36:00 UTC
Fixed.