62217 – [4.9 Regression] DOM confuses complete unrolling which in turn causes VRP to warn

Bug 62217 - [4.9 Regression] DOM confuses complete unrolling which in turn causes VRP to warn

Summary: [4.9 Regression] DOM confuses complete unrolling which in turn causes VRP to ...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.9.2

Importance:	P2 normal
Target Milestone:	4.9.4
Assignee:	Richard Biener

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2014-08-21 13:31 UTC by Kirill Yukhin
Modified:	2016-02-11 13:45 UTC (History)
CC List:	5 users (show)

See Also:
Host:
Target:
Build:
Known to work:	4.8.3, 5.0
Known to fail:	4.9.3
Last reconfirmed:	2014-08-22 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Kirill Yukhin 2014-08-21 13:31:32 UTC

Suppose we have a test:

typedef struct { unsigned data; } s1;
s1 g_x[4];

extern void foo (s1 *x1, s1 *x2, int a, int b)
{
  int i;
  for(i = 0; i < a; i++)
    if(i == b)
      g_x[i] = *x1;
    else
      g_x[i] = *x2;
}

$ ./build-x86_64-linux/gcc/xgcc -B./build-x86_64-linux/gcc -O3 -Wall -S x1.c
x1.c: In function ‘foo’:
x1.c:9:10: warning: array subscript is above array bounds [-Warray-bounds]
       g_x[i] = *x1;
          ^
Which come from second pass of VRP:
  # i_14 = PHI <4(14)>
  if (b_6(D) == 4)
    goto <bb 17>;
  else
    goto <bb 18>;

  <bb 17>:
  g_x[4] = *x1_7(D);
  i_11 = 5;
  goto <bb 3>;

  <bb 18>:
  __builtin_unreachable ();

VRP checker fairly warns that g_x[4] is out-of-bounds mem ref.

This code was generated by tree-ssa-loop-ivcanon.c

I do not completely understand logic of cunroll pass,
however dumps say:

Loop 1 iterates at most 4 times.
...
x1.c:7:3: note: loop with 5 iterations completely unrolled
Last iteration exit edge was proved true.
Forced statement unreachable: g_x[i_14] = *x2_9(D);

In function `try_unroll_loop_completely' argument `maxiter'
equals to 4, which is correct. Then (through n_unroll var)
this value is passed to `gimple_duplicate_loop_to_header_edge'
routine, which perform the peeling. This routine as far
as I understand perform peel `n_unroll' copies from original
loop, resulting to (n_unroll+1) overall copies (including original
loop).

I think of 2 solutions:
- Reduce peel number by 1
- Improve remove_exits_and_undefined_stmts (), which correctly
  insert `gcc_unreachable' for `false' part of the if-stmt on
  5-th iteration.

It obviously don't warns, when limit is reduced:
./build-x86_64-linux/gcc/xgcc -B./build-x86_64-linux/gcc -O3 -Wall  -msse3  -c -m32  x1.c --param max-completely-peel-times=3

Comment 1 Andrew Pinski 2014-08-22 04:41:12 UTC

Actually it is a bad interaction between DOM and complete unrolling.
  <bb 20>:
  # i_14 = PHI <i_36(19)>
  if (i_14 == b_6(D))
    goto <bb 21>;
  else
    goto <bb 22>;

  <bb 21>:
  g_x[b_6(D)] = *x1_7(D);
  i_11 = i_14 + 1;
  goto <bb 3>;

  <bb 22>:
  __builtin_unreachable ();

--- CUT ---
Basic Block 21 should have been a __builtin_unreachable (); also.
BB21 was bb5 after dom.

BB5  was changed all the way back in dom to:
  <bb 5>:
  g_x[b_6(D)] = *x1_7(D);
  goto <bb 7>;


It was before DOM:
  <bb 5>:
  g_x[i_14] = *x2_9(D);

Comment 2 Andrew Pinski 2014-08-22 07:00:29 UTC

Confirmed.

Comment 3 Kirill Yukhin 2014-08-22 07:45:02 UTC

As long as I understand `remove_exits_and_undefined_stmts'
iterate loop boundaries set marking `unreachable' stmts w/
impossible bounds.

For the example we have:
- for true edge
  basic block 6, loop depth 1
   pred:       5
  g_x[b_6(D)] = *x1_7(D);
  goto <bb 8>;
   succ:       8
- for false edge
  basic block 7, loop depth 1
   pred:       5
  g_x[i_14] = *x2_9(D);
   succ:       8

I suspect, that the problem is that `b' was propagated along
`true' edge and replaced use of `i'.
This removed reference to g_x from boundaries analysis for that edge:
no IV is used there explicitly, only implicitly as b == i.

Hence this stmt didn't hit boundaries set of the loop and
wasn't marked as unreachable.

BTW: this code survive rest of optimizations:
        movl    (%edx), %edx
        cmpl    $4, %eax
        movl    %edx, g_x+12
        jle     .L1
        movl    (%ebx), %eax
        movl    %eax, g_x+16 ;; REDUNDANT

Looks like not simple warning, but also unnecessary code was
generated.

Comment 4 Jakub Jelinek 2014-10-30 10:36:51 UTC

GCC 4.9.2 has been released.

Comment 5 Jeffrey A. Law 2015-02-12 22:41:58 UTC

Kirill, you are correct WRT propagation of "b" for "i".  Prior to DOM1 we have:

;;   basic block 3, loop depth 1, count 0, freq 9100, maybe hot
;;    prev block 2, next block 4, flags: (NEW, REACHABLE)
;;    pred:       7 [91.0%]  (TRUE_VALUE,EXECUTABLE)
  if (i_1 == b_6(D))
    goto <bb 4>;
  else
    goto <bb 5>;
;;    succ:       4 [0.0%]  (TRUE_VALUE,EXECUTABLE)
;;                5 [100.0%]  (FALSE_VALUE,EXECUTABLE)

;;   basic block 4, loop depth 1, count 0, freq 2, maybe hot
;;    prev block 3, next block 5, flags: (NEW, REACHABLE)
;;    pred:       3 [0.0%]  (TRUE_VALUE,EXECUTABLE)
  g_x[i_1] = *x1_7(D);
  goto <bb 6>;
;;    succ:       6 [100.0%]  (FALLTHRU,EXECUTABLE)

;;   basic block 5, loop depth 1, count 0, freq 9098, maybe hot
;;    prev block 4, next block 6, flags: (NEW, REACHABLE)
;;    pred:       3 [100.0%]  (FALSE_VALUE,EXECUTABLE)
  g_x[i_1] = *x2_9(D);
;;    succ:       6 [100.0%]  (FALLTHRU,EXECUTABLE)


DOM records that i_1 and b_6 are equivalent on the edge bb3->bb4.  So in bb4 it replaces i_1 with b_6.  Presumably that's causing problems downstream in the optimization pipeline.  The easiest way to think about this is we record that i_1 can be legitimately replaced with b_6 in that context.  We could just have easily recorded that b_6 can be replaced with i_1.

I don't think we have any heuristics for which of those two equivalences to record, it's strictly based on the order of appearance (which is likely determined elsewhere by canonicalization rules).

If you want to propose some heuristics, I'm all ears.   One might be to put the object with the least number of references on the lhs.  THe idea would be to try and ultimately get that use count to 0/1 which may allow that object to die at the comparison.  There may be some reasonable heuristic around loop depth of the names as well.    ie, do we want to replace uses of a non-loop object with a loop object or vice versa?

Anyway, open to suggestions here...

Comment 6 Richard Biener 2015-02-13 09:16:46 UTC

(In reply to Jeffrey A. Law from comment #5)
> Kirill, you are correct WRT propagation of "b" for "i".  Prior to DOM1 we
> have:
> 
> ;;   basic block 3, loop depth 1, count 0, freq 9100, maybe hot
> ;;    prev block 2, next block 4, flags: (NEW, REACHABLE)
> ;;    pred:       7 [91.0%]  (TRUE_VALUE,EXECUTABLE)
>   if (i_1 == b_6(D))
>     goto <bb 4>;
>   else
>     goto <bb 5>;
> ;;    succ:       4 [0.0%]  (TRUE_VALUE,EXECUTABLE)
> ;;                5 [100.0%]  (FALSE_VALUE,EXECUTABLE)
> 
> ;;   basic block 4, loop depth 1, count 0, freq 2, maybe hot
> ;;    prev block 3, next block 5, flags: (NEW, REACHABLE)
> ;;    pred:       3 [0.0%]  (TRUE_VALUE,EXECUTABLE)
>   g_x[i_1] = *x1_7(D);
>   goto <bb 6>;
> ;;    succ:       6 [100.0%]  (FALLTHRU,EXECUTABLE)
> 
> ;;   basic block 5, loop depth 1, count 0, freq 9098, maybe hot
> ;;    prev block 4, next block 6, flags: (NEW, REACHABLE)
> ;;    pred:       3 [100.0%]  (FALSE_VALUE,EXECUTABLE)
>   g_x[i_1] = *x2_9(D);
> ;;    succ:       6 [100.0%]  (FALLTHRU,EXECUTABLE)
> 
> 
> DOM records that i_1 and b_6 are equivalent on the edge bb3->bb4.  So in bb4
> it replaces i_1 with b_6.  Presumably that's causing problems downstream in
> the optimization pipeline.  The easiest way to think about this is we record
> that i_1 can be legitimately replaced with b_6 in that context.  We could
> just have easily recorded that b_6 can be replaced with i_1.
> 
> I don't think we have any heuristics for which of those two equivalences to
> record, it's strictly based on the order of appearance (which is likely
> determined elsewhere by canonicalization rules).
> 
> If you want to propose some heuristics, I'm all ears.   One might be to put
> the object with the least number of references on the lhs.  THe idea would
> be to try and ultimately get that use count to 0/1 which may allow that
> object to die at the comparison.  There may be some reasonable heuristic
> around loop depth of the names as well.    ie, do we want to replace uses of
> a non-loop object with a loop object or vice versa?
> 
> Anyway, open to suggestions here...

The rule is simple - we should always replace with the more dominating definition because that's what value-numbering would do to be able to
make the other defs unused.

Comment 7 Jeffrey A. Law 2015-02-13 23:29:13 UTC

But replacement with the most dominating name (presumably a default def dominates everything) isn't going to help here.

In many ways we'd be better off if we didn't propagate from those equality comparisons -- unless they allowed some other later simplification.  But we don't have a good way to make that determination.  Which ultimately let to the uncprop pass which we run very late to try and put things back the way they were.

I wonder if running uncprop between DOM1 & the late unroller would be worth the extra pass.  And if so, where does it make the most sense.

Comment 8 rguenther@suse.de 2015-02-16 08:55:50 UTC

On Fri, 13 Feb 2015, law at redhat dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62217
> 
> --- Comment #7 from Jeffrey A. Law <law at redhat dot com> ---
> But replacement with the most dominating name (presumably a default def
> dominates everything) isn't going to help here.
> 
> In many ways we'd be better off if we didn't propagate from those equality
> comparisons -- unless they allowed some other later simplification.  But we
> don't have a good way to make that determination.

That's true.  We can also always choose the most immediate dominating
definition for the reason that it might carry more information.

But that might be a loss in some cases as well.

>  Which ultimately let to the
> uncprop pass which we run very late to try and put things back the way they
> were.
>
> I wonder if running uncprop between DOM1 & the late unroller would be worth the
> extra pass.  And if so, where does it make the most sense.

uncprop is mostly to help out-of-SSA PHI coalescing so it's sth different
(and should better be integrated with the out-of-SSA machinery)

Comment 9 Jeffrey A. Law 2015-02-16 19:05:26 UTC

Yes, any particular choice has the potential to regress in one way or another. These are heuristics after all. We're just looking for a reasonable refinement if we can find one.

Dominance doesn't seem to be the right thing to be looking at to me since the crux of this issue is propagating the "copy" implied by the equality comparison just changes what SSA_NAME we reference and as a result ultimately hides stuff from later passes. It doesn't (in this case) enable further simplifications or for either SSA_NAME to become unused. A dominance test between the args of the equality comparison just doesn't seem helpful here.

In fact, because both names are used in the equality test, these propagations can never cause an SSA_NAME to become unused. At best the propagation will expose some further simplification on one of the paths or it will result in one SSA_NAME having a single use (the comparison). We have no good way of guessing if the former will happen, but we can encourage the latter.

But as I mentioned earlier, I really wonder if we should allow these context sensitive equivalences to be expressed in the gimple if they didn't prove useful. And that was the whole purpose behind uncprop -- to find context sensitive propagations that ultimately didn't prove useful and which result in poor coalescing or unwanted constant initializations and un propagate them.

It wouldn't directly help this problem because the use is in a normal statement, but it's definitely closely related and it wouldn't be hard to repurpose that code. In fact, it might be a good thing to do in general.

Essentially what it's doing is building a map of values back to SSA_NAMEs which hold those values by way of an equality comparison. At each use of the value, we can look at the list of SSA_NAMEs that hold that value and select the one that appears to be least cost based on whatever metrics we see fit.

Comment 10 rguenther@suse.de 2015-02-17 09:44:14 UTC

On Mon, 16 Feb 2015, law at redhat dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62217
> 
> --- Comment #9 from Jeffrey A. Law <law at redhat dot com> --- Yes, any 
> particular choice has the potential to regress in one way or another. 
> These are heuristics after all.  We're just looking for a reasonable 
> refinement if we can find one.

Well, there is also canonicalization for CSE / jump threading.  Consider

if (i == 2)
 {
   if (i != j)
     ...
   else if (j == 2)
     ...

or

y = 2 * i;
if (i == j)
  x = i + j;

but yes, this is followup transforms.  Unfortunately(?) DOM performs
copy/constant propagation for all recorded equalities.

> Dominance doesn't seem to be the right thing to be looking at to me 
> since the crux of this issue is propagating the "copy" implied by the 
> equality comparison just changes what SSA_NAME we reference and as a 
> result ultimately hides stuff from later passes.  It doesn't (in this 
> case) enable further simplifications or for either SSA_NAME to become 
> unused.  A dominance test between the args of the equality comparison 
> just doesn't seem helpful here.
>
> In fact, because both names are used in the equality test, these 
> propagations can never cause an SSA_NAME to become unused.  At best the 
> propagation will expose some further simplification on one of the paths 
> or it will result in one SSA_NAME having a single use (the comparison).  
> We have no good way of guessing if the former will happen, but we can 
> encourage the latter.

As you say, we don't know - we only know that properly canonicalizing
will expose the followup transforms if there are any.  It looks like
we are basically taking the original order of EQ_EXPR operands
(thus eventually rely on tree_swap_operands canonicalization) plus
the "correctness" thing of taking into account loop depth (which is
kind of a dominance relation).

We are also introducing SSA_NAME_VALUE "chains" in record_equality
as we are setting x SSA_NAME_VALUE to y not to SSA_NAME_VALUE (y)
(we seem to do this in multiple places for some odd reason...,
only tree-ssa-threadedge.c:record_temporary_equivalence seems to get
this right).

> But as I mentioned earlier, I really wonder if we should allow these context
> sensitive equivalences to be expressed in the gimple if they didn't prove
> useful.  And that was the whole purpose behind uncprop -- to find context
> sensitive propagations that ultimately didn't prove useful and which result in
> poor coalescing or unwanted constant initializations and un propagate them.

Yes (but on GIMPLE we don't care).

> It wouldn't directly help this problem because the use is in a normal
> statement, but it's definitely closely related and it wouldn't be hard to
> repurpose that code.  In fact, it might be a good thing to do in general.
> 
> Essentially what it's doing is building a map of values back to SSA_NAMEs which
> hold those values by way of an equality comparison.  At each use of the value,
> we can look at the list of SSA_NAMEs that hold that value and select the one
> that appears to be least cost based on whatever metrics we see fit.

Comment 11 Richard Biener 2015-02-17 14:31:32 UTC

Note that the diagnostic part is fixed for GCC 5.  We are still somehow not removing dead code.

Comment 12 Richard Biener 2015-02-17 15:02:24 UTC

Index: gcc/tree-ssa-dom.c
===================================================================
--- gcc/tree-ssa-dom.c  (revision 220755)
+++ gcc/tree-ssa-dom.c  (working copy)
@@ -2291,11 +2291,16 @@ cprop_operand (gimple stmt, use_operand_
       if (!may_propagate_copy (op, val))
        return;
 
-      /* Do not propagate copies into simple IV increment statements.
-         See PR23821 for how this can disturb IV analysis.  */
-      if (TREE_CODE (val) != INTEGER_CST
-         && simple_iv_increment_p (stmt))
-       return;
+      /* Do not propagate copies into BIVs.
+         See PR23821 and PR62217 for how this can disturb IV and
+        number of iteration analysis.  */
+      if (TREE_CODE (val) != INTEGER_CST)
+       {
+         gimple def = SSA_NAME_DEF_STMT (op);
+         if (gimple_code (def) == GIMPLE_PHI
+             && gimple_bb (def)->loop_father->header == gimple_bb (def))
+           return;
+       }
 
       /* Dump details.  */
       if (dump_file && (dump_flags & TDF_DETAILS))


fixes the warning on the branch, not sure yet if the missed-optimization on
the trunk.  It extends an existing heuristic to not replace a BIV use
in an increment to not replace any BIV use (???  Best if we'd know if the
equivalence were temporary only...)

Comment 13 Jeffrey A. Law 2015-02-17 15:39:59 UTC

On 02/17/15 02:44, rguenther at suse dot de wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62217
>
> --- Comment #10 from rguenther at suse dot de <rguenther at suse dot de> ---
> On Mon, 16 Feb 2015, law at redhat dot com wrote:
>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62217
>>
>> --- Comment #9 from Jeffrey A. Law <law at redhat dot com> --- Yes, any
>> particular choice has the potential to regress in one way or another.
>> These are heuristics after all.  We're just looking for a reasonable
>> refinement if we can find one.
>
> Well, there is also canonicalization for CSE / jump threading.  Consider
>
> if (i == 2)
>   {
>     if (i != j)
>       ...
>     else if (j == 2)
>       ...
>
> or
>
> y = 2 * i;
> if (i == j)
>    x = i + j;
>
> but yes, this is followup transforms.  Unfortunately(?) DOM performs
> copy/constant propagation for all recorded equalities.
Yea, I've been mulling what it would take to instead build equivalence 
classes, then when we find a use walk through the class and see which 
one is "best".  I suspect the vast majority of the time the "best" is 
always going to be the same -- the major exception will be these 
temporary equivalences.   I've also been pondering somehow marking the 
equivalent values with some kind of tag indicating their scope and using 
that to guide propagation.

But I haven't prototyped anything...


>
> As you say, we don't know - we only know that properly canonicalizing
> will expose the followup transforms if there are any.  It looks like
> we are basically taking the original order of EQ_EXPR operands
> (thus eventually rely on tree_swap_operands canonicalization) plus
> the "correctness" thing of taking into account loop depth (which is
> kind of a dominance relation).
Correct, we take it from the original order and thus from the earlier 
canonicalization via tree_swap_operands.

>
> We are also introducing SSA_NAME_VALUE "chains" in record_equality
> as we are setting x SSA_NAME_VALUE to y not to SSA_NAME_VALUE (y)
> (we seem to do this in multiple places for some odd reason...,
> only tree-ssa-threadedge.c:record_temporary_equivalence seems to get
> this right).
Even if we set to SSA_NAME_VALUE (y) we can end up with chains.  But 
this change might eliminate the benefit of the chain walk, which would 
be good.

Jeff

Comment 14 Jeffrey A. Law 2015-02-17 20:01:48 UTC

WRT the patch in c#12, it looks reasonable for the same reasons as we avoid propagating in 23821.  I can confirm that it prevents the unwanted cprop into array reference.   By DOM2 we have the following array references:


  g_x[0] = *x2_9(D);
  g_x[0] = *x1_7(D);
  g_x[1] = *x2_9(D);
  g_x[1] = *x1_7(D);
  g_x[2] = *x2_9(D);
  g_x[2] = *x1_7(D);
  g_x[3] = *x1_7(D);
  g_x[3] = *x2_9(D);


Assuming it bootstraps and regression tests, I'd go with it.

Comment 15 Richard Biener 2015-02-18 09:49:29 UTC

Author: rguenth
Date: Wed Feb 18 09:48:57 2015
New Revision: 220785

URL: https://gcc.gnu.org/viewcvs?rev=220785&root=gcc&view=rev
Log:
2015-02-18  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/62217
	* tree-ssa-dom.c (cprop_operand): Avoid propagating copies
	into BIVs.

	* gcc.dg/tree-ssa/cunroll-11.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/cunroll-11.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-dom.c

Comment 16 Richard Biener 2015-02-18 09:49:41 UTC

Fixed on trunk sofar.

Comment 17 Jakub Jelinek 2015-06-26 19:53:48 UTC

GCC 4.9.3 has been released.

Comment 18 Richard Biener 2016-02-11 13:41:04 UTC

Author: rguenth
Date: Thu Feb 11 13:40:31 2016
New Revision: 233344

URL: https://gcc.gnu.org/viewcvs?rev=233344&root=gcc&view=rev
Log:
2016-02-11  Richard Biener  <rguenther@suse.de>

	Backport from mainline
	2015-02-18  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/62217
	* tree-ssa-dom.c (cprop_operand): Avoid propagating copies
	into BIVs.

	* gcc.dg/tree-ssa/cunroll-11.c: New testcase.

	2015-06-18  Richard Biener  <rguenther@suse.de>

	Backport from mainline
	2015-06-03  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/66375
	* tree-scalar-evolution.c (follow_ssa_edge_binary): First
	add to the evolution before following SSA edges.

	* gcc.dg/torture/pr66375.c: New testcase.

	2015-06-23  Richard Biener  <rguenther@suse.de>

	Backport from mainline
	2015-06-09  Richard Biener  <rguenther@suse.de>

	PR middle-end/66413
	* tree-inline.c (insert_init_debug_bind): Unshare value.

	* gcc.dg/torture/pr66413.c: New testcase.

	2015-07-08  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/66794
	* gimple-ssa-isolate-paths.c (gimple_ssa_isolate_erroneous_paths):
	Free post-dominators.

	* gcc.dg/torture/pr66794.c: New testcase.

	2015-07-10  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/66823
	* tree-if-conv.c (memrefs_read_or_written_unconditionally): Fix
	inverted predicate.

Added:
    branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/torture/pr66375.c
    branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/torture/pr66413.c
    branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/torture/pr66794.c
    branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/tree-ssa/cunroll-11.c
Modified:
    branches/gcc-4_9-branch/gcc/ChangeLog
    branches/gcc-4_9-branch/gcc/gimple-ssa-isolate-paths.c
    branches/gcc-4_9-branch/gcc/gimple.c
    branches/gcc-4_9-branch/gcc/testsuite/ChangeLog
    branches/gcc-4_9-branch/gcc/tree-if-conv.c
    branches/gcc-4_9-branch/gcc/tree-inline.c
    branches/gcc-4_9-branch/gcc/tree-scalar-evolution.c
    branches/gcc-4_9-branch/gcc/tree-ssa-dom.c

Comment 19 Richard Biener 2016-02-11 13:45:04 UTC

Fixed.