55555 – [4.8 Regression] miscompilation at -O2 (number_of_iterations)

Bug 55555 - [4.8 Regression] miscompilation at -O2 (number_of_iterations)

Summary: [4.8 Regression] miscompilation at -O2 (number_of_iterations)

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	4.8.0

Importance:	P1 normal
Target Milestone:	4.8.0
Assignee:	Richard Biener

URL:
Keywords:

Depends on:	57511
Blocks:
	Show dependency tree / graph

Reported:	2012-12-01 15:46 UTC by Joost VandeVondele
Modified:	2013-08-28 14:17 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2012-12-01 00:00:00

Attachments
testcase (875 bytes, text/x-fortran) 2012-12-01 15:46 UTC, Joost VandeVondele	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Joost VandeVondele 2012-12-01 15:46:09 UTC

Created attachment 28845 [details]
testcase

The attached testcase is miscompiled at -O2, but runs fine at -O1.

To reproduce

gfortran -O2 bug.f90 ; ./a.out
#3  0x400C7E in MAIN__ at bug.f90:97
Aborted

I believe this is a recent regression on trunk. Tested with

gcc version 4.8.0 20121201 (experimental) [trunk revision 194017] (GCC)

Comment 1 Joost VandeVondele 2012-12-01 15:53:17 UTC

Using -O2 -fno-tree-pre fixes the testcase.
Using -O1 -ftree-pre leads to an infinite loop at runtime.

Comment 2 Dominique d'Humieres 2012-12-01 17:03:37 UTC

Revision 192891 (2012-10-28) is OK.
Revision 193261 (2012-11-06) is not.

Comment 3 H.J. Lu 2012-12-01 19:17:36 UTC

It is caused by revision 193098:

http://gcc.gnu.org/ml/gcc-cvs/2012-11/msg00045.html

Comment 4 Jan Hubicka 2012-12-02 09:59:35 UTC

Hmm, this seems to be caused by
Forced statement unreachable: pretmp_516 = coef_x[pretmp_515];
Forced statement unreachable: pretmp_513 = coef_x[pretmp_512];
Forced statement unreachable: pretmp_479 = coef_x[pretmp_478];

I am not exactly fortran guru. Can someone double check that there are no out of bounds accesses into these?

Honza

Comment 5 Joost VandeVondele 2012-12-02 10:11:34 UTC

(In reply to comment #4)
> Hmm, this seems to be caused by
> Forced statement unreachable: pretmp_516 = coef_x[pretmp_515];
> Forced statement unreachable: pretmp_513 = coef_x[pretmp_512];
> Forced statement unreachable: pretmp_479 = coef_x[pretmp_478];
> 
> I am not exactly fortran guru. Can someone double check that there are no out
> of bounds accesses into these?
> 
> Honza

I'm pretty sure there are no out-of-bounds. In particular coef_x is easy to check, it is only used as coef_x(:,lxp) where lxp is the loop bound 0..lp consistent with its def. Of course maybe the FE does something inconsistent ?

Also this runs fine:

fortran -O0 -fsanitize=address PR55555.f90 ; ./a.out

Comment 6 Jan Hubicka 2012-12-02 11:03:53 UTC

> I'm pretty sure there are no out-of-bounds. In particular coef_x is easy to
> check, it is only used as coef_x(:,lxp) where lxp is the loop bound 0..lp
> consistent with its def. Of course maybe the FE does something inconsistent ?
> 
> Also this runs fine:
> 
> fortran -O0 -fsanitize=address PR55555.f90 ; ./a.out

Hmm,  I saw similar weird cases generated by the frontend. coef_x is array of 8 elements

  real(kind=8) coef_x[8];

in loop analyzis we do:

Induction variable (integer(kind=8)) 4 + 4 * iteration does not wrap in statement pretmp_516 = coef_x[pretmp_515];
 in loop 4.
Statement pretmp_516 = coef_x[pretmp_515];
 is executed at most 0 (bounded by 0) + 1 times in loop 4.


This is true, pretmp_512 would be 8 at the second iteration of the loop.

later we conclude
Loop 4 iterates 1 times.
Loop 4 iterates at most 1 times.

 BB: 9, after_exit: 0
  size:   0 # DEBUG lxp => lxp_4
  size:   0 _137 = (integer(kind=8)) lxp_4;
  size:   1 _140 = _137 + pretmp_508;
  size:   1 _142 = *pol_x_141(D)[_140];
  size:   1 _143 = _137 * 4;
  size:   1 _144 = _143 + -1;
 BB: 10, after_exit: 0
  size:   1 _146 = S.25_279 + _144;
  size:   1 _150 = _142 * prephitmp_520;
  size:   1 _151 = _150 + prephitmp_517;
  size:   1 coef_x[_146] = _151;
  size:   1 S.25_153 = S.25_279 + 1;
  size:   1 ivtmp_162 = ivtmp_91 - 1;
  size:   2 if (ivtmp_162 == 0)
		goto bb12 or bb11
 BB: 11, after_exit: 0
  size:   1 pretmp_515 = _144 + S.25_153;
  size:   1 pretmp_516 = coef_x[pretmp_515];
  size:   1 pretmp_518 = S.25_153 + -1;
  size:   1 pretmp_519 = s[pretmp_518];
 BB: 12, after_exit: 0
  size:   0 # DEBUG lxp => lxp_4 + 1
  size:   1 ivtmp_109 = ivtmp_163 - 1;
  size:   2 if (ivtmp_109 == 0)
		goto bb 13 or exit
 BB: 13, after_exit: 1
  size:   1 lxp_154 = lxp_4 + 1;
  size:   0 pretmp_506 = (integer(kind=8)) lxp_154;
  size:   1 pretmp_509 = pretmp_506 * 4;
  size:   1 pretmp_510 = pretmp_509 + -1;
  size:   1 pretmp_512 = pretmp_510 + 1;
  size:   1 pretmp_513 = coef_x[pretmp_512];

Unrolled loop 4 completely (duplicated 1 times).
Exit condition of peeled iterations was eliminated.
Last iteration exit edge was proved true.

So the curious statements are in bb11.  Adding unreachable calls makes CSE to eventually
turn condition in the second copy of BB10 to always just to BB 12 that seem all right
to me.
Perhaps cascaded unrolling confuse some of the exits...

Honza

Comment 7 Jakub Jelinek 2012-12-07 09:44:40 UTC

I've tried to rewrite this as C, but managed to turn it into something that is miscompiled at a different spot.  The fortran testcase starts having __builtin_unreachable () in it in the *.cunroll pass, this one already in *.cunrolli pass.  Still, I believe it doesn't do any out of bounds access anywhere.  -O2 on x86_64-linux.

double s[4] = { 1.0, 2.0, 3.0, 4.0 }, pol_x[2] = { 5.0, 6.0 };

__attribute__((noinline)) int
foo (void)
{
  double coef_x[8] = { 0, 0, 0, 0, 0, 0, 0, 0 };
  int lxp = 0;
  if (lxp <= 1)
    do
      {
	double t = pol_x[lxp];
	long S;
	long l = lxp * 4L - 1;
	for (S = 1; S <= 4; S++)
	  coef_x[S + l] = coef_x[S + l] + s[S - 1] * t;
      }
    while (lxp++ != 1);
  asm volatile ("" : : "r" (coef_x) : "memory");
  for (lxp = 0; lxp < 8; lxp++)
    if (coef_x[lxp] != ((lxp & 3) + 1) * (5.0 + (lxp >= 4)))
      __builtin_abort ();
  return 1;
}

int
main ()
{
  asm volatile ("" : : : "memory");
  if (!foo ())
    __builtin_abort ();
  return 0;
}

Works with r193067, fails with r193100, haven't tried to bisect exactly, but would guess this is r193098 again.  For the outer loop it prints:

Analyzing # of iterations of loop 1
  exit condition [0, + , 1](no_overflow) != 1
  bounds on difference of bases: 1 ... 1
  result:
    # of iterations 1, bounded by 1
Loop 1 iterates 1 times.
Loop 1 iterates at most 1 times.

but that is wrong, the outer loop iterates exactly 2 times.

  <bb 3>:
  # lxp_1 = PHI <0(2), lxp_21(15)>
...
  <bb 6>:
  lxp_21 = lxp_1 + 1;
  if (lxp_1 != 1)
    goto <bb 15>;
  else
    goto <bb 7>;
  
  <bb 15>:
  goto <bb 3>;

If it used lxp_21 in the condition, that would be correct, but it uses the previous value of the IV.

Comment 8 Richard Biener 2012-12-14 13:49:02 UTC

(In reply to comment #7)
> I've tried to rewrite this as C, but managed to turn it into something that is
> miscompiled at a different spot.  The fortran testcase starts having
> __builtin_unreachable () in it in the *.cunroll pass, this one already in
> *.cunrolli pass.  Still, I believe it doesn't do any out of bounds access
> anywhere.  -O2 on x86_64-linux.
> 
> double s[4] = { 1.0, 2.0, 3.0, 4.0 }, pol_x[2] = { 5.0, 6.0 };
> 
> __attribute__((noinline)) int
> foo (void)
> {
>   double coef_x[8] = { 0, 0, 0, 0, 0, 0, 0, 0 };
>   int lxp = 0;
>   if (lxp <= 1)
>     do
>       {
>     double t = pol_x[lxp];
>     long S;
>     long l = lxp * 4L - 1;
>     for (S = 1; S <= 4; S++)
>       coef_x[S + l] = coef_x[S + l] + s[S - 1] * t;
>       }
>     while (lxp++ != 1);
>   asm volatile ("" : : "r" (coef_x) : "memory");
>   for (lxp = 0; lxp < 8; lxp++)
>     if (coef_x[lxp] != ((lxp & 3) + 1) * (5.0 + (lxp >= 4)))
>       __builtin_abort ();
>   return 1;
> }
> 
> int
> main ()
> {
>   asm volatile ("" : : : "memory");
>   if (!foo ())
>     __builtin_abort ();
>   return 0;
> }
> 
> Works with r193067, fails with r193100, haven't tried to bisect exactly, but
> would guess this is r193098 again.  For the outer loop it prints:
> 
> Analyzing # of iterations of loop 1
>   exit condition [0, + , 1](no_overflow) != 1
>   bounds on difference of bases: 1 ... 1
>   result:
>     # of iterations 1, bounded by 1
> Loop 1 iterates 1 times.
> Loop 1 iterates at most 1 times.
> 
> but that is wrong, the outer loop iterates exactly 2 times.

That's latch block executions, so one latch block execution is correct.

Comment 9 Richard Biener 2012-12-14 14:11:59 UTC

The unrolling puts __builtin_unreachable ()s into the inner duplicated loops:

First iteration, good:

  <bb 16>:
  # lxp_30 = PHI <0(2)>
  t_32 = pol_x[lxp_30];
  _33 = (long int) lxp_30;
  _34 = _33 * 4;
  l_35 = _34 + -1;

  <bb 17>:
  # S_36 = PHI <1(16), S_45(18)>
  if (S_36 <= 4)
    goto <bb 18>;
  else
    goto <bb 19>;

  <bb 18>:
  _38 = S_36 + l_35;
  _39 = coef_x[_38];
  _40 = S_36 + -1;
  _41 = s[_40];
  _42 = _41 * t_32;
  _43 = _39 + _42;
  coef_x[_38] = _43;
  S_45 = S_36 + 1;
  goto <bb 17>;

  <bb 19>:
  lxp_46 = lxp_30 + 1;

  <bb 20>:

Peeled iteration, bad:

  <bb 3>:
  # lxp_1 = PHI <lxp_46(20)>
  t_9 = pol_x[lxp_1];
  _10 = (long int) lxp_1;
  _11 = _10 * 4;
  l_12 = _11 + -1;
  goto <bb 5>;

  <bb 4>:
  _13 = S_3 + l_12;
  __builtin_unreachable ();
  _14 = coef_x[_13];
  _15 = S_3 + -1;
  _16 = s[_15];
  _17 = _16 * t_9;
  _18 = _14 + _17;
  __builtin_unreachable ();
  coef_x[_13] = _18;
  S_20 = S_3 + 1;

  <bb 5>:
  # S_3 = PHI <1(3), S_20(4)>
  if (S_3 <= 4)
    goto <bb 4>;
  else
    goto <bb 6>;

  <bb 6>:
  lxp_21 = lxp_1 + 1;
  if (1 == 0)
    goto <bb 15>;
  else
    goto <bb 7>;

  <bb 15>:

  <bb 22>:
  __builtin_unreachable ();

  <bb 7>:
  __asm__ __volatile__("" :  : "r" &coef_x : "memory");
  goto <bb 13>;

Note that the outer loop body looks ok - it is the inner body that get's
mangled in a bogus way.

That's because the loop bounds derived from the inner loop accesses
are bogus:

remove_exits_and_undefined_stmts (loop=0x7ffff67ec770, npeeled=1)
    at /space/rguenther/src/svn/trunk/gcc/tree-ssa-loop-ivcanon.c:481
481       bool changed = false;
(gdb) n
483       for (elt = loop->bounds; elt; elt = elt->next)
(gdb) 
488           if (!elt->is_exit
(gdb) p elt 
$8 = (nb_iter_bound *) 0x7ffff6925e38
(gdb) p *elt
$9 = {stmt = 0x7ffff6905c80, bound = {low = 0, high = 0}, is_exit = false, 
  next = 0x7ffff6925dc0}
(gdb) p elt->stmt
$10 = (gimple) 0x7ffff6905c80
(gdb) call debug_gimple_stmt ($10)
# .MEM_19 = VDEF <.MEM_6>
coef_x[_13] = _18;

as far as I can see even with lxp == 1 we have at most an index of 4 + 1*4 - 1.

Statement _14 = coef_x[_13];
 is executed at most 0 (bounded by 0) + 1 times in loop 1.

_that's_ bogus.

Comment 10 Richard Biener 2012-12-17 10:14:37 UTC

I'll have a more detailled look.

Comment 11 Richard Biener 2012-12-17 12:14:18 UTC

Ok, I'm confused by the following:

  <bb 3>: (loop 1 header)
  # lxp_1 = PHI <0(2), lxp_24(12)>
  t_9 = pol_x[lxp_1];
  _10 = (long int) lxp_1;
  _11 = _10 * 4;
  l_12 = _11 + -1;
  goto <bb 5>;


  <bb 4>:
  _15 = S_2 + l_12;
  _16 = coef_x[_15];
  _17 = S_2 + -1;
  _18 = s[_17];
  _19 = _18 * t_9;
  _20 = _16 + _19;
  coef_x[_15] = _20;
  S_22 = S_2 + 1;

  <bb 5>: (loop 2 header)
  # S_2 = PHI <1(3), S_22(4)>
  if (S_2 <= 4)
    goto <bb 4>;
  else
    goto <bb 6>;

  <bb 6>:
  lxp_24 = lxp_1 + 1;
  if (lxp_1 != 1)
    goto <bb 12>;
  else
    goto <bb 7>;

  <bb 12>:
  goto <bb 3>;

and SCEV says _15 is {l_12 + 1, +, 1}_2 which looks correct.  But then it
does

(chrec_apply
  (varying_loop = 2)
  (chrec = {l_12 + 1, +, 1}_2)
  (x = 4)
  (res = l_12 + 5))

and magically, via l_12 being {-1, +, 4}_1 (also correct) arrives at

(instantiate_scev
  (instantiate_below = 2)
  (evolution_loop = 1)
  (chrec = {4, +, 4}_1)
  (res = {4, +, 4}_1))

huh?  So to it _15 is {4, +, 4}_1 (not sure what is considered "initial"
in terms of scalar evolution with a value varying in an inner loop).

This is what infer_loop_bounds_from_undefined derives the bogus bound
for loop 1 from.  To my eyes _15 should be {0, + 4}_1!

Maybe it doesn't really make sense to ask for the evolution of something
defined in loop N with respect to an outer loop M?

If we change idx_infer_loop_bounds with

Index: tree-ssa-loop-niter.c
===================================================================
--- tree-ssa-loop-niter.c       (revision 194552)
+++ tree-ssa-loop-niter.c       (working copy)
@@ -2671,7 +2671,12 @@ idx_infer_loop_bounds (tree base, tree *
       upper = false;
     }
 
-  ev = instantiate_parameters (loop, analyze_scalar_evolution (loop, *idx));
+  struct loop *dloop = loop_containing_stmt (data->stmt);
+  if (!dloop)
+    return true;
+
+  ev = analyze_scalar_evolution (dloop, *idx);
+  ev = instantiate_parameters (loop, ev);
   init = initial_condition (ev);
   step = evolution_part_in_loop_num (ev, loop->num);
 
then we obtain via {l_12 + 1, +, 1}_2, {{0, +, 4}_1, +, 1}_2 the correct
solution (init == 0, step == 4).

I am going to bootstrap and regtest that.

Comment 12 Richard Biener 2012-12-18 13:12:39 UTC

Author: rguenth
Date: Tue Dec 18 13:12:34 2012
New Revision: 194578

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=194578
Log:
2012-12-18  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/55555
	* tree-ssa-loop-niter.c (idx_infer_loop_bounds): Properly
	analyze evolution of the index for the loop it is used in.
	* tree-scalar-evolution.c (instantiate_scev_name): Take
	inner loop we will be creating a chrec for.  Generalize
	fix for PR40281 and prune invalid SCEVs.
	(instantiate_scev_poly): Likewise - pass down inner loop
	we will be creating a chrec for.
	(instantiate_scev_binary): Take and pass through inner loop.
	(instantiate_array_ref): Likewise.
	(instantiate_scev_convert): Likewise.
	(instantiate_scev_not): Likewise.
	(instantiate_scev_3): Likewise.
	(instantiate_scev_2): Likewise.
	(instantiate_scev_1): Likewise.
	(instantiate_scev_r): Likewise.
	(resolve_mixers): Adjust.
	(instantiate_scev): Likewise.

	* gcc.dg/torture/pr55555.c: New testcase.
	* gcc.dg/vect/vect-iv-11.c: Adjust.

Added:
    trunk/gcc/testsuite/gcc.dg/torture/pr55555.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/vect/vect-iv-11.c
    trunk/gcc/tree-scalar-evolution.c
    trunk/gcc/tree-ssa-loop-niter.c

Comment 13 Richard Biener 2012-12-18 13:29:52 UTC

Fixed.