Bug 67681 - Missed vectorization: induction variable used after loop
Summary: Missed vectorization: induction variable used after loop
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 6.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2015-09-22 16:41 UTC by Alan Lawrence
Modified: 2016-07-22 15:30 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2016-03-10 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Lawrence 2015-09-22 16:41:40 UTC
The inner loop here:
void addlog2 (int *data)
{
  int i = 1;
  for (int j=0; j<=30; j++) {
    int max = 1 << j;
    if (FOO && i>max) break;
 
    for (; i <= max; i++)
      data[i] += j;
  }
}

does not vectorize if the if(FOO...) is present:
$ /work/alalaw01/build-aarch64-none-elf/install/bin/aarch64-none-elf-gcc -S -O2 -ftree-vectorize -fdump-tree-vect-details=stdout loop9b.c -DFOO=1 | grep vectorized
loop9b.c:1:6: note: not vectorized: inner-loop count not invariant.
loop9b.c:8:5: note: === vect_mark_stmts_to_be_vectorized ===
loop9b.c:8:5: note: not vectorized: value used after loop.
loop9b.c:8:5: note: === vect_mark_stmts_to_be_vectorized ===
loop9b.c:8:5: note: not vectorized: value used after loop.
loop9b.c:1:6: note: vectorized 0 loops in function.


$ aarch64-none-elf-gcc -S -O2 -ftree-vectorize -fdump-tree-vect-details=stdout loop9b.c -DFOO=0 | grep vectorized
loop9b.c:4:3: note: not vectorized: inner-loop count not invariant.
loop9b.c:8:5: note: === vect_mark_stmts_to_be_vectorized ===
loop9b.c:8:5: note: loop vectorized
loop9b.c:1:6: note: vectorized 1 loops in function.

Same with -O3. Of course clever analysis could figure out that i>max is never true, but even without that, we should be able to get 'i' back afterwards.
Comment 1 Richard Biener 2015-09-23 07:56:51 UTC
The outer loop also has multiple exits.  The testcase is incomplete (FOO is undefined).
Comment 2 Alan Lawrence 2015-09-23 10:17:42 UTC
Being stupid here, but why does the outer loop having multiple exits matter - it's the inner loop that should be vectorized?

FOO was a macro used to selectively make the test i>max disappear (enabling vectorization) - the two commandlines had -DFOO=0 (vectorizes) and -DFOO=1 (doesn't).
Comment 3 Alan Lawrence 2016-03-09 16:34:24 UTC
So in the not-vectorized case (-DFOO=1), we get for the inner loop:

<bb 6>:
  # i_27 = PHI <i_22(5), i_16(7)>
  _8 = (long unsigned int) i_27;
  _9 = _8 * 4;
  _11 = data_10(D) + _9;
  _13 = *_11;
  _14 = _13 + j_23;
  *_11 = _14;
  i_16 = i_27 + 1;
  if (i_16 <= max_24)
    goto <bb 7>;
  else
    goto <bb 8>;

  <bb 7>:
  goto <bb 6>;

  <bb 8>:
  # i_32 = PHI <i_16(6)>

the loop exit phi, i_32=PHI<i_16(6)>, makes i_16=i_27+1 relevant (vec_stmt_relevant_p: used out of loop.), so we go through that on the worklist and then i_27=PHI<i_22(5),i_16(7)>, marking the phi as STMT_VINFO_LIVE_P, and hence "not vectorized: value used after loop". Kind of as expected, FORNOW.

In the -DFOO=0 case, a bunch of loop peeling, header-copying, and other transforms, end up with this input to vectorization:

  <bb 5>: //header of inner loop
  # i_2 = PHI <i_25(4), i_15(6)>
  _8 = (long unsigned int) i_2;
  _9 = _8 * 4;
  _11 = data_10(D) + _9;
  _12 = *_11;
  _13 = _12 + j_26;
  *_11 = _13;
  i_15 = i_2 + 1;
  if (max_7 >= i_15)
    goto <bb 6>;
  else
    goto <bb 7>;

  <bb 6>:
  goto <bb 5>;

  <bb 7>: //bb 5 is only predecessor
  _19 = (unsigned int) i_25;
  _18 = (unsigned int) max_7;
  _17 = (unsigned int) i_25;
  _5 = _18 - _17;
  _4 = _5 + _19;
  _3 = _4 + 1;
  i_21 = (int) _3;

  <bb 8>:
  # i_23 = PHI <i_21(7), i_25(3)>
  //tests outer loop

note bb7 use i_25, not i_2; so neither i_15 nor i_2 escape the loop, and we don't have the problem from above. (Yes bb7 is taking i_25 away from max_7 and then adding it back on again, before adding 1, to give the value of i after the inner loop.)

This arrangement of multiple i's live at the same time, is not present in 107t.ch2. 130t.loopinit introduces i_21, computed by an exit phi on leaving the inner loop. 135t.sccp then changes this to the max_7-i_25+i_25 sequence which removes the dependency on i_15 and allows vectorization.
Comment 4 Alan Lawrence 2016-03-09 17:13:02 UTC
loopinit introduces the exit phi in much the same way for both -DFOO=0 and -DFOO=1, so the difference is in sccp.

In the -DFOO=0 case, sccp does this (removing TODO_cleanup_cfg from pass_data_scev_cprop to make the diff easier, still vectorizes):

 ;; Function addlog2 (addlog2, funcdef_no=0, decl_uid=2749, cgraph_uid=0, symbol_order=0)
 
+
+final value replacement:
+  i_21 = PHI <i_15(4)>
+  with
+  i_21 = (int) _3;
+
...[snip]...
   <bb 10>:
-  # i_21 = PHI <i_15(4)>
+  _19 = (unsigned int) i_25;
+  _18 = (unsigned int) max_7;
+  _17 = (unsigned int) i_25;
+  _5 = _18 - _17;
+  _4 = _5 + _19;
+  _3 = _4 + 1;
+  i_21 = (int) _3;

In the -DFOO=1 case, sccp doesn't do anything; and adding -fno-tree-scev-cprop prevents vectorization of the -DFOO=0 case.
Comment 5 Alan Lawrence 2016-03-09 17:47:31 UTC
In the -DFOO=0 case, we have peeled an extra copy of the inner loop condition, i <= max_7, above the loop. scalar evolution (final_value_replacement_loop) works, because it sees the inner loop goes round niter = (unsigned int) max_7 - (unsigned int) i_25 iterations, and compute_overall_effect_of_inner_loop gives us

(int) (((unsigned int) i_25 + ((unsigned int) max_7 - (unsigned int) i_25)) + 1)

which is not expression_expensive_p, so we do it. Hence the add/subtract above.

When -DFOO=1, we have not done that peeling, so niter = i_22 <= max_24 ? (unsigned int) max_24 - (unsigned int) i_22 : 0, and compute_overall_effect_of_inner_loop gives us

(i_22 + 1) + (i_22 <= max_24 ? (int) ((unsigned int) max_24 - (unsigned int) i_22) : 0)

which is expression_expensive_p, so we don't do the final value replacement.
Comment 6 Richard Biener 2016-03-10 09:47:53 UTC
I suppose adjusting expression_expensive slightly might be workable, not sure.
SCCP does aggressively replace all final values that are "not expensive"
even if that doesn't yield in any followup transforms (so we eventually
compute things twice for no good reason).  IMHO SCCP should be re-written
as a "tool" to passes can force the transform if it is necessary for them
(and profitable, say for IVOPTs).  One can do this "rewrite" by splitting
analysis, cost compute and transform so that for example the vectorizer
could handle things by providing a different cost analysis and forcing
the transform if vectorization is otherwise possible.
Comment 7 Alan Lawrence 2016-03-10 16:12:12 UTC
Looking at where the peeling happens. In both -DFOO=0 and -DFOO=1 cases, 107.ch2 peels the inner loop header, so there is an i<=max test in the outer loop before the inner loop. However, in the -DFOO=1 case, this is dominated by the extra i>max test (that breaks out of the outer loop), so 110.dom2 removes the peeled i<=max.

Thus, just before sccp, in the -DFOO=0 case, we have:

  <bb 3>:
  # i_25 = PHI <i_23(11), 1(2)>
  # j_26 = PHI <j_16(11), 0(2)>
  max_7 = 1 << j_26;
  if (max_7 >= i_25)
    goto <bb 4>;
  else
    goto <bb 5>; //skip inner loop

  <bb 4>: //inner loop header
  # i_2 = PHI <i_25(7), i_15(9)>
  _8 = (long unsigned int) i_2;
  _9 = _8 * 4;
  _11 = data_10(D) + _9;
  _12 = *_11;
  _13 = _12 + j_26;
  *_11 = _13;
  i_15 = i_2 + 1;
  if (max_7 >= i_15)
    goto <bb 4>; //cleaned, actually via latch
  else
    goto <bb 10>;

note the inner loop exits if !(max_7 >= i_15), and when we hit the inner loop, we know that (max_7 >= i_25). Whereas in the -DFOO=1 case:
  <bb 2>:
  goto <bb 4>;

  <bb 3>: //in outer loop
  max_7 = 1 << j_17;
  if (max_7 < i_32)
    goto <bb 7>;
  else
    goto <bb 4>;

  <bb 4>: //outer loop header
  # max_24 = PHI <max_7(9), 1(2)>
  # i_22 = PHI <i_32(9), 1(2)>
  # j_23 = PHI <j_17(9), 0(2)>

  <bb 5>: //inner loop header
  # i_27 = PHI <i_22(4), i_16(10)>
  _8 = (long unsigned int) i_27;
  _9 = _8 * 4;
  _11 = data_10(D) + _9;
  _13 = *_11;
  _14 = _13 + j_23;
  *_11 = _14;
  i_16 = i_27 + 1;
  if (i_16 <= max_24)
    goto <bb 5>; //cleaned, actually via latch
  else
    goto <bb 6>;

the inner loop exits if !(max_24 >= i_16), but max_24 is defined as PHI<max_7, 1>, and we only have that max_7<i_32 if we came round the outer loop, rather than jumping into the first iteration from bb 2. Hence the complex niter

(i_22 + 1) + (i_22 <= max_24 ? (int) ((unsigned int) max_24 - (unsigned int) i_22) : 0)

because i_22 <= max_24 has not obviously been tested.

This structure is essentially created by dom2, when it jump-threads (?) the first "if (i>max) break" out of the loop, such that the outer loop now executes "if (i>max) break" after the inner loop (rather than testing "if (i>max) break" before the inner loop, as it still did following 107.ch2). So as an alternative, possibly tweaking the jump-threading/loop-peeling heuristics might help (?).
Comment 8 Alan Lawrence 2016-03-10 16:26:34 UTC
Indeed, the -DFOO=1 case vectorizes with -fno-tree-dominator-opts.
Comment 9 bin cheng 2016-07-22 15:16:26 UTC
I think this has been fixed by Alan Hayward's patch vectorizing loop whose induction is live after loop.
Comment 10 bin cheng 2016-07-22 15:30:59 UTC
Fixed.