The inner loop here: void addlog2 (int *data) { int i = 1; for (int j=0; j<=30; j++) { int max = 1 << j; if (FOO && i>max) break; for (; i <= max; i++) data[i] += j; } } does not vectorize if the if(FOO...) is present: $ /work/alalaw01/build-aarch64-none-elf/install/bin/aarch64-none-elf-gcc -S -O2 -ftree-vectorize -fdump-tree-vect-details=stdout loop9b.c -DFOO=1 | grep vectorized loop9b.c:1:6: note: not vectorized: inner-loop count not invariant. loop9b.c:8:5: note: === vect_mark_stmts_to_be_vectorized === loop9b.c:8:5: note: not vectorized: value used after loop. loop9b.c:8:5: note: === vect_mark_stmts_to_be_vectorized === loop9b.c:8:5: note: not vectorized: value used after loop. loop9b.c:1:6: note: vectorized 0 loops in function. $ aarch64-none-elf-gcc -S -O2 -ftree-vectorize -fdump-tree-vect-details=stdout loop9b.c -DFOO=0 | grep vectorized loop9b.c:4:3: note: not vectorized: inner-loop count not invariant. loop9b.c:8:5: note: === vect_mark_stmts_to_be_vectorized === loop9b.c:8:5: note: loop vectorized loop9b.c:1:6: note: vectorized 1 loops in function. Same with -O3. Of course clever analysis could figure out that i>max is never true, but even without that, we should be able to get 'i' back afterwards.
The outer loop also has multiple exits. The testcase is incomplete (FOO is undefined).
Being stupid here, but why does the outer loop having multiple exits matter - it's the inner loop that should be vectorized? FOO was a macro used to selectively make the test i>max disappear (enabling vectorization) - the two commandlines had -DFOO=0 (vectorizes) and -DFOO=1 (doesn't).
So in the not-vectorized case (-DFOO=1), we get for the inner loop: <bb 6>: # i_27 = PHI <i_22(5), i_16(7)> _8 = (long unsigned int) i_27; _9 = _8 * 4; _11 = data_10(D) + _9; _13 = *_11; _14 = _13 + j_23; *_11 = _14; i_16 = i_27 + 1; if (i_16 <= max_24) goto <bb 7>; else goto <bb 8>; <bb 7>: goto <bb 6>; <bb 8>: # i_32 = PHI <i_16(6)> the loop exit phi, i_32=PHI<i_16(6)>, makes i_16=i_27+1 relevant (vec_stmt_relevant_p: used out of loop.), so we go through that on the worklist and then i_27=PHI<i_22(5),i_16(7)>, marking the phi as STMT_VINFO_LIVE_P, and hence "not vectorized: value used after loop". Kind of as expected, FORNOW. In the -DFOO=0 case, a bunch of loop peeling, header-copying, and other transforms, end up with this input to vectorization: <bb 5>: //header of inner loop # i_2 = PHI <i_25(4), i_15(6)> _8 = (long unsigned int) i_2; _9 = _8 * 4; _11 = data_10(D) + _9; _12 = *_11; _13 = _12 + j_26; *_11 = _13; i_15 = i_2 + 1; if (max_7 >= i_15) goto <bb 6>; else goto <bb 7>; <bb 6>: goto <bb 5>; <bb 7>: //bb 5 is only predecessor _19 = (unsigned int) i_25; _18 = (unsigned int) max_7; _17 = (unsigned int) i_25; _5 = _18 - _17; _4 = _5 + _19; _3 = _4 + 1; i_21 = (int) _3; <bb 8>: # i_23 = PHI <i_21(7), i_25(3)> //tests outer loop note bb7 use i_25, not i_2; so neither i_15 nor i_2 escape the loop, and we don't have the problem from above. (Yes bb7 is taking i_25 away from max_7 and then adding it back on again, before adding 1, to give the value of i after the inner loop.) This arrangement of multiple i's live at the same time, is not present in 107t.ch2. 130t.loopinit introduces i_21, computed by an exit phi on leaving the inner loop. 135t.sccp then changes this to the max_7-i_25+i_25 sequence which removes the dependency on i_15 and allows vectorization.
loopinit introduces the exit phi in much the same way for both -DFOO=0 and -DFOO=1, so the difference is in sccp. In the -DFOO=0 case, sccp does this (removing TODO_cleanup_cfg from pass_data_scev_cprop to make the diff easier, still vectorizes): ;; Function addlog2 (addlog2, funcdef_no=0, decl_uid=2749, cgraph_uid=0, symbol_order=0) + +final value replacement: + i_21 = PHI <i_15(4)> + with + i_21 = (int) _3; + ...[snip]... <bb 10>: - # i_21 = PHI <i_15(4)> + _19 = (unsigned int) i_25; + _18 = (unsigned int) max_7; + _17 = (unsigned int) i_25; + _5 = _18 - _17; + _4 = _5 + _19; + _3 = _4 + 1; + i_21 = (int) _3; In the -DFOO=1 case, sccp doesn't do anything; and adding -fno-tree-scev-cprop prevents vectorization of the -DFOO=0 case.
In the -DFOO=0 case, we have peeled an extra copy of the inner loop condition, i <= max_7, above the loop. scalar evolution (final_value_replacement_loop) works, because it sees the inner loop goes round niter = (unsigned int) max_7 - (unsigned int) i_25 iterations, and compute_overall_effect_of_inner_loop gives us (int) (((unsigned int) i_25 + ((unsigned int) max_7 - (unsigned int) i_25)) + 1) which is not expression_expensive_p, so we do it. Hence the add/subtract above. When -DFOO=1, we have not done that peeling, so niter = i_22 <= max_24 ? (unsigned int) max_24 - (unsigned int) i_22 : 0, and compute_overall_effect_of_inner_loop gives us (i_22 + 1) + (i_22 <= max_24 ? (int) ((unsigned int) max_24 - (unsigned int) i_22) : 0) which is expression_expensive_p, so we don't do the final value replacement.
I suppose adjusting expression_expensive slightly might be workable, not sure. SCCP does aggressively replace all final values that are "not expensive" even if that doesn't yield in any followup transforms (so we eventually compute things twice for no good reason). IMHO SCCP should be re-written as a "tool" to passes can force the transform if it is necessary for them (and profitable, say for IVOPTs). One can do this "rewrite" by splitting analysis, cost compute and transform so that for example the vectorizer could handle things by providing a different cost analysis and forcing the transform if vectorization is otherwise possible.
Looking at where the peeling happens. In both -DFOO=0 and -DFOO=1 cases, 107.ch2 peels the inner loop header, so there is an i<=max test in the outer loop before the inner loop. However, in the -DFOO=1 case, this is dominated by the extra i>max test (that breaks out of the outer loop), so 110.dom2 removes the peeled i<=max. Thus, just before sccp, in the -DFOO=0 case, we have: <bb 3>: # i_25 = PHI <i_23(11), 1(2)> # j_26 = PHI <j_16(11), 0(2)> max_7 = 1 << j_26; if (max_7 >= i_25) goto <bb 4>; else goto <bb 5>; //skip inner loop <bb 4>: //inner loop header # i_2 = PHI <i_25(7), i_15(9)> _8 = (long unsigned int) i_2; _9 = _8 * 4; _11 = data_10(D) + _9; _12 = *_11; _13 = _12 + j_26; *_11 = _13; i_15 = i_2 + 1; if (max_7 >= i_15) goto <bb 4>; //cleaned, actually via latch else goto <bb 10>; note the inner loop exits if !(max_7 >= i_15), and when we hit the inner loop, we know that (max_7 >= i_25). Whereas in the -DFOO=1 case: <bb 2>: goto <bb 4>; <bb 3>: //in outer loop max_7 = 1 << j_17; if (max_7 < i_32) goto <bb 7>; else goto <bb 4>; <bb 4>: //outer loop header # max_24 = PHI <max_7(9), 1(2)> # i_22 = PHI <i_32(9), 1(2)> # j_23 = PHI <j_17(9), 0(2)> <bb 5>: //inner loop header # i_27 = PHI <i_22(4), i_16(10)> _8 = (long unsigned int) i_27; _9 = _8 * 4; _11 = data_10(D) + _9; _13 = *_11; _14 = _13 + j_23; *_11 = _14; i_16 = i_27 + 1; if (i_16 <= max_24) goto <bb 5>; //cleaned, actually via latch else goto <bb 6>; the inner loop exits if !(max_24 >= i_16), but max_24 is defined as PHI<max_7, 1>, and we only have that max_7<i_32 if we came round the outer loop, rather than jumping into the first iteration from bb 2. Hence the complex niter (i_22 + 1) + (i_22 <= max_24 ? (int) ((unsigned int) max_24 - (unsigned int) i_22) : 0) because i_22 <= max_24 has not obviously been tested. This structure is essentially created by dom2, when it jump-threads (?) the first "if (i>max) break" out of the loop, such that the outer loop now executes "if (i>max) break" after the inner loop (rather than testing "if (i>max) break" before the inner loop, as it still did following 107.ch2). So as an alternative, possibly tweaking the jump-threading/loop-peeling heuristics might help (?).
Indeed, the -DFOO=1 case vectorizes with -fno-tree-dominator-opts.
I think this has been fixed by Alan Hayward's patch vectorizing loop whose induction is live after loop.
Fixed.