This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/88259] vectorization failure for a typical loop for getting max value and index
- From: "rsandifo at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Fri, 07 Dec 2018 15:56:56 +0000
- Subject: [Bug tree-optimization/88259] vectorization failure for a typical loop for getting max value and index
- Auto-submitted: auto-generated
- References: <bug-88259-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88259
--- Comment #4 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #3)
> The vectorizer does not like
>
> <bb 3> [local count: 955630224]:
> # best_i_25 = PHI <best_i_11(8), best_i_16(D)(18)>
> # best_26 = PHI <best_13(8), 0(18)>
> # i_27 = PHI <i_20(8), 0(18)>
> _1 = (long unsigned int) i_27;
> _2 = _1 * 4;
> _3 = data_18(D) + _2;
> _4 = *_3;
> best_i_11 = _4 <= best_26 ? best_i_25 : i_27;
> best_13 = MAX_EXPR <_4, best_26>;
> i_20 = i_27 + 1;
> if (n_17(D) > i_20)
>
> because for the best MAX reduction we have an additional use of the
> reduction value in the index reduction. This combination isn't
> magically supported even though in isolation both cases are.
>
> t.c:4:5: note: Analyze phi: best_26 = PHI <best_13(8), 0(18)>
> t.c:4:5: missed: reduction used in loop.
> t.c:4:5: missed: Unknown def-use cycle pattern.
> t.c:4:5: note: Analyze phi: best_i_25 = PHI <best_i_11(8),
> best_i_16(D)(18)>
> t.c:4:5: note: detected reduction: need to swap operands: best_i_11 = _4 >
> best_26 ? i_27 : best_i_25;
> t.c:4:5: note: Detected reduction.
>
> if we'd been lucky and had analyzed best_i_25 before best_26 then we could
> probably special-case the case of "reduction used in loop" when that appears
> in other reductions. In general that's of course still not valid I think.
Yeah. Disabling the check for uses in the loop:
/* If this isn't a nested cycle or if the nested cycle reduction value
is used ouside of the inner loop we cannot handle uses of the reduction
value. */
if ((!nested_in_vect_loop || inner_loop_of_double_reduc)
&& (nlatch_def_loop_uses > 1 || nphi_def_loop_uses > 1))
gives us something like the vector body we want, modulo some
inefficiency:
.L4:
ldr q4, [x2], 16
mov v3.16b, v2.16b
add v2.4s, v2.4s, v6.4s
cmge v5.4s, v0.4s, v4.4s
cmp x3, x2
smax v0.4s, v0.4s, v4.4s
bif v1.16b, v3.16b, v5.16b
bne .L4
where v0.4s ends up containing the maximum for each individual
lane and v1.s contains the best_i associated with each member
of v0.4s. We "just" then need to make the epilogue do the
right thing with this information.
Hacking out the condition above (obviously an invalid thing
to do) sets "best" to the maximum of v0.s (good) but also sets
"best_i" to the maximum of v1.s (bad). We need to restrict the
maximum of v1.s to lanes of v0.s that contain "best" (i.e. the
reduction result of v0.s):
dup v2.4s, best
cmpeq v2.4s, v2.4s, v0.4s
and v1.4s, v1.4s, v2.4s
and only then take the maximum of v1.4s.
This requires "best" to come from a reassociatve conditional
reduction and would require the "best_i" reduction to be marked
as dependent on the "best" reduction. Might end up being a bit
messy, since we'd have to be careful to retain the uses check
above for all other cases.