| Summary: | multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert | ||
|---|---|---|---|
| Product: | gcc | Reporter: | Peter Cordes <pcordes> |
| Component: | tree-optimization | Assignee: | Richard Biener <rguenth> |
| Status: | RESOLVED FIXED | ||
| Severity: | normal | CC: | marxin, rguenth |
| Priority: | P3 | Keywords: | missed-optimization, ssemmx |
| Version: | 11.0 | ||
| Target Milestone: | --- | ||
| Host: | Target: | x86_64-*-*, i?86-*-* | |
| Build: | Known to work: | ||
| Known to fail: | Last reconfirmed: | 2021-01-04 00:00:00 | |
| Bug Depends on: | |||
| Bug Blocks: | 53947 | ||
I think this is a missed optimization in SLP reduction vectorization. We're
detecting the optimal vectorization opportunity but:
x.c:6:36: note: ==> examining statement: d2_30 = PHI <d2_25(7), 0.0(6)>
x.c:6:36: note: vect_is_simple_use: operand _11 * _14, type of def: internal
x.c:6:36: note: vect_is_simple_use: operand d2_30 = PHI <d2_25(7), 0.0(6)>, type of def: reduction
x.c:6:36: missed: reduc op not supported by target.
x.c:6:36: missed: in-order unchained SLP reductions not supported.
x.c:1:8: missed: not vectorized: relevant stmt not supported: d2_30 = PHI <d2_25(7), 0.0(6)>
x.c:6:36: note: removing SLP instance operations starting from: d1_24 = _7 + d1_28;
x.c:6:36: missed: unsupported SLP instances
x.c:6:36: note: re-trying with SLP disabled
and end up vectorizing two reductions on interleaved data (ugh).
The fact we fail to notice is that there's no reduction needed in the
epilogue (VF == 1) and the reduction is vectorized in-order already.
if (reduction_type == FOLD_LEFT_REDUCTION
&& slp_node
&& !REDUC_GROUP_FIRST_ELEMENT (stmt_info))
{
/* We cannot use in-order reductions in this case because there is
an implicit reassociation of the operations involved. */
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"in-order unchained SLP reductions not supported.\n");
return false;
If we allow VF == 1 here we end up with odd (wrong?) code so I guess we
shouldn't classify a VF == 1 SLP reduction as FOLD_LEFT_REDUCTION in the
first place but IIRC this classification is done before the VF is fixed.
Mine.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:8837f82e4bab1b5405cf034eab9b3e83afc563ad commit r11-6434-g8837f82e4bab1b5405cf034eab9b3e83afc563ad Author: Richard Biener <rguenther@suse.de> Date: Mon Jan 4 09:53:11 2021 +0100 tree-optimization/98291 - allow SLP more vectorization of reductions When the VF is one a SLP reduction is in-order and thus we can vectorize even when the reduction op is not associative. 2021-01-04 Richard Biener <rguenther@suse.de> PR tree-optimization/98291 * tree-vect-loop.c (vectorizable_reduction): Bypass associativity check for SLP reductions with VF 1. * gcc.dg/vect/slp-reduc-11.c: New testcase. * gcc.dg/vect/vect-reduc-in-order-4.c: Adjust. Fixed on trunk. |
An FP reduction loop with 2 scalar accumulators auto-vectorizes into a mess, instead of effectively mapping each scalar to an element of one vector accumulator. (Unless we use -ffast-math, then that happens. clang gets it right even without -ffast-math). double dotprod(const double *a, const double *b, unsigned long long n) { double d1 = 0.0; double d2 = 0.0; for (unsigned long long i = 0; i < n; i += 2) { d1 += a[i] * b[i]; d2 += a[i + 1] * b[i + 1]; } return (d1 + d2); } https://godbolt.org/z/Kq48j9 With -ffast-math the nice sane loop we expect .L3: movupd (%rsi,%rax), %xmm0 movupd (%rdi,%rax), %xmm3 addq $1, %rdx addq $16, %rax mulpd %xmm3, %xmm0 addpd %xmm0, %xmm1 cmpq %rcx, %rdx jb .L3 without: ... main loop .L4: movupd (%rcx,%rax), %xmm1 # 16-byte load movupd (%rsi,%rax), %xmm3 movhpd 16(%rcx,%rax), %xmm1 # overwrite the high half of it!! movhpd 16(%rsi,%rax), %xmm3 mulpd %xmm3, %xmm1 movupd 16(%rsi,%rax), %xmm3 movlpd 8(%rsi,%rax), %xmm3 addsd %xmm1, %xmm2 unpckhpd %xmm1, %xmm1 addsd %xmm1, %xmm2 movupd 16(%rcx,%rax), %xmm1 movlpd 8(%rcx,%rax), %xmm1 addq $32, %rax mulpd %xmm3, %xmm1 addsd %xmm1, %xmm0 unpckhpd %xmm1, %xmm1 addsd %xmm1, %xmm0 cmpq %rdx, %rax jne .L4 The overall strategy is insane, but even some of the details are insane. e.g. a 16-byte load into XMM1, and then overwriting the high half of that with a different double before reading it. That's bad enough, but you'd expect movsd / movhpd to manually gather 2 doubles, without introducing the possibility of a cache-line split load for zero benefit. Similarly, movupd / movlpd should have just loaded in the other order. (Or since they're contiguous, movupd 8(%rsi,%rax), %xmm3 / shufpd.) So beyond the bad overall strategy (which is likely worse than unrolled scalar), it might be worth checking for some of this kind of smaller-scale insanity somewhere later to make it less bad if some other inputs can trigger similar behaviour. (This small-scale detecting of movupd / movhpd and using movsd / movhpd could be a separate bug, but if it's just a symptom of something that should never happen in the first place then it's not really its own bug at all.)