Analyzing some performance anomaly for spec2000 I found out that 168.wupwise with vectorization is slower than without it on x86. The main problem is that gcc does not recognize some special idioms of complex addition and multiplication in process of loop vectorization. For example, for a simple zaxpy loop icc genearates 1.6X faster code than gcc. Here is assembly for zaxpy loop produced by icc: ..B1.4: # Preds ..B1.2 ..B1.4 movups (%rsi,%rdx), %xmm2 #7.28 movups 16(%rsi,%rdx), %xmm5 #7.28 movups (%rsi,%rcx), %xmm4 #7.17 movups 16(%rsi,%rcx), %xmm7 #7.17 movddup (%rsi,%rdx), %xmm3 #7.27 incq %r8 #6.10 movddup 16(%rsi,%rdx), %xmm6 #7.27 unpckhpd %xmm2, %xmm2 #7.27 unpckhpd %xmm5, %xmm5 #7.27 mulpd %xmm1, %xmm3 #7.27 mulpd %xmm0, %xmm2 #7.27 mulpd %xmm1, %xmm6 #7.27 mulpd %xmm0, %xmm5 #7.27 addsubpd %xmm2, %xmm3 #7.27 addsubpd %xmm5, %xmm6 #7.27 addpd %xmm3, %xmm4 #7.9 addpd %xmm6, %xmm7 #7.9 movups %xmm4, (%rsi,%rcx) #7.9 movups %xmm7, 16(%rsi,%rcx) #7.9 addq $32, %rsi #6.10 cmpq %rdi, %r8 #6.10 jb ..B1.4 # Prob 64% #6.10 ( I got it with -xSSE4.2 -O3 options). Gor gcc compiler the following options were used: -m64 -mfpmath=sse -march=corei7 -O3 -ffast-math.
Can you reproduce zaxpy source here please? Also please see the list of bugs referenced from PR53947, there is likely a duplicate for this issue.
Created attachment 28455 [details] test reproducer
I looked through the list of all issues related to vectorization but could not find duplicate.
Thanks.
Confirmed. GCC vectorizes this using hybrid SLP - it unrolls the loop once to be able to vectorize two minus and two adds resulting from the complex multiplication. The PR is kind-of a duplicate of PR37021 where also a reduction and a variable stride is involved. So fixing this bug is required to more efficiently vectorize PR37021. Note that even this bug has multiple issues that need to be tackled. I happen to work on them.
So we now would vectorize this but compute the SLP vectorization as never profitable. Generated code with -fno-vect-cost-model (-Ofast -march=corei7): .L4: movupd (%rdx,%rax), %xmm0 movapd %xmm3, %xmm1 mulpd %xmm0, %xmm1 palignr $8, %xmm0, %xmm0 mulpd %xmm2, %xmm0 addsubpd %xmm0, %xmm1 movupd (%rcx,%rax), %xmm0 addpd %xmm0, %xmm1 movups %xmm1, (%rcx,%rax) addq $16, %rax cmpq %rsi, %rax jne .L4 t.f:6:0: note: Cost model analysis: Vector inside of loop cost: 15 Vector prologue cost: 10 Vector epilogue cost: 0 Scalar iteration cost: 14 Scalar outside cost: 6 Vector outside cost: 10 prologue iterations: 0 epilogue iterations: 0 t.f:6:0: note: cost model: the vector iteration cost = 15 divided by the scalar iteration cost = 14 is greater or equal to the vectorization factor = 1. t.f:6:0: note: not vectorized: vectorization not profitable. t.f:6:0: note: not vectorized: vector version will never be profitable. as it is basically equal to basic-block vectorizing the loop body. Note that we pessimistically handle addsubpd as if it were not present and the code would really end up as vect__31.16_42 = vect__27.10_49 - vect__28.15_43; vect__31.17_41 = vect__27.10_49 + vect__28.15_43; _40 = VEC_PERM_EXPR <vect__31.16_42, vect__31.17_41, { 0, 3 }>; which is what the vectorizer handles this with (two vector_stmt plus one vec_perm cost). The x86 vectorizer cost model would need to be adjusted for this. With cost model enabled we fall back to vectorization using interleaving: .L5: movupd (%rax), %xmm5 addl $1, %r9d addq $32, %rax addq $32, %r8 movupd -32(%r8), %xmm1 movupd -16(%r8), %xmm0 movapd %xmm5, %xmm2 movapd %xmm1, %xmm6 movupd -16(%rax), %xmm9 unpckhpd %xmm0, %xmm1 unpcklpd %xmm0, %xmm6 movapd %xmm6, %xmm0 mulpd %xmm8, %xmm0 unpcklpd %xmm9, %xmm2 unpckhpd %xmm9, %xmm5 mulpd %xmm7, %xmm6 addpd %xmm0, %xmm2 movapd %xmm1, %xmm0 mulpd %xmm7, %xmm0 mulpd %xmm8, %xmm1 subpd %xmm0, %xmm2 movapd %xmm1, %xmm0 addpd %xmm6, %xmm0 movapd %xmm2, %xmm1 addpd %xmm5, %xmm0 unpcklpd %xmm0, %xmm1 unpckhpd %xmm0, %xmm2 movups %xmm1, -32(%rax) movups %xmm2, -16(%rax) cmpl %esi, %r9d jb .L5 t.f:6:0: note: Cost model analysis: Vector inside of loop cost: 26 Vector prologue cost: 10 Vector epilogue cost: 14 Scalar iteration cost: 14 Scalar outside cost: 6 Vector outside cost: 24 prologue iterations: 0 epilogue iterations: 1 Calculated minimum iters for profitability: 6 t.f:6:0: note: Runtime profitability threshold = 5 t.f:6:0: note: Static estimate profitability threshold = 16 Thus this is now a cost model issue.
With AVX2 we indeed generate .L4: vmovupd (%rdx,%rax), %ymm3 addl $1, %r9d vpermpd $177, %ymm3, %ymm4 vmovapd %ymm3, %ymm2 vmulpd %ymm6, %ymm4, %ymm4 vfmsub132pd %ymm5, %ymm4, %ymm2 vfmadd132pd %ymm5, %ymm4, %ymm3 vshufpd $10, %ymm3, %ymm2, %ymm2 vaddpd (%rcx,%rax), %ymm2, %ymm2 vmovupd %ymm2, (%rcx,%rax) addq $32, %rax cmpl %esi, %r9d jb .L4 thus either there is no addsub for %ymm or there is insufficient pattern support for it. Note that with AVX2 the above is what is generated even with the cost model as it's now considered a profitable vectorization.
On trunk we get with -Ofast -msse4.2 .L3: movupd (%rdx,%rax), %xmm0 movupd (%rdx,%rax), %xmm1 movupd (%rcx,%rax), %xmm4 mulpd %xmm3, %xmm1 palignr $8, %xmm0, %xmm0 mulpd %xmm2, %xmm0 addsubpd %xmm0, %xmm1 addpd %xmm4, %xmm1 movups %xmm1, (%rcx,%rax) addq $16, %rax cmpq %rsi, %rax jne .L3 ICC unrolls the body once more. With -mavx2 we get .L4: vmovupd (%rdx,%rsi), %xmm6 vinsertf128 $0x1, 16(%rdx,%rsi), %ymm6, %ymm1 vmovupd (%rcx,%rsi), %xmm7 vmulpd %ymm5, %ymm1, %ymm2 vpermpd $177, %ymm1, %ymm1 vmulpd %ymm4, %ymm1, %ymm1 vaddsubpd %ymm1, %ymm2, %ymm1 vinsertf128 $0x1, 16(%rcx,%rsi), %ymm7, %ymm2 vaddpd %ymm2, %ymm1, %ymm1 vmovups %xmm1, (%rcx,%rsi) vextractf128 $0x1, %ymm1, 16(%rcx,%rsi) addq $32, %rsi cmpq %rsi, %rdi jne .L4 if I add -mfma for example via -march=core-avx2 I get again .L4: vpermpd $177, (%rdx,%rsi), %ymm2 vmovapd %ymm4, %ymm1 vmulpd %ymm5, %ymm2, %ymm2 vfmsub132pd (%rdx,%rsi), %ymm2, %ymm1 vfmadd231pd (%rdx,%rsi), %ymm4, %ymm2 vshufpd $10, %ymm2, %ymm1, %ymm1 vaddpd (%rcx,%rsi), %ymm1, %ymm1 vmovupd %ymm1, (%rcx,%rsi) addq $32, %rsi cmpq %rsi, %rdi jne .L4 showing that we lack vfmaddsub patterns or those present do not work properly. combine sees 37: r109:V4DF={r127:V4DF*r105:V4DF+-r108:V4DF} 38: r110:V4DF={r127:V4DF*r105:V4DF+r108:V4DF} REG_DEAD r127:V4DF REG_DEAD r108:V4DF 39: r129:V4DF=vec_select(vec_concat(r109:V4DF,r110:V4DF),parallel) REG_DEAD r110:V4DF REG_DEAD r109:V4DF unfortunately the fmaddsub patterns all use UNSPECs with the comment ;; It would be possible to represent these without the UNSPEC as ;; ;; (vec_merge ;; (fma op1 op2 op3) ;; (fma op1 op2 (neg op3)) ;; (merge-const)) ;; ;; But this doesn't seem useful in practice. the AVX512 ones do not seem to suffer from this but using AVX512 via -march=knl also only results in .L4: vmovupd (%rdx,%rax), %zmm1 vpermpd $177, %zmm1, %zmm2 vmovapd %zmm1, %zmm0 vmulpd %zmm4, %zmm2, %zmm2 vfmsub132pd %zmm3, %zmm2, %zmm0 vfmadd132pd %zmm3, %zmm2, %zmm1 vshufpd $170, %zmm1, %zmm0, %zmm0 vaddpd (%rcx,%rax), %zmm0, %zmm0 vmovupd %zmm0, (%rcx,%rax) leaq 64(%rax), %rax cmpq %rax, %rsi jne .L4 but maybe KNL doesn't have fmaddsub. Ah, with AVX512 we end up with vec_merge (like with SSE2) but above AVX256 shows concat/select (avx_shufpd256_1). Maybe combine itself can try both variants in case there's a duality between (vec_merge ...) and (vec_select (vec_concat ...))?
I think we should declare one of the forms canonical, and make simplify use that one, if possible. If we really want one form in some cases and another in other cases something like change_zero_ext will work.
On Thu, 3 Jan 2019, segher at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939 > > --- Comment #9 from Segher Boessenkool <segher at gcc dot gnu.org> --- > I think we should declare one of the forms canonical, and make simplify > use that one, if possible. If we really want one form in some cases and > another in other cases something like change_zero_ext will work. Yeah, the main issue is that targets have existing patterns prefering either form. vec_merge avoids the need of wider intermediate modes but with vec_select it's easier to allow constrained operands... So IMHO RTL shouldn't have both but it would be convenient to not need that wider mode... it might ask for a vec_concat_and_select (aka vec_shuffle2). Another issue with vec_merge is the limit of the bitmask operand (CONST_INT) which limits us to 64 vector elements. That said, kill vec_merge and allow (vec_select:V4DF (vec_concat:V8DF (reg:V4DF...) (reg:V4DF...)) (parallel [ (const_int 0) ....]))) to be written as (vec_shuffle:V4DF (reg:V4DF...) (reg:V4DF...) (parallel [ (const_int 0) ....]))) note vec_concat operands are not constrained in any way but vec_shuffle would be? That is, in practice targets can probably get rid of vec_merge by using vec_select/vec_concat "easily" even if that means introducing of "fake" larger vector modes.
(In reply to rguenther@suse.de from comment #10) > even if that means introducing of "fake" larger vector modes. That would be a good reason to not do this, except many targets already do need double-length modes to describe their permute instructions?
(In reply to Richard Biener from comment #8) > Maybe combine itself can try both variants in case there's a duality > between (vec_merge ...) and (vec_select (vec_concat ...))? Please note that combine splitters are used to help combine create addsub pattern. Please also see addsub_vm_operator predicate that handles vec_merge form and addsub_vs_operator that handles vec_select/vec_concat combo. Please see PR66560. Probably these predicates and similar splitters can be used to handle fmaddsub instructions.
On Thu, 3 Jan 2019, segher at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939 > > --- Comment #11 from Segher Boessenkool <segher at gcc dot gnu.org> --- > (In reply to rguenther@suse.de from comment #10) > > even if that means introducing of "fake" larger vector modes. > > That would be a good reason to not do this, except many targets already > do need double-length modes to describe their permute instructions? Yes. I think there's no good need for vec_merge for those targets. vec_shuffle would of course remove the need of the larger modes.
With SSE4.2 we now get .L3: movupd (%rdx,%rax), %xmm0 movupd (%rcx,%rax), %xmm4 movapd %xmm0, %xmm1 palignr $8, %xmm0, %xmm0 mulpd %xmm3, %xmm1 mulpd %xmm2, %xmm0 addpd %xmm4, %xmm1 addsubpd %xmm0, %xmm1 movups %xmm1, (%rcx,%rax) addq $16, %rax cmpq %rsi, %rax jne .L3 with AVX and FMA .L4: vmovupd (%rdx,%rax), %ymm0 vmovapd %ymm4, %ymm1 vfmadd213pd (%rcx,%rax), %ymm0, %ymm1 vpermilpd $5, %ymm0, %ymm0 vmulpd %ymm3, %ymm0, %ymm0 vaddsubpd %ymm0, %ymm1, %ymm1 vmovupd %ymm1, (%rcx,%rax) addq $32, %rax cmpq %rsi, %rax jne .L4 so I'd say fixed. But. With AVX512 we now get .L4: vmovupd (%rdi,%rax), %zmm0 vmovapd %zmm7, %zmm2 vmovapd %zmm4, %zmm6 vfmadd213pd (%rcx,%rax), %zmm0, %zmm2 vpermilpd $85, %zmm0, %zmm0 vfmadd132pd %zmm0, %zmm2, %zmm6 vfnmadd132pd %zmm4, %zmm2, %zmm0 vmovapd %zmm6, %zmm0{%k1} vmovupd %zmm0, (%rcx,%rax) addq $64, %rax cmpq %rax, %rsi jne .L4 it's odd that this only happens with -mprefer-vector-width=512 though. Do we possibly miss vec_{fm,}{addsub,subadd} for those? Looks like so. Tracking in PR110767. The vectorizer side is fixed.