Let's continue our complex dot product series started here https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96854 This time I have no code generation bugs for your pleasure, just "interesting" optimization issues. All examples below, unless stated otherwise were compiled with gcc.10.2 for x86-64 with following sets of flags: set1: -Wall -mavx2 -mfma -march=skylake -O3 -ffast-math -fno-associative-math set2: -Wall -mavx2 -mfma -march=skylake -O3 -ffast-math The kernel in question is an example of complex dot product in so-called "hybrid AoS" layout a.k.a. AoSoA. https://en.wikipedia.org/wiki/AoS_and_SoA#Array_of_Structures_of_Arrays In my experience it's quite rare when in dense complex linear algebra and similar computational fields AoSoA is *not* the optimal internal form. So, practically, I consider these kernels as more important than AoS kernel presented in bug 96854. More specifically, the layout can be described as struct { double re[4], im[4]; }; But for purpose of simplicity I omitted the type definition fro code examples and coded it directly over flat arrays of doubles. Part 1. void cdot(double* res, const double* a, const double* b, int N) { double acc_re = 0; double acc_im = 0; for (int c = 0; c < N; ++c) { for (int k = 0; k < 4; ++k) { acc_re = acc_re + a[c*8+k+0]*b[c*8+k+0] + a[c*8+k+4]*b[c*8+k+4]; acc_im = acc_im - a[c*8+k+0]*b[c*8+k+4] + a[c*8+k+4]*b[c*8+k+0]; } } res[0] = acc_re; res[4] = acc_im; } That's how we want to code it in the ideal world and let to compiles to care about dirty details. In less ideal world, gcc is not the only compiler that can't cope with it. MSVC (-W4 -O2 -fp:fast -arch:AVX2) also can't vectorize it. Even mighty icc generates code that it's not quite bad, but somewhat suboptimal. So, let's it pass. I don't want to blame gcc for not being smart enough. It's just normal. Except that when we use set2 the code generated by gcc becomes not just non-smart, but quite crazy. I am ignoring it in the hope that it would be magically fixed by the change made by Richard Biener 2020-08-31 Part 2. void cdot(double* res, const double* a, const double* b, int N) { double acc_rere = 0; double acc_imim = 0; double acc_reim = 0; double acc_imre = 0; for (int c = 0; c < N; ++c) { for (int k = 0; k < 4; ++k) { acc_rere += a[c*8+k+0]*b[c*8+k+0]; acc_imim += a[c*8+k+4]*b[c*8+k+4]; acc_reim += a[c*8+k+0]*b[c*8+k+4]; acc_imre += a[c*8+k+4]*b[c*8+k+0]; } } res[0] = acc_rere+acc_imim; res[4] = acc_imre-acc_reim; } We are explaining it to compiler slowly. For icc and MSVC it's enough. They understood. icc generates near-perfect code. I can write it nicer, but do not expect my variant to be any faster. MSVC generates near-perfect inner loop and epilogue that is not great, but not really much slower. gcc still didn't get it. It still tries to implement 4 accumulators literally, as if -ffast-math were not here. But, sad as it is, it's still a case of not being smart enough. So, I am not complaining. Part 3. static inline double sum4(double x[]) { return x[0]+x[1]+x[2]+x[3]; } void cdot(double* res, const double* a, const double* b, int N) { double acc_re[4] = {0}; double acc_im[4] = {0}; for (int c = 0; c < N; ++c) { for (int k = 0; k < 4; ++k) { acc_re[k] = acc_re[k] + a[c*8+k+0]*b[c*8+k+0] + a[c*8+k+4]*b[c*8+k+4]; acc_im[k] = acc_im[k] - a[c*8+k+0]*b[c*8+k+4] + a[c*8+k+4]*b[c*8+k+0]; } } res[0] = sum4(acc_re); res[4] = sum4(acc_im); } Attempt to feed compiler by teaspoon. That's not a way I want to write code in HLL. icc copes, producing about the same code as in Part 1 MSVC doesn't understand a Kunststück (I am sympathetic) and generates literal scalar code with local arrays on stack. gcc with set1 is a little better than MSVC - the code is fully scalar, but at least accumulators kept in registers. gcc with set2 is the most interesting. It vectorizes, but how? Here is an inner loop: .L3: vpermpd $27, (%r8,%rax), %ymm2 vpermpd $27, 32(%rdx,%rax), %ymm3 vpermpd $27, (%rdx,%rax), %ymm1 vpermpd $27, 32(%r8,%rax), %ymm0 vmulpd %ymm2, %ymm1, %ymm6 vmulpd %ymm2, %ymm3, %ymm2 addq $64, %rax vfnmadd132pd %ymm0, %ymm2, %ymm1 vfmadd132pd %ymm3, %ymm6, %ymm0 vaddpd %ymm1, %ymm5, %ymm5 vaddpd %ymm0, %ymm4, %ymm4 cmpq %rcx, %rax jne .L3 What all this vpermpd business about? Shuffling SIMD lanes around just because it's funny? That the first thing I do want to complain about. Not "not smart enough", but too smart for its own good. And finally Part 4 static inline double sum4(double x[]) { return x[0]+x[1]+x[2]+x[3]; } void cdot(double* res, const double* a, const double* b, int N) { double acc_rere[4] = {0}; double acc_imim[4] = {0}; double acc_reim[4] = {0}; double acc_imre[4] = {0}; for (int c = 0; c < N; ++c) { for (int k = 0; k < 4; ++k) { acc_rere[k] += a[c*8+k+0]*b[c*8+k+0]; acc_imim[k] += a[c*8+k+4]*b[c*8+k+4]; acc_reim[k] += a[c*8+k+0]*b[c*8+k+4]; acc_imre[k] += a[c*8+k+4]*b[c*8+k+0]; } } double acc_re[4]; double acc_im[4]; for (int k = 0; k < 4; ++k) { acc_re[k] = acc_rere[k]+acc_imim[k]; acc_im[k] = acc_imre[k]-acc_reim[k]; } res[0] = sum4(acc_re); res[4] = sum4(acc_im); } Not just fed by teaspoon, but compiler's mouth held open manually, so to speak. icc, of course, understands and generates pretty much the same good code as in Part 2. MSVC, of course, does not understand and generates arrays on stack. gcc with set2, of course, continues to enjoy a juggling. Doubling or tripling up vs last time's performance. Inner loop: .L3: vmovupd (%rdx,%rax), %ymm1 vmovupd 32(%rdx,%rax), %ymm0 vmovupd 32(%r8,%rax), %ymm5 vperm2f128 $49, %ymm1, %ymm0, %ymm3 vinsertf128 $1, %xmm1, %ymm0, %ymm0 vpermpd $221, %ymm0, %ymm10 vpermpd $136, %ymm0, %ymm1 vmovupd (%r8,%rax), %ymm0 vpermpd $136, %ymm3, %ymm9 vperm2f128 $49, %ymm5, %ymm0, %ymm2 vinsertf128 $1, %xmm5, %ymm0, %ymm0 vpermpd $40, %ymm2, %ymm11 vpermpd $125, %ymm0, %ymm5 vpermpd $221, %ymm3, %ymm3 vpermpd $40, %ymm0, %ymm0 vpermpd $125, %ymm2, %ymm2 addq $64, %rax vfmadd231pd %ymm2, %ymm3, %ymm8 vfmadd231pd %ymm11, %ymm9, %ymm6 vfmadd231pd %ymm5, %ymm10, %ymm7 vfmadd231pd %ymm0, %ymm1, %ymm4 cmpq %rax, %rcx jne .L3 But this time gcc with set1 was a real star of the show. My only reaction is "What?" .L4: vmovupd 0(%r13), %ymm5 vmovupd 64(%r13), %ymm7 vmovupd 192(%r13), %ymm4 vmovupd 128(%r13), %ymm6 vunpcklpd 32(%r13), %ymm5, %ymm13 vunpckhpd 32(%r13), %ymm5, %ymm12 vunpckhpd 96(%r13), %ymm7, %ymm1 vunpcklpd 96(%r13), %ymm7, %ymm5 vmovupd 128(%r13), %ymm7 vunpcklpd 224(%r13), %ymm4, %ymm2 vunpcklpd 160(%r13), %ymm6, %ymm6 vunpckhpd 160(%r13), %ymm7, %ymm11 vunpckhpd 224(%r13), %ymm4, %ymm0 vpermpd $216, %ymm13, %ymm13 vpermpd $216, %ymm6, %ymm6 vpermpd $216, %ymm2, %ymm2 vpermpd $216, %ymm5, %ymm5 vunpcklpd %ymm2, %ymm6, %ymm4 vpermpd $216, %ymm1, %ymm1 vpermpd $216, %ymm11, %ymm11 vunpcklpd %ymm5, %ymm13, %ymm9 vpermpd $216, %ymm12, %ymm12 vpermpd $216, %ymm0, %ymm0 vpermpd $216, %ymm4, %ymm3 vpermpd $216, %ymm9, %ymm9 vunpckhpd %ymm2, %ymm6, %ymm4 vunpckhpd %ymm5, %ymm13, %ymm5 vunpcklpd %ymm1, %ymm12, %ymm6 vunpcklpd %ymm0, %ymm11, %ymm2 vunpckhpd %ymm1, %ymm12, %ymm12 vunpckhpd %ymm0, %ymm11, %ymm0 vpermpd $216, %ymm12, %ymm1 vunpcklpd %ymm3, %ymm9, %ymm11 vpermpd $216, %ymm5, %ymm5 vpermpd $216, %ymm4, %ymm4 vpermpd $216, %ymm0, %ymm0 vmovupd 64(%r12), %ymm15 vpermpd $216, %ymm6, %ymm6 vpermpd $216, %ymm11, %ymm8 vpermpd $216, %ymm2, %ymm2 vunpcklpd %ymm4, %ymm5, %ymm11 vunpckhpd %ymm3, %ymm9, %ymm9 vunpckhpd %ymm4, %ymm5, %ymm4 vunpcklpd %ymm0, %ymm1, %ymm5 vpermpd $216, %ymm9, %ymm3 vunpcklpd %ymm2, %ymm6, %ymm9 vunpckhpd %ymm2, %ymm6, %ymm2 vpermpd $216, %ymm5, %ymm6 vunpcklpd 96(%r12), %ymm15, %ymm12 vunpckhpd %ymm0, %ymm1, %ymm0 vmovupd %ymm6, 64(%rsp) vunpckhpd 96(%r12), %ymm15, %ymm6 vmovupd 128(%r12), %ymm15 vpermpd $216, %ymm0, %ymm5 vpermpd $216, %ymm9, %ymm7 vmovupd (%r12), %ymm0 vunpckhpd 160(%r12), %ymm15, %ymm9 vmovupd %ymm5, 96(%rsp) vunpcklpd 160(%r12), %ymm15, %ymm5 vmovupd 192(%r12), %ymm15 vunpcklpd 32(%r12), %ymm0, %ymm1 vpermpd $216, %ymm9, %ymm14 vunpcklpd 224(%r12), %ymm15, %ymm9 vunpckhpd 224(%r12), %ymm15, %ymm13 vunpckhpd 32(%r12), %ymm0, %ymm0 vpermpd $216, %ymm12, %ymm12 vpermpd $216, %ymm9, %ymm9 vpermpd $216, %ymm1, %ymm1 vpermpd $216, %ymm5, %ymm5 vpermpd $216, %ymm6, %ymm6 vunpcklpd %ymm12, %ymm1, %ymm10 vpermpd $216, %ymm0, %ymm0 vpermpd $216, %ymm13, %ymm13 vunpckhpd %ymm12, %ymm1, %ymm1 vunpcklpd %ymm9, %ymm5, %ymm12 vpermpd $216, %ymm12, %ymm12 vpermpd $216, %ymm10, %ymm10 vunpckhpd %ymm9, %ymm5, %ymm5 vunpcklpd %ymm6, %ymm0, %ymm9 vunpckhpd %ymm6, %ymm0, %ymm0 vunpcklpd %ymm13, %ymm14, %ymm6 vunpckhpd %ymm13, %ymm14, %ymm13 vpermpd $216, %ymm13, %ymm14 vunpcklpd %ymm12, %ymm10, %ymm13 vpermpd $216, %ymm13, %ymm13 vmulpd %ymm13, %ymm8, %ymm15 vpermpd $216, %ymm5, %ymm5 vpermpd $216, %ymm6, %ymm6 vpermpd $216, %ymm1, %ymm1 vpermpd $216, %ymm9, %ymm9 vpermpd $216, %ymm0, %ymm0 vunpckhpd %ymm12, %ymm10, %ymm10 vunpcklpd %ymm6, %ymm9, %ymm12 vunpckhpd %ymm6, %ymm9, %ymm9 vunpcklpd %ymm5, %ymm1, %ymm6 vunpckhpd %ymm5, %ymm1, %ymm1 vunpcklpd %ymm14, %ymm0, %ymm5 vunpckhpd %ymm14, %ymm0, %ymm0 vpermpd $216, %ymm0, %ymm0 vmovupd %ymm0, 160(%rsp) vmovq %r9, %xmm0 vaddsd %xmm15, %xmm0, %xmm0 vunpckhpd %xmm15, %xmm15, %xmm14 vpermpd $216, %ymm10, %ymm10 vaddsd %xmm14, %xmm0, %xmm0 vextractf128 $0x1, %ymm15, %xmm14 vmulpd %ymm10, %ymm8, %ymm8 vaddsd %xmm14, %xmm0, %xmm15 vunpckhpd %xmm14, %xmm14, %xmm14 vpermpd $216, %ymm12, %ymm12 vaddsd %xmm14, %xmm15, %xmm0 vmulpd %ymm10, %ymm3, %ymm15 vunpckhpd %xmm8, %xmm8, %xmm10 vmovq %xmm0, %r9 vmovq %rcx, %xmm0 vmulpd %ymm13, %ymm3, %ymm3 vaddsd %xmm15, %xmm0, %xmm0 vunpckhpd %xmm15, %xmm15, %xmm14 vextractf128 $0x1, %ymm15, %xmm15 vaddsd %xmm14, %xmm0, %xmm14 vpermpd $216, %ymm1, %ymm1 vmovupd %ymm1, 128(%rsp) vaddsd %xmm15, %xmm14, %xmm14 vunpckhpd %xmm15, %xmm15, %xmm15 vpermpd $216, %ymm2, %ymm2 vaddsd %xmm15, %xmm14, %xmm0 vmovsd 56(%rsp), %xmm14 vpermpd $216, %ymm9, %ymm9 vaddsd %xmm8, %xmm14, %xmm14 vextractf128 $0x1, %ymm8, %xmm8 vmovq %xmm0, %rcx vaddsd %xmm10, %xmm14, %xmm10 vpermpd $216, %ymm6, %ymm6 vpermpd $216, %ymm11, %ymm11 vaddsd %xmm8, %xmm10, %xmm10 vunpckhpd %xmm8, %xmm8, %xmm8 vpermpd $216, %ymm4, %ymm4 vaddsd %xmm8, %xmm10, %xmm0 vmovsd 48(%rsp), %xmm10 vunpckhpd %xmm3, %xmm3, %xmm8 vaddsd %xmm3, %xmm10, %xmm10 vextractf128 $0x1, %ymm3, %xmm3 vmovsd %xmm0, 56(%rsp) vaddsd %xmm8, %xmm10, %xmm8 vmulpd %ymm12, %ymm7, %ymm10 vmulpd %ymm9, %ymm7, %ymm7 vaddsd %xmm3, %xmm8, %xmm8 vunpckhpd %xmm3, %xmm3, %xmm3 vpermpd $216, %ymm5, %ymm5 vaddsd %xmm3, %xmm8, %xmm0 vunpckhpd %xmm10, %xmm10, %xmm3 addq $256, %r12 vmovsd %xmm0, 48(%rsp) vmovq %rdi, %xmm0 vaddsd %xmm10, %xmm0, %xmm8 vextractf128 $0x1, %ymm10, %xmm10 vmovq %rbx, %xmm0 vaddsd %xmm3, %xmm8, %xmm3 vmulpd %ymm9, %ymm2, %ymm8 vmulpd %ymm12, %ymm2, %ymm2 vaddsd %xmm10, %xmm3, %xmm3 vunpckhpd %xmm10, %xmm10, %xmm10 addq $256, %r13 vaddsd %xmm10, %xmm3, %xmm1 vaddsd %xmm8, %xmm0, %xmm10 vunpckhpd %xmm8, %xmm8, %xmm3 vextractf128 $0x1, %ymm8, %xmm8 vaddsd %xmm3, %xmm10, %xmm3 vmovq %xmm1, %rdi vmovq %r11, %xmm1 vaddsd %xmm8, %xmm3, %xmm3 vunpckhpd %xmm8, %xmm8, %xmm8 vmovq %r10, %xmm0 vaddsd %xmm8, %xmm3, %xmm3 vmovsd 40(%rsp), %xmm8 vaddsd %xmm7, %xmm8, %xmm8 vmovq %xmm3, %rbx vunpckhpd %xmm7, %xmm7, %xmm3 vaddsd %xmm3, %xmm8, %xmm3 vextractf128 $0x1, %ymm7, %xmm7 vaddsd %xmm7, %xmm3, %xmm3 vunpckhpd %xmm7, %xmm7, %xmm7 vaddsd %xmm7, %xmm3, %xmm3 vmovsd 32(%rsp), %xmm7 vaddsd %xmm2, %xmm7, %xmm7 vmovsd %xmm3, 40(%rsp) vunpckhpd %xmm2, %xmm2, %xmm3 vaddsd %xmm3, %xmm7, %xmm3 vextractf128 $0x1, %ymm2, %xmm2 vmulpd %ymm6, %ymm11, %ymm7 vaddsd %xmm2, %xmm3, %xmm3 vunpckhpd %xmm2, %xmm2, %xmm2 vaddsd %xmm2, %xmm3, %xmm2 vaddsd %xmm7, %xmm1, %xmm3 vmovupd 128(%rsp), %ymm1 vmovsd %xmm2, 32(%rsp) vunpckhpd %xmm7, %xmm7, %xmm2 vaddsd %xmm2, %xmm3, %xmm2 vextractf128 $0x1, %ymm7, %xmm7 vmulpd %ymm1, %ymm4, %ymm3 vaddsd %xmm7, %xmm2, %xmm2 vunpckhpd %xmm7, %xmm7, %xmm7 vmulpd %ymm1, %ymm11, %ymm1 vaddsd %xmm7, %xmm2, %xmm2 vaddsd %xmm3, %xmm0, %xmm7 vmulpd %ymm6, %ymm4, %ymm4 vmovq %xmm2, %r11 vunpckhpd %xmm3, %xmm3, %xmm2 vaddsd %xmm2, %xmm7, %xmm2 vextractf128 $0x1, %ymm3, %xmm3 vmovupd 64(%rsp), %ymm6 vaddsd %xmm3, %xmm2, %xmm2 vunpckhpd %xmm3, %xmm3, %xmm3 vmovupd 96(%rsp), %ymm7 vaddsd %xmm3, %xmm2, %xmm2 vmovsd 24(%rsp), %xmm3 vmovupd 160(%rsp), %ymm0 vaddsd %xmm1, %xmm3, %xmm3 vmovq %xmm2, %r10 vunpckhpd %xmm1, %xmm1, %xmm2 vaddsd %xmm2, %xmm3, %xmm2 vextractf128 $0x1, %ymm1, %xmm1 vmovq %rbp, %xmm3 vaddsd %xmm1, %xmm2, %xmm2 vunpckhpd %xmm1, %xmm1, %xmm1 vaddsd %xmm1, %xmm2, %xmm2 vunpckhpd %xmm4, %xmm4, %xmm1 vmovsd %xmm2, 24(%rsp) vmovsd 16(%rsp), %xmm2 vaddsd %xmm4, %xmm2, %xmm2 vextractf128 $0x1, %ymm4, %xmm4 vaddsd %xmm1, %xmm2, %xmm1 vaddsd %xmm4, %xmm1, %xmm1 vunpckhpd %xmm4, %xmm4, %xmm4 vaddsd %xmm4, %xmm1, %xmm4 vmovsd %xmm4, 16(%rsp) vmulpd %ymm6, %ymm5, %ymm4 vmulpd %ymm7, %ymm5, %ymm5 vaddsd %xmm4, %xmm3, %xmm1 vunpckhpd %xmm4, %xmm4, %xmm2 vmovq %rsi, %xmm3 vaddsd %xmm2, %xmm1, %xmm2 vextractf128 $0x1, %ymm4, %xmm1 vaddsd %xmm1, %xmm2, %xmm2 vunpckhpd %xmm1, %xmm1, %xmm1 vaddsd %xmm1, %xmm2, %xmm4 vmovq %xmm4, %rbp vmulpd %ymm0, %ymm7, %ymm4 vmulpd %ymm0, %ymm6, %ymm0 vaddsd %xmm4, %xmm3, %xmm1 vunpckhpd %xmm4, %xmm4, %xmm2 vaddsd %xmm2, %xmm1, %xmm2 vextractf128 $0x1, %ymm4, %xmm1 vaddsd %xmm1, %xmm2, %xmm2 vunpckhpd %xmm1, %xmm1, %xmm1 vaddsd %xmm1, %xmm2, %xmm4 vmovsd 8(%rsp), %xmm2 vunpckhpd %xmm0, %xmm0, %xmm1 vaddsd %xmm0, %xmm2, %xmm2 vextractf128 $0x1, %ymm0, %xmm0 vmovq %xmm4, %rsi vaddsd %xmm1, %xmm2, %xmm1 vaddsd %xmm0, %xmm1, %xmm1 vunpckhpd %xmm0, %xmm0, %xmm0 vaddsd %xmm0, %xmm1, %xmm6 vmovsd (%rsp), %xmm1 vunpckhpd %xmm5, %xmm5, %xmm0 vaddsd %xmm5, %xmm1, %xmm1 vextractf128 $0x1, %ymm5, %xmm5 vmovsd %xmm6, 8(%rsp) vaddsd %xmm0, %xmm1, %xmm0 vaddsd %xmm5, %xmm0, %xmm0 vunpckhpd %xmm5, %xmm5, %xmm5 vaddsd %xmm5, %xmm0, %xmm5 vmovsd %xmm5, (%rsp) cmpq %rax, %r12 jne .L4 movl %r15d, %r12d andl $-4, %r12d movl %r12d, %edx cmpl %r12d, %r15d je .L5 .L3: movl %r15d, %eax subl %r12d, %eax cmpl $1, %eax je .L6 salq $6, %r12 leaq (%r14,%r12), %r13 vmovupd 16(%r13), %xmm3 vmovupd 48(%r13), %xmm0 vmovupd 64(%r13), %xmm8 vmovupd 112(%r13), %xmm10 vmovupd 0(%r13), %xmm4 vmovupd 32(%r13), %xmm2 vmovupd 80(%r13), %xmm6 vmovupd 96(%r13), %xmm1 vunpcklpd %xmm3, %xmm4, %xmm5 vunpckhpd %xmm3, %xmm4, %xmm4 vunpcklpd %xmm0, %xmm2, %xmm3 vunpckhpd %xmm0, %xmm2, %xmm2 vunpcklpd %xmm6, %xmm8, %xmm0 vunpckhpd %xmm6, %xmm8, %xmm6 vunpcklpd %xmm10, %xmm1, %xmm8 vunpckhpd %xmm10, %xmm1, %xmm1 vunpcklpd %xmm3, %xmm5, %xmm11 vunpcklpd %xmm2, %xmm4, %xmm10 vunpckhpd %xmm3, %xmm5, %xmm3 vunpckhpd %xmm2, %xmm4, %xmm2 vunpcklpd %xmm8, %xmm0, %xmm5 vunpcklpd %xmm1, %xmm6, %xmm4 vunpckhpd %xmm8, %xmm0, %xmm0 vunpckhpd %xmm1, %xmm6, %xmm1 addq %r8, %r12 vunpcklpd %xmm5, %xmm11, %xmm8 vunpckhpd %xmm0, %xmm3, %xmm7 vunpckhpd %xmm5, %xmm11, %xmm11 vunpckhpd %xmm1, %xmm2, %xmm5 vmovupd 64(%r12), %xmm12 vunpcklpd %xmm1, %xmm2, %xmm6 vmovupd 80(%r12), %xmm9 vmovupd 48(%r12), %xmm1 vmovupd 96(%r12), %xmm2 vunpcklpd %xmm4, %xmm10, %xmm14 vunpcklpd %xmm0, %xmm3, %xmm13 vunpckhpd %xmm4, %xmm10, %xmm10 vmovupd 32(%r12), %xmm3 vmovupd 16(%r12), %xmm4 vmovapd %xmm7, 64(%rsp) vmovapd %xmm5, 96(%rsp) vmovupd 112(%r12), %xmm7 vmovupd (%r12), %xmm5 movl %eax, %r12d vunpcklpd %xmm4, %xmm5, %xmm15 vunpckhpd %xmm4, %xmm5, %xmm5 vunpcklpd %xmm1, %xmm3, %xmm4 vunpckhpd %xmm1, %xmm3, %xmm3 vunpcklpd %xmm9, %xmm12, %xmm1 vunpckhpd %xmm9, %xmm12, %xmm9 vunpcklpd %xmm7, %xmm2, %xmm12 vunpckhpd %xmm7, %xmm2, %xmm2 vunpcklpd %xmm4, %xmm15, %xmm7 vunpckhpd %xmm4, %xmm15, %xmm15 vunpcklpd %xmm12, %xmm1, %xmm4 vunpckhpd %xmm12, %xmm1, %xmm1 vunpcklpd %xmm3, %xmm5, %xmm12 vunpckhpd %xmm3, %xmm5, %xmm5 vunpcklpd %xmm2, %xmm9, %xmm3 vunpckhpd %xmm2, %xmm9, %xmm2 vunpcklpd %xmm4, %xmm7, %xmm9 vunpckhpd %xmm1, %xmm15, %xmm0 vunpckhpd %xmm4, %xmm7, %xmm4 vunpcklpd %xmm3, %xmm12, %xmm7 vunpckhpd %xmm3, %xmm12, %xmm3 vunpcklpd %xmm1, %xmm15, %xmm12 vunpcklpd %xmm2, %xmm5, %xmm15 vunpckhpd %xmm2, %xmm5, %xmm2 vmulpd %xmm9, %xmm8, %xmm5 vmovapd %xmm0, 128(%rsp) vmovq %r9, %xmm0 andl $-2, %r12d addl %r12d, %edx vaddsd %xmm5, %xmm0, %xmm0 vunpckhpd %xmm5, %xmm5, %xmm5 vaddsd %xmm5, %xmm0, %xmm1 vmulpd %xmm4, %xmm11, %xmm5 vmulpd %xmm4, %xmm8, %xmm4 vmovq %xmm1, %r9 vmovq %rcx, %xmm1 vmulpd %xmm9, %xmm11, %xmm11 vaddsd %xmm5, %xmm1, %xmm1 vunpckhpd %xmm5, %xmm5, %xmm5 vmulpd %xmm7, %xmm14, %xmm9 vaddsd %xmm5, %xmm1, %xmm1 vmovsd 56(%rsp), %xmm5 vmulpd %xmm3, %xmm10, %xmm8 vaddsd %xmm4, %xmm5, %xmm5 vunpckhpd %xmm4, %xmm4, %xmm4 vmovq %xmm1, %rcx vaddsd %xmm4, %xmm5, %xmm4 vmovq %rdi, %xmm1 vmulpd %xmm3, %xmm14, %xmm14 vmovsd %xmm4, 56(%rsp) vmovsd 48(%rsp), %xmm4 vmovq %rbx, %xmm0 vaddsd %xmm11, %xmm4, %xmm4 vunpckhpd %xmm11, %xmm11, %xmm11 vmovsd 40(%rsp), %xmm3 vaddsd %xmm11, %xmm4, %xmm4 vmulpd %xmm7, %xmm10, %xmm10 vaddsd %xmm14, %xmm3, %xmm3 vmovsd %xmm4, 48(%rsp) vaddsd %xmm9, %xmm1, %xmm4 vunpckhpd %xmm9, %xmm9, %xmm9 vunpckhpd %xmm14, %xmm14, %xmm14 vaddsd %xmm9, %xmm4, %xmm4 vmovapd 128(%rsp), %xmm5 vmovapd 64(%rsp), %xmm11 vmovq %xmm4, %rdi vaddsd %xmm8, %xmm0, %xmm4 vunpckhpd %xmm8, %xmm8, %xmm8 vmovsd 24(%rsp), %xmm1 vaddsd %xmm8, %xmm4, %xmm4 vmovsd 16(%rsp), %xmm0 vmovq %xmm4, %rbx vaddsd %xmm14, %xmm3, %xmm4 vmovsd 32(%rsp), %xmm3 vaddsd %xmm10, %xmm3, %xmm3 vunpckhpd %xmm10, %xmm10, %xmm10 vmovsd %xmm4, 40(%rsp) vaddsd %xmm10, %xmm3, %xmm7 vmulpd %xmm12, %xmm13, %xmm3 vmulpd %xmm5, %xmm13, %xmm13 vmovsd %xmm7, 32(%rsp) vmovq %r11, %xmm7 vmulpd %xmm11, %xmm12, %xmm12 vaddsd %xmm3, %xmm7, %xmm4 vunpckhpd %xmm3, %xmm3, %xmm3 vaddsd %xmm13, %xmm1, %xmm1 vaddsd %xmm3, %xmm4, %xmm7 vmulpd %xmm5, %xmm11, %xmm3 vunpckhpd %xmm13, %xmm13, %xmm13 vmovq %xmm7, %r11 vmovq %r10, %xmm7 vaddsd %xmm12, %xmm0, %xmm0 vaddsd %xmm3, %xmm7, %xmm4 vunpckhpd %xmm3, %xmm3, %xmm3 vunpckhpd %xmm12, %xmm12, %xmm12 vaddsd %xmm3, %xmm4, %xmm7 vaddsd %xmm13, %xmm1, %xmm4 vmovq %xmm7, %r10 vmovsd %xmm4, 24(%rsp) vaddsd %xmm12, %xmm0, %xmm4 vmulpd %xmm15, %xmm6, %xmm0 vmovq %rbp, %xmm7 vmovsd %xmm4, 16(%rsp) vmovapd 96(%rsp), %xmm5 vaddsd %xmm0, %xmm7, %xmm1 vunpckhpd %xmm0, %xmm0, %xmm0 vmovq %rsi, %xmm7 vaddsd %xmm0, %xmm1, %xmm4 vmulpd %xmm5, %xmm2, %xmm0 vmulpd %xmm2, %xmm6, %xmm2 vmovq %xmm4, %rbp vmulpd %xmm5, %xmm15, %xmm15 vaddsd %xmm0, %xmm7, %xmm1 vunpckhpd %xmm0, %xmm0, %xmm0 vaddsd %xmm0, %xmm1, %xmm4 vmovsd 8(%rsp), %xmm0 vaddsd %xmm2, %xmm0, %xmm0 vunpckhpd %xmm2, %xmm2, %xmm2 vmovq %xmm4, %rsi vaddsd %xmm2, %xmm0, %xmm6 vmovsd (%rsp), %xmm0 vaddsd %xmm15, %xmm0, %xmm0 vunpckhpd %xmm15, %xmm15, %xmm15 vmovsd %xmm6, 8(%rsp) vaddsd %xmm15, %xmm0, %xmm5 vmovsd %xmm5, (%rsp) cmpl %r12d, %eax je .L5
All below for Part 2. Without -ffast-math you are seeing GCC using in-order reductions now while with -ffast-math the vectorizer gets a bit confused about reassociations done before, for me producing .L3: vmovupd 32(%rsi,%rax), %ymm3 vmovupd (%rdx,%rax), %ymm7 vinsertf128 $1, (%rsi,%rax), %ymm3, %ymm0 vinsertf128 $1, 32(%rdx,%rax), %ymm7, %ymm2 vmovupd 32(%rsi,%rax), %ymm5 vpermpd $136, %ymm0, %ymm4 vpermpd $40, %ymm2, %ymm7 vpermpd $221, %ymm0, %ymm1 vpermpd $125, %ymm2, %ymm3 vperm2f128 $49, (%rsi,%rax), %ymm5, %ymm0 vmovupd (%rdx,%rax), %ymm2 vperm2f128 $49, 32(%rdx,%rax), %ymm2, %ymm2 addq $64, %rax vpermpd $136, %ymm0, %ymm5 vpermpd $221, %ymm0, %ymm0 vpermpd $40, %ymm2, %ymm8 vpermpd $125, %ymm2, %ymm2 vmulpd %ymm8, %ymm5, %ymm5 vmulpd %ymm2, %ymm0, %ymm0 vfmadd132pd %ymm3, %ymm5, %ymm1 vfmadd231pd %ymm7, %ymm4, %ymm0 vaddpd %ymm0, %ymm1, %ymm0 vaddpd %ymm0, %ymm6, %ymm6 cmpq %rcx, %rax jne .L3 -ffast-math vs. non-ffast-math we're using a SLP reduction vs. 4 reduction chains and this SLP reduction ends up looking like t5.c:7:21: note: Vectorizing SLP tree: t5.c:7:21: note: node 0x4100c20 (max_nunits=4, refcnt=2) t5.c:7:21: note: stmt 0 acc_imre_158 = acc_imre_3 + _34; t5.c:7:21: note: stmt 1 acc_reim_156 = acc_reim_1 + _8; t5.c:7:21: note: stmt 2 acc_imim_154 = _21 + acc_imim_35; t5.c:7:21: note: stmt 3 acc_rere_146 = _11 + acc_rere_29; t5.c:7:21: note: children 0x3f272e0 0x4100bb0 t5.c:7:21: note: node 0x3f272e0 (max_nunits=4, refcnt=1) t5.c:7:21: note: stmt 0 acc_imre_3 = PHI <acc_imre_158(7), 0.0(8)> t5.c:7:21: note: stmt 1 acc_reim_1 = PHI <acc_reim_156(7), 0.0(8)> t5.c:7:21: note: stmt 2 acc_imim_35 = PHI <acc_imim_154(7), 0.0(8)> t5.c:7:21: note: stmt 3 acc_rere_29 = PHI <acc_rere_146(7), 0.0(8)> t5.c:7:21: note: children 0x4100c20 t5.c:7:21: note: node 0x4100bb0 (max_nunits=4, refcnt=1) t5.c:7:21: note: stmt 0 _34 = _36 + _157; t5.c:7:21: note: stmt 1 _8 = _30 + _155; t5.c:7:21: note: stmt 2 _21 = _15 + _153; t5.c:7:21: note: stmt 3 _11 = _6 + _145; t5.c:7:21: note: children 0x4100920 0x4100b40 t5.c:7:21: note: node 0x4100920 (max_nunits=4, refcnt=1) t5.c:7:21: note: stmt 0 _36 = _37 + _73; t5.c:7:21: note: stmt 1 _30 = _32 + _71; t5.c:7:21: note: stmt 2 _15 = _10 + _69; t5.c:7:21: note: stmt 3 _6 = _31 + _61; t5.c:7:21: note: children 0x41004e0 0x41008b0 t5.c:7:21: note: node 0x41004e0 (max_nunits=4, refcnt=1) t5.c:7:21: note: stmt 0 _37 = _101 + _129; t5.c:7:21: note: stmt 1 _32 = _99 + _127; t5.c:7:21: note: stmt 2 _10 = _97 + _125; t5.c:7:21: note: stmt 3 _31 = _89 + _117; t5.c:7:21: note: children 0x3f2a550 0x3f28700 t5.c:7:21: note: node 0x3f2a550 (max_nunits=4, refcnt=1) t5.c:7:21: note: stmt 0 _101 = _88 * _94; t5.c:7:21: note: stmt 1 _99 = _86 * _96; t5.c:7:21: note: stmt 2 _97 = _94 * _96; t5.c:7:21: note: stmt 3 _89 = _86 * _88; t5.c:7:21: note: children 0x40b6990 0x3f29e00 t5.c:7:21: note: node 0x40b6990 (max_nunits=4, refcnt=1) t5.c:7:21: note: stmt 0 _88 = *_87; t5.c:7:21: note: stmt 1 _96 = *_95; t5.c:7:21: note: stmt 2 _96 = *_95; t5.c:7:21: note: stmt 3 _88 = *_87; t5.c:7:21: note: load permutation { 1 5 5 1 } t5.c:7:21: note: node 0x3f29e00 (max_nunits=4, refcnt=1) t5.c:7:21: note: stmt 0 _94 = *_93; t5.c:7:21: note: stmt 1 _86 = *_85; t5.c:7:21: note: stmt 2 _94 = *_93; t5.c:7:21: note: stmt 3 _86 = *_85; t5.c:7:21: note: load permutation { 5 1 5 1 } t5.c:7:21: note: node 0x3f28700 (max_nunits=4, refcnt=1) t5.c:7:21: note: stmt 0 _129 = _116 * _122; t5.c:7:21: note: stmt 1 _127 = _114 * _124; t5.c:7:21: note: stmt 2 _125 = _122 * _124; t5.c:7:21: note: stmt 3 _117 = _114 * _116; t5.c:7:21: note: children 0x3f287e0 0x3f28770 t5.c:7:21: note: node 0x3f287e0 (max_nunits=4, refcnt=1) t5.c:7:21: note: stmt 0 _116 = *_115; t5.c:7:21: note: stmt 1 _124 = *_123; t5.c:7:21: note: stmt 2 _124 = *_123; t5.c:7:21: note: stmt 3 _116 = *_115; t5.c:7:21: note: load permutation { 2 6 6 2 } t5.c:7:21: note: node 0x3f28770 (max_nunits=4, refcnt=1) t5.c:7:21: note: stmt 0 _122 = *_121; t5.c:7:21: note: stmt 1 _114 = *_113; t5.c:7:21: note: stmt 2 _122 = *_121; t5.c:7:21: note: stmt 3 _114 = *_113; t5.c:7:21: note: load permutation { 6 2 6 2 } t5.c:7:21: note: node 0x41008b0 (max_nunits=4, refcnt=1) t5.c:7:21: note: stmt 0 _73 = _60 * _66; t5.c:7:21: note: stmt 1 _71 = _58 * _68; t5.c:7:21: note: stmt 2 _69 = _66 * _68; t5.c:7:21: note: stmt 3 _61 = _58 * _60; t5.c:7:21: note: children 0x4100290 0x4100810 t5.c:7:21: note: node 0x4100290 (max_nunits=4, refcnt=1) t5.c:7:21: note: stmt 0 _60 = *_59; t5.c:7:21: note: stmt 1 _68 = *_67; t5.c:7:21: note: stmt 2 _68 = *_67; t5.c:7:21: note: stmt 3 _60 = *_59; t5.c:7:21: note: load permutation { 0 4 4 0 } t5.c:7:21: note: node 0x4100810 (max_nunits=4, refcnt=1) t5.c:7:21: note: stmt 0 _66 = *_65; t5.c:7:21: note: stmt 1 _58 = *_57; t5.c:7:21: note: stmt 2 _66 = *_65; t5.c:7:21: note: stmt 3 _58 = *_57; t5.c:7:21: note: load permutation { 4 0 4 0 } t5.c:7:21: note: node 0x4100b40 (max_nunits=4, refcnt=1) t5.c:7:21: note: stmt 0 _157 = _144 * _150; t5.c:7:21: note: stmt 1 _155 = _142 * _152; t5.c:7:21: note: stmt 2 _153 = _150 * _152; t5.c:7:21: note: stmt 3 _145 = _142 * _144; t5.c:7:21: note: children 0x4100990 0x4100a50 t5.c:7:21: note: node 0x4100990 (max_nunits=4, refcnt=1) t5.c:7:21: note: stmt 0 _144 = *_143; t5.c:7:21: note: stmt 1 _152 = *_151; t5.c:7:21: note: stmt 2 _152 = *_151; t5.c:7:21: note: stmt 3 _144 = *_143; t5.c:7:21: note: load permutation { 3 7 7 3 } t5.c:7:21: note: node 0x4100a50 (max_nunits=4, refcnt=1) t5.c:7:21: note: stmt 0 _150 = *_149; t5.c:7:21: note: stmt 1 _142 = *_141; t5.c:7:21: note: stmt 2 _150 = *_149; t5.c:7:21: note: stmt 3 _142 = *_141; t5.c:7:21: note: load permutation { 7 3 7 3 } which eventually shows some non-obvious permute optimization opportunities. I'm currently working on a permute optimization phase btw. but for start only handling cases that do not help here. Btw, if I use -ffast-math but disable reassociation via -fno-tree-reassoc I get the reduction chain variant which optimizes to .L3: vmovupd 32(%rsi,%rax), %ymm6 vmovupd 32(%rdx,%rax), %ymm7 vmovupd (%rsi,%rax), %ymm5 vfmadd231pd (%rdx,%rax), %ymm6, %ymm0 vfmadd231pd (%rdx,%rax), %ymm5, %ymm3 vfmadd231pd (%rsi,%rax), %ymm7, %ymm1 addq $64, %rax vfmadd231pd %ymm6, %ymm7, %ymm2 cmpq %rcx, %rax jne .L3 even with GCC 10 (-Ofast -march=core-avx2 -fno-tree-reassoc). Which means the following source change helps: void __attribute__((optimize("no-tree-reassoc"))) cdot(double* res, const double* a, const double* b, int N) { double acc_rere = 0; double acc_imim = 0; double acc_reim = 0; double acc_imre = 0; for (int c = 0; c < N; ++c) { for (int k = 0; k < 4; ++k) { acc_rere += a[c*8+k+0]*b[c*8+k+0]; acc_imim += a[c*8+k+4]*b[c*8+k+4]; acc_reim += a[c*8+k+0]*b[c*8+k+4]; acc_imre += a[c*8+k+4]*b[c*8+k+0]; } } res[0] = acc_rere+acc_imim; res[4] = acc_imre-acc_reim; } the reduction epilogue ends up like vextractf128 $0x1, %ymm3, %xmm4 vaddpd %xmm3, %xmm4, %xmm3 vunpckhpd %xmm3, %xmm3, %xmm4 vaddpd %xmm3, %xmm4, %xmm3 vextractf128 $0x1, %ymm2, %xmm4 vaddpd %xmm2, %xmm4, %xmm4 vunpckhpd %xmm4, %xmm4, %xmm2 vaddpd %xmm4, %xmm2, %xmm2 vextractf128 $0x1, %ymm1, %xmm4 vaddpd %xmm1, %xmm4, %xmm4 vaddsd %xmm2, %xmm3, %xmm2 vunpckhpd %xmm4, %xmm4, %xmm1 vaddpd %xmm4, %xmm1, %xmm1 vextractf128 $0x1, %ymm0, %xmm4 vaddpd %xmm0, %xmm4, %xmm4 vunpckhpd %xmm4, %xmm4, %xmm0 vaddpd %xmm4, %xmm0, %xmm0 vsubsd %xmm1, %xmm0, %xmm0 vzeroupper vmovsd %xmm2, (%rdi) vmovsd %xmm0, 32(%rdi) which is not optimal since we miss the opportunity to vectorize the adds of the accumulators res[0] = acc_rere+acc_imim; res[4] = acc_imre-acc_reim;
(In reply to Richard Biener from comment #1) > All below for Part 2. > > Without -ffast-math you are seeing GCC using in-order reductions now while > with -ffast-math the vectorizer gets a bit confused about reassociations done > before, for me producing > Just to understand, when you are saying "Without -ffast-math" does it includes my set1 == "-O3 -ffast-math -fno-associative-math" ? BTW, I like your phrasing: "a bit confused about reassociations" > .L3: > vmovupd 32(%rsi,%rax), %ymm3 > vmovupd (%rdx,%rax), %ymm7 > vinsertf128 $1, (%rsi,%rax), %ymm3, %ymm0 > vinsertf128 $1, 32(%rdx,%rax), %ymm7, %ymm2 > vmovupd 32(%rsi,%rax), %ymm5 > vpermpd $136, %ymm0, %ymm4 > vpermpd $40, %ymm2, %ymm7 > vpermpd $221, %ymm0, %ymm1 > vpermpd $125, %ymm2, %ymm3 > vperm2f128 $49, (%rsi,%rax), %ymm5, %ymm0 > vmovupd (%rdx,%rax), %ymm2 > vperm2f128 $49, 32(%rdx,%rax), %ymm2, %ymm2 > addq $64, %rax > vpermpd $136, %ymm0, %ymm5 > vpermpd $221, %ymm0, %ymm0 > vpermpd $40, %ymm2, %ymm8 > vpermpd $125, %ymm2, %ymm2 > vmulpd %ymm8, %ymm5, %ymm5 > vmulpd %ymm2, %ymm0, %ymm0 > vfmadd132pd %ymm3, %ymm5, %ymm1 > vfmadd231pd %ymm7, %ymm4, %ymm0 > vaddpd %ymm0, %ymm1, %ymm0 > vaddpd %ymm0, %ymm6, %ymm6 > cmpq %rcx, %rax > jne .L3 > > -ffast-math vs. non-ffast-math we're using a SLP reduction vs. 4 reduction > chains and this SLP reduction ends up looking like > > t5.c:7:21: note: Vectorizing SLP tree: > t5.c:7:21: note: node 0x4100c20 (max_nunits=4, refcnt=2) > t5.c:7:21: note: stmt 0 acc_imre_158 = acc_imre_3 + _34; > t5.c:7:21: note: stmt 1 acc_reim_156 = acc_reim_1 + _8; > t5.c:7:21: note: stmt 2 acc_imim_154 = _21 + acc_imim_35; > t5.c:7:21: note: stmt 3 acc_rere_146 = _11 + acc_rere_29; > t5.c:7:21: note: children 0x3f272e0 0x4100bb0 > t5.c:7:21: note: node 0x3f272e0 (max_nunits=4, refcnt=1) > t5.c:7:21: note: stmt 0 acc_imre_3 = PHI <acc_imre_158(7), 0.0(8)> > t5.c:7:21: note: stmt 1 acc_reim_1 = PHI <acc_reim_156(7), 0.0(8)> > t5.c:7:21: note: stmt 2 acc_imim_35 = PHI <acc_imim_154(7), 0.0(8)> > t5.c:7:21: note: stmt 3 acc_rere_29 = PHI <acc_rere_146(7), 0.0(8)> > t5.c:7:21: note: children 0x4100c20 > t5.c:7:21: note: node 0x4100bb0 (max_nunits=4, refcnt=1) > t5.c:7:21: note: stmt 0 _34 = _36 + _157; > t5.c:7:21: note: stmt 1 _8 = _30 + _155; > t5.c:7:21: note: stmt 2 _21 = _15 + _153; > t5.c:7:21: note: stmt 3 _11 = _6 + _145; > t5.c:7:21: note: children 0x4100920 0x4100b40 > t5.c:7:21: note: node 0x4100920 (max_nunits=4, refcnt=1) > t5.c:7:21: note: stmt 0 _36 = _37 + _73; > t5.c:7:21: note: stmt 1 _30 = _32 + _71; > t5.c:7:21: note: stmt 2 _15 = _10 + _69; > t5.c:7:21: note: stmt 3 _6 = _31 + _61; > t5.c:7:21: note: children 0x41004e0 0x41008b0 > t5.c:7:21: note: node 0x41004e0 (max_nunits=4, refcnt=1) > t5.c:7:21: note: stmt 0 _37 = _101 + _129; > t5.c:7:21: note: stmt 1 _32 = _99 + _127; > t5.c:7:21: note: stmt 2 _10 = _97 + _125; > t5.c:7:21: note: stmt 3 _31 = _89 + _117; > t5.c:7:21: note: children 0x3f2a550 0x3f28700 > t5.c:7:21: note: node 0x3f2a550 (max_nunits=4, refcnt=1) > t5.c:7:21: note: stmt 0 _101 = _88 * _94; > t5.c:7:21: note: stmt 1 _99 = _86 * _96; > t5.c:7:21: note: stmt 2 _97 = _94 * _96; > t5.c:7:21: note: stmt 3 _89 = _86 * _88; > t5.c:7:21: note: children 0x40b6990 0x3f29e00 > t5.c:7:21: note: node 0x40b6990 (max_nunits=4, refcnt=1) > t5.c:7:21: note: stmt 0 _88 = *_87; > t5.c:7:21: note: stmt 1 _96 = *_95; > t5.c:7:21: note: stmt 2 _96 = *_95; > t5.c:7:21: note: stmt 3 _88 = *_87; > t5.c:7:21: note: load permutation { 1 5 5 1 } > t5.c:7:21: note: node 0x3f29e00 (max_nunits=4, refcnt=1) > t5.c:7:21: note: stmt 0 _94 = *_93; > t5.c:7:21: note: stmt 1 _86 = *_85; > t5.c:7:21: note: stmt 2 _94 = *_93; > t5.c:7:21: note: stmt 3 _86 = *_85; > t5.c:7:21: note: load permutation { 5 1 5 1 } > t5.c:7:21: note: node 0x3f28700 (max_nunits=4, refcnt=1) > t5.c:7:21: note: stmt 0 _129 = _116 * _122; > t5.c:7:21: note: stmt 1 _127 = _114 * _124; > t5.c:7:21: note: stmt 2 _125 = _122 * _124; > t5.c:7:21: note: stmt 3 _117 = _114 * _116; > t5.c:7:21: note: children 0x3f287e0 0x3f28770 > t5.c:7:21: note: node 0x3f287e0 (max_nunits=4, refcnt=1) > t5.c:7:21: note: stmt 0 _116 = *_115; > t5.c:7:21: note: stmt 1 _124 = *_123; > t5.c:7:21: note: stmt 2 _124 = *_123; > t5.c:7:21: note: stmt 3 _116 = *_115; > t5.c:7:21: note: load permutation { 2 6 6 2 } > t5.c:7:21: note: node 0x3f28770 (max_nunits=4, refcnt=1) > t5.c:7:21: note: stmt 0 _122 = *_121; > t5.c:7:21: note: stmt 1 _114 = *_113; > t5.c:7:21: note: stmt 2 _122 = *_121; > t5.c:7:21: note: stmt 3 _114 = *_113; > t5.c:7:21: note: load permutation { 6 2 6 2 } > t5.c:7:21: note: node 0x41008b0 (max_nunits=4, refcnt=1) > t5.c:7:21: note: stmt 0 _73 = _60 * _66; > t5.c:7:21: note: stmt 1 _71 = _58 * _68; > t5.c:7:21: note: stmt 2 _69 = _66 * _68; > t5.c:7:21: note: stmt 3 _61 = _58 * _60; > t5.c:7:21: note: children 0x4100290 0x4100810 > t5.c:7:21: note: node 0x4100290 (max_nunits=4, refcnt=1) > t5.c:7:21: note: stmt 0 _60 = *_59; > t5.c:7:21: note: stmt 1 _68 = *_67; > t5.c:7:21: note: stmt 2 _68 = *_67; > t5.c:7:21: note: stmt 3 _60 = *_59; > t5.c:7:21: note: load permutation { 0 4 4 0 } > t5.c:7:21: note: node 0x4100810 (max_nunits=4, refcnt=1) > t5.c:7:21: note: stmt 0 _66 = *_65; > t5.c:7:21: note: stmt 1 _58 = *_57; > t5.c:7:21: note: stmt 2 _66 = *_65; > t5.c:7:21: note: stmt 3 _58 = *_57; > t5.c:7:21: note: load permutation { 4 0 4 0 } > t5.c:7:21: note: node 0x4100b40 (max_nunits=4, refcnt=1) > t5.c:7:21: note: stmt 0 _157 = _144 * _150; > t5.c:7:21: note: stmt 1 _155 = _142 * _152; > t5.c:7:21: note: stmt 2 _153 = _150 * _152; > t5.c:7:21: note: stmt 3 _145 = _142 * _144; > t5.c:7:21: note: children 0x4100990 0x4100a50 > t5.c:7:21: note: node 0x4100990 (max_nunits=4, refcnt=1) > t5.c:7:21: note: stmt 0 _144 = *_143; > t5.c:7:21: note: stmt 1 _152 = *_151; > t5.c:7:21: note: stmt 2 _152 = *_151; > t5.c:7:21: note: stmt 3 _144 = *_143; > t5.c:7:21: note: load permutation { 3 7 7 3 } > t5.c:7:21: note: node 0x4100a50 (max_nunits=4, refcnt=1) > t5.c:7:21: note: stmt 0 _150 = *_149; > t5.c:7:21: note: stmt 1 _142 = *_141; > t5.c:7:21: note: stmt 2 _150 = *_149; > t5.c:7:21: note: stmt 3 _142 = *_141; > t5.c:7:21: note: load permutation { 7 3 7 3 } > > which eventually shows some non-obvious permute optimization opportunities. > I'm currently working on a permute optimization phase btw. but for start > only handling cases that do not help here. > > Btw, if I use -ffast-math but disable reassociation via -fno-tree-reassoc I > get > the reduction chain variant which optimizes to > > .L3: > vmovupd 32(%rsi,%rax), %ymm6 > vmovupd 32(%rdx,%rax), %ymm7 > vmovupd (%rsi,%rax), %ymm5 > vfmadd231pd (%rdx,%rax), %ymm6, %ymm0 > vfmadd231pd (%rdx,%rax), %ymm5, %ymm3 > vfmadd231pd (%rsi,%rax), %ymm7, %ymm1 > addq $64, %rax > vfmadd231pd %ymm6, %ymm7, %ymm2 > cmpq %rcx, %rax > jne .L3 > > even with GCC 10 (-Ofast -march=core-avx2 -fno-tree-reassoc). It's very fragile. I made a tiny (and natural for my real app) change in the source (see below) and the nice MSVC-like inner loop disappeared. void cdot(double* res, const double* a, const double* b, int N) { double acc_rere = 0; double acc_imim = 0; double acc_reim = 0; double acc_imre = 0; for (int c = 0; c < N; ++c) { for (int k = 0; k < 4; ++k) { acc_rere += a[c*8+k+0]*b[c*8+k+0]; acc_imim += a[c*8+k+4]*b[c*8+k+4]; acc_reim -= a[c*8+k+0]*b[c*8+k+4]; acc_imre += a[c*8+k+4]*b[c*8+k+0]; } } res[0] = acc_rere+acc_imim; res[4] = acc_reim+acc_imre; } > Which means > the following source change helps: > > void __attribute__((optimize("no-tree-reassoc"))) cdot(double* res, const > double* a, const double* b, int N) > { > double acc_rere = 0; > double acc_imim = 0; > double acc_reim = 0; > double acc_imre = 0; > for (int c = 0; c < N; ++c) { > for (int k = 0; k < 4; ++k) { > acc_rere += a[c*8+k+0]*b[c*8+k+0]; > acc_imim += a[c*8+k+4]*b[c*8+k+4]; > acc_reim += a[c*8+k+0]*b[c*8+k+4]; > acc_imre += a[c*8+k+4]*b[c*8+k+0]; > } > } > res[0] = acc_rere+acc_imim; > res[4] = acc_imre-acc_reim; > } > IMHO, options like "no-tree-reassoc", including and especially within __attribute__((optimize(""))) are for people like you. People like me, i.e. not compiler guys, don't consider them not only for production, but even for hobby. Unless a hobby is compiler research. Also, several years ago I was told by (not you, as Richard Biener, but "you" as "gcc maintainers", more specifically, by Manuel López-Ibáñez. You, as Richard Biener, also took part in discussion, but appeared to hold different opinion, see 70255) told me that __attribute__((optimize(""))) can't be relied on in production. Back than we came to conclusion that this statement has to be in official docs. And indeed since GCC7 section 6.33.1 contains the folowing sentences: "The optimize attribute should be used for debugging purposes only. It is not suitable in production code." > the reduction epilogue ends up like > > vextractf128 $0x1, %ymm3, %xmm4 > vaddpd %xmm3, %xmm4, %xmm3 > vunpckhpd %xmm3, %xmm3, %xmm4 > vaddpd %xmm3, %xmm4, %xmm3 > vextractf128 $0x1, %ymm2, %xmm4 > vaddpd %xmm2, %xmm4, %xmm4 > vunpckhpd %xmm4, %xmm4, %xmm2 > vaddpd %xmm4, %xmm2, %xmm2 > vextractf128 $0x1, %ymm1, %xmm4 > vaddpd %xmm1, %xmm4, %xmm4 > vaddsd %xmm2, %xmm3, %xmm2 > vunpckhpd %xmm4, %xmm4, %xmm1 > vaddpd %xmm4, %xmm1, %xmm1 > vextractf128 $0x1, %ymm0, %xmm4 > vaddpd %xmm0, %xmm4, %xmm4 > vunpckhpd %xmm4, %xmm4, %xmm0 > vaddpd %xmm4, %xmm0, %xmm0 > vsubsd %xmm1, %xmm0, %xmm0 > vzeroupper > vmovsd %xmm2, (%rdi) > vmovsd %xmm0, 32(%rdi) > > which is not optimal since we miss the opportunity to vectorize the > adds of the accumulators > > res[0] = acc_rere+acc_imim; > res[4] = acc_imre-acc_reim; Epilogue is a tricky matter. There are many ways to do it with about the same speed and which variant is faster could depend on fine microarchitectural details, that could be not the same even on such quite similar CPUs as Skylake and Zen2 or Skylake and Ice Lake, or may be even Skylake and Skylake-X (well, may be the last one is an exaggeration). The optimal sequence also depends on surrounding, e.g. if I change the source to: res[0] = acc_rere+acc_imim; res[1] = acc_reim-acc_imre; // results adjacent in memory It could already be different. In later case it likely would be vaddpd ymm_rere,ymm_imim,ymm_re vsubpd ymm_reim,ymm_imre,ymm_im vhadd ymm_im,ymm_re,ymm_reim vperm2f128 $1, ymm_reim, ymm_reim, ymm_reimH vaddpd xmm2_reimH, xmm_reim, xmm_reim Even icc can't do it perfectly right now. It would be nice (for you personally, at least) if in this case gcc will generate better code than icc, but it is far [far [far...]] less important than robust handling of inner loop.
(In reply to Michael_S from comment #2) > (In reply to Richard Biener from comment #1) > > All below for Part 2. > > > > Without -ffast-math you are seeing GCC using in-order reductions now while > > with -ffast-math the vectorizer gets a bit confused about reassociations done > > before, for me producing > > > > Just to understand, when you are saying "Without -ffast-math" does it > includes my set1 == "-O3 -ffast-math -fno-associative-math" ? No, -fno-associative-math breaks it again. > BTW, I like your phrasing: "a bit confused about reassociations" > > > .L3: > > vmovupd 32(%rsi,%rax), %ymm3 > > vmovupd (%rdx,%rax), %ymm7 > > vinsertf128 $1, (%rsi,%rax), %ymm3, %ymm0 > > vinsertf128 $1, 32(%rdx,%rax), %ymm7, %ymm2 > > vmovupd 32(%rsi,%rax), %ymm5 > > vpermpd $136, %ymm0, %ymm4 > > vpermpd $40, %ymm2, %ymm7 > > vpermpd $221, %ymm0, %ymm1 > > vpermpd $125, %ymm2, %ymm3 > > vperm2f128 $49, (%rsi,%rax), %ymm5, %ymm0 > > vmovupd (%rdx,%rax), %ymm2 > > vperm2f128 $49, 32(%rdx,%rax), %ymm2, %ymm2 > > addq $64, %rax > > vpermpd $136, %ymm0, %ymm5 > > vpermpd $221, %ymm0, %ymm0 > > vpermpd $40, %ymm2, %ymm8 > > vpermpd $125, %ymm2, %ymm2 > > vmulpd %ymm8, %ymm5, %ymm5 > > vmulpd %ymm2, %ymm0, %ymm0 > > vfmadd132pd %ymm3, %ymm5, %ymm1 > > vfmadd231pd %ymm7, %ymm4, %ymm0 > > vaddpd %ymm0, %ymm1, %ymm0 > > vaddpd %ymm0, %ymm6, %ymm6 > > cmpq %rcx, %rax > > jne .L3 > > > > -ffast-math vs. non-ffast-math we're using a SLP reduction vs. 4 reduction > > chains and this SLP reduction ends up looking like > > > > t5.c:7:21: note: Vectorizing SLP tree: > > t5.c:7:21: note: node 0x4100c20 (max_nunits=4, refcnt=2) > > t5.c:7:21: note: stmt 0 acc_imre_158 = acc_imre_3 + _34; > > t5.c:7:21: note: stmt 1 acc_reim_156 = acc_reim_1 + _8; > > t5.c:7:21: note: stmt 2 acc_imim_154 = _21 + acc_imim_35; > > t5.c:7:21: note: stmt 3 acc_rere_146 = _11 + acc_rere_29; > > t5.c:7:21: note: children 0x3f272e0 0x4100bb0 > > t5.c:7:21: note: node 0x3f272e0 (max_nunits=4, refcnt=1) > > t5.c:7:21: note: stmt 0 acc_imre_3 = PHI <acc_imre_158(7), 0.0(8)> > > t5.c:7:21: note: stmt 1 acc_reim_1 = PHI <acc_reim_156(7), 0.0(8)> > > t5.c:7:21: note: stmt 2 acc_imim_35 = PHI <acc_imim_154(7), 0.0(8)> > > t5.c:7:21: note: stmt 3 acc_rere_29 = PHI <acc_rere_146(7), 0.0(8)> > > t5.c:7:21: note: children 0x4100c20 > > t5.c:7:21: note: node 0x4100bb0 (max_nunits=4, refcnt=1) > > t5.c:7:21: note: stmt 0 _34 = _36 + _157; > > t5.c:7:21: note: stmt 1 _8 = _30 + _155; > > t5.c:7:21: note: stmt 2 _21 = _15 + _153; > > t5.c:7:21: note: stmt 3 _11 = _6 + _145; > > t5.c:7:21: note: children 0x4100920 0x4100b40 > > t5.c:7:21: note: node 0x4100920 (max_nunits=4, refcnt=1) > > t5.c:7:21: note: stmt 0 _36 = _37 + _73; > > t5.c:7:21: note: stmt 1 _30 = _32 + _71; > > t5.c:7:21: note: stmt 2 _15 = _10 + _69; > > t5.c:7:21: note: stmt 3 _6 = _31 + _61; > > t5.c:7:21: note: children 0x41004e0 0x41008b0 > > t5.c:7:21: note: node 0x41004e0 (max_nunits=4, refcnt=1) > > t5.c:7:21: note: stmt 0 _37 = _101 + _129; > > t5.c:7:21: note: stmt 1 _32 = _99 + _127; > > t5.c:7:21: note: stmt 2 _10 = _97 + _125; > > t5.c:7:21: note: stmt 3 _31 = _89 + _117; > > t5.c:7:21: note: children 0x3f2a550 0x3f28700 > > t5.c:7:21: note: node 0x3f2a550 (max_nunits=4, refcnt=1) > > t5.c:7:21: note: stmt 0 _101 = _88 * _94; > > t5.c:7:21: note: stmt 1 _99 = _86 * _96; > > t5.c:7:21: note: stmt 2 _97 = _94 * _96; > > t5.c:7:21: note: stmt 3 _89 = _86 * _88; > > t5.c:7:21: note: children 0x40b6990 0x3f29e00 > > t5.c:7:21: note: node 0x40b6990 (max_nunits=4, refcnt=1) > > t5.c:7:21: note: stmt 0 _88 = *_87; > > t5.c:7:21: note: stmt 1 _96 = *_95; > > t5.c:7:21: note: stmt 2 _96 = *_95; > > t5.c:7:21: note: stmt 3 _88 = *_87; > > t5.c:7:21: note: load permutation { 1 5 5 1 } > > t5.c:7:21: note: node 0x3f29e00 (max_nunits=4, refcnt=1) > > t5.c:7:21: note: stmt 0 _94 = *_93; > > t5.c:7:21: note: stmt 1 _86 = *_85; > > t5.c:7:21: note: stmt 2 _94 = *_93; > > t5.c:7:21: note: stmt 3 _86 = *_85; > > t5.c:7:21: note: load permutation { 5 1 5 1 } > > t5.c:7:21: note: node 0x3f28700 (max_nunits=4, refcnt=1) > > t5.c:7:21: note: stmt 0 _129 = _116 * _122; > > t5.c:7:21: note: stmt 1 _127 = _114 * _124; > > t5.c:7:21: note: stmt 2 _125 = _122 * _124; > > t5.c:7:21: note: stmt 3 _117 = _114 * _116; > > t5.c:7:21: note: children 0x3f287e0 0x3f28770 > > t5.c:7:21: note: node 0x3f287e0 (max_nunits=4, refcnt=1) > > t5.c:7:21: note: stmt 0 _116 = *_115; > > t5.c:7:21: note: stmt 1 _124 = *_123; > > t5.c:7:21: note: stmt 2 _124 = *_123; > > t5.c:7:21: note: stmt 3 _116 = *_115; > > t5.c:7:21: note: load permutation { 2 6 6 2 } > > t5.c:7:21: note: node 0x3f28770 (max_nunits=4, refcnt=1) > > t5.c:7:21: note: stmt 0 _122 = *_121; > > t5.c:7:21: note: stmt 1 _114 = *_113; > > t5.c:7:21: note: stmt 2 _122 = *_121; > > t5.c:7:21: note: stmt 3 _114 = *_113; > > t5.c:7:21: note: load permutation { 6 2 6 2 } > > t5.c:7:21: note: node 0x41008b0 (max_nunits=4, refcnt=1) > > t5.c:7:21: note: stmt 0 _73 = _60 * _66; > > t5.c:7:21: note: stmt 1 _71 = _58 * _68; > > t5.c:7:21: note: stmt 2 _69 = _66 * _68; > > t5.c:7:21: note: stmt 3 _61 = _58 * _60; > > t5.c:7:21: note: children 0x4100290 0x4100810 > > t5.c:7:21: note: node 0x4100290 (max_nunits=4, refcnt=1) > > t5.c:7:21: note: stmt 0 _60 = *_59; > > t5.c:7:21: note: stmt 1 _68 = *_67; > > t5.c:7:21: note: stmt 2 _68 = *_67; > > t5.c:7:21: note: stmt 3 _60 = *_59; > > t5.c:7:21: note: load permutation { 0 4 4 0 } > > t5.c:7:21: note: node 0x4100810 (max_nunits=4, refcnt=1) > > t5.c:7:21: note: stmt 0 _66 = *_65; > > t5.c:7:21: note: stmt 1 _58 = *_57; > > t5.c:7:21: note: stmt 2 _66 = *_65; > > t5.c:7:21: note: stmt 3 _58 = *_57; > > t5.c:7:21: note: load permutation { 4 0 4 0 } > > t5.c:7:21: note: node 0x4100b40 (max_nunits=4, refcnt=1) > > t5.c:7:21: note: stmt 0 _157 = _144 * _150; > > t5.c:7:21: note: stmt 1 _155 = _142 * _152; > > t5.c:7:21: note: stmt 2 _153 = _150 * _152; > > t5.c:7:21: note: stmt 3 _145 = _142 * _144; > > t5.c:7:21: note: children 0x4100990 0x4100a50 > > t5.c:7:21: note: node 0x4100990 (max_nunits=4, refcnt=1) > > t5.c:7:21: note: stmt 0 _144 = *_143; > > t5.c:7:21: note: stmt 1 _152 = *_151; > > t5.c:7:21: note: stmt 2 _152 = *_151; > > t5.c:7:21: note: stmt 3 _144 = *_143; > > t5.c:7:21: note: load permutation { 3 7 7 3 } > > t5.c:7:21: note: node 0x4100a50 (max_nunits=4, refcnt=1) > > t5.c:7:21: note: stmt 0 _150 = *_149; > > t5.c:7:21: note: stmt 1 _142 = *_141; > > t5.c:7:21: note: stmt 2 _150 = *_149; > > t5.c:7:21: note: stmt 3 _142 = *_141; > > t5.c:7:21: note: load permutation { 7 3 7 3 } > > > > which eventually shows some non-obvious permute optimization opportunities. > > I'm currently working on a permute optimization phase btw. but for start > > only handling cases that do not help here. > > > > Btw, if I use -ffast-math but disable reassociation via -fno-tree-reassoc I > > get > > the reduction chain variant which optimizes to > > > > .L3: > > vmovupd 32(%rsi,%rax), %ymm6 > > vmovupd 32(%rdx,%rax), %ymm7 > > vmovupd (%rsi,%rax), %ymm5 > > vfmadd231pd (%rdx,%rax), %ymm6, %ymm0 > > vfmadd231pd (%rdx,%rax), %ymm5, %ymm3 > > vfmadd231pd (%rsi,%rax), %ymm7, %ymm1 > > addq $64, %rax > > vfmadd231pd %ymm6, %ymm7, %ymm2 > > cmpq %rcx, %rax > > jne .L3 > > > > even with GCC 10 (-Ofast -march=core-avx2 -fno-tree-reassoc). > > It's very fragile. I made a tiny (and natural for my real app) change in the > source (see below) and the nice MSVC-like inner loop disappeared. > > void cdot(double* res, const double* a, const double* b, int N) > { > double acc_rere = 0; > double acc_imim = 0; > double acc_reim = 0; > double acc_imre = 0; > for (int c = 0; c < N; ++c) { > for (int k = 0; k < 4; ++k) { > acc_rere += a[c*8+k+0]*b[c*8+k+0]; > acc_imim += a[c*8+k+4]*b[c*8+k+4]; > acc_reim -= a[c*8+k+0]*b[c*8+k+4]; > acc_imre += a[c*8+k+4]*b[c*8+k+0]; > } > } > res[0] = acc_rere+acc_imim; > res[4] = acc_reim+acc_imre; > } wow, OK ... :/ > > Which means > > the following source change helps: > > > > void __attribute__((optimize("no-tree-reassoc"))) cdot(double* res, const > > double* a, const double* b, int N) > > { > > double acc_rere = 0; > > double acc_imim = 0; > > double acc_reim = 0; > > double acc_imre = 0; > > for (int c = 0; c < N; ++c) { > > for (int k = 0; k < 4; ++k) { > > acc_rere += a[c*8+k+0]*b[c*8+k+0]; > > acc_imim += a[c*8+k+4]*b[c*8+k+4]; > > acc_reim += a[c*8+k+0]*b[c*8+k+4]; > > acc_imre += a[c*8+k+4]*b[c*8+k+0]; > > } > > } > > res[0] = acc_rere+acc_imim; > > res[4] = acc_imre-acc_reim; > > } > > > > IMHO, options like "no-tree-reassoc", including and especially within > __attribute__((optimize(""))) are for people like you. > People like me, i.e. not compiler guys, don't consider them not only for > production, but even for hobby. Unless a hobby is compiler research. That's true. > Also, several years ago I was told by (not you, as Richard Biener, but "you" > as "gcc maintainers", more specifically, by Manuel López-Ibáñez. You, as > Richard Biener, also took part in discussion, but appeared to hold different > opinion, see 70255) told me that __attribute__((optimize(""))) can't be > relied on in production. Back than we came to conclusion that this statement > has to be in official docs. And indeed since GCC7 section 6.33.1 contains > the folowing sentences: > "The optimize attribute should be used for debugging purposes only. It is > not suitable in production code." Yeah, guess from my HPC background I'd say HPC is never "production" ;) I indeed would avoid optimized("") in say a github hosted project but IMHO it's as valid as splitting out the relevant function and compiling the TU with -fno-tree-reassoc - it's tuning the setup for max performance with a specific compiler and for a specific host. > > > the reduction epilogue ends up like > > > > vextractf128 $0x1, %ymm3, %xmm4 > > vaddpd %xmm3, %xmm4, %xmm3 > > vunpckhpd %xmm3, %xmm3, %xmm4 > > vaddpd %xmm3, %xmm4, %xmm3 > > vextractf128 $0x1, %ymm2, %xmm4 > > vaddpd %xmm2, %xmm4, %xmm4 > > vunpckhpd %xmm4, %xmm4, %xmm2 > > vaddpd %xmm4, %xmm2, %xmm2 > > vextractf128 $0x1, %ymm1, %xmm4 > > vaddpd %xmm1, %xmm4, %xmm4 > > vaddsd %xmm2, %xmm3, %xmm2 > > vunpckhpd %xmm4, %xmm4, %xmm1 > > vaddpd %xmm4, %xmm1, %xmm1 > > vextractf128 $0x1, %ymm0, %xmm4 > > vaddpd %xmm0, %xmm4, %xmm4 > > vunpckhpd %xmm4, %xmm4, %xmm0 > > vaddpd %xmm4, %xmm0, %xmm0 > > vsubsd %xmm1, %xmm0, %xmm0 > > vzeroupper > > vmovsd %xmm2, (%rdi) > > vmovsd %xmm0, 32(%rdi) > > > > which is not optimal since we miss the opportunity to vectorize the > > adds of the accumulators > > > > res[0] = acc_rere+acc_imim; > > res[4] = acc_imre-acc_reim; > > Epilogue is a tricky matter. > There are many ways to do it with about the same speed and which variant is > faster could depend on fine microarchitectural details, that could be not > the same even on such quite similar CPUs as Skylake and Zen2 or Skylake and > Ice Lake, or may be even Skylake and Skylake-X (well, may be the last one is > an exaggeration). The optimal sequence also depends on surrounding, e.g. if > I change the source to: > res[0] = acc_rere+acc_imim; > res[1] = acc_reim-acc_imre; // results adjacent in memory > It could already be different. > In later case it likely would be > vaddpd ymm_rere,ymm_imim,ymm_re > vsubpd ymm_reim,ymm_imre,ymm_im > vhadd ymm_im,ymm_re,ymm_reim > vperm2f128 $1, ymm_reim, ymm_reim, ymm_reimH > vaddpd xmm2_reimH, xmm_reim, xmm_reim > > Even icc can't do it perfectly right now. > It would be nice (for you personally, at least) if in this case gcc will > generate better code than icc, but it is far [far [far...]] less important > than robust handling of inner loop. Yeah, the issue really is that with reassociation we miss the obviously better reduction scheme and the vectorizer cannot translate between one and the other at the moment (it doesn't itself try reassociating other than that implied by vectorizing). We could try to add this capability without too many issues I guess - the main problem will be that it's decision upfront for the reduction scheme and that it wouldn't be one based on actual costs. Thanks for the insightful loop kernels btw!